Indexing:

back to the FAQ Index

How do I switch off stop-word removal?

You need to supply your own StandardAnalyzer? to your Index or IndexWriter?. For example;

analyzer = Ferret::Analysis::StandardAnalyzer.new([])
index = Ferret::Index::Index.new(analyzer)

You will want to do this if you want to search for terms like 'and' and 'the'. Ferret also comes with a number of different lists of stop-words. See the constants at here.

How do you create the first index?

There is no need to create a special first index. As soon as you build an index with Index::Index.new(), a new and empty index will be created, if not already present.

However, if you want to define one or more special Fields using FieldInfos, you might want to create the Index via the create_index method of FieldInfos.

  field_infos = FieldInfos.new(:term_vector => :no)
  field_infos.add_field(:title, :index => :untokenized, :boost => 10.0)
  field_infos.add_field(:content, :term_vector => :yes)
  field_infos.create_index("/path/to/index")

Or you can pass the FieldInfos object to the IndexWriter or Index constructor and let them take care of the creation for you.

  field_infos = FieldInfos.new(:term_vector => :no)
  field_infos.add_field(:title, :index => :untokenized, :boost => 10.0)
  field_infos.add_field(:content, :term_vector => :yes)

  index = Index.new(:field_infos => field_infos, :path => "/path/to/index')

see also What is a Field?, What is a FieldInfo?

What is the default fieldinfo?

By default, the field value will be stored, it will be searchable and tokenized and will store the term-vectors with positions and offsets (blowing up your index). You can override these defaults with your own defaults by passing them to the FieldInfos? constructor.

  # Set the defaults for all fields
  FieldInfos.new(:term_vector => :no, :store => :no)

Can I add new fields after I've created the FieldInfos?

Yes, simply add them to the field_infos of the Index.

  index.field_infos.add_field(:new_field, :store => :yes)

What happens if I add new Fields to the Index later on?

You can add fields anytime you want, this will not have an impact on the speed of ferret. But you should remember to optimize your index.

What happens if I try to add a Field to the index, that is already in there?

You will get an ArgumentError Exception, so you should either handle the exception or check, if a field of that name is already defined.

  # check if field already defined
  index.field_infos.fields.include?(:my_new_field) 

  # handle potential exceptions while adding
  begin
    index.field_infos.add_field( :new_field, :store => :no )
  rescue ArgumentException
    # do something.. 
  end

Is there a way to optimize the Index?

Yes, and it's your task to do that. Use the optimize method of Index. While it is not neccessary to optimize the index every time after you've added a new document, we recommend optimizing the index if there are no more indexing requests pending.

How to index all files under a directory?

Brian McCallister's wrote a few lines on how to do that in his blog. Here is some even simpler code.

#!/usr/bin/env ruby
require 'rubygems'
require 'ferret'

index = Ferret::I.new(:path => '/path/to/index')

Dir["/path/to/dir/**/*.txt").each do |path|
  index << {:path => path, :content => File.read(path)}
end

puts index.search("Ferret").to_s(:path)

How to index an IMAP directory?

John Wells has written some lines of code to index via IMAP using Ferret. Code can be found in this thread on the Ruby Forum.

How can I crawl external content and add it to my index?

Take a look at RDig (http://rubyforge.org/projects/rdig). It contains a simple HTTP crawler and some support for extracting textual content from the fetched pages. Adding the content to your index is your task.

How to remove all documents from index?

You can either remove all documents from the index by deleting each document:

index.size.times {|i| index.delete(i)}

Or you simply overwrite the current index by forcing the creation of a new one:

index = Index::Index.new(:path => '/path/to/index', :create => true)

Can I use an index that was build with Lucene?

Maybe with 0.9.* versions or earlier. And then only if the index doesn't contain any multibyte characters. That is, if all the data is ASCII.

How to use Ferret on an Existing Java Lucene Index?

See the previous question first. Unless it is very easy for you to reindex all of your documents using Lucene, I recommend you make a copy of your index. This hasn't been extensively tested so I can't guarantee you won't corrupt your index. So on *nix, that would be;

    cp -R /path/to/index /path/to/index_copy

Then simple open the index as you usually would in Ferret;

index = Index::Index.new(:path=>'/path/to/index_copy')
# Add documents, run searches etc.
index.close

Note: It appears this doesn't work if the Lucene index uses the ".cfs" (Compound file format). When I set IndexWriter.setUseCompoundFile(false) in the Java program it works great.