Table of Contents
How Tos
Please add your own tutorials or how-tos to this page.
(You might want to check out the FerretArticles section as well)
How to Integrate Ferret With Rails
acts_as_ferret
acts_as_ferret is the recommended way to integrate ferret with rails. It's maintainers have put up an svn-repository and trac at http://projects.jkraemer.net/acts_as_ferret/
Thanks to Kasper Weibel who started this great plugin. There's a page on this wiki as well: FerretOnRails !
using ferret with activerecord
Aslak Hellesoy has another piece;
http://aslakhellesoy.com/articles/2005/11/18/using-ferret-with-activerecord
How to integrate Ferret with rails on the rail wiki (a little outdated)
Jan Prill has written up a great howto on integrating ferret with rails. You can check it out here:
http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails . The recommended way for integration is acts_as_ferret.
How to index all files under a directory
Check out Brian McCallister?'s blog for a description of how to do this.
How to create a persistent index.
I'll just quickly explain the :create and :create_if_missing options in index. To create a new index, use :create => true. This will create a new index, regardless of whether in index already exists in the specified directory. So, if you are going to use this option you should only use it the first time, ie;
index = Index::Index.new(:path=>'/tmp/ferret',:create=>true) # Add fields etc. run searches etc. index.close # and in a new session index = Index::Index.new(:path=>'/tmp/ferret') # Add fields etc. run searches etc. index.close
If you want Ferret to only create and index when one is missing, you can explicitly set :create_if_missing => true. This is the default behaviour in 0.1.1. If you want an exception thrown if there is no index then set :create_if_missing => false.
How to search within any field of a document.
# And if you want to search all fields by default; index = Index::Index.new(:default_field => "*") # Note :default_field is already an option and currently defaults to "" # but I'll probably make it default to "*" in the next version # Now to search for foo in all fields in documents after Jan 1st 2005; topdocs = index.search("foo AND created: >= 20050101")
Please see more on this here.
How to use Ferret on an Existing Java Lucene Index
Unless it is very easy for you to reindex all of your documents using Lucene, I recommend you make a copy of your index. This hasn't been extensively tested so I can't guarantee you won't corrupt your index. So on *nix, that would be;
cp -R /path/to/index /path/to/index_copy
Then simple open the index as you usually would in Ferret;
index = Index::Index.new(:path=>'/path/to/index_copy') # Add fields etc. run searches etc. index.close
Note: it appears this doesn't work if the Lucene index uses the ".cfs" (Compound file format). When I IndexWriter?.setUseCompoundFile(false) in the Java program it works great.
How to use the C Indexer
Currently cFerret isn't really ready for release and it'll only run on linux (compile and pass the tests on Mac OS X v.4), but it is definitely usable if you know your way around C. If you want to index a lot of documents, it may be worth looking into. Indexing all of the text documents on my PC took over 20 minutes with Java Lucene and less than a minute with cFerret. And the indexes were identical. To get cFerret, you'll need subversion;
svn co svn://www.davebalmain.com/cferret/trunk cferret
Then change directory into cferret and run make. If you've got gcc 4.0 or above you'll get a lot of warnings which you can ignore. (I do plan to fix that). Running make will just compile all of the object files and run the unit tests. If the tests don't all pass, please let me know at dbalmain@gmail.com. Then look at bench.c to see how to use it.
How to use Index::Index in Multi-Threaded Applications
The Index::Index class it thread-safe so it should run well in threaded applications. One thing to note is that document numbers are ephemeral, ie they may change as an index updated. Clients should thus not rely on a given document having the same number between requests. There are two possible solutions to this. You can synchronize on the index;
index.synchronize do topdocs = index.search("foo AND created: >= 20050101") docs = [] topdocs.each {|doc, score| docs << index[doc]} end docs.each do |doc| # You can now do whatever you want with your documents end
Or perhaps you can do this a little more easily with the block search method;
docs = [] index.search_each("foo AND created: >= 20050101") do |doc, score| docs << index[doc] end docs.each do |doc| # You can now do whatever you want with your documents end
You can also use the synchronize shown in the first example to run transactions on the index.
Things get a little more complicated when you have multiple separate processes accessing an index, for example in a rails web app served with multiple dispatch threads. To handle this, you need to make sure your index is flushed as soon you perform an update, ie add or delete a document. Otherwise, other processes trying to update the index will time out while trying to get the write lock. Here is an example;
# Change all instances of the name David Grey to the correct David Gray index.search_each('artist:"David Grey"') do |doc_num, score| document = index[doc_num] document[:artist] = "David Gray" index.delete(doc_num) index << document #Note that the document will now have a new document number end index.flush # <= this method should be called if you want other processes to be able to update the index.
How to not use the main Index::Index class
Right now it is fairly simple to use the Index::Index class. It handles most of the index updating and locking for you. The problem is, it is doing a lot of extra work to make sure that you are always searching on the latest index. It is actually a lot more efficient to have one object for updating the index and as many others as you like for searching the index. This gives you more control on what is going on in the index and leads to greater efficiency. These are, Index::IndexReader?, Index::IndexWriter?, and Index::IndexSearcher?. I'll cover the searcher here first, and most others will follow suite.
include Ferret::Search #In the current release (0.1.3 ) there is bug in the IndexSearcher. In the initialize function #the line that calls FSDirectory.open should be change to FSDirectory.new(args, true), believe this is fixed in the dev build sr = Index::IndexSearcher("path/to/index") #Then to use the searcher we can do something like: #You can include options in the QueryParser if you want. I decided to leave them blank for now qp = Ferret::QueryParser.new( ) #Need to get the field names out, or you wont be searching much qp.field = sr.reader.get_field_names.to_a #Then just search as follows sr.search(qp.parse("whatever"))
That should allow to get a read-only searcher up and running with out requiring any write-locks or such. One important note on this. If you have an external process change your index, you will need to reset the reader object to get those changes.
include Ferret::Store sr.reader = Index::IndexReader.open(FSDirectory.get_directory("path/to/index"))
How to remove all documents from index
If you want to build index anew you need to remove all documents from your index first. You could do it with following code.
index.size.times {|i| index.delete(i)}
How to use keys for document
Ferret contains very useful concept of document keys. You could think about the key like as document field that unique across the index. Ok. Some code could help you understand a bit more. Let's imaging that we want to index Document object.
document = Document.find(some_id) #Document our business class that we want to index with Ferret index << {:id => document.id, :text => document.text}
If you run this code you will have indexed document. It is exactly what we need. But what will be if we run this code again?? Then we would have 2 Document objects with the same id in our index. But it is wrong!! We need to store just one Document.
In this situation you could help Ferret index keys. In the code below we set that key of index will be id field. So after we execute code we will have only one document in index.
index = Index::Index.new(:key => :id) index << {:id => 23, :data => "This is the data..."} index << {:id => 23, :data => "This is the new data..."}
Remember also that we could get very quickly document by its key (and I love Ferret for this feature)
index["23"] # Get document with key 23 index[112] # Get document with internal number 112. It is NOT the key field. # It is just the internal Ferret id. This number is subject to change # whenever the document is updated or other documents are deleted and # the index is optimized. #Now we will remove by key index.remove("23") #Remove Document with id=23 from index. The same as following statement index.remove("id:23")
How to index an IMAP directory
John Wells has written some lines of code to index via IMAP using Ferret. Code can be found in this thread on the Ruby Forum.
How to do location-based searches (search by zip code)
You can find some example code posted on the tourbus blog at http://blog.tourb.us/archives/ferret-and-location-based-searches
How to build a ferret index from documents with different mime types
The FerretHelper? module and Ferret Finder utility
Stuart Rackham wrote Helpers and Utilities for indexing the filesystem. With his Tools you are able to use commands like
$ ff -i ~/doc ~/projects # Create new index of doc and projects directories $ ff instantiation ruby # Find docs with both words $ ff "array ruby -python" # Find docs with array and ruby but not python $ ff file:*ruby*.txt # Find docs with file names like *ruby*.txt
His library is utilizing the following tools for conversion to indexable txt: - PDF to text conversion with pdftotext - HTML to text conversion with html2text - Open Document to text conversion with odt2txt - Word to text conversion with antiword
Converting these common document types for indexing will be a task that everyone is facing who wants to do desktop search. If that's intersting for you, you might want to have a look at RDig as well (following right underneath...)
How to crawl internet-sites, an intranet or the filesystem and index the crawled documents - RDig
Jens Kraemer came up with a great tool for crawling documents that reside on the internet, your intranet or the file-system. Have a look at RDig: RDig provides an HTTP crawler and content extraction utilities to help building a site search for web sites or intranets. Internally, Ferret is used for the full text indexing. After creating a config file for your site, the index can be built with a single call to rdig.
How to index word documents
antiword is a great tool for converting word documents to text. You can use this to batch convert your word documents to text so you can index them with Lucene. You can see a web demo of this in action at scattrbrain
How to make sure that the index gets valid UTF-8 text
Paul Battley has a good blog post on correcting UTF-8 text at http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/. Basically you use the iconv library (a standard library) and do this;
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8') valid_string = ic.iconv(untrusted_string)
How to launch DRb server on reboot (linux)
Many people have had a difficult time getting their DRb server to launch at reboot on newer Linux distributions. This is caused by a PATH issue that comes about when users have installed Ruby in /usr/local/bin and their linux distribution utilizes SELinux. Here's a fix (and a startup script):
#!/bin/bash
#
# This script starts and stops the ferret DRb server
# chkconfig: 2345 89 36
# description: Ferret search engine for ruby apps.
#
# save the current directory
CURDIR=`pwd`
PATH=/usr/local/bin:$PATH
RORPATH="/path/to/ror_root"
case "$1" in
start)
cd $RORPATH
echo "Starting ferret DRb server."
FERRET_USE_LOCAL_INDEX=1 \
script/runner -e production \
vendor/plugins/acts_as_ferret/script/ferret_start
;;
stop)
cd $RORPATH
echo "Stopping ferret DRb server."
FERRET_USE_LOCAL_INDEX=1 \
script/runner -e production \
vendor/plugins/acts_as_ferret/script/ferret_stop
;;
*)
echo $"Usage: $0 {start, stop}"
exit 1
;;
esac
cd $CURDIR
How to create synonym based searching
Most code listed here is based off examples from the "Lucene in Action" book along with examples of how to create filters/analyzers from the ferret mailing list. The wordnet_prolog_2_ferret.rb script is based on the Lucece program Syns2Index.java.
Creating the analyzer
The SynonymAnalyzer is fairly simple, like most analyzers. It is very similar to the StandardAnalyzer except for a few exceptions noted below.
A synonym engine must be supplied to the analyzer. The engine is required to do the lookup of a word and return the resulting synonyms. The SynonymAnalyzer also requires a SynonymTokenFilter that does most of the work and actually makes the calls to the specified synonym engine. Finally, unlike the StandardAnalyzer this class does not run tokens through the HyphenFilter because if there are hyphenated words that have synonyms, it would be nice to capture those.
class SynonymAnalyzer < Ferret::Analysis::Analyzer
include Ferret::Analysis
def initialize(synonym_engine, stop_words = FULL_ENGLISH_STOP_WORDS, lower = true)
@synonym_engine = synonym_engine
@lower = lower
@stop_words = stop_words
end
def token_stream(field, str)
ts = StandardTokenizer.new(str)
ts = LowerCaseFilter.new(ts) if @lower
ts = StopFilter.new(ts, @stop_words)
ts = SynonymTokenFilter.new(ts, @synonym_engine)
end
end
Creating the token filter
SynonymTokenFilter does the job of taking a token from the supplied token stream and injecting all the synonyms for that token.
Couple of interesting piece of code here, starting with the 'next' method. The first thing it does is check the @synonym_stack to see if there are any synonyms left in it and if so then return that instead of the next token in the @token_stream. If @synonym_stack is empty then it proceeds to finding the next token and if it's not nil it calls add_synonyms_to_stack.
The add_synonyms_to_stack method takes the supplied token, calls the get_synonym method of the @synonym_engine and then loops over the results and adding them to the stack. While adding them to the stack it turns them into tokens that have the same start position and end position as the original token. It also makes sure to set the position increment to 0. That is very important because you want all the synonyms and the original token to have the same positions.
class SynonymTokenFilter < Ferret::Analysis::TokenStream
include Ferret::Analysis
def initialize(token_stream, synonym_engine)
@token_stream = token_stream
@synonym_stack = []
@synonym_engine = synonym_engine
end
def text=(text)
@token_stream.text = text
end
def next
return @synonym_stack.pop if @synonym_stack.size > 0
if token = @token_stream.next
add_synonyms_to_stack(token) unless token.nil?
end
return token
end
private
def add_synonyms_to_stack(token)
synonyms = @synonym_engine.get_synonyms(token.text)
return if synonyms.nil?
synonyms.each do |s|
@synonym_stack.push(
Token.new(s, token.start, token.end, 0))
end
end
end
Create the synonym engine
The WordnetSynonymEngine does the actual job of querying an existing ferret index for the synonyms for any word passed to get_synonyms. The engine creates a searcher object to use for every call to get_synonyms. The 'existing ferret index' mentioned previously is created by wordnet_prolog_2_ferret.rb that'll be described in the next section.
When get_synonyms is called it creates a simple TermQuery object on the "word" field in the index and returns the first result it finds from the @searcher's search_each method.
Any synonym engine must implement get_synonyms, and the results get_synonyms returns must be an array.
# Accesses a ferret index created from the wordnet synonym database
class WordnetSynonymEngine
include Ferret::Search
def initialize(wordnet_index_location)
@searcher = Searcher.new(index_location)
end
def get_synonyms(word)
@searcher.search_each(TermQuery.new(:word, word)) do |doc_id, score|
return @searcher[doc_id][:syn]
end
return nil
end
end
The engine described above is based on the example in the "Lucene in Action" book; however, other engines can easily be created.
Here's an example of using a YAML based synonym engine.
# Accesses a YAML file for synonym lookup.
class YAMLSynonymEngine
def initialize(index_location)
@searcher = YAML.load_file(index_location)
end
def get_synonyms(word)
return @searcher[word]
end
end
Fairly simple class that loads the file specified by the index_location parameter into the @searcher variable. Then any call to get_synonyms just returns the lookup for @searcher's indexer method. If YAML doesn't find anything it returns nil, but if it does find something it returns that. Again, the engines must return an array so this YAML engine requires that the YAML file be set up using a multi-line inline collection. Here's an short example:
# Notice that multi-word keys must be in quotes. ferret: ['black-footed ferret', 'mustela nigripes', 'ferret out'] 'black-footed ferret': ['ferret', 'mustela nigripes'] 'ferret out': ['ferret'] 'mustela nigripes': ['black-footed ferret', 'ferret']
Both engines work the same:
>> w = WordnetSynonymEngine.new("#{RAILS_ROOT}/index/#{ENV['RAILS_ENV']}/wordnet")
>> w.get_synonyms('ferret')
=> ["black-footed ferret", "mustela nigripes", "ferret out"]
>>
>> y = YAMLSynonymEngine.new("#{RAILS_ROOT}/extras/synonyms.yaml")
>> y.get_synonyms('ferret')
=> ["black-footed ferret", "mustela nigripes", "ferret out"]
Creating the Wordnet synonym index
Ferret version
This code is a port of the Syns2Index.java program into ruby with only a few minor changes to how it works. I did not want to exclude words with spaces in them so I removed any logic for that, and obviously I changed it so that it builds a ferret index instead of a Lucene index.
To use this script download the prolog wordnet database and extract it. Run the script without any arguments to see the usage. The file you will want to use is 'wn_s.pl'.
The index is built on the idea that there are two fields. A "word" field and a "syn" field. The word field is the word to look up, and the syn field is an array of all the synonyms. When ferret returns the syn field it will return the array as it was indexed.
require 'rubygems'
require 'ferret'
def index(index_dir, word2nums, num2words)
row = 0
mod = 1
# override the specific index if it already exists
field_infos = Ferret::Index::FieldInfos.new()
field_infos.add_field(:word, :index => :untokenized, :term_vector => :no)
field_infos.add_field(:syn, :index => :no, :term_vector => :no)
index = Ferret::Index::Index.new(:path => index_dir, :field_infos => field_infos)
word2nums.each do |key, value|
doc = {:word => key}
n = index_word(word2nums, num2words, key, doc)
if n > 0
if ((row = row + 1) % mod) == 0
puts "\nrow=#{row}/#{word2nums.size} doc=#{doc}"
mod = mod * 2
end
index << doc
end # else degenerate
end
end
# Given 2 maps fills a document for 1 word
def index_word(word2nums, num2words, key, doc)
words = []
word2nums[key].each do |value|
words << num2words[value] unless num2words[value].nil?
end
words.flatten!
words.uniq!
num = 0
words.delete(key) # remove itself
doc[:syn] = []
words.each do |value|
num = num + 1
doc[:syn] << value
end
num
end
def usage
puts "ruby wordnet_prolog_to_ferret.rb <prolog file> <index dir>"
end
if ARGV.size.eql? 2
@prolog_filename = ARGV[0]
@index_dir = ARGV[1]
else
usage;
exit(1);
end
# make sure the prolog file is readable
unless File.readable?(@prolog_filename)
puts "Error: cannot read Prolog file: #{@prolog_filename}"
exit(1)
end
# exit if the target index directory already exists
if File.exists?(@index_dir)
puts "Error: index directory already exists: #{@index_dir}"
puts "Please specify a name of a non-existant directory"
exit(1)
end
puts "Opening Prolog file #{@prolog_filename}"
File.open(@prolog_filename, "r") do |file|
word2nums = {}
num2words = {}
rejected_words = 0
mod = 1; # used for
row = 1; # status updates
puts "[1/2] Parsing #{@prolog_filename}"
while (line = file.gets)
# occasional progress
if ((row = row +1) % mod) == 0 # periodically print out line we read in
mod = mod * 2
puts "\n#{row} #{line} word2num size: #{word2nums.size} num2words size: #{num2words.size} rejected words=#{rejected_words}"
end
# syntax check
unless line[0..1] == "s("
puts "OUCH: #{line}"
exit(1);
end
# parse line
line = line[2..-4]
line_parts = line.split(',')
line_parts[2] = line_parts[2].slice(1..-2).downcase # trim single quotes off word
# 1/2: word2nums map
# append to entry or add new one
lis = word2nums[line_parts[2]]
if lis.nil?
word2nums[line_parts[2]] = [line_parts[0]]
else
lis << line_parts[0]
end
# 2/2: num2words map
lis = num2words[line_parts[0]]
if lis.nil?
num2words[line_parts[0]] = [line_parts[2]]
else
lis << line_parts[2]
end
end
puts "\n[2/2] Building index to store synonyms, map sizes are #{word2nums.size} and #{num2words.size}"
index(@index_dir, word2nums, num2words)
end
YAML Version
For completeness sake here is a quick version thrown together to create a YAML version of the wordnet database. Due to the length of the code I'm only including the relevant methods that have changed. This script will take some time to complete and will use a lot of resources (over 250 megs of memory to create). The resultant YAML file will require a little over 50 megs of memory in usage when loaded for searching.
The file that is output has a different format then the example YAML file listed above but it works exactly the same.
require 'yaml' # instead of require 'ferret'
def index(index_dir, word2nums, num2words)
row = 0
mod = 1
doc = {}
word2nums.each do |key, value|
n = index_word(word2nums, num2words, key, doc)
if n > 0
if ((row = row + 1) % mod) == 0
puts "\nrow=#{row}/#{word2nums.size} doc_count=#{doc.size}"
mod = mod * 2
end
end
end
File.open(index_dir, 'w') do |out|
YAML.dump(doc, out)
end
end
# Given 2 maps fills a document for 1 word
def index_word(word2nums, num2words, key, doc)
words = []
word2nums[key].each do |value|
words << num2words[value] unless num2words[value].nil?
end
words.flatten!
words.uniq!
num = 0
words.delete(key) # remove itself
doc[key] = [] if words.size > 0
words.each do |value|
num = num + 1
doc[key] << value
end
num
end
Integrating with acts_as_ferret and Rails
Using this with Rails and acts_as_ferret is easy. Store these files in your "#{RAILS_ROOT}/lib" directory so they are loaded by the Rails system when it starts up.
Then modify one of your existing aaf enabled models similar to the following:
class Test << ActiveRecord::Base
acts_as_ferret(
:fields => [:your, :fields, :here],
:store_class_name => true,
:ferret => {
:or_default => false,
:analyzer => SynonymAnalyzer.new(
# YAMLSynonymEngine.new("#{RAILS_ROOT}/extras/synonyms.yaml"), [])
WordnetSynonymEngine.new('#{RAILS_ROOT}/index/#{ENV['RAILS_ENV']}/wordnet'), [])
},
end
Move the indexes you created in the section above into the relevant areas. Delete the engine reference you don't want to use. Then you're all set up.
Outstanding Issues
There are still some issues that need to be taken care of: 1. You can not do a synonym based search for words with spaces yet. Since the tokenizer breaks words up by spaces it will not find these in the index.
2. Currently aaf doesn't support different analyzers for searching/indexing so doing this actually causes the synonym insertion to be done twice (once during indexing and another time during query generation). Really it's only needed once: during indexing if you want it more transparent to the user, or during query generation if you want to be able to give the user control of when to search for synonyms). More on this second option in a minute.
3. Currently I get errors when trying to use this with a ferret index running on Drb.
Above I mentioned allowing the user to control the search for synonyms. I was considering a construct of "%{word or words}" to add to the grammar. This would give the user the ability to do "rabbits %{ferret}" and the resulting query would look like:
rabbits ferret|"black-footed ferret"|"mustela nigripes"|"ferret out"
That would actually solve issue 1 and issue 2 above, since by enclosing the synonym search in French braces would allow for multi-word synonyms. It would also remove the need for indexing your documents upon insertion into the database keeping the size of the index down as well.
None of that has been done as of yet, so for now the synonym searching is not as robust as it will hopefully become.
