Changes between Version 40 and Version 41 of HowTos


Ignore:
Timestamp:
02/06/08 19:47:24 (3 years ago)
Author:
dbalmain
Comment:

Corrected some spelling mistakes. Scanned for spam. Rolled back some spammed versions of the page.

Legend:

Unmodified
Added
Removed
Modified
  • HowTos

    v40 v41  
    2121http://aslakhellesoy.com/articles/2005/11/18/using-ferret-with-activerecord 
    2222 
    23 === Howto integrate Ferret with rails on the rail wiki (a little outdated) === 
     23=== How to integrate Ferret with rails on the rail wiki (a little outdated) === 
    2424 
    2525Jan Prill has written up a great howto on integrating ferret with rails. You can check it out here: 
     
    9797== How to use the C Indexer == 
    9898 
    99 Currently cFerret isn't really ready for release and it'll only run on linux (compile and pass the tests on Mac OS X v.4), but it is definitely useable if you know your way around C. If you want to index a lot of documents, it may be worth looking into. Indexing all of the text documents on my PC took over 20 minutes with Java Lucene and less than a minute with cFerret. And the indexes were identical. To get cFerret, you'll need subversion; 
     99Currently cFerret isn't really ready for release and it'll only run on linux (compile and pass the tests on Mac OS X v.4), but it is definitely usable if you know your way around C. If you want to index a lot of documents, it may be worth looking into. Indexing all of the text documents on my PC took over 20 minutes with Java Lucene and less than a minute with cFerret. And the indexes were identical. To get cFerret, you'll need subversion; 
    100100 
    101101{{{ 
     
    109109== How to use Index::Index in Multi-Threaded Applications == 
    110110 
    111 The Index::Index class it threadsafe so it should run well in threaded applications. One thing to note is that document numbers are ephemeral, ie they may change 
     111The Index::Index class it thread-safe so it should run well in threaded applications. One thing to note is that document numbers are ephemeral, ie they may change 
    112112as an index updated.  Clients should thus not rely on a given document having the same number between requests. There are two possible solutions to this. You can synchronize on the index; 
    113113 
     
    159159== How to not use the main Index::Index class == 
    160160 
    161 Right now it is fairly simple to use the Index::Index class. It handles most of the index updating and locking for you. The problem is, it is doing a lot of extra work to make sure that you are always searching on the latest index. It is actually a lot more efficiant to have one object for updating the index and as many others as you like for searching the index. This gives you more control on what is going on in the index and leads to greater efficiency. These are, Index::IndexReader, Index::IndexWriter, and Index::IndexSearcher. I'll cover the searcher here first, and most others will follow suite. 
     161Right now it is fairly simple to use the Index::Index class. It handles most of the index updating and locking for you. The problem is, it is doing a lot of extra work to make sure that you are always searching on the latest index. It is actually a lot more efficient to have one object for updating the index and as many others as you like for searching the index. This gives you more control on what is going on in the index and leads to greater efficiency. These are, Index::IndexReader, Index::IndexWriter, and Index::IndexSearcher. I'll cover the searcher here first, and most others will follow suite. 
    162162 
    163163{{{ 
     
    200200== How to use keys for document == 
    201201Ferret contains very useful concept of document keys. You could think about the key like as document field that unique across the index. 
    202 Ok. Some code could help you undertand a bit more. Let's imaging that we want to index Document object. 
     202Ok. Some code could help you understand a bit more. Let's imaging that we want to index Document object. 
    203203{{{ 
    204204#!ruby 
     
    269269== How to crawl internet-sites, an intranet or the filesystem and index the crawled documents - RDig == 
    270270 
    271 Jens Kraemer came up with a great tool for crawling documents that reside on the internet, your intranet or the filesystem. Have a look at [http://rdig.rubyforge.org/ RDig]: RDig provides an HTTP crawler and content extraction utilities to help building a site search for web sites or intranets. Internally, Ferret is used for the full text indexing. After creating a config file for your site, the index can be built with a single call to rdig. 
     271Jens Kraemer came up with a great tool for crawling documents that reside on the internet, your intranet or the file-system. Have a look at [http://rdig.rubyforge.org/ RDig]: RDig provides an HTTP crawler and content extraction utilities to help building a site search for web sites or intranets. Internally, Ferret is used for the full text indexing. After creating a config file for your site, the index can be built with a single call to rdig. 
    272272 
    273273----------------- 
     
    342342The !SynonymAnalyzer is fairly simple, like most analyzers. It is very similar to the !StandardAnalyzer except for a few exceptions noted below. 
    343343 
    344 A synonym engine must be supplied to the analyzer. The engine is required to do the lookup of a word and return the resulting synonyms. The !SynonymAnalyzer also requires a !SynonymTokenFilter that does most of the work and actually makes the calls to the specified synonym engine. Finally, unlike the !StandardAnalyzer this class does not run tokens through the !HyphenFilter because if there are hypenated words that have synonyms, it would be nice to capture those. 
     344A synonym engine must be supplied to the analyzer. The engine is required to do the lookup of a word and return the resulting synonyms. The !SynonymAnalyzer also requires a !SynonymTokenFilter that does most of the work and actually makes the calls to the specified synonym engine. Finally, unlike the !StandardAnalyzer this class does not run tokens through the !HyphenFilter because if there are hyphenated words that have synonyms, it would be nice to capture those. 
    345345 
    346346{{{ 
     
    482482==== Ferret version ==== 
    483483 
    484 This code is a port of the Syns2Index.java program into ruby with only a few minor changes to how it works. I did not want to exclude words with spaces in them so I removed any logic for that, and obviously I changed it so that it builds a ferret index instead of a lucene index. 
     484This code is a port of the Syns2Index.java program into ruby with only a few minor changes to how it works. I did not want to exclude words with spaces in them so I removed any logic for that, and obviously I changed it so that it builds a ferret index instead of a Lucene index. 
    485485 
    486486To use this script download the [http://wordnet.princeton.edu/obtain prolog wordnet database] and extract it. Run the script without any arguments to see the usage. The file you will want to use is 'wn_s.pl'. 
     
    686686 
    687687 
    688 === Oustanding Issues === 
     688=== Outstanding Issues === 
    689689 
    690690There are still some issues that need to be taken care of: 
     
    699699{{{ rabbits ferret|"black-footed ferret"|"mustela nigripes"|"ferret out" }}} 
    700700 
    701 That would actually solve issue 1 and issue 2 above, since by enclosing the synonym search in french braces would allow for multi-word synonyms. It would also remove the need for indexing your documents upon insertion into the database keeping the size of the index down as well. 
    702  
    703 None of that has been done as of yet, so for now the synonym searching is not as robust as it will hopefuly become. 
     701That would actually solve issue 1 and issue 2 above, since by enclosing the synonym search in French braces would allow for multi-word synonyms. It would also remove the need for indexing your documents upon insertion into the database keeping the size of the index down as well. 
     702 
     703None of that has been done as of yet, so for now the synonym searching is not as robust as it will hopefully become.