Changes between Version 40 and Version 41 of HowTos
- Timestamp:
- 02/06/08 19:47:24 (3 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
HowTos
v40 v41 21 21 http://aslakhellesoy.com/articles/2005/11/18/using-ferret-with-activerecord 22 22 23 === How to integrate Ferret with rails on the rail wiki (a little outdated) ===23 === How to integrate Ferret with rails on the rail wiki (a little outdated) === 24 24 25 25 Jan Prill has written up a great howto on integrating ferret with rails. You can check it out here: … … 97 97 == How to use the C Indexer == 98 98 99 Currently cFerret isn't really ready for release and it'll only run on linux (compile and pass the tests on Mac OS X v.4), but it is definitely us eable if you know your way around C. If you want to index a lot of documents, it may be worth looking into. Indexing all of the text documents on my PC took over 20 minutes with Java Lucene and less than a minute with cFerret. And the indexes were identical. To get cFerret, you'll need subversion;99 Currently cFerret isn't really ready for release and it'll only run on linux (compile and pass the tests on Mac OS X v.4), but it is definitely usable if you know your way around C. If you want to index a lot of documents, it may be worth looking into. Indexing all of the text documents on my PC took over 20 minutes with Java Lucene and less than a minute with cFerret. And the indexes were identical. To get cFerret, you'll need subversion; 100 100 101 101 {{{ … … 109 109 == How to use Index::Index in Multi-Threaded Applications == 110 110 111 The Index::Index class it thread safe so it should run well in threaded applications. One thing to note is that document numbers are ephemeral, ie they may change111 The Index::Index class it thread-safe so it should run well in threaded applications. One thing to note is that document numbers are ephemeral, ie they may change 112 112 as an index updated. Clients should thus not rely on a given document having the same number between requests. There are two possible solutions to this. You can synchronize on the index; 113 113 … … 159 159 == How to not use the main Index::Index class == 160 160 161 Right now it is fairly simple to use the Index::Index class. It handles most of the index updating and locking for you. The problem is, it is doing a lot of extra work to make sure that you are always searching on the latest index. It is actually a lot more effici ant to have one object for updating the index and as many others as you like for searching the index. This gives you more control on what is going on in the index and leads to greater efficiency. These are, Index::IndexReader, Index::IndexWriter, and Index::IndexSearcher. I'll cover the searcher here first, and most others will follow suite.161 Right now it is fairly simple to use the Index::Index class. It handles most of the index updating and locking for you. The problem is, it is doing a lot of extra work to make sure that you are always searching on the latest index. It is actually a lot more efficient to have one object for updating the index and as many others as you like for searching the index. This gives you more control on what is going on in the index and leads to greater efficiency. These are, Index::IndexReader, Index::IndexWriter, and Index::IndexSearcher. I'll cover the searcher here first, and most others will follow suite. 162 162 163 163 {{{ … … 200 200 == How to use keys for document == 201 201 Ferret contains very useful concept of document keys. You could think about the key like as document field that unique across the index. 202 Ok. Some code could help you under tand a bit more. Let's imaging that we want to index Document object.202 Ok. Some code could help you understand a bit more. Let's imaging that we want to index Document object. 203 203 {{{ 204 204 #!ruby … … 269 269 == How to crawl internet-sites, an intranet or the filesystem and index the crawled documents - RDig == 270 270 271 Jens Kraemer came up with a great tool for crawling documents that reside on the internet, your intranet or the file system. Have a look at [http://rdig.rubyforge.org/ RDig]: RDig provides an HTTP crawler and content extraction utilities to help building a site search for web sites or intranets. Internally, Ferret is used for the full text indexing. After creating a config file for your site, the index can be built with a single call to rdig.271 Jens Kraemer came up with a great tool for crawling documents that reside on the internet, your intranet or the file-system. Have a look at [http://rdig.rubyforge.org/ RDig]: RDig provides an HTTP crawler and content extraction utilities to help building a site search for web sites or intranets. Internally, Ferret is used for the full text indexing. After creating a config file for your site, the index can be built with a single call to rdig. 272 272 273 273 ----------------- … … 342 342 The !SynonymAnalyzer is fairly simple, like most analyzers. It is very similar to the !StandardAnalyzer except for a few exceptions noted below. 343 343 344 A synonym engine must be supplied to the analyzer. The engine is required to do the lookup of a word and return the resulting synonyms. The !SynonymAnalyzer also requires a !SynonymTokenFilter that does most of the work and actually makes the calls to the specified synonym engine. Finally, unlike the !StandardAnalyzer this class does not run tokens through the !HyphenFilter because if there are hyp enated words that have synonyms, it would be nice to capture those.344 A synonym engine must be supplied to the analyzer. The engine is required to do the lookup of a word and return the resulting synonyms. The !SynonymAnalyzer also requires a !SynonymTokenFilter that does most of the work and actually makes the calls to the specified synonym engine. Finally, unlike the !StandardAnalyzer this class does not run tokens through the !HyphenFilter because if there are hyphenated words that have synonyms, it would be nice to capture those. 345 345 346 346 {{{ … … 482 482 ==== Ferret version ==== 483 483 484 This code is a port of the Syns2Index.java program into ruby with only a few minor changes to how it works. I did not want to exclude words with spaces in them so I removed any logic for that, and obviously I changed it so that it builds a ferret index instead of a lucene index.484 This code is a port of the Syns2Index.java program into ruby with only a few minor changes to how it works. I did not want to exclude words with spaces in them so I removed any logic for that, and obviously I changed it so that it builds a ferret index instead of a Lucene index. 485 485 486 486 To use this script download the [http://wordnet.princeton.edu/obtain prolog wordnet database] and extract it. Run the script without any arguments to see the usage. The file you will want to use is 'wn_s.pl'. … … 686 686 687 687 688 === Ou standing Issues ===688 === Outstanding Issues === 689 689 690 690 There are still some issues that need to be taken care of: … … 699 699 {{{ rabbits ferret|"black-footed ferret"|"mustela nigripes"|"ferret out" }}} 700 700 701 That would actually solve issue 1 and issue 2 above, since by enclosing the synonym search in french braces would allow for multi-word synonyms. It would also remove the need for indexing your documents upon insertion into the database keeping the size of the index down as well.702 703 None of that has been done as of yet, so for now the synonym searching is not as robust as it will hopeful y become.701 That would actually solve issue 1 and issue 2 above, since by enclosing the synonym search in French braces would allow for multi-word synonyms. It would also remove the need for indexing your documents upon insertion into the database keeping the size of the index down as well. 702 703 None of that has been done as of yet, so for now the synonym searching is not as robust as it will hopefully become.
