= My first benchmark: = [[PageOutline]] ''Jan Prill'' I'm remebering the day that I read on Brian's Waste of Time about an ongoing effort of Dave Balmain to port Lucene to ruby. Man this was '''great''' news: Fulltext-search with ruby and rails!! Since then a lot of work went into the library and efforts like acts_as_ferret for integration with rails turned up. One of the '''most important things about search is the performance of the search engine'''. Only if the user is able to try out different search terms while getting '''immediate results''' the search feature of an application will be a success. For huge amounts of data the indexing speed is of great importance as well. As Dave wrote himself in a mail-thread he was at first a little too optimistic and offensive in his predictions how fast he will be able to make ferret complete and performant. It seems as if he has learned a lesson at that time. Now he's leaving it to the community to check out the performance and compare it to java-lucene and other search engines. I think this is an '''honorable move''' of Dave, so I've tried to provide a first benchmark. As it turns out - regardless how flawed this benchmark is - Dave made a great job. Once again it's time to say thank you, Dave, for providing a high performant port of a great search library!! '''As an important disclaimer''' I would like to say that I'm not an expert in benchmarking. I've read quite a lot of them but this is the first one that I'm practically doing myself. I haven't made any optimization efforts on any of the tools but used them as they came out of the box. I've only read the first lines of "how to get started" documenation on java-lucene, hyperestraier and ferret. These where: '''ferret 0.9.3''': http://kasparov.skife.org/blog/src/ruby/ferret.html [[BR]] '''java-lucene 1.9.1''': http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/test/org/apache/lucene/SearchTest.java?rev=150494&view=markup and http://lucene.apache.org/java/docs/gettingstarted.html [[BR]] '''hyperestraier 1.2.5''': http://hyperestraier.sourceforge.net/intro-en.html [[BR]] '''Please''': If you are an expert in one of these search-engines than provide some information about the best optimizations. I'm pretty sure that I've made something wrong with hyperestraier. This shouldn't be the start of a search-engine "war" but an effort of getting the best out of each search-engine. Therefore I'm describing what I've did as exactly as I can. Please jump in, especially if you are a hyperestraier expert so that we could get some better numbers for hyperestraier. Once again: Please correct my mistakes! This isn't meant to offend anyone of the developers or users of either of the search-engines!!! == advice provided by readers == Marvin Humphrey was the first one who actually followed my encouragement to submit some advice on optimizing one of lucene, ferret or hyperestraier (advice for other search engines is welcome as well by the way). I'll take these tips into account when I'm doing a next round of tests. If you are interested in doing benchmarks yourself then don't miss out the optimizations that Marvin is speaking of in: http://www.ruby-forum.com/topic/65565 . == used hardware & software: == hyperestraier 1.2.5 java-lucene 1.9.1 ferret 0.9.3 The hardware I've used is a small webserver. The following outputs should give you an idea: uname -a[[BR]] Linux 2.6.16-gentoo-r6 no 1 Thu May 4 10:40:34 CEST 2006 i686 Intel(R) Celeron(R) CPU 2.40GHz GNU/Linux Tasks: 48 total, 2 running, 46 sleeping, 0 stopped, 0 zombie[[BR]] Cpu(s): 0.0% us, 0.0% sy, 0.0% ni, 99.7% id, 0.0% wa, 0.0% hi, 0.3% si[[BR]] Mem: 515340k total, 499816k used, 15524k free, 36784k buffers[[BR]] Swap: 2562356k total, 70096k used, 2492260k free, 411860k cached[[BR]] java -version[[BR]] java version "1.5.0_06"[[BR]] Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)[[BR]] Java HotSpot(TM) Client VM (build 1.5.0_06-b05, mixed mode)[[BR]] ruby -v[[BR]] ruby 1.8.4 (2005-12-24) [i686-linux][[BR]] gcc -v[[BR]] /usr/lib/gcc/i686-pc-linux-gnu/3.4.6/specs[[BR]] /var/tmp/portage/gcc-3.4.6-r1/work/gcc-3.4.6/configure --prefix=/usr --bindir=/usr/i686-pc-linux-gnu/gcc-bin/3.4.6 --includedir=/usr/lib/gcc/i686-pc-linux-gnu/3.4.6/include --datadir=/usr/share/gcc-data/i686-pc-linux-gnu/3.4.6 --mandir=/usr/share/gcc-data/i686-pc-linux-gnu/3.4.6/man --infodir=/usr/share/gcc-data/i686-pc-linux-gnu/3.4.6/info --with-gxx-include-dir=/usr/lib/gcc/i686-pc-linux-gnu/3.4.6/include/g++-v3 --host=i686-pc-linux-gnu --build=i686-pc-linux-gnu --disable-altivec --enable-nls --without-included-gettext --with-system-zlib --disable-checking --disable-werror --disable-libunwind-exceptions --disable-multilib --disable-libgcj --enable-languages=c,c++,f77 --enable-shared --enable-threads=posix {{{--enable-__cxa_atexit}}} --enable-clocale=gnu[[BR]] Thread-Modell: posix[[BR]] gcc-Version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)[[BR]] == indexing: == Readers of the mailing lists of ferret and rails are aware that there are people who are in need of indexing gigs of documents. When this is your task indexing speed gets pretty important. === text material: === I've used textmaterial from project gutenberg. This is a great place for getting loads of text. You may download the same package that I did at: ftp://ibiblio.org/pub/docs/books/gutenberg/1/1/2/2/11220/PG2003-08_files.zip. md5sum PG2003-08_files.zip [[BR]] c09a4db30b099f479da5981aae3ccc51 PG2003-08_files.zip I've extracted only *.txt-Files to a single directory by using: [[BR]] unzip -Cj PG2003-08_files.zip "*.txt" -d . The total size of the textfiles in that directory were after extraction ~ 408 MB: [[BR]] du -hs . [[BR]] 408M === indexing with ferret: === For the indexing process with ferret I've used a short script that was inspired by http://kasparov.skife.org/blog/src/ruby/ferret.html. You may look at the source code at BenchIndexWithFerret. 2316 janprill 18 0 50804 47m 2448 R 76.6 9.4 0:18.85 ruby // ranging from 75 - 90 % CPU usage. first run: start: Fr Mai 12 16:04:24 CEST 2006, end: Fr Mai 12 16:05:38 CEST 2006 ~ 0h 1m 14s ''using time command:'' [[BR]] ''first run: '' [[BR]] time ruby ferret_test.rb[[BR]] real 1m9.647s[[BR]] user 0m56.048s[[BR]] sys 0m2.344s[[BR]] ''second-run:'' [[BR]] real 1m9.224s[[BR]] user 0m55.739s[[BR]] sys 0m2.252s[[BR]] === indexing with hyperestraier: === 2347 janprill 18 0 77992 68m 3768 R 95.5 13.5 2:00.98 estcmd // ranging from 85 - 99.9 % CPU usage ''first run'' hyperestraier 1.2.3: estcmd: INFO: finished successfully: elapsed time: 0h 4m 6s[[BR]] ''second run'' changing to he 1.2.5: /usr/local/bin/estcmd: INFO: finished successfully: elapsed time: 0h 3m 34s ''using time command (ongoing with he 1.2.5): '' [[BR]] ''first run:'' [[BR]] time /usr/local/bin/estcmd gather -sd hyperestraierindex files[[BR]] real 3m30.970s[[BR]] user 3m16.100s[[BR]] sys 0m3.268s ''second run: '' [[BR]] real 3m31.795s[[BR]] user 3m16.900s[[BR]] sys 0m3.220s === indexing with java-lucene: === 2452 root 25 0 229m 18m 6936 R 99.9 3.7 0:27.95 java // ranging from 99.1 - 99.9 % CPU usage ''first-run'': 130102 total milliseconds ~ 0h 2m 10s ''using time command: '' [[BR]] ''first run: '' [[BR]] time java -classpath ./lucene-core-1.9.1.jar:./lucene-demos-1.9.1.jar org.apache.lucene.demo.IndexFiles /home/janprill/searchtest/files[[BR]] real 2m14.564s[[BR]] user 2m6.420s[[BR]] sys 0m4.972s[[BR]] ''second run:'' [[BR]] real 2m22.966s[[BR]] user 2m15.112s[[BR]] sys 0m4.804s[[BR]] === index-size: === The indexes of ferret and lucene aren't the same in size. I don't know if in the lucene demo jars only the SimpleAnalyzer is used. Maybe this is the reason. Both "lucene" indexes are quite small compared to the hyperestraier index. Maybe this has something to do with the used flags on the he-gatherer. I don't know... du -hs ferretindex/ hyperestraierindex/ index/[[BR]] 19M ferretindex/[[BR]] 73M hyperestraierindex/[[BR]] 16M index/[[BR]] ----------------------------------- == searching: == === checking results count: === '''hyperestraier:''' 'english': 453[[BR]] 'gutenberg': 597[[BR]] 'yesterday': 215[[BR]] 'together': 545[[BR]] 'america': 181[[BR]] 'advanced': 284[[BR]] 'president': 118[[BR]] 'bunny': 1[[BR]] '''ferret: ''' 'english': 448[[BR]] 'gutenberg': 597[[BR]] 'yesterday': 198[[BR]] 'together': 537[[BR]] 'america': 176[[BR]] 'advanced': 260[[BR]] 'president': 114[[BR]] 'bunny': 1[[BR]] '''lucene: ''' 'english': 433[[BR]] 'gutenberg': 597[[BR]] 'yesterday': 157[[BR]] 'together': 524[[BR]] 'america': 163[[BR]] 'advanced': 222[[BR]] 'president': 103[[BR]] 'bunny': 1[[BR]] === search performance: === To test the performances of searches I've made up a '''loop that searches all of the terms above'''. Because all of these searchengines are pretty fast I'm '''looping this search loop a thousand times'''. Interestingly ferret and java-lucene are hitting the cpu quite hard (60-70 % CPU utilization) while hyperestraier itself seems to be very low on cpu utilization (around 4 %) but utilizes the system heavily. This is once again indicating that I'm doing something wrong with hyperestraier. Hopefully a he-expert jumps right in. I've had quite some utilization by sshd while I let the search engines put out the found document id and its score. Because of this I've eliminated outputs in one test by commenting out the puts and System.out.printlns (ferret and java-lucene) or sending the output of estcmd to /dev/null. You might have a look at the sources: BenchSearchWithHe | BenchSearchWithFerret | BenchSearchWithLucene and the commented out lines in there. Once again: I'm quite sure that I'm doing things wrong with hyperestraier. I think estcmd will open and close the index each time. But I'm leaving it to the hyperestraier experts to fix this. Of course you could give me some hints and I'll perform then on my little testcase. Obviously using the ruby bindings would be great for a direct comparison with ferret. Maybe I'm doing this myself in the next days. === searching with ferret: === ''no outputs:'' [[BR]] ''first run:'' [[BR]] time ruby search_ferret.rb[[BR]] real 0m24.330s[[BR]] user 0m15.681s[[BR]] sys 0m6.240s[[BR]] ''second run:'' [[BR]] real 0m20.937s[[BR]] user 0m15.397s[[BR]] sys 0m5.520s[[BR]] ''with puts:'' [[BR]] ''first run:'' [[BR]] real 0m33.276s[[BR]] user 0m19.753s[[BR]] sys 0m7.508s[[BR]] ''second run:'' [[BR]] real 0m32.841s[[BR]] user 0m20.061s[[BR]] sys 0m7.284s[[BR]] === searching with hyperestraier: === ''outputs to /dev/null:'' [[BR]] ''first run:'' [[BR]] time sh search_he.sh[[BR]] real 1m13.886s[[BR]] user 0m34.194s[[BR]] sys 0m39.590s[[BR]] ''second run:'' [[BR]] real 1m14.292s[[BR]] user 0m34.406s[[BR]] sys 0m39.718s[[BR]] ''with outputs to stdout:'' [[BR]] ''first run:'' [[BR]] real 1m30.498s[[BR]] user 0m36.766s[[BR]] sys 0m42.255s[[BR]] ''second run:'' [[BR]] real 1m30.240s[[BR]] user 0m36.906s[[BR]] sys 0m42.155s[[BR]] === searching with java-lucene: === ''no outputs:'' [[BR]] ''first run:'' [[BR]] time java -classpath ./lucene-demos-1.9.1.jar:./lucene-core-1.9.1.jar:. SearchLucene[[BR]] real 0m5.207s[[BR]] user 0m4.584s[[BR]] sys 0m0.528s[[BR]] ''second run:'' [[BR]] real 0m5.083s[[BR]] user 0m4.488s[[BR]] sys 0m0.540s[[BR]] ''with System.out.println's:'' [[BR]] ''first run:'' [[BR]] real 0m21.769s[[BR]] user 0m10.033s[[BR]] sys 0m2.740s[[BR]] ''second run:'' [[BR]] real 0m21.332s[[BR]] user 0m9.741s[[BR]] sys 0m2.644s[[BR]] == conclusions: == While the pure ruby version of ferret was magnitudes slower than the java version '''this isn't the case any longer'''. All tested search engines are quite fast. Java-lucene is a defacto standard for open source search and is used on countless sites. It's as well the base for nutch: an industry strength crawler and search engine. With ferret there is a port of the lucene standard to ruby that is indexing even faster than java lucene (two times faster then java-lucene) and searching nearly as fast. Taking into account, that hyperestraier - as well as ferret and lucene - could be heavily optimized it's a strong contender. There are papers as well as an thread on hyperestraiers mailing list that states that hyperestraier is four times faster in indexing than lucene. Unfortunatly I can't recall the url of these threads, the mailing list of hyperestraier isn't archived on sourceforge. Once again: Maybe a hyperestraier expert can jump in to provide numbers of a better optimized hyperestraier... Regards Jan Prill == comments == anonymous: I'd like to point out that ferret is not multithreaded, and it's not really fair to compare java-lucene (with all of its threading overheads) to ferret which has threading disabled. Threading adds a lot of overhead and this fact is not mentioned in this benchmark. Jan Prill: Hey anon! I've encouraged readers to state their opinion and yours is perfectly fine. But: I think it is a matter of respect to post comments like these not anonymously but at least with a webname and / or an email-address. Maybe you'll let people know who you are next time you are posting... it's so much more fun to discuss when you know at least anything about your vis-à-vis, Cheers, Jan David Balmain: Ferret is multithreaded and the benchmarks are completely fair. If anything they err in favour of Java Lucene as the Lucene benchmark is only adding *.txt documents while the Ferret benchmark is adding all documents to the index. Not to mention the fact that there have been some rather large performance improvements since Ferret 0.9. Still, I'd prefer the error to be Ferret's way in case there are complaints about the validity of the benchmarks (like this one).