My first benchmark:

Jan Prill

I'm remebering the day that I read on Brian's Waste of Time about an ongoing effort of Dave Balmain to port Lucene to ruby. Man this was great news: Fulltext-search with ruby and rails!! Since then a lot of work went into the library and efforts like acts_as_ferret for integration with rails turned up. One of the most important things about search is the performance of the search engine. Only if the user is able to try out different search terms while getting immediate results the search feature of an application will be a success. For huge amounts of data the indexing speed is of great importance as well.

As Dave wrote himself in a mail-thread he was at first a little too optimistic and offensive in his predictions how fast he will be able to make ferret complete and performant. It seems as if he has learned a lesson at that time. Now he's leaving it to the community to check out the performance and compare it to java-lucene and other search engines. I think this is an honorable move of Dave, so I've tried to provide a first benchmark. As it turns out - regardless how flawed this benchmark is - Dave made a great job. Once again it's time to say thank you, Dave, for providing a high performant port of a great search library!!

As an important disclaimer I would like to say that I'm not an expert in benchmarking. I've read quite a lot of them but this is the first one that I'm practically doing myself. I haven't made any optimization efforts on any of the tools but used them as they came out of the box. I've only read the first lines of "how to get started" documenation on java-lucene, hyperestraier and ferret. These where:

ferret 0.9.3: http://kasparov.skife.org/blog/src/ruby/ferret.html
java-lucene 1.9.1: http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/test/org/apache/lucene/SearchTest.java?rev=150494&view=markup and http://lucene.apache.org/java/docs/gettingstarted.html
hyperestraier 1.2.5: http://hyperestraier.sourceforge.net/intro-en.html

Please: If you are an expert in one of these search-engines than provide some information about the best optimizations. I'm pretty sure that I've made something wrong with hyperestraier. This shouldn't be the start of a search-engine "war" but an effort of getting the best out of each search-engine. Therefore I'm describing what I've did as exactly as I can. Please jump in, especially if you are a hyperestraier expert so that we could get some better numbers for hyperestraier. Once again: Please correct my mistakes! This isn't meant to offend anyone of the developers or users of either of the search-engines!!!

advice provided by readers

Marvin Humphrey was the first one who actually followed my encouragement to submit some advice on optimizing one of lucene, ferret or hyperestraier (advice for other search engines is welcome as well by the way). I'll take these tips into account when I'm doing a next round of tests. If you are interested in doing benchmarks yourself then don't miss out the optimizations that Marvin is speaking of in: http://www.ruby-forum.com/topic/65565 .

used hardware & software:

hyperestraier 1.2.5 java-lucene 1.9.1 ferret 0.9.3

The hardware I've used is a small webserver. The following outputs should give you an idea:

uname -a
Linux 2.6.16-gentoo-r6 no 1 Thu May 4 10:40:34 CEST 2006 i686 Intel(R) Celeron(R) CPU 2.40GHz GNU/Linux

Tasks: 48 total, 2 running, 46 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0% us, 0.0% sy, 0.0% ni, 99.7% id, 0.0% wa, 0.0% hi, 0.3% si
Mem: 515340k total, 499816k used, 15524k free, 36784k buffers
Swap: 2562356k total, 70096k used, 2492260k free, 411860k cached

java -version
java version "1.5.0_06"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)
Java HotSpot?(TM) Client VM (build 1.5.0_06-b05, mixed mode)

ruby -v
ruby 1.8.4 (2005-12-24) [i686-linux]

gcc -v
/usr/lib/gcc/i686-pc-linux-gnu/3.4.6/specs
/var/tmp/portage/gcc-3.4.6-r1/work/gcc-3.4.6/configure --prefix=/usr --bindir=/usr/i686-pc-linux-gnu/gcc-bin/3.4.6 --includedir=/usr/lib/gcc/i686-pc-linux-gnu/3.4.6/include --datadir=/usr/share/gcc-data/i686-pc-linux-gnu/3.4.6 --mandir=/usr/share/gcc-data/i686-pc-linux-gnu/3.4.6/man --infodir=/usr/share/gcc-data/i686-pc-linux-gnu/3.4.6/info --with-gxx-include-dir=/usr/lib/gcc/i686-pc-linux-gnu/3.4.6/include/g++-v3 --host=i686-pc-linux-gnu --build=i686-pc-linux-gnu --disable-altivec --enable-nls --without-included-gettext --with-system-zlib --disable-checking --disable-werror --disable-libunwind-exceptions --disable-multilib --disable-libgcj --enable-languages=c,c++,f77 --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu
Thread-Modell: posix
gcc-Version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)

indexing:

Readers of the mailing lists of ferret and rails are aware that there are people who are in need of indexing gigs of documents. When this is your task indexing speed gets pretty important.

text material:

I've used textmaterial from project gutenberg. This is a great place for getting loads of text. You may download the same package that I did at: ftp://ibiblio.org/pub/docs/books/gutenberg/1/1/2/2/11220/PG2003-08_files.zip.

md5sum PG2003-08_files.zip
c09a4db30b099f479da5981aae3ccc51 PG2003-08_files.zip

I've extracted only *.txt-Files to a single directory by using:
unzip -Cj PG2003-08_files.zip "*.txt" -d .

The total size of the textfiles in that directory were after extraction ~ 408 MB:
du -hs .
408M

indexing with ferret:

For the indexing process with ferret I've used a short script that was inspired by http://kasparov.skife.org/blog/src/ruby/ferret.html. You may look at the source code at BenchIndexWithFerret.

2316 janprill 18 0 50804 47m 2448 R 76.6 9.4 0:18.85 ruby // ranging from 75 - 90 % CPU usage.

first run: start: Fr Mai 12 16:04:24 CEST 2006, end: Fr Mai 12 16:05:38 CEST 2006 ~ 0h 1m 14s

using time command:
first run:
time ruby ferret_test.rb
real 1m9.647s
user 0m56.048s
sys 0m2.344s

second-run:
real 1m9.224s
user 0m55.739s
sys 0m2.252s

indexing with hyperestraier:

2347 janprill 18 0 77992 68m 3768 R 95.5 13.5 2:00.98 estcmd // ranging from 85 - 99.9 % CPU usage

first run hyperestraier 1.2.3: estcmd: INFO: finished successfully: elapsed time: 0h 4m 6s
second run changing to he 1.2.5: /usr/local/bin/estcmd: INFO: finished successfully: elapsed time: 0h 3m 34s

using time command (ongoing with he 1.2.5):
first run:
time /usr/local/bin/estcmd gather -sd hyperestraierindex files
real 3m30.970s
user 3m16.100s
sys 0m3.268s

second run:
real 3m31.795s
user 3m16.900s
sys 0m3.220s

indexing with java-lucene:

2452 root 25 0 229m 18m 6936 R 99.9 3.7 0:27.95 java // ranging from 99.1 - 99.9 % CPU usage

first-run: 130102 total milliseconds ~ 0h 2m 10s

using time command:
first run:
time java -classpath ./lucene-core-1.9.1.jar:./lucene-demos-1.9.1.jar org.apache.lucene.demo.IndexFiles? /home/janprill/searchtest/files
real 2m14.564s
user 2m6.420s
sys 0m4.972s

second run:
real 2m22.966s
user 2m15.112s
sys 0m4.804s

index-size:

The indexes of ferret and lucene aren't the same in size. I don't know if in the lucene demo jars only the SimpleAnalyzer? is used. Maybe this is the reason. Both "lucene" indexes are quite small compared to the hyperestraier index. Maybe this has something to do with the used flags on the he-gatherer. I don't know...

du -hs ferretindex/ hyperestraierindex/ index/
19M ferretindex/
73M hyperestraierindex/
16M index/


searching:

checking results count:

hyperestraier:

'english': 453
'gutenberg': 597
'yesterday': 215
'together': 545
'america': 181
'advanced': 284
'president': 118
'bunny': 1

ferret:

'english': 448
'gutenberg': 597
'yesterday': 198
'together': 537
'america': 176
'advanced': 260
'president': 114
'bunny': 1

lucene:

'english': 433
'gutenberg': 597
'yesterday': 157
'together': 524
'america': 163
'advanced': 222
'president': 103
'bunny': 1

search performance:

To test the performances of searches I've made up a loop that searches all of the terms above. Because all of these searchengines are pretty fast I'm looping this search loop a thousand times. Interestingly ferret and java-lucene are hitting the cpu quite hard (60-70 % CPU utilization) while hyperestraier itself seems to be very low on cpu utilization (around 4 %) but utilizes the system heavily. This is once again indicating that I'm doing something wrong with hyperestraier. Hopefully a he-expert jumps right in. I've had quite some utilization by sshd while I let the search engines put out the found document id and its score. Because of this I've eliminated outputs in one test by commenting out the puts and System.out.printlns (ferret and java-lucene) or sending the output of estcmd to /dev/null. You might have a look at the sources: BenchSearchWithHe | BenchSearchWithFerret | BenchSearchWithLucene and the commented out lines in there. Once again: I'm quite sure that I'm doing things wrong with hyperestraier. I think estcmd will open and close the index each time. But I'm leaving it to the hyperestraier experts to fix this. Of course you could give me some hints and I'll perform then on my little testcase. Obviously using the ruby bindings would be great for a direct comparison with ferret. Maybe I'm doing this myself in the next days.

searching with ferret:

no outputs:
first run:
time ruby search_ferret.rb
real 0m24.330s
user 0m15.681s
sys 0m6.240s

second run:
real 0m20.937s
user 0m15.397s
sys 0m5.520s

with puts:
first run:
real 0m33.276s
user 0m19.753s
sys 0m7.508s

second run:
real 0m32.841s
user 0m20.061s
sys 0m7.284s

searching with hyperestraier:

outputs to /dev/null:
first run:
time sh search_he.sh
real 1m13.886s
user 0m34.194s
sys 0m39.590s

second run:
real 1m14.292s
user 0m34.406s
sys 0m39.718s

with outputs to stdout:
first run:
real 1m30.498s
user 0m36.766s
sys 0m42.255s

second run:
real 1m30.240s
user 0m36.906s
sys 0m42.155s

searching with java-lucene:

no outputs:
first run:
time java -classpath ./lucene-demos-1.9.1.jar:./lucene-core-1.9.1.jar:. SearchLucene?
real 0m5.207s
user 0m4.584s
sys 0m0.528s

second run:
real 0m5.083s
user 0m4.488s
sys 0m0.540s

with System.out.println's:
first run:
real 0m21.769s
user 0m10.033s
sys 0m2.740s

second run:
real 0m21.332s
user 0m9.741s
sys 0m2.644s

conclusions:

While the pure ruby version of ferret was magnitudes slower than the java version this isn't the case any longer. All tested search engines are quite fast. Java-lucene is a defacto standard for open source search and is used on countless sites. It's as well the base for nutch: an industry strength crawler and search engine. With ferret there is a port of the lucene standard to ruby that is indexing even faster than java lucene (two times faster then java-lucene) and searching nearly as fast. Taking into account, that hyperestraier - as well as ferret and lucene - could be heavily optimized it's a strong contender. There are papers as well as an thread on hyperestraiers mailing list that states that hyperestraier is four times faster in indexing than lucene. Unfortunatly I can't recall the url of these threads, the mailing list of hyperestraier isn't archived on sourceforge. Once again: Maybe a hyperestraier expert can jump in to provide numbers of a better optimized hyperestraier...

Regards Jan Prill

comments

anonymous: I'd like to point out that ferret is not multithreaded, and it's not really fair to compare java-lucene (with all of its threading overheads) to ferret which has threading disabled. Threading adds a lot of overhead and this fact is not mentioned in this benchmark.

Jan Prill: Hey anon! I've encouraged readers to state their opinion and yours is perfectly fine. But: I think it is a matter of respect to post comments like these not anonymously but at least with a webname and / or an email-address. Maybe you'll let people know who you are next time you are posting... it's so much more fun to discuss when you know at least anything about your vis-à-vis, Cheers, Jan

David Balmain: Ferret is multithreaded and the benchmarks are completely fair. If anything they err in favour of Java Lucene as the Lucene benchmark is only adding *.txt documents while the Ferret benchmark is adding all documents to the index. Not to mention the fact that there have been some rather large performance improvements since Ferret 0.9. Still, I'd prefer the error to be Ferret's way in case there are complaints about the validity of the benchmarks (like this one).