I've been a fan of the acts_as_ferret plugin for a while and have had great success in using it while developing Rails applications using WEBrick. It's a cinch to use and can be highly configurable. However, once I deployed applications that used acts_as_ferret to production things started to come apart. acts_as_ferret simply didn't work in a multi-process (read fastcgi) or multi-server environment.
The problem is due to concurrency issues between processes. Each fastcgi process assumes it is the only one using the index files and would not respect other processes' actions. Under heavy loads, and during concurrent read/write accesses, file locking breaks down and the application begins to throw errors (including some nasty segfaults).
The problem gets worse when not only are there multiple processes on a server but there are multiple servers as well. In this configuration, keeping the ferret index files separated on each server isn't an option as that would lead to out of sync indices. The most obvious solution is to use a centralized location for these files and link each server to it. But that is the same situation as above, only this time there are more processes!
So, is acts_as_ferret a 'development-only' plugin that's not ready for production? Fortunately not!
There is now a DRb Server implementation for acts_as_ferret. From the authors:
"In production environments most often multiple processes are responsible for serving client requests. Sometimes these processes are even spread across several physical machines.
Just like the database, the Ferret index of an application is a unique resource that has to be shared among all servers. To achieve this, acts_as_ferret comes with a built in DRb server that acts as the central hub for all indexing and searching in your application."
Perfect! Now acts_as_ferret is ready for production environments. The only question is, "How does it scale?"
I recently ran across another 'acts_as_' plugin for Rails, called acts_as_solr, that has many of the same features, and many that acts_as_ferret doesn't, but its index server is implemented in Java using the apache-solr engine. I began to wonder which one was faster and thus would scale better. Time for a benchmark!
- CPU: Intel(R) Pentium(R) 4 CPU 3.00GHz (using non-smp kernel)
- RAM: 2GB
- OS: Kubuntu Edgy (latest updates as of 3/14/07)
- Ruby: ruby 1.8.4 (2005-12-24) [i686-linux]
- Rails: 1.1.6
- MySQL: mysql Ver 14.7 Distrib 4.1.15, for pc-linux-gnu (i486) using readline 5.1
- ferret gem: 0.11.3
- acts_as_ferret: svn://projects.jkraemer.net/acts_as_ferret/trunk/plugin/acts_as_ferret (as of 3/14/07)
- acts_as_solr: http://opensvn.csie.org/acts_as_solr/trunk (as of 3/13/07)
- apache-solr: apache-solr-1.1.0-incubating
Hard Drive information ('hdparm -I /dev/hda')
Everything was done on a single machine (see assumptions as to why).
This routine simply loops a specified number of times, evaluates a given routine, and outputs basic statistical information. It catches all exceptions thrown (caused by passing in invalid characters as part of the query) and treats that timing instance as noise (not included in success totals).
Note: only one of the 'acts_as' declarations was uncommented at a time.
The table I used contained 414 rows of news articles pulled from various news feeds (Yahoo, MSNBC, etc.). The columns indexed were the url of the full article, the title of the article, and a synopsis of the article content. The article table is as follows:
All routines were performed on the same local machine. Thus these tests do not account for any latency that may occur when in an environment distributed across different machines on a network. Any latency differences between a multi-server environment versus a localhost-only environment would be constant, and thus not affect the relative performance, as both plugins are opening sockets to connect to their respective servers.
I am assuming that 414 rows of aggregated news article summaries from several different sites provides a large enough pool of information that searches terms pulled from these articles can be considered random.
There were four different routines ran for each plugin: random search with no background updates, cached search with no background updates, random search with continuous background updates, cached search with no background updates. Each routine/plugin combination ran a benchmark of 10, 100, 1000, and 10000 queries.
Random search with No Background Updates:
The routines I used for random searching is as as follows:
Article.find(:first, :order => 'rand()').description.split(' ')[rand()] is used as the search term to provide random queries by pulling a random word from a random article description. The idea here is to see what non-cached query benchmarks are. I am not accounting for time taken to perform the query to get a random word from the database.
Cached search with No Background Updates:
As many of the feeds used to generate the articles table were of a technical nature I chose the word 'computer' to maximize the number of matches. The routine I used for cached searching is as follows:
Random Search with Continuous Background Updates:
Using two different 'script/console's, one would continuously select a random article and save it, thus updating the indices, while the other continuously ran the random query as described above.
Cached search with Continuous Background Updates:
Using two different 'script/console's, one would continuously select a random article and save it, thus updating the indicies, while the other continuously ran the cached query as described above.
Here is the output from two script/console sessions (one for acts_as_solr and the other for acts_as_ferret):
acts_as_solr Random Query Test (with no background updates):
acts_as_ferret Random Query Test (with no background updates):
acts_as_solr Cached Query Test (with no background updates):
acts_as_ferret Cached Query Test (with no background updates):
acts_as_solr Random Query Test (with background updates):
acts_as_solr Cached Query Test (with background updates):
acts_as_ferret Random Query Test (with background updates):
acts_as_ferret Cached Query Test (with background updates):
The results were surprisingly close, with the largest margin of difference being approximately 0.01 seconds and acts_as_ferret performing faster on most test cases. I honestly would have figured a Java implementation would have been faster with all the negative press out there about Ruby benchmarking.
Now to be fair to the solr server, it does appear to have many features that the acts_as_ferret DRb server does not, and thus could be doing a lot more than just building the index files. I didn't look into this further, though that would make a good follow up to see how much (if at all) these extra features affect these results and if they can be configured to improve performance.
Here a breakdown of the results for different tests:
Test Winner Avg. Margin
random (no updates) acts_as_ferret 0.002734
random (with updates) acts_as_ferret 0.012787
cached (no updates) acts_as_ferret 0.000026
cached (with updates) acts_as_solr 0.002395