The Joseph Rowntree Foundation's new site

At Torchbox we've recently completely rebuilt the website for the Joseph Rowntree Foundation using Drupal. I was technical lead for the project and I'm incredibly proud of the work that the whole team, both at Torchbox and at the JRF, has put into launching such a great site.

Search indexing and retrieval is powered by Apache Solr on a Tomcat server; the Apache Tika extensions are used to scrape PDFs for their plain text. Torchbox chose a Lucene-based product like Solr because we've got a history of experience with the technology: our own proprietary CMS has used Lucene for some years, and we've built site spiders using it.. So it was quite a surprise to see that the drupal.org redevelopment has chosen Apache Solr as its search technology. Also, Lucid Imagination recently decloaked, with big players aligning themselves with Lucene. Although some of it is still a bit nebulous, this news really bodes well for support and development of this open-source search technology going forwards.

We integrated Drupal with the search stack in the first instance using the apachesolr module, but found that the requirements of the site---category-based and year-based searching, archived publications, visitor-configurable number-per-page and three different search pages---very quickly outstripped the capabilities of either the core module or optional submodules within it: Tika submission and serialization/deserialization between structured searches and Lucene syntax are almost entirely handled by custom code. The apachesolr module was very friendly to our extensions, though, constituting a useful library for us to work with, although its alpha status (it's since moved to beta) meant that we had some difficulty upgrading it to try to get extra functionality, and it doesn't seem to support such Lucene edge cases as on-the-fly weighting.

Performance has been great so far (although I shouldn't jinx that sort of thing by even talking about it) and offloading search to Solr was definitely the right decision: Lucene/Tika solves a whole domain of problems related to free-text and vocabulary searches---fast indexing, stemming, stop words, conditional logic, weighting, text extraction---that Drupal's core search was always going to struggle with. The stack has been a (sometimes infuriating) pleasure to work with, and I look forward to using it in future projects.