New favorite toy

It certainly ain’t cutting edge, but I recently started using Sphinx. I heard presentations about it, read some, but never had the occasion to use it. It’s very impressive as a search engine. The main downside that kept me away from it for so long is that it pretty much requires a dedicated server to run it. As I primarily work on open source software where I can make no assumption about the environment, sphinx never was an option. For those environments, the PHP implementation of Lucene in the Zend Framework is a better candidate.

In most cases, I tend to stick with what I know. When I need to deliver software, I much prefer avoiding new problems and sticking with what I know is good enough. Granted the option, a few details made me go for Sphinx rather than Lucene (always referring to the PHP project, not the Java one).

  • No built-in stemmer, and could only find for English. If you’ve never tried, not having a stemmer in a search engine is a good way to only occasionally have search results and it makes everything very frustrating.
  • Pagination has to be handled manually. Because it runs in PHP and all memory is gone by the end of the execution, the only way it can handle pagination decently is to let you handle it yourself.

However, it’s a matter of trade-off. Sphinx has a few inconvenients.

  • Runs as a separate server application and requires an additional PHP extension to connect to it (although recent versions support the MySQL protocol and lets you query it from SQL).
  • No incremental update of the index. The indexer runs from command line and can only build indexes efficiently from scratch. Different configurations can be used to lower the importance of this issue. Some delay on the search index update has to be tolerated.

If you can get past those issues, Sphinx really shines.

  • It handles pagination for you. Running it a daemon, it can keep buffers open and keep the data internally and manage it’s memory properly. In fact, you don’t need to know and that’s just perfect.
  • It can store additional attributes and filter on them, including multi-valued fields.
  • It’s distributed, so you can scale the search capacity independently. It requires to modify the configuration file, but entirely transparent to your application.
  • Result sorting options, including one based on time segments, giving higher ranking for recent results depending on which segment they are part of (hour, day, week, month). Ideal when searching for recent changes.

Within just a few hours, it allowed me to solve one of the long lasting issues in all CMS software I’ve came across: respecting access rights in search results efficiently. Typically, whichever search technique you use will provide you with a raw list of results. You then need to filter those results to hide those that cannot be seen by the user. If you can accept that not all pages have the same amount of results (or none at all), this can work pretty efficiently. Otherwise, it adds either a lot of complexity or a lot of inefficiency.

An other option is to just display the results anyway to preserve the aesthetic and let the user be faced with a 403-type error later on. It may be an acceptable solution in some cases. However, you need to hope the excerpt generated or the title does not contain sensitive information, like Should we fire John Doe?. This can also happen if the page contains portions of restricted information.

First, the pagination issue. I could solve this one by adding an attribute to each indexed document and a filter in all queries. The attribute contains the list of all roles who can view the page. The filter contains the list of all roles the user has. Magically, Sphinx paginates the results with only the pages that are visible to the current user.

Of course, this required a bit of up-front design. The permission system allows me to obtain the list of all roles having an impact on the permissions. Visibility can then be verified for each role without having to scan for every (potentially hundreds or thousands) role in the system.

Sphinx can build the index either directly from the database by providing it with the credentials and a query or through an XML pipe. Because a lot happens from the logic in the code, I chose the second approach, providing me with a lot more flexibility. All you have to do is write a PHP script that (ideally using XMLWriter) gathers the data to be indexed and writes it to the output buffer.

The second part of the problem, about exposing sensitive information in the result, was resolved as a side effect. The system allows to grant or deny access to portions of the content. When building the index, absolutely all content is indexed. However, sphinx does not generate the excerpts automatically when generating the search results. One reason is that you may not need them, but the main reason is more likely to be that it does not preserve the original text. It only indexes it. Doing so avoids having to keep yet an other copy of your data. Your database already contains it.

To generate the excerpt, you need to get the content and throw it back to the server with the search words. The trick here is that you don’t really need to send back the same content. While I send the full content during the indexing phase, I only send the filtered content when time comes to generate the excerpt.

Sure, there may be false positives. Someone may see the search result and get nothing meaningful to them. John Doe might find out that the page mentions his name, but the content will not be exposed in any case. Quite a good compromise.

So many possibilities. What is your favorite feature?

Leave a Reply

Your email address will not be published. Required fields are marked *