I have not written much C++ in my life. Most of it goes back to college and university, and that short period of time I was at Autodesk. However, I always considered the STL to be very influential. A few years back, I read Bjarne Stroustrup’s book and something hit me. For those not familiar with the STL, it’s a very template-intensive library that does not use much of the traditional interfaces you typically see in object oriented libraries. Instead, everything is based on duck typing. If the argument you pass provides the right set of operators or methods, the compiler will do the job and go ahead with it. If it does not, the compiler will print out a few dozen lines of garbage with angle brackets all around, which does not relate much to your code. Still, one of the core concepts in the library is that if an operation is efficient, it will have an operator to do it. If it’s not, it will have a function name that’s rather long. Hence encouraging the use of efficient operations. How does this work in reality? A vector list will provide direct value access through square brackets, pretending to be an array and a queue won’t, because that would be O(n) and that’s not something they want to encourage.
The documentation also contains hints like this one:
 One might wonder why pop() returns void, instead of value_type. That is, why must one use front() and pop() to examine and remove the element at the front of the queue, instead of combining the two in a single member function? In fact, there is a good reason for this design. If pop() returned the front element, it would have to return by value rather than by reference: return by reference would create a dangling pointer. Return by value, however, is inefficient: it involves at least one redundant copy constructor call. Since it is impossible for pop() to return a value in such a way as to be both efficient and correct, it is more sensible for it to return no value at all and to require clients to use front() to inspect the value at the front of the queue.
Not all libraries in the world are so careful. Take this snippet from the Zend Search Lucene documentation:
Java Lucene uses the ‘contents’ field as a default field to search. Zend_Search_Lucene searches through all fields by default, but the behavior is configurable. See the “Default search field” chapter for details.
When I came across this reading the documentation to build a fairly complex index, I thought it was a reasonable default to search all fields. It’s very convenient. I could store my content in the field they belong, make sure they are searchable by default and still allow for finer-grained search when required. Fantastic. I went ahead with it, write the code to collect the information and index it properly and the query abstraction at the same time in a good old TDD fashion. Everything worked. Of course, I was testing on very small data sets.
I then went ahead to test with larger data sets. I had an old back-up from a site installed with around 2000 documents in it. It felt like a decent test. I expected the indexing to be around half an hour for that type of data based on what I had read online. The search component had not been selected for speed, it had been selected for portability. Speed was one of those sacrificed attributes as long as it was not too terrible. Of course, the initial indexing took longer than expected, but only by a factor of 2, and I knew some places were not fully optimized yet (it’s now down to 20-25 minutes).
The really big surprise came as I attempted to make a simple search in the index. It timed out. After 60 seconds. Initial attempts at profiling failed as it was getting late in the afternoon on a Friday night. I closed up shop and had a bit of trouble getting it out of my head that night.
When I got back to it, I took out the time limit, started a profiling session on it and enjoyed my coffee for a little while. The results indicated that the search spent pretty much all of it’s time optimizing the search query. It was making tens of thousands of of calls to some functions eventually making reads on disk. There was not much more reporting in there to help me. I started adding some var_dumps in the code to see what was going on. Well, it turns out that “search all fields by default” was not such a great idea. It actually made it search through all the fields and basically expand the query. Because of how I interpreted the API and documentation, I had built my index to be quite expanded and it contained a few dozens of fields, not all of which existed for all documents. It was a mistake. There was one valid reason why the Java implementation did not behave that way: it was not possible to do it efficiently.
I ended up modifying the index generation to put all content in a contents field, duplicating the content you would actually want to search independently in their own fields and search contents by default. Indexation time wasn’t altered by much, the code changes were actually very minor and easy due to the array of tests available and search speed went up dramatically. It’s not as fast as sphinx for sure, but it does offer decent performance and can run on pretty much any kind of cheap hosting, which is a good feature for an open source CMS. It still needs to be investigated, but it’s also likely to be a smooth upgrade path to using Solr for larger installations. Abstractions around the indexing and searching will also allow to quickly move to other index engines as needed.
Asymptotic analysis is one really boring part of the computer science curriculum, but it’s really something to consider when building libraries that are going to be used with large numbers of documents. The API needs to reflect the limitations and the documentation must explain them clearly.