Issues should be visible

The best way to improve a system’s stability is to make sure the issues are visible by everyone, everyday. We all get too many emails of various relevance. So many that our vision blurs when looking at our mailbox and our brains automatically filter out some parts of it that have been around for too long. Time to handle an issue becomes:

  1. Immediately
  2. End of day
  3. Never.

The best software systems are built with error handling. They detect errors, validate that they behaved as expected and that the results are within normal boundaries. Typically, the go-to technique to report these issues is to send an email to the right person. That’s until the right person moves on with their life. Then that notification goes to someone who has no clue how to handle it or how important it is. The issues escalate to developers and that’s how my mailbox fills up with useless redundant requests.

Don’t manage errors with a mailbox. Mailboxes are great to handle immediate issues, but the search capabilities are really not as good as you need them to be. When an issue arises in a mailbox, it does not have any context. It’s buried in endless noise out of which you cannot see the patterns. The hard issues can rarely be fixed without identifying a pattern.

I searched for a system that would allow me to do this. I found quite a few interesting solutions, but nothing that suited what I wanted to do. It turns out it’s a single database table, a couple of lines of SQL, and rendering using d3.js.

The table?

  • An autoincrement ID because the framework likes it better
  • An identifier for the notification
  • A date
  • A status (green, yellow or orange)
  • A time to live
  • A one-line message

Each task or verification in the system pushes it’s state at the end of the task, recording the result as green if all is good. Orange if there is an error. Yellow if something is preventing execution.

Red is reserved for moments where the task does not execute within the expected TTL, meaning the task either did not run at all or crashed in the process.

It turns out green is addictive. If you present a dashboard with green bubbles for all success, the few occasional errors stand out and efforts will be put to go back to green. Sometimes resolving the issue involves contacting the customer, sometimes it requires adjusting tests, but having everything in a single location allows to see patterns. Failures rarely come alone.

To help further, when you click on one of those circles, you get to see a graph showing the history of the task. I used a plot of time for X and time between notifications for Y. Dot color shows the state. Interesting patterns emerge there.

As a bonus, an iframe points to a page on the wiki with troubleshooting information. Documentation is only every used when it is in context.

The great thing about having a dashboard is that you can then include additional information in it. The more value it has, the more people will access it. It needs to make their daily jobs easier managing the site.

I hope it will keep my inbox more manageable. Meanwhile, we can get a good handle on the system behavior, resolve long lasting issues and improve performance.

Motivation

The last couple of months have been busy. Last November, I joined/co-founded HireVoice, a company that was setting out to do brand monitoring for employers. Helping companies improve to attract better talent seemed to be a good cause and early feedback seemed to be promising. As the technical co-founder, I started building the solution. Soon enough, we had a decent application in place, ready to be used. It was partial, it had gaps. but it was built to be production-ready. I don’t believe in prototypes and with the tools we have available freely, there is no reason to settle for anything else than high standards. This is when we figured out that getting from people from interested to actually writing checks was not going to be as easy as we had anticipated. Even though we had a good vision, and at the time our advisers we very encouraging, the reality of being a small supplier with no history makes selling hard. Our benefits were not tangible enough.

We moved on to different ideas, pitching all around, trying to find something that would cause more than interest and provide tangible results as soon as possible for the potential clients. To cut the story short, we essentially built 5 different products to various levels over the course of 8 months, only to be back at the starting point. The only difference is that we knew our market a whole lot better.

We came up with another plan. Instead of keeping the focus on perception as we previously did, we went towards recruiting. We followed the money. With some early validation, we figured we could start building a solution again. I took the week-end off. I intended to get started early on the Monday morning, fully refreshed and energized.

Monday morning arrived. I sat down at my computer with my double espresso ready to start. I had my plans, I knew exactly what I had to do. This is usually an ideal scenario. I do my best work in the morning and when I am fully aware of what needs to be done. I got to work and I realized the spark was gone. Even though we had a good plan, it wasn’t good for me. It did not serve a purpose meaningful enough to motivate me into making the sacrifices that come with starting a business. This was no longer about helping companies improve and make people happy at new jobs, it was just about recruiting and perpetuating the same old cycle.

My heart just wasn’t into it. With the motivation gone, I stated thinking about what I had left behind for the past few months. My personal relationships had been affected by it. I had been working a lot. Partially still consulting to pay the bills, mostly on the company, spending most of the time thinking about it. The idea of keeping up that rhythm longer and the consequences that it could bring were simply unacceptable.

I have read a lot of books. Recently, I had read more about business and start-ups. None of them had prepared me for this. How do you tell your business partner you are out? No chapter ever covers this. I had to do it the only way I could think of, say it frankly and openly, and order beers to end it cleanly. I did it the same day, then spent a few hours wondering if I had made the right decision. I did not get much work done in the two following weeks. There is no formula to tell when you should cut your losses. Even if it did not account to much as a company, there is still attachment to what we built and the time we spent.

This makes HireVoice part of the statistics, those 90% of start-ups that don’t make it. However, looking back, I don’t see it as a complete failure. I worked on fun challenges, met many interesting people, and got a better understanding of how businesses operate. Part of the code still lives through an open source project and is being used by a few people. I still maintain it for fun. Somehow, helping a handful of people I don’t know is more motivating to me.

Following great design

A couple of months ago, I started a new project. I had no legacy to follow. Everything was open. I decided to stick to PHP, because this is what I know best. New projects have enough unknowns by themselves. I didn’t need to learn a whole new platform. Instead, I decided to learn a new framework. I went with Symfony2, and of course, Doctrine2. I must say I was more than impressed. These two together completely change what PHP programming is, and for the best. They do what they need to do, and do it extremely well. Even profiling the code demonstrated how much care went into them. I never like ORMs much because the previous generations polluted the object models. Doctrine 2 lets you write fully testable object code and hides away the storage… mostly.

As the project evolved, it became clear that the relational model was not a good fit. A graph was mostly what we needed. A little bit of study and experimentation later, I settled for Neo4j. An excellent PHP library handled the REST protocol. However, using it felt like a downgrade from my new Doctrine habits. The code was not as nice. I started writing a tiny entity manager with the persist() and flush() methods just to handle the creation of the objects. Most of the work was done through queries anyway. I did not need much more than a write-only interface. A couple of hours later, it was up and running. It made the insert code much nicer. At this point, I was still experimenting a little. There was no strong commitment in the code.

As time went by, I started adding a few pieces. I figured just retrieving the properties back into the object would not be so hard. With a couple hours here and there, mostly in my spare time, because this was actually fun, I eventually ended up dynamically loading the relations and essentially have a nearly complete* entity manager for Neo4j (which I am glad to distribute under MIT licence).

Most of the development was driven by actual needs I had, and I was willing to accept a few workarounds. It was a side project within a larger project with deliverables after all. For example, the first generation of entity proxies were simply decorators. This worked great for most of the situations, but Twig, the favored template engine with Symfony, did not appreciate them much as they relied too much on magic methods which it could not use reflection on. For a long time, I would just use getEntity() on the proxy to get the actual object, with the limitations that comes with. I eventually gave in and generated proxies, just like Doctrine does.

In fact, very early on in the project, the decorating proxies would only rely on naming conventions to do their job. That worked great until a few edge cases made it hard to work around. They are now using declarative annotations, using the Doctrine annotation readers.

I never intended to write a complete entity manager, but it came naturally. It felt like a huge task initially. In the end, it was just a few hours of coding joy spread out over a few months. All along, I could just take a look at how they achieved it in Doctrine, and orient the design in the same direction, taking shortcuts to meet the other schedules, but still. One of the great aspect of Doctrine’s design is that it relies on very few methods. The tests are all very high level, meaning I could refactor major parts of it without changing the tests at all.

* To this day, there is absolutely no support for removal of nodes or relations, because I did not need that.

Experiment in the small

Technology move fast. Every week there are new frameworks and libraries. In the past years, it seems like data stores have been appearing at an even faster rate. Each of them claims to be a revolution. Those that have been around for a while know that revolutions don’t happen that often. Those claims set expectations very high.

Have you ever been in a situation where a new hire in a company is having a walk-through of the projects and the structures, and the mentor can’t really explain how it became such a mess? Mentions technologies that were once promising and revolutionary, only to be left now as shameful legacy?

Only time can test new technologies. What may look promising based on demos and samples may simply not scale to larger applications or cause maintenance burdens on the long term. I grew to be conservative when it comes to technologies. I still use PHP daily after all. I know it has flaws, but I also know it won’t fail me. Starting a new green field project is challenging. There are tons of decisions to be made. Tons of new and exciting toys to play with. However, trying to be too innovative hurts most of the time. Bleeding edge is a very well coined term. New technologies mean new problems to solve, which can be fun early on, but when you need to deliver and you start to hit limitations you were not aware of, waste of time starts eating away the benefits.

Immature products do not come with a huge body of knowledge and clear guidelines. You can use a great technology in a wrong way and create horrors. We have all seen some.

Of course, new technologies need to be adopted. In the long run, stable becomes obsolete. While I believe relational databases are not going away any time soon, those new data stores that used to be in the academic world will eventually mature and become mainstream. It started with all of the start-ups in the world using MongoDB or Cassandra or CouchDB, but this is not mainstream. Early adopters at best. I do try out new frameworks and databases on a regular basis, and enjoy it. However, I keep them outside of the critical path until I am confident enough that I understand the technology well enough.

There are plenty of places to experiment around projects. Perhaps a report needs to be built and SQL is not too suitable for it. A prototype for a new feature can be a good place to experiment as well. If it is to go straight into the main application, I take extra precautions. I prepare a contingency plan. I make sure there are good abstractions in place that allow me to replace it if anything goes wrong. If the technology does not allow me to abstract it away, it’s probably not a design elegant enough for me to use it anyway. I always place maintainability above my desire to try new things, which can be hard.

Experiments are supposed to fail once in a while. If you end up in a situation where everything you try is wonderful and you end up using, there is something wrong with the evaluation process. Even more so if you experiment on the bleeding edge, with technologies out for a couple of weeks. Failures are not a bad thing. Most of the time, new technologies come as a whole package that you are supposed to either take or discard. However, most of the time, they are based on ideas that are simply less common. Ideas that you can take away and use to influence your designs.

Rushing to track

Working on larger projects come with the annoyances of project management. Every stakeholder sends in someone to check on the project status and make sure it is on track. It’s not really possible to blame them to keep an eye on where their money is going, but it can be very counter productive. Software projects are not like every other project. It’s not a production line and it is very hard to measure. Complexity is hidden, and so can progress be. A well run project will tackle the bigger risks first to avoid uncertainty near the end. This means that early on, progress may just be invisible. Outsiders like to see visible progress, but focusing on visual details just builds pretty prototypes. It looks complete, but it’s not.

Trying to please project managers rather than filling the actual needs is a bad engineering practice, even though it creates successful projects on paper. Just like the vast majority will not remember who finished second, a pile of project reports finished on time and on schedule is a manager’s pride. Reports don’t often state that the definition of complete was distorted. Trying to track the invisible causes dysfunction. At some point, you need to trust the people working on the project.

Estimates are hardly ever accurate on day 0. Until some work is done, it’s just impossible to know how long something is going to take. You can have a wild guess. Having worked on similar problems before helps assessing the risks and adjusting the estimate, but when it’s brand new work integrating with unknown systems, starting to code is the only way to feel the resistance the system will offer. The risks need to be assessed. Unstable or undocumented APIs, incoherent data, undefined business logic and dozens of other factors can affect how long something will take. If you have committed to a tight deadline and the developers know it, knowing precisely how long tasks are going to take won’t save you. Letting them work might.

When the primary risks are tackled and something is in place, management becomes possible. Making a list of changes that deviate from the current state, the baseline, to reach a target is quite simple. With the primary risks out of the way, it should be possible to break down the list of tasks into fairly even sizes and manage by tracking velocity. Managing becomes what it should be: looking at the time available and prioritizing. Trying to manage by tracking velocity  during the inception, which ends when the primary risks are tackled is just a waste of time. It leads to panic because velocity is not high enough early on until for a magical reason velocity sharply increases at some point in the project, which is when the actual construction starts. That is, if panic decisions did not screw up with the project.

You can plan all you want, but software has its own agenda. The problem space defines how long inception and elaboration will be, not a schedule on the wall.

Chicken and eggs

Developing new features in the open source world is a long process. Not because coding takes time, but because the maturation cycle is much longer. In a normal business development cycle, the specifications are usually quite clear and they will be validated before a release by QA. In most cases I encounter, the initial need is driven by a specific case, but due to the open nature, the implementation must eventually cover broader cases, driven by feature requests or stories from other users.

The main issue that that those additional cases cannot be validated right away. Even if you contact people directly, it’s unlikely that they will get a development version to test with. Validation will have to wait until the software is released, and they might not even test as soon as the software comes out. With a 6-month release schedule for major changes, that means that the use case validation will take 6 to 12 months.

When the feedback finally arrives, changes are often needed. It’s not usually very large changes. Small changes to the user interface to include existing capabilities, minor bug fixes or other issues that take less than 2 hours to resolve. Some say that figuring out the problem is half of the job. In this case, finding the issue consumes 99% of the schedule. However, fixing it is not the end of it. For re-validation, a release still needs to happen. It might be in a minor release depending on the moment of the fix, which may be a month away in the best cases. Still, the story is not over as yet more issues may be found.

The reason it takes so long is that development is made for preemptive needs rather than immediate needs. They are nice to have features, but not having them is not a show stopper or they would not be using the software. Alternatively, it may be a show stopper, in which case they are not using the software at all and use something else in the mean time.

This is still in the best of cases, as some people will just try it and declare it broken, stick to their old ways and never signal an issue. In their minds, the feature remains broken forever and they will stay away from it. They might come back much later once the feature has matured. Because they have a work-around, they won’t ever feel the urge to transition, and the longer it takes, the harder it will be as the work-around probably uses some techniques that are not as clean and slightly corrupt the data structure.

Assuming the feature is useful enough for a critical mass to try it and report issues, it can easily take 2 years for a feature to go from functional to mature and broadly usable. It is a long time. This is for a feature that really worked from the start, had known behaviors, documentation, unit testing and all of what you would expect from production-ready code. It still takes years.

The only way to speed-up the process is to find some other users with critical needs that will have a detailed case to resolve. Most of the time, they will not even know they can hook into some existing functionality. Getting a handful of those users who will be brave enough to install a development version and actively test for their use case can cut down the maturation process in half. Every time an issue is resolved (in a good way, not a dirty hack), it unlocks many more use cases and allows for more improvements. That’s when the feature becomes first-class.

Faster iteration is the key. If your organization uses a waterfall process or has a distant QA team that does not work closely to the developers, the same issue is likely to hit you. If you can’t live with the long maturation process the way an open source project can, you need to plan for it and manage it as a risk. Don’t wait until the week before the release to tie-up loose ends. Make sure the code is more than just a proof of concept early.

 

Contributing organizations

Originally written in 2009, but never published. Conclusion was reworked.

The open source model is widely different from the typical business plan. There used to be a time when contributors were volunteers, working for passion and love of the project in their free time. These days, I feel most of the contributors to projects make a living out of it. I do it. There is nothing wrong with that. In fact, it’s a good thing. It’s a lot more sustainable. Everyone needs to make a living, so odds are you loose a contributor because the company he works for just bought out (or is on the verge to be) and needs to work 70 hour weeks and burns out are much lower. Contribution can become a top priority for a reasonable amount of work hours per week. The Linux kernel was probably one of the first sample where all the top contributors ended up being paid by companies to do it.

As an individual, I find it makes a lot of sense to focus on open source software. There is nothing that I hate doing more than writing the same code twice. Building from an open platform, and contributing back, allows me to avoid writing the same thing twice and prevents me from ever writing some code. I’m mostly a developer. I don’t do much of the applying for contracts and filling specific needs. Tried it before. Not a happy place for me, even if basing it on open solutions. I’d rather leave the pressure to deliver on someone else. Turns out many companies out there see open source as a business opportunity. They can build on an existing product to deliver more value fast. I get hired to take them further while they handle the day to day problems. It’s a good deal for both sides.

There is however a major difference between a company serving real tangible clients and the open source world. After working for several clients, I can see a major difference between the successful ones and those that barely stay above waters. It turns out it probably differentiates successful from unsuccessful in any field. It’s called vision. The good companies see ahead. They anticipate problems and make sure they are resolved before the client meets them, at least the fundamental ones. Unsuccessful companies just keep fighting fires and push down the problems and hope for someone to resolve them right away.

A while back, I read one of Joel’s articles on micromanagement. Apart from the conference organizer’s inside jokes about terrible WIFI access in conference centers (and the great advice to make sure they give it for free if it does not handle the load — which is always), the following passage made me smile:

At the top of every company, there’s at least one person who really cares and really wants the product and the customer experience to be great. That’s you, and me, and Ryan. Below that person, there are layers of people, many of whom are equally dedicated and equally talented.

But at some point as you work your way through an organization, you find pockets of people who don’t care that much.

Having spent most of my time as either an employee (or technically an intern, as I never really had a full time job otherwise), I spent most of my time as a consultant working in fairly small of organizations and have been kept closer to developers than the management-types. I can say without an hesitation that the people down the ladder mostly blame high level management for getting in their way and preventing them from doing their job. Both are probably right. However, I still find Joel’s wording a bit harsh.

To function correctly with open source and make the relationship efficient, you need to embrace it. Companies trying to make it a one way relationship ended up failing. In open source, the project extends beyond the company. Every line of code you don’t contribute back is one line you will have to maintain yourself. At some point, it will simply break. Upgrading to obtain the later versions will become harder. At which point, you better hope you have killer traceability to go back to the original issue, because you will have to implement it over again. With a high turnover, it might just kill the company, and the community (and consultants part of it) might not be motivated in helping you out if you did not contribute back when times were good.

Refactoring sprint

I spent the last week in Boston to tackle the refactoring of Tiki trackers with other developers. The code was getting old and had evolved in ways no one would recommend. The original author himself had qualified them as a hack. Yet, hundreds of people use them extensively and the interface had been polished over the years. The main issue is that the cruft to value ratio was reaching a tipping point. The collapse had been predicted for a long time but did not happen. Modifications only took longer to perform, leaving more cruft to remove each time. Worst, no one dared to get close to that code. Few had enough courage to modify it.

Before leaving for Boston to tackle the issue, I had gone through the code and cleaned up parts of it, making conditions more explicit and code slightly more expressive. The objective was not as much to improve the code as to understand the raw design underneath it. I think all software as a design, only too often, it is unintentional and hidden. In those cases, refactoring is initially about making the current design explicit. Once that is done, it can be refactored further and improved to match new requirements. When initially setting the goals for the week, saying we’re not going to fix any issues, add new features or otherwise solve everyone’s own favorite issue is a tough sale.

A successful refactoring sprint is about discipline. The group must stay away from distractions and concentrate on the task. Our task was to extract the field rendering and input logic into cohesive units. The initial input was composed of a few files between 1KLOC and 6KLOC, containing around 40 field types being handled, all mashed together. Some lines were between 500 and 700 characters long. Some parts were duplicated in multiple locations, with the mandatory differences that make them hard to reconcile. Removing that duplication was one of the primary objective. It’s challenging. I don’t think anyone thought it was possible when we began, but it had to be done.

Figuring out where to begin is not easy. Initially, you can’t even get everyone working. Even if the problem had a natural separation with the field types, the code would fight back when too many people worked on it. At first, an initial interface had to be defined as the target to reach. Then it had to be plugged. Essentially, it comes down to if we have a handler to do it, use it, otherwise, revert back to whatever was there before. Those kind of hooks had to be deployed in many places, but we began with one. Working on a few handlers to see how it worked out, learning about the design.

As more handlers got to be created, more hooks were added in other places, leading to revisiting the previous ones. It’s a highly iterative process. I made the first iteration alone and others were introduced gradually. Everyone’s first handler took a whole day. It was much more than my most pessimistic estimations. There was a lot to learn. However, the pace then accelerated. As each of us understood the design of the code and the patterns to be found, the pace accelerated. We could see those huge files melting. Each step of the way, it became easier. Anyway, that was the feeling.

Then someone asked how far were we. I pulled out a white board and made a list of the field types that were still to be done. The initial list came as a disappointment. The list was still long. We were only half way and way past the week’s mid-mark. However, past the initial disappointment, having the list visible ended up being a motivator, because each one that was completed made the list shorter. It encouraged to fully complete the handlers rather than leaving dangling issues.

We ended up completing on the last evening. This was a one week burn. The last few hours were hard for everyone. After spending a week working long hours on challenging code, I don’t think we could have accomplished more than we did. However, there was great satisfaction. The refactoring process is not completed. One of the issues was tackled, but there are other areas of the code that need to be worked on. However, the bulk of the job was done as a team effort, and now there are stronger grounds to build from. No one could have done it alone.

It should be noted that the week was not only hard work. It was also a social event where non-coding contributors of the community and users were welcome to stop by and chat. There were late night discussions around beers, leading to even less sleep, and the whole week was a great team building experience. While we were shuffling thousands of lines of code around, the documentation team also re-organized the structure of the documentation.

Where ugly code lies

There are multiple definitions to what software architecture is, notwithstanding that in some areas, the term cannot legally be used. Definitions vary from high level code design to organizational issues. James O. Coplien and Gertrud Bjørnvig came up with a good summary in Lean Architecture.

  • Architecture is more about form than structure. Implementation details are not in the scope. Interfaces, classes and design paradigms are not even considered. Only the general form of the final solution is.
  • Architecture is more about compression than abstraction. Abstractions are in code. They are detailed. The architecture part of the work is about larger scale partitioning into well named patterns, which may have multiple implementations.
  • Much architecture is not about solving user problems. While I don’t fully agree with this one, it’s true that most users will not see the changes right away.

These are high level concepts that have a huge impact on the code. The partitioning that results determines how additions are made to the code and how they will be. There is a direct relationship to the software design and the API available to implementers.

When I look at code in software designed using good techniques, there is typically a clear distinction between some core managing the general process and the extensions following interfaces that are called at the appropriate time. When you look at code inside the core, it really does not seem to do much. There are usually a few strange incantations to call the extension points efficiently and massage the information sent through. The code is not really pretty, but the structure it represents is clean. The glue has to go somewhere.

The leafs, or extension points, are typically a complete jungle. Contortions must be made to fit the interface as mapping is done to an external system. Some pieces were written quickly to serve a pressing matter, fell into a technical debt backlog and eventually out of sight. Code is duplicated around, taken from the older generations to the new ones, evolving over time, except that the ancestors stay around and never get the improvements from the new generations. Quality varies widely as does the implementer’s abilities and experience, but all of the components are isolated and do not cause trouble… most of the time.

Seeing how code rots in controlled environments, I’m always a bit scared when I see a developer searching the web and grabbing random plugins for an open source platform and including them in the code. Disregarding the license issues that are almost never studied, that practice is plain dangerous. There are security implications. Most developers publishing plugins are not malicious, they are simply ignorant of the flaws they introduce.

jQuery is probably the flagship in the category of quality core containing arcade incantations and the jungle of plugins. Surely, having 50,000 plugins may seem like a selling point, but when you consider most of them are lesser duplicates of other ones. Code quality is appallingly low. In most cases, it takes less than 30 seconds to realize they were written by people (self-proclaimed experts) who knew nothing of jQuery, just enough JavaScript to smash together a piece of functionality and branded it as a jQuery plugin for popularity’s sake while following a tutorial. Never use a plugin without auditing the code first.

Even if good care is taken to control the leafs, ugly code will appear all over. There are no other solutions than to go back and add the missing abstractions. Provide the additional tools needed to handle the frequent problems that were duplicated all around. No amount of planning will predict the detailed needs of those extension points. What allows architecture to work is compression, to be able to skip details so the system can be understood as a whole. The job is not done when the core is in place. Some time must be allocated to watching the emergence of patterns and to respond to them, either by modifying the core or providing use-at-will components. It can be made in multiple steps too.

Recently, I was asked to do a lot of high level refactoring in Tiki. Major components had systemic flaws known for a while and many determined they had to be attacked after the release of the long term support version. High level work has several impacts, but sometimes, just providing low level tools can improve the platform significantly. Cleaner code will make the high level changes easier to perform. It only takes a few hours to run through several thousands lines of code and identify commonalities that could be extracted. Extract them, deploy it around. Iterate. Automated tests to support those changes would be nice, but most of the time, those changes are so low level, it’s almost impossible to get wrong.

Sensible defaults

I have not written much C++ in my life. Most of it goes back to college and university, and that short period of time I was at Autodesk. However, I always considered the STL to be very influential. A few years back, I read Bjarne Stroustrup’s book and something hit me. For those not familiar with the STL, it’s a very template-intensive library that does not use much of the traditional interfaces you typically see in object oriented libraries. Instead, everything is based on duck typing. If the argument you pass provides the right set of operators or methods, the compiler will do the job and go ahead with it. If it does not, the compiler will print out a few dozen lines of garbage with angle brackets all around, which does not relate much to your code. Still, one of the core concepts in the library is that if an operation is efficient, it will have an operator to do it. If it’s not, it will have a function name that’s rather long. Hence encouraging the use of efficient operations. How does this work in reality? A vector list will provide direct value access through square brackets, pretending to be an array and a queue won’t, because that would be O(n) and that’s not something they want to encourage.

The documentation also contains hints like this one:

[3] One might wonder why pop() returns void, instead of value_type. That is, why must one use front() and pop() to examine and remove the element at the front of the queue, instead of combining the two in a single member function? In fact, there is a good reason for this design. If pop() returned the front element, it would have to return by value rather than by reference: return by reference would create a dangling pointer. Return by value, however, is inefficient: it involves at least one redundant copy constructor call. Since it is impossible for pop() to return a value in such a way as to be both efficient and correct, it is more sensible for it to return no value at all and to require clients to use front() to inspect the value at the front of the queue.

Not all libraries in the world are so careful. Take this snippet from the Zend Search Lucene documentation:

Java Lucene uses the ‘contents’ field as a default field to search. Zend_Search_Lucene searches through all fields by default, but the behavior is configurable. See the “Default search field” chapter for details.

When I came across this reading the documentation to build a fairly complex index, I thought it was a reasonable default to search all fields. It’s very convenient. I could store my content in the field they belong, make sure they are searchable by default and still allow for finer-grained search when required. Fantastic. I went ahead with it, write the code to collect the information and index it properly and the query abstraction at the same time in a good old TDD fashion. Everything worked. Of course, I was testing on very small data sets.

I then went ahead to test with larger data sets. I had an old back-up from a site installed with around 2000 documents in it. It felt like a decent test. I expected the indexing to be around half an hour for that type of data based on what I had read online. The search component had not been selected for speed, it had been selected for portability. Speed was one of those sacrificed attributes as long as it was not too terrible. Of course, the initial indexing took longer than expected, but only by a factor of 2, and I knew some places were not fully optimized yet (it’s now down to 20-25 minutes).

The really big surprise came as I attempted to make a simple search in the index. It timed out. After 60 seconds. Initial attempts at profiling failed as it was getting late in the afternoon on a Friday night. I closed up shop and had a bit of trouble getting it out of my head that night.

When I got back to it, I took out the time limit, started a profiling session on it and enjoyed my coffee for a little while. The results indicated that the search spent pretty much all of it’s time optimizing the search query. It was making tens of thousands of of calls to some functions eventually making reads on disk. There was not much more reporting in there to help me. I started adding some var_dumps in the code to see what was going on. Well, it turns out that “search all fields by default” was not such a great idea. It actually made it search through all the fields and basically expand the query. Because of how I interpreted the API and documentation, I had built my index to be quite expanded and it contained a few dozens of fields, not all of which existed for all documents. It was a mistake. There was one valid reason why the Java implementation did not behave that way: it was not possible to do it efficiently.

I ended up modifying the index generation to put all content in a contents field, duplicating the content you would actually want to search independently in their own fields and search contents by default. Indexation time wasn’t altered by much, the code changes were actually very minor and easy due to the array of tests available and search speed went up dramatically. It’s not as fast as sphinx for sure, but it does offer decent performance and can run on pretty much any kind of cheap hosting, which is a good feature for an open source CMS. It still needs to be investigated, but it’s also likely to be a smooth upgrade path to using Solr for larger installations. Abstractions around the indexing and searching will also allow to quickly move to other index engines as needed.

Asymptotic analysis is one really boring part of the computer science curriculum, but it’s really something to consider when building libraries that are going to be used with large numbers of documents. The API needs to reflect the limitations and the documentation must explain them clearly.