Following great design

A couple of months ago, I started a new project. I had no legacy to follow. Everything was open. I decided to stick to PHP, because this is what I know best. New projects have enough unknowns by themselves. I didn’t need to learn a whole new platform. Instead, I decided to learn a new framework. I went with Symfony2, and of course, Doctrine2. I must say I was more than impressed. These two together completely change what PHP programming is, and for the best. They do what they need to do, and do it extremely well. Even profiling the code demonstrated how much care went into them. I never like ORMs much because the previous generations polluted the object models. Doctrine 2 lets you write fully testable object code and hides away the storage… mostly.

As the project evolved, it became clear that the relational model was not a good fit. A graph was mostly what we needed. A little bit of study and experimentation later, I settled for Neo4j. An excellent PHP library handled the REST protocol. However, using it felt like a downgrade from my new Doctrine habits. The code was not as nice. I started writing a tiny entity manager with the persist() and flush() methods just to handle the creation of the objects. Most of the work was done through queries anyway. I did not need much more than a write-only interface. A couple of hours later, it was up and running. It made the insert code much nicer. At this point, I was still experimenting a little. There was no strong commitment in the code.

As time went by, I started adding a few pieces. I figured just retrieving the properties back into the object would not be so hard. With a couple hours here and there, mostly in my spare time, because this was actually fun, I eventually ended up dynamically loading the relations and essentially have a nearly complete* entity manager for Neo4j (which I am glad to distribute under MIT licence).

Most of the development was driven by actual needs I had, and I was willing to accept a few workarounds. It was a side project within a larger project with deliverables after all. For example, the first generation of entity proxies were simply decorators. This worked great for most of the situations, but Twig, the favored template engine with Symfony, did not appreciate them much as they relied too much on magic methods which it could not use reflection on. For a long time, I would just use getEntity() on the proxy to get the actual object, with the limitations that comes with. I eventually gave in and generated proxies, just like Doctrine does.

In fact, very early on in the project, the decorating proxies would only rely on naming conventions to do their job. That worked great until a few edge cases made it hard to work around. They are now using declarative annotations, using the Doctrine annotation readers.

I never intended to write a complete entity manager, but it came naturally. It felt like a huge task initially. In the end, it was just a few hours of coding joy spread out over a few months. All along, I could just take a look at how they achieved it in Doctrine, and orient the design in the same direction, taking shortcuts to meet the other schedules, but still. One of the great aspect of Doctrine’s design is that it relies on very few methods. The tests are all very high level, meaning I could refactor major parts of it without changing the tests at all.

* To this day, there is absolutely no support for removal of nodes or relations, because I did not need that.

Experiment in the small

Technology move fast. Every week there are new frameworks and libraries. In the past years, it seems like data stores have been appearing at an even faster rate. Each of them claims to be a revolution. Those that have been around for a while know that revolutions don’t happen that often. Those claims set expectations very high.

Have you ever been in a situation where a new hire in a company is having a walk-through of the projects and the structures, and the mentor can’t really explain how it became such a mess? Mentions technologies that were once promising and revolutionary, only to be left now as shameful legacy?

Only time can test new technologies. What may look promising based on demos and samples may simply not scale to larger applications or cause maintenance burdens on the long term. I grew to be conservative when it comes to technologies. I still use PHP daily after all. I know it has flaws, but I also know it won’t fail me. Starting a new green field project is challenging. There are tons of decisions to be made. Tons of new and exciting toys to play with. However, trying to be too innovative hurts most of the time. Bleeding edge is a very well coined term. New technologies mean new problems to solve, which can be fun early on, but when you need to deliver and you start to hit limitations you were not aware of, waste of time starts eating away the benefits.

Immature products do not come with a huge body of knowledge and clear guidelines. You can use a great technology in a wrong way and create horrors. We have all seen some.

Of course, new technologies need to be adopted. In the long run, stable becomes obsolete. While I believe relational databases are not going away any time soon, those new data stores that used to be in the academic world will eventually mature and become mainstream. It started with all of the start-ups in the world using MongoDB or Cassandra or CouchDB, but this is not mainstream. Early adopters at best. I do try out new frameworks and databases on a regular basis, and enjoy it. However, I keep them outside of the critical path until I am confident enough that I understand the technology well enough.

There are plenty of places to experiment around projects. Perhaps a report needs to be built and SQL is not too suitable for it. A prototype for a new feature can be a good place to experiment as well. If it is to go straight into the main application, I take extra precautions. I prepare a contingency plan. I make sure there are good abstractions in place that allow me to replace it if anything goes wrong. If the technology does not allow me to abstract it away, it’s probably not a design elegant enough for me to use it anyway. I always place maintainability above my desire to try new things, which can be hard.

Experiments are supposed to fail once in a while. If you end up in a situation where everything you try is wonderful and you end up using, there is something wrong with the evaluation process. Even more so if you experiment on the bleeding edge, with technologies out for a couple of weeks. Failures are not a bad thing. Most of the time, new technologies come as a whole package that you are supposed to either take or discard. However, most of the time, they are based on ideas that are simply less common. Ideas that you can take away and use to influence your designs.

Chicken and eggs

Developing new features in the open source world is a long process. Not because coding takes time, but because the maturation cycle is much longer. In a normal business development cycle, the specifications are usually quite clear and they will be validated before a release by QA. In most cases I encounter, the initial need is driven by a specific case, but due to the open nature, the implementation must eventually cover broader cases, driven by feature requests or stories from other users.

The main issue that that those additional cases cannot be validated right away. Even if you contact people directly, it’s unlikely that they will get a development version to test with. Validation will have to wait until the software is released, and they might not even test as soon as the software comes out. With a 6-month release schedule for major changes, that means that the use case validation will take 6 to 12 months.

When the feedback finally arrives, changes are often needed. It’s not usually very large changes. Small changes to the user interface to include existing capabilities, minor bug fixes or other issues that take less than 2 hours to resolve. Some say that figuring out the problem is half of the job. In this case, finding the issue consumes 99% of the schedule. However, fixing it is not the end of it. For re-validation, a release still needs to happen. It might be in a minor release depending on the moment of the fix, which may be a month away in the best cases. Still, the story is not over as yet more issues may be found.

The reason it takes so long is that development is made for preemptive needs rather than immediate needs. They are nice to have features, but not having them is not a show stopper or they would not be using the software. Alternatively, it may be a show stopper, in which case they are not using the software at all and use something else in the mean time.

This is still in the best of cases, as some people will just try it and declare it broken, stick to their old ways and never signal an issue. In their minds, the feature remains broken forever and they will stay away from it. They might come back much later once the feature has matured. Because they have a work-around, they won’t ever feel the urge to transition, and the longer it takes, the harder it will be as the work-around probably uses some techniques that are not as clean and slightly corrupt the data structure.

Assuming the feature is useful enough for a critical mass to try it and report issues, it can easily take 2 years for a feature to go from functional to mature and broadly usable. It is a long time. This is for a feature that really worked from the start, had known behaviors, documentation, unit testing and all of what you would expect from production-ready code. It still takes years.

The only way to speed-up the process is to find some other users with critical needs that will have a detailed case to resolve. Most of the time, they will not even know they can hook into some existing functionality. Getting a handful of those users who will be brave enough to install a development version and actively test for their use case can cut down the maturation process in half. Every time an issue is resolved (in a good way, not a dirty hack), it unlocks many more use cases and allows for more improvements. That’s when the feature becomes first-class.

Faster iteration is the key. If your organization uses a waterfall process or has a distant QA team that does not work closely to the developers, the same issue is likely to hit you. If you can’t live with the long maturation process the way an open source project can, you need to plan for it and manage it as a risk. Don’t wait until the week before the release to tie-up loose ends. Make sure the code is more than just a proof of concept early.

 

Refactoring sprint

I spent the last week in Boston to tackle the refactoring of Tiki trackers with other developers. The code was getting old and had evolved in ways no one would recommend. The original author himself had qualified them as a hack. Yet, hundreds of people use them extensively and the interface had been polished over the years. The main issue is that the cruft to value ratio was reaching a tipping point. The collapse had been predicted for a long time but did not happen. Modifications only took longer to perform, leaving more cruft to remove each time. Worst, no one dared to get close to that code. Few had enough courage to modify it.

Before leaving for Boston to tackle the issue, I had gone through the code and cleaned up parts of it, making conditions more explicit and code slightly more expressive. The objective was not as much to improve the code as to understand the raw design underneath it. I think all software as a design, only too often, it is unintentional and hidden. In those cases, refactoring is initially about making the current design explicit. Once that is done, it can be refactored further and improved to match new requirements. When initially setting the goals for the week, saying we’re not going to fix any issues, add new features or otherwise solve everyone’s own favorite issue is a tough sale.

A successful refactoring sprint is about discipline. The group must stay away from distractions and concentrate on the task. Our task was to extract the field rendering and input logic into cohesive units. The initial input was composed of a few files between 1KLOC and 6KLOC, containing around 40 field types being handled, all mashed together. Some lines were between 500 and 700 characters long. Some parts were duplicated in multiple locations, with the mandatory differences that make them hard to reconcile. Removing that duplication was one of the primary objective. It’s challenging. I don’t think anyone thought it was possible when we began, but it had to be done.

Figuring out where to begin is not easy. Initially, you can’t even get everyone working. Even if the problem had a natural separation with the field types, the code would fight back when too many people worked on it. At first, an initial interface had to be defined as the target to reach. Then it had to be plugged. Essentially, it comes down to if we have a handler to do it, use it, otherwise, revert back to whatever was there before. Those kind of hooks had to be deployed in many places, but we began with one. Working on a few handlers to see how it worked out, learning about the design.

As more handlers got to be created, more hooks were added in other places, leading to revisiting the previous ones. It’s a highly iterative process. I made the first iteration alone and others were introduced gradually. Everyone’s first handler took a whole day. It was much more than my most pessimistic estimations. There was a lot to learn. However, the pace then accelerated. As each of us understood the design of the code and the patterns to be found, the pace accelerated. We could see those huge files melting. Each step of the way, it became easier. Anyway, that was the feeling.

Then someone asked how far were we. I pulled out a white board and made a list of the field types that were still to be done. The initial list came as a disappointment. The list was still long. We were only half way and way past the week’s mid-mark. However, past the initial disappointment, having the list visible ended up being a motivator, because each one that was completed made the list shorter. It encouraged to fully complete the handlers rather than leaving dangling issues.

We ended up completing on the last evening. This was a one week burn. The last few hours were hard for everyone. After spending a week working long hours on challenging code, I don’t think we could have accomplished more than we did. However, there was great satisfaction. The refactoring process is not completed. One of the issues was tackled, but there are other areas of the code that need to be worked on. However, the bulk of the job was done as a team effort, and now there are stronger grounds to build from. No one could have done it alone.

It should be noted that the week was not only hard work. It was also a social event where non-coding contributors of the community and users were welcome to stop by and chat. There were late night discussions around beers, leading to even less sleep, and the whole week was a great team building experience. While we were shuffling thousands of lines of code around, the documentation team also re-organized the structure of the documentation.

Where ugly code lies

There are multiple definitions to what software architecture is, notwithstanding that in some areas, the term cannot legally be used. Definitions vary from high level code design to organizational issues. James O. Coplien and Gertrud Bjørnvig came up with a good summary in Lean Architecture.

  • Architecture is more about form than structure. Implementation details are not in the scope. Interfaces, classes and design paradigms are not even considered. Only the general form of the final solution is.
  • Architecture is more about compression than abstraction. Abstractions are in code. They are detailed. The architecture part of the work is about larger scale partitioning into well named patterns, which may have multiple implementations.
  • Much architecture is not about solving user problems. While I don’t fully agree with this one, it’s true that most users will not see the changes right away.

These are high level concepts that have a huge impact on the code. The partitioning that results determines how additions are made to the code and how they will be. There is a direct relationship to the software design and the API available to implementers.

When I look at code in software designed using good techniques, there is typically a clear distinction between some core managing the general process and the extensions following interfaces that are called at the appropriate time. When you look at code inside the core, it really does not seem to do much. There are usually a few strange incantations to call the extension points efficiently and massage the information sent through. The code is not really pretty, but the structure it represents is clean. The glue has to go somewhere.

The leafs, or extension points, are typically a complete jungle. Contortions must be made to fit the interface as mapping is done to an external system. Some pieces were written quickly to serve a pressing matter, fell into a technical debt backlog and eventually out of sight. Code is duplicated around, taken from the older generations to the new ones, evolving over time, except that the ancestors stay around and never get the improvements from the new generations. Quality varies widely as does the implementer’s abilities and experience, but all of the components are isolated and do not cause trouble… most of the time.

Seeing how code rots in controlled environments, I’m always a bit scared when I see a developer searching the web and grabbing random plugins for an open source platform and including them in the code. Disregarding the license issues that are almost never studied, that practice is plain dangerous. There are security implications. Most developers publishing plugins are not malicious, they are simply ignorant of the flaws they introduce.

jQuery is probably the flagship in the category of quality core containing arcade incantations and the jungle of plugins. Surely, having 50,000 plugins may seem like a selling point, but when you consider most of them are lesser duplicates of other ones. Code quality is appallingly low. In most cases, it takes less than 30 seconds to realize they were written by people (self-proclaimed experts) who knew nothing of jQuery, just enough JavaScript to smash together a piece of functionality and branded it as a jQuery plugin for popularity’s sake while following a tutorial. Never use a plugin without auditing the code first.

Even if good care is taken to control the leafs, ugly code will appear all over. There are no other solutions than to go back and add the missing abstractions. Provide the additional tools needed to handle the frequent problems that were duplicated all around. No amount of planning will predict the detailed needs of those extension points. What allows architecture to work is compression, to be able to skip details so the system can be understood as a whole. The job is not done when the core is in place. Some time must be allocated to watching the emergence of patterns and to respond to them, either by modifying the core or providing use-at-will components. It can be made in multiple steps too.

Recently, I was asked to do a lot of high level refactoring in Tiki. Major components had systemic flaws known for a while and many determined they had to be attacked after the release of the long term support version. High level work has several impacts, but sometimes, just providing low level tools can improve the platform significantly. Cleaner code will make the high level changes easier to perform. It only takes a few hours to run through several thousands lines of code and identify commonalities that could be extracted. Extract them, deploy it around. Iterate. Automated tests to support those changes would be nice, but most of the time, those changes are so low level, it’s almost impossible to get wrong.

Sensible defaults

I have not written much C++ in my life. Most of it goes back to college and university, and that short period of time I was at Autodesk. However, I always considered the STL to be very influential. A few years back, I read Bjarne Stroustrup’s book and something hit me. For those not familiar with the STL, it’s a very template-intensive library that does not use much of the traditional interfaces you typically see in object oriented libraries. Instead, everything is based on duck typing. If the argument you pass provides the right set of operators or methods, the compiler will do the job and go ahead with it. If it does not, the compiler will print out a few dozen lines of garbage with angle brackets all around, which does not relate much to your code. Still, one of the core concepts in the library is that if an operation is efficient, it will have an operator to do it. If it’s not, it will have a function name that’s rather long. Hence encouraging the use of efficient operations. How does this work in reality? A vector list will provide direct value access through square brackets, pretending to be an array and a queue won’t, because that would be O(n) and that’s not something they want to encourage.

The documentation also contains hints like this one:

[3] One might wonder why pop() returns void, instead of value_type. That is, why must one use front() and pop() to examine and remove the element at the front of the queue, instead of combining the two in a single member function? In fact, there is a good reason for this design. If pop() returned the front element, it would have to return by value rather than by reference: return by reference would create a dangling pointer. Return by value, however, is inefficient: it involves at least one redundant copy constructor call. Since it is impossible for pop() to return a value in such a way as to be both efficient and correct, it is more sensible for it to return no value at all and to require clients to use front() to inspect the value at the front of the queue.

Not all libraries in the world are so careful. Take this snippet from the Zend Search Lucene documentation:

Java Lucene uses the ‘contents’ field as a default field to search. Zend_Search_Lucene searches through all fields by default, but the behavior is configurable. See the “Default search field” chapter for details.

When I came across this reading the documentation to build a fairly complex index, I thought it was a reasonable default to search all fields. It’s very convenient. I could store my content in the field they belong, make sure they are searchable by default and still allow for finer-grained search when required. Fantastic. I went ahead with it, write the code to collect the information and index it properly and the query abstraction at the same time in a good old TDD fashion. Everything worked. Of course, I was testing on very small data sets.

I then went ahead to test with larger data sets. I had an old back-up from a site installed with around 2000 documents in it. It felt like a decent test. I expected the indexing to be around half an hour for that type of data based on what I had read online. The search component had not been selected for speed, it had been selected for portability. Speed was one of those sacrificed attributes as long as it was not too terrible. Of course, the initial indexing took longer than expected, but only by a factor of 2, and I knew some places were not fully optimized yet (it’s now down to 20-25 minutes).

The really big surprise came as I attempted to make a simple search in the index. It timed out. After 60 seconds. Initial attempts at profiling failed as it was getting late in the afternoon on a Friday night. I closed up shop and had a bit of trouble getting it out of my head that night.

When I got back to it, I took out the time limit, started a profiling session on it and enjoyed my coffee for a little while. The results indicated that the search spent pretty much all of it’s time optimizing the search query. It was making tens of thousands of of calls to some functions eventually making reads on disk. There was not much more reporting in there to help me. I started adding some var_dumps in the code to see what was going on. Well, it turns out that “search all fields by default” was not such a great idea. It actually made it search through all the fields and basically expand the query. Because of how I interpreted the API and documentation, I had built my index to be quite expanded and it contained a few dozens of fields, not all of which existed for all documents. It was a mistake. There was one valid reason why the Java implementation did not behave that way: it was not possible to do it efficiently.

I ended up modifying the index generation to put all content in a contents field, duplicating the content you would actually want to search independently in their own fields and search contents by default. Indexation time wasn’t altered by much, the code changes were actually very minor and easy due to the array of tests available and search speed went up dramatically. It’s not as fast as sphinx for sure, but it does offer decent performance and can run on pretty much any kind of cheap hosting, which is a good feature for an open source CMS. It still needs to be investigated, but it’s also likely to be a smooth upgrade path to using Solr for larger installations. Abstractions around the indexing and searching will also allow to quickly move to other index engines as needed.

Asymptotic analysis is one really boring part of the computer science curriculum, but it’s really something to consider when building libraries that are going to be used with large numbers of documents. The API needs to reflect the limitations and the documentation must explain them clearly.

How not to fail miserably with system integration

In the last few weeks, I’ve had the opportunity to get my hands in an SOA system involving multiple components. I had prior experience, but not with that many evolving services at the same time. When I initially read the architecture’s documentation, I had serious doubts any of it was going to work. There were way too many components and each of them had to do a whole lot of work to get the simplest task accomplished. My role was to sit from the outside and call those services, so I did not really have to worry about the internals. Still, unless you’re working with Grade A services provided by reputed companies and used by tens of thousands of programmers, knowing the internals will save a few pains.

Make sure you can work without them

I actually got to this one by accident, but it happened to be a good one. When I joined the team, there were plenty of services available. However, the entry point I needed to get to be able to use any of them was not present at all. I had a vague idea of what the data would look like, so I started building static structures that would be sent off to the views to render the pages. I had a bit of heads up before the integrators joined the team and certainly did not want them to wait around for weeks until the service was delivered.

At some point, the said service actually entered in someone’s iteration, which means it would be delivered in a near future. Fairly quickly, a contract was made for the exact data format that would be provided. Although I was not wrong on the values that would be provided, the format was entirely different. My initial format then became an intermediate format for the view, providing the strict minimum required, and a layer was added in the system to translate the service’s format to our own. The service was not yet available, so the service was just stubbed out. The conversion could be unit tested based on the static data in the service stub. Plugging the service when it arrived was a charm and except for a few environment configurations, it was transparent to integrators.

During the entire development, this fake data ended up being very useful.

  • Whenever the services would change, there was an easy way to compare the expectations to what we actually received, allowing to update the stub data and the intermediate layer.
  • When something would go terribly wrong and services would just fail, it was always possible to revert to the fake data and keep working.
  • It allowed to reach code paths that would not normally be reached, like error handling.

Expect them to fail

One of the holy grails with SOA is that you can replace services on the fly and adjust capacity when needed. This may be partially true, but it also means that while your component works fine, the neighbor may be completely unreachable during a maintenance or while restarting. If you happen to need it, you might as well be down in many cases. While one would hope services won’t crash in production, they happen to crash in development fairly often. To live with this, there is one simple rule: expect every single call to fail, and make a conscious decision about what to do about it. For one, it will make your system fault tolerant. If the only call that failed is fetching some information that is only used being rendered, it’s very likely that you don’t need to die with a 500 internal error. However, if you didn’t expect and handle the failure, that’s what will happen.

Adding this level of error handling does add a significant cost to the development. It’s a lot of code and a lot of reflexions needed. Live with it or reconsider your SOA strategy.

Early on, adding the try/catch blocks wasn’t much of a reflex. After all, you can write code that works without it and PHP sure won’t indicate you that you forgot one. When the first crashes occurred in development, we still had few services integrated. Service interruptions just became worst as we integrated more. What really pushed towards adding more granularity in catching exceptions is not really the non-gracefulness in which they would break the system, it’s the waste of time. With a few team members pushing features in a system, a 15 minutes interruption may not seem like much, but it’s enough to break the flow, which is hard enough to get into in an office environment. Especially when the service that breaks has nothing to do with the task you’re on at the moment.

It does not take much.

  • When fetching information, have a fall-back for missing information.
  • When processing an information, make sure you give a proper notification when something goes wrong.
  • Log all failures and have at least a subtle indication for the end-user that something went wrong and that logs should be verified before bothering someone else.

Build for testability

Services live independently in their own memory space with their own states. They just won’t reset when your new unit test begins, making them a pain to test. However, that is far from being an excuse for not automating at least some tests. Every shortcut made will come back to hunt you, or someone else on the team. It’s very likely that you won’t be able to get the remote system in every possible state to test all combination. Even if you could, the suite would most likely take very long to execute, leading to poor feedback. Mocks and stubs can get you a long way just making sure your code makes the correct calls in sequence (when it matters), passing the right values and stopping correctly when an error occurs. That alone should give some confidence.

To be able to check all calls made, we ended up with an interface defining all possible remote calls providing the exact same parameters and return values as the remote systems. There was a lot of refactoring to get to the solution. Essentially, every single time an attempt was made to regroup some calls because they were called at the same time and shared parameters, or because it was too much data to stub out for just those 2 tiny values, it had to be redone. Some error would happen with the real services because the very few lines of code that were not tested with the real service happened to contain errors, or something would come up and suddenly, those calls were no longer regrouped.

As far as calling the real services go, smoke testing is about the only thing I could really do. Making a basic call and checking if the output seems to be in the appropriate format. In the best of worlds, the service implementers would also provide a stub in which the internal state can be modified, and maintain the stub to reflect the contract made by the real service. It could have solved some issues with the fact that some services are simply impossible to run in a development environment. Sticking to the contract is the only thing that can really be done in an automated fashion for development. I first encountered that type of environment a few years back where running a test actually implied walking to a room, possibly climbing a ladder, switching wires and getting back to the workstation to check.

Have an independent QA team

It might not be miserably, but chances you fail are fairly high when a lot of components need to talk to each other and there is no way you can replicate all of it at once. A good QA team testing in an environment that maps to the production environment will find the most hallucinating issues. In most cases, they are caused by a mismatch between the understanding of the interface between the implementor and the client. When you have a clear log pointing out the exact source of the problem and all your expectations documented in tests and stubs, it does not take a very long discussion to find the source of the issue. Fixing it just becomes adjusting the stubs, and fixing the broken tests.

If you’re lucky enough, there might not be issues left when it goes to production. Seriously, don’t over-do SOA. It’s not as magic as the vendors or “enterprise architects” say it is.

Branching, the cost is still too high

Everyone’s motivation to move to distributed version control systems (DVCS) was that the cost of branching was too high with Subversion. Part of it is true, but even with DVCS, I find the cost of branching to be too high for my taste. I can create feature branches for branches of a decent size, but I think traceability needs even even more granularity.

Let’s begin by listing my typical process to handle feature branches these days.

  1. Branch trunk from the repository to my local copy
  2. Copy configuration files from an other branch
  3. Make minor changes
  4. Run scripts to initialize the environment.
  5. Develop, commit, pull, merge – all of this is great
  6. Push to trunk

My problem is that dealing with those configuration files takes too much time and that is still troublesome. However, there is no real way around it. The application needs to connect to MySQL, Gearman, Sphinx and Memcached. On development setups, they are all on the same machine. Still because I am way too lazy to create new database instances and I often don’t change my prefixes as much as I should, I end up with multiple branches sitting there with only one really usable at any time. Of course, it would all be solved if I were more disciplined, but if it annoys me, it prevents me from doing it right. Just having to do the configuration part encourages me to re-use branches.

The goal of fine-grained branches is to represent the decision-making process as part of the revision control. The way I see it, top level branches represent a goal. It could be implementing a new feature, enhancing a piece of the user interface or anything. However, to reach those top level objectives, it may be required to perform some refactoring or upgrade a library. If those changes are made atomically through a branch and merged as a single commit, there would be ways to look at the hierarchy of commits to understand the flow of intentions. Bazaar can generate graphs from forks and merges. I can imagine tools to help traceability if the decision making is organized in the branch structure.

Why traceability you might ask. For many things that don’t seem to make sense in code, there is a good historical reason (unless it’s due to accidental complexity). Even in my own code written a few months prior, I find places that need refactoring. Most of the time, it’s simply because I was trying to look too far ahead at the time. I was anticipating the final shape of the software, but by the time it got there, new and better ways to achieve the same result had been implemented, leaving legacy behind. When this happens to be in my own code, I can think about the process that led to it, figure out what the original intention was and decide how the design should be adapted to the new reality. When the code is written by someone else, the original intention can only be guessed. I hope creating a hierarchy of branches can provide an outline of the thought process that would explain the decisions made.

My Subversion reflexes pointed me towards bzr switch. It brings a change to the way I got used to work with a DVCS.  My transition was to switch the concept of working copy to branching. Check-outs simply had no use. I was wrong. They can actually fix my issue of configuration burden. If I keep a single check-out of the code that is configured for my local environment, I can then switch it from one branch to an other. Because we are in the distributed world, those other branches can be kept locally, just not in the working copy. The process then changes.

  1. Create a new branch locally
  2. Switch the check-out to the new branch
  3. Develop, commit, pull, merge
  4. Switch check-out to parent branch
  5. Merge local branch

Of course, if changes happen in the configuration files outside of what was locally configured or the schema changes, this has to be dealt with, but I expect this to be much less frequent.

The next step will be to rebuild my development environment in a smarter way. Right now, I have way too many services running locally. I want to move all of those to a virtual machine, which I will fire up when I need them. For this step, I am waiting for the final release of Ubuntu 10.04, and probably a few more weeks. In the past, I had terrible experiences with pre-release OS and learned to stay away, no matter how fun and attractive new features are. It also means re-installing my entire machine, so I don’t look so much towards that. It should be easier now that almost everything is web-based, as long as I don’t loose those precious passwords.

Using virtual machine to keep your primary host clean of any excess is nothing new. I guess I did not do it before I though my disk space was more limited than it is. My laptop has a 64G SSD drive. It was a downscale from my previous laptop’s drive, which was continuously getting full. Too many check-outs, database dumps, log files. They just keep piling up over the years. It turns out the overhead of having an extra operating system isn’t that bad after all.

The good thing about virtual machines is that they are completely disposable. You can build it with the software you need, take a snapshot and move on from there. Simply reverting back to the snapshot will clean up all the mess created. Only one detail to keep in mind: no permanent data can be stored in there. I will keep my local branches on the main host and the check-out in the virtual machine. Having a shell on a virtual machine won’t make much of a difference than a shell locally.

Improving rendering speed

Speed is a matter of perception. We’d like to believe it’s all due to computational power or the execution speed of queries. There are barriers that should not be crossed, but in most case, getting your application to behave correctly while the user is waiting will improve the perspective. Improving the rendering speed is a good step and tweaking a few settings will improve perception more than trimming off milliseconds from an SQL query.

A now classic example of the effects of perception is the one of progress bars. When moving forward at different rates, even though the total time remains the same, will give the impression of being shorter or longer.

Fiddling with HTTP headers is actually very simple and will help lower the load on your server too. A hit you don’t get is so much faster to serve. Both Yahoo! and Google turned this optimization pain into a game by providing scores to increase. If you are not familiar with them, consider installing YSlow and Page Speed right away.  Now, if you’ve never used them before, chances are running it on your own website will provide terrible scores. Actually, running it on most of the websites out there provides terrible scores.

Both of them will complain about a few items:

  • Too many HTTP requests.
  • Missing expires headers
  • Uncompressed streams
  • Unminified CSS and Javascript
  • Recommend use of CDN

Fewer files

Now, the too many HTTP requests are likely caused by those multiple JavaScript and CSS files you include. The JavaScript part is very simple. All you have to do is concatenate the scripts in the appropriate order, minify them and deliver it all as a single file. There are good tools out there to do it. Depending on how you deploy the application, some may be better than others. I’ve used a PHP implementation to do it just in time and cached the result as a static file, and used a Java implementation as part of a build process. I find the later to be a better option if it is possible.

This is easy enough for production environments, but it really makes development a pain. Debugging a minified script is not quite pleasant. In Tikiwiki, this simple became an other option. In a typical Zend Framework environment, APPLICATION_ENV is a good binding point for the behavior. Basically, you need to know the individual files that need to be served. If in a development environment, serve them individually. In a production or staging environment, serve the compiled file (or build it JIT if building is not an option).

Unless you live with an application that has been shielded from the real world for a decade, it’s very likely that most of the code you use was not written by you. It comes from a framework. You can skip those altogether by not distributing them at all. Google provides a content delivery network (CDN) for those. Why is this faster? You don’t have to serve it, and your users likely won’t have to download it. Because the files are referenced by multiple websites, it’s very likely that they downloaded it and cached it locally in the past. They also serve the standard CSS files for JQuery UI (see bottom right corner), although that’s not quite as well indicated (you should be able to find the pattern).

Both of the minify libraries mentioned above also do the CSS minification. However, this is a bit more tricky as you will need to worry about the relative paths to images and imports of other CSS files.

The final step is to make sure all the CSS is in the header and the JavaScript at the bottom of the page.

Web server tuning

Now that the amount of files is reduced, your scores already improved significantly, an other class of issues will take over. Namely compression, expiry dates and improper ETags. The easiest to set-up is the compression. You will need to make sure mod_gzip or mod_deflate is installed in Apache. It almost always is. Everything is done transparently. All you need to do is make sure the right types are set. It can be done in the .htaccess file. Here is an example for mod deflate.
<IfModule deflate_module>
AddOutputFilterByType DEFLATE text/html text/plain text/xml text/css application/javascript
</IfModule>

Use firebug to see the content type of all files YSlow is still complaining about and add them to the list.

An other easy target is the ETag declaration. In most installs, Apache will generate an ETag for static files. ETags are a good idea. The browser remembers the last ETag it received for a given URI and requests it back asking if it changed. The server compares it and either sends 304 to indicate it was not modified or the new version. The problem is that your server still gets a hit. You’re better off not having them at all.
<FilesMatch "\.(js|png|gif|jpg|css)$">
FileEtag None
</FilesMatch>

Expiry headers are a bit more tricky. When those occur in your scripts, you have to deal with them. Setting an expiry date means accepting that your users might not see the most recent version of the content because they won’t query your server to check. These may not be easy decisions to make.

However, static files are much easier to handle. You will need mod_expires in Apache, which is not quite as common as the compression counterpart. The goal is just to set an arbitrary date in the future. Page Speed likes dates further than a month away. YSlow seems to settle for 2 weeks. The documentation uses 10 years. It should be far enough.
<FilesMatch "\.(js|png|gif|jpg|css|ico)$">
ExpiresActive on
ExpiresDefault "access plus 10 years"
</FilesMatch>

Cookies

Your website most likely uses a cookie to track the session. They are great for your PHP scripts that need them to track who’s visiting, but they also happen to be sent to static files as well because the browser does not know it makes no difference. Cookies alter the request and cause confusion to intermediate caches or whenever the cookies change, like when you change the session id to avoid session hijacks.

The easiest way avoid those cookies from being sent to the static files is to place them on a different server. Luckily, browsers don’t really know how things are organized on the other hand, so just using a different domain or sub-domain pointing to the exact same application will do the trick. If you have more load, you might want to serve them with a different HTTP server altogether, but that requires more infrastructure. It should be easy to push JavaScript and CSS to the other domain. Reaching the images will depend on the structure of your application. You will thank those view helpers if you have any.

If you serve some semi-dynamic files through that domain, make sure PHP does not start the session, otherwise, all this was futile.

You can then configure YSlow’s CDN list to include that other domain and the google CDN, and observe blazing scores. To modify the configuration, you need to edit Firefox preferences. Type about:config in the URL bar, say you will be careful, search for yslow and modify the cdnHostnames property to contain a comma separated list of domains.

One more

By default, PHP sends a ridiculous Cache-Control header. It basically asks the browser to verify for a new version of the script on every request. When you user presses back, you get a new request, and he will likely loose local modifications in the form. Not really nice, and one too many hit on your server. Setting the header to something like max-age=3600, must-revalidate, will resolve that issue and make navigation on your site look so much faster.

These items should cover most of the frequent issues. Both tools will report a few minor issues, some which may be easy to fix, some not so much. Make those verifications part of the release procedure. A new type may get introduced in the application and cause less than optimal behaviors due to the lack of a few headers. It may not be possible to get a perfect score on all pages of a site, but if you can cover the most important one, your users may believe your site is fast, even though you use a heavy framework.

Decision making

As part of the day to day work of a software developer, decisions have to be taken every single day. Some have a minor impact and can be reverted at nearly no cost. Others have a significant impact on the project and reverting it would be a fundamental change. I have found that, in most cases, not making a decision at all is a much better solution. A lot of time is wasted evaluating technology. Out of all the options available out there, there is a natural tendency to do everything possible to pick the best of the crop, the one that will offer the most to the project and provides the largest amount of features for future developments. While the reasoning sounds valid, it’s an attempt to predict the future and will most likely be wrong.

Of course, the project you are working on is great and you truly believe it will be revolutionary. However, you’re not alone. Every day, thousands of other teams work on their own projects. Chances are they are not competitors, but most likely a complement to yours and will likely make the package you spent so much time selecting completely obsolete before you’ve used all those advanced features.

Too often, I see a failure to classify the type of decision that has to be made in projects. They are not all equal. Some deserve more time. In the end, it’s all about managing risks and contingencies. The very first step is to identify the real need and what the boundaries are with your system. No one needs Sphinx. People need to search for content in their system. Sphinx is one option. You could also use Lucene or even external engines if your data is public. What matters when integrating in this case is how the data will be indexed and how searching will be made. When trying out new technology, most will begin with a prototype, which then evolves into production code. At that point, a critical error was made. Your application became dependent on the API.

If you begin by making clear that the objective is to index the content in your system, you can design boundaries that isolate the engine-specific aspects and leave a cohesive — neutral — language in your application.

Effectively, having such a division allows you not to choose between Sphinx or Lucene or something else. You can implement one that makes sense for you today and be certain that required changes to move to something else will be localized. With your application logic to extract the data to be indexed and the logic for fetching results and displaying them left independent, the decision-making step becomes irrelevant.

Certainly, there is some overhead. You need to convert the data to a neutral format rather than simply fetching what the API wants and then convert it to the appropriate format. Some look at the additional layer and see a performance loss. In most cases when integrating with other systems, the additional level of indirection really does not matter. You are about to call a remote system performing a complex operation over a network stack. If that wasn’t complex, you would have written it yourself.

A common pitfall is to create an abstraction that is too closely bound to the implementation rather than the real needs of the system. The abstraction must speak your system’s language and completely hide the implementation, otherwise, the layer serves no purpose. It’s a good idea to look at multiple packages and see how they work conceptually when designing the abstraction. While you’re not going to implement all of them, looking at other options gives a different perspective and helps in adjusting the level of abstraction.

Once the abstraction is in place. the integration problem is out of the critical path. You can implement the simplest solution, knowing that it won’t scale to the appropriate level down the road, but the simplest solution now will allow to focus on more important aspects until the limit is reached. When it will be, you will be able to re-asses the situation and select a better option knowing that changes will be localized to an adapter.

Abstracting away is a good design practice and it can be applied to almost any situation. It allows your code to remain clean, breaks dependencies to external systems that would otherwise make it hard to set-up the environment and decrease testability. Because the code is isolated, it leaves room for experimentation with a safety net. If the chosen technology proves to be unstable or a poor performer, you can always switch to something else.

While it works in most cases, it certainly does not work for some fundamental decisions, like the implementation language, unless you plan on writing your own language that would compile in other languages. Some abstractions just don’t make sense.

When you can’t defer decision making, stick with what you know. Sure you might want to try one this new framework in the cool new language. The core of your project, if you expect it to live, is no place to experiment. I have been using PHP for nearly a decade now. I’ve learned all the subtleties of the language. It is a better choice for me. I’ve used the Zend Framework on a few projects and know my way around it well enough. It’s a good solution for me. Both together are a much safer path than Python/Django or any alternative, no matter what Slashdot may say.

It might not sound like a good thing to say, but experimenting as part of projects is important. You can’t test a technology well enough unless it’s done part of a real project and a project is unlikely to be real unless it’s part of your job. It’s just important to isolate experiments to less critical aspects. It’s the responsible thing to do.

It’s all about risk management. Make sure all decisions you make are either irrelevant because they can be reverted at a low cost or use technologies you trust based on past experience and you will avoid bad surprises.