Issues should be visible

The best way to improve a system’s stability is to make sure the issues are visible by everyone, everyday. We all get too many emails of various relevance. So many that our vision blurs when looking at our mailbox and our brains automatically filter out some parts of it that have been around for too long. Time to handle an issue becomes:

  1. Immediately
  2. End of day
  3. Never.

The best software systems are built with error handling. They detect errors, validate that they behaved as expected and that the results are within normal boundaries. Typically, the go-to technique to report these issues is to send an email to the right person. That’s until the right person moves on with their life. Then that notification goes to someone who has no clue how to handle it or how important it is. The issues escalate to developers and that’s how my mailbox fills up with useless redundant requests.

Don’t manage errors with a mailbox. Mailboxes are great to handle immediate issues, but the search capabilities are really not as good as you need them to be. When an issue arises in a mailbox, it does not have any context. It’s buried in endless noise out of which you cannot see the patterns. The hard issues can rarely be fixed without identifying a pattern.

I searched for a system that would allow me to do this. I found quite a few interesting solutions, but nothing that suited what I wanted to do. It turns out it’s a single database table, a couple of lines of SQL, and rendering using d3.js.

The table?

  • An autoincrement ID because the framework likes it better
  • An identifier for the notification
  • A date
  • A status (green, yellow or orange)
  • A time to live
  • A one-line message

Each task or verification in the system pushes it’s state at the end of the task, recording the result as green if all is good. Orange if there is an error. Yellow if something is preventing execution.

Red is reserved for moments where the task does not execute within the expected TTL, meaning the task either did not run at all or crashed in the process.

It turns out green is addictive. If you present a dashboard with green bubbles for all success, the few occasional errors stand out and efforts will be put to go back to green. Sometimes resolving the issue involves contacting the customer, sometimes it requires adjusting tests, but having everything in a single location allows to see patterns. Failures rarely come alone.

To help further, when you click on one of those circles, you get to see a graph showing the history of the task. I used a plot of time for X and time between notifications for Y. Dot color shows the state. Interesting patterns emerge there.

As a bonus, an iframe points to a page on the wiki with troubleshooting information. Documentation is only every used when it is in context.

The great thing about having a dashboard is that you can then include additional information in it. The more value it has, the more people will access it. It needs to make their daily jobs easier managing the site.

I hope it will keep my inbox more manageable. Meanwhile, we can get a good handle on the system behavior, resolve long lasting issues and improve performance.

Rushing to track

Working on larger projects come with the annoyances of project management. Every stakeholder sends in someone to check on the project status and make sure it is on track. It’s not really possible to blame them to keep an eye on where their money is going, but it can be very counter productive. Software projects are not like every other project. It’s not a production line and it is very hard to measure. Complexity is hidden, and so can progress be. A well run project will tackle the bigger risks first to avoid uncertainty near the end. This means that early on, progress may just be invisible. Outsiders like to see visible progress, but focusing on visual details just builds pretty prototypes. It looks complete, but it’s not.

Trying to please project managers rather than filling the actual needs is a bad engineering practice, even though it creates successful projects on paper. Just like the vast majority will not remember who finished second, a pile of project reports finished on time and on schedule is a manager’s pride. Reports don’t often state that the definition of complete was distorted. Trying to track the invisible causes dysfunction. At some point, you need to trust the people working on the project.

Estimates are hardly ever accurate on day 0. Until some work is done, it’s just impossible to know how long something is going to take. You can have a wild guess. Having worked on similar problems before helps assessing the risks and adjusting the estimate, but when it’s brand new work integrating with unknown systems, starting to code is the only way to feel the resistance the system will offer. The risks need to be assessed. Unstable or undocumented APIs, incoherent data, undefined business logic and dozens of other factors can affect how long something will take. If you have committed to a tight deadline and the developers know it, knowing precisely how long tasks are going to take won’t save you. Letting them work might.

When the primary risks are tackled and something is in place, management becomes possible. Making a list of changes that deviate from the current state, the baseline, to reach a target is quite simple. With the primary risks out of the way, it should be possible to break down the list of tasks into fairly even sizes and manage by tracking velocity. Managing becomes what it should be: looking at the time available and prioritizing. Trying to manage by tracking velocity  during the inception, which ends when the primary risks are tackled is just a waste of time. It leads to panic because velocity is not high enough early on until for a magical reason velocity sharply increases at some point in the project, which is when the actual construction starts. That is, if panic decisions did not screw up with the project.

You can plan all you want, but software has its own agenda. The problem space defines how long inception and elaboration will be, not a schedule on the wall.

Contributing organizations

Originally written in 2009, but never published. Conclusion was reworked.

The open source model is widely different from the typical business plan. There used to be a time when contributors were volunteers, working for passion and love of the project in their free time. These days, I feel most of the contributors to projects make a living out of it. I do it. There is nothing wrong with that. In fact, it’s a good thing. It’s a lot more sustainable. Everyone needs to make a living, so odds are you loose a contributor because the company he works for just bought out (or is on the verge to be) and needs to work 70 hour weeks and burns out are much lower. Contribution can become a top priority for a reasonable amount of work hours per week. The Linux kernel was probably one of the first sample where all the top contributors ended up being paid by companies to do it.

As an individual, I find it makes a lot of sense to focus on open source software. There is nothing that I hate doing more than writing the same code twice. Building from an open platform, and contributing back, allows me to avoid writing the same thing twice and prevents me from ever writing some code. I’m mostly a developer. I don’t do much of the applying for contracts and filling specific needs. Tried it before. Not a happy place for me, even if basing it on open solutions. I’d rather leave the pressure to deliver on someone else. Turns out many companies out there see open source as a business opportunity. They can build on an existing product to deliver more value fast. I get hired to take them further while they handle the day to day problems. It’s a good deal for both sides.

There is however a major difference between a company serving real tangible clients and the open source world. After working for several clients, I can see a major difference between the successful ones and those that barely stay above waters. It turns out it probably differentiates successful from unsuccessful in any field. It’s called vision. The good companies see ahead. They anticipate problems and make sure they are resolved before the client meets them, at least the fundamental ones. Unsuccessful companies just keep fighting fires and push down the problems and hope for someone to resolve them right away.

A while back, I read one of Joel’s articles on micromanagement. Apart from the conference organizer’s inside jokes about terrible WIFI access in conference centers (and the great advice to make sure they give it for free if it does not handle the load — which is always), the following passage made me smile:

At the top of every company, there’s at least one person who really cares and really wants the product and the customer experience to be great. That’s you, and me, and Ryan. Below that person, there are layers of people, many of whom are equally dedicated and equally talented.

But at some point as you work your way through an organization, you find pockets of people who don’t care that much.

Having spent most of my time as either an employee (or technically an intern, as I never really had a full time job otherwise), I spent most of my time as a consultant working in fairly small of organizations and have been kept closer to developers than the management-types. I can say without an hesitation that the people down the ladder mostly blame high level management for getting in their way and preventing them from doing their job. Both are probably right. However, I still find Joel’s wording a bit harsh.

To function correctly with open source and make the relationship efficient, you need to embrace it. Companies trying to make it a one way relationship ended up failing. In open source, the project extends beyond the company. Every line of code you don’t contribute back is one line you will have to maintain yourself. At some point, it will simply break. Upgrading to obtain the later versions will become harder. At which point, you better hope you have killer traceability to go back to the original issue, because you will have to implement it over again. With a high turnover, it might just kill the company, and the community (and consultants part of it) might not be motivated in helping you out if you did not contribute back when times were good.

Only the words change

Amazon has brought me back to 1975 and the Mythical Man-Month. It had been on my reading list for quite a while, but at some point around two years ago, it became unavailable. After that, it sat on a shelf for a few months until the stack got down to it. I must say, skip a few technical details and this book could very well have been written last year. After all, in 1975, structured programming (that is, conditions and loops) was a recent concept and not widely adopted. Surprisingly, Brooks knew a whole lot about software development, testing and management. I have the feeling we have learned nothing since it was written. Concepts were only refined, renamed and spread out. As far as I can tell, just a few paragraphs in chapter 13 lays out the founding concepts of TDD.

Build plenty of scaffolding. By scaffolding, I mean all programs and data built for debugging purposes but never intended to be in the final product. It is not unreasonable for the to be half as much code in scaffolding as there is in product.

One form of scaffolding is the dummy component, which consists only of interfaces and perhaps some faked data or some small test cases. For example, a system may include a sort program which isn’t finished yet. Its neighbors can be tested by using a dummy program that merely reads and tests the format of input data, and spews out a set of well-formatted meaningless but ordered data.

Another form is the miniature file. A very common form of system but is misunderstanding of formats for tape and disk files. So it is worthwhile to build some little files that have only a few typical records, but all the descriptions, pointers, etc.

[…]

Yet another form of scaffolding are auxiliary programs. Generators for test data, special analysis printouts, cross-reference table analyzers, are all examples of the special-purpose jigs and fixtures one may want to build.

[…]

Add one component at a time. This precept, too, is obvious, but optimism and laziness tempt us to violate it. To do it requires dummies and other scaffolding, and that takes work. And after all, perhaps all that work won’t be needed? Perhaps there are no bugs?

No! Resist the temptation! That is what systematic system testing is all about. One must assume that there will be lots of bugs, and plan an orderly procedure for snaking them out.

Note that one must have thorough test cases, testing the partial systems after each new piece is added. And the old ones, run successfully on the last partial sum, must be rerun on the new one to test for system regression.

Does it sound familiar? I see test cases, test data, mock objects, fuzzing and quite a lot of things we hear about these days. Certainly, it was different. They had different constraints at the time, like having to schedule to get access to a batch-processing machine. There is some discussion about interactive programming and how it would speed up the code and test cycles.

I find it impressive given that they had so little to work with. I wasn’t even born when they figured that out.

Because the experience is based on system programming for an operating system to be ran on a machine built in parallel, there is a strong emphasis on top-down design, which is the most important new programming formalization of the decade, and requirements. To me, the word requirement is a scary one. I don’t do system programming and for what I do, prototyping and communication does a much better job. However, I found the take interesting.

Designing the Bugs Out

Bug-proofing the definition. The most pernicious and subtle bugs are system bugs arising from mismatched assumptions made by the authors of various components. The approach to conceptual integrity discussed above in Chapters 4, 5 and 6 addresses these problems directly. In short, conceptual integrity of the product not only makes it easier to use, it also makes it easier to build and less subject to bugs.

So does the detailed, painstaking architectural effort implied by that approach, V. A. Vyssotsky, of Bell Telephone Laboratories’ Safeguard Project, says, “The crucial task is to get the product defined. Many, many failures concern exactly those aspects that were never quite specified.” Careful function definition, careful specification, and the disciplined exorcism of frills of function and flights of technique all reduce the number of system bugs that have to be found.

Testing the specification

Long before any code exists, the specification must be handed to an outside testing group to be scrutinized for completeness and clarity. As Vyssotsky says, the developers themselves cannot do this: “They won’t tell you they don’t understand it; they will happily invent their way through the gaps and obscurities.”

Beyond the punch line, this does call for very detailed specifications. It felt to me that those were in-retrospect comments. I don’t think it was ever made that the specification were fully detailed enough for bugs to be driven out. If it had been, you would end up with an issue introduced in the previous chapter: “There are those who would argue that the OS/360 six-foot shelf of manuals represents verbal diarrhea, that the very voluminosity introduces a new kind of incomprehensibility. And there is some truth in that.”

Very detailed specifications, just like exhaustive documentation, will reach a point where it does not bring value because no one can get through all of it in a reasonable time frame. Attempting to find inconsistencies in a few pages of requirements is possible. Not in thousands of pages. The effort required is just surrealist. Sadly, following the advice of strong and detailed requirements, the world of software development sunk in waterfall in the years that followed. For all the great insight in the book we collectively realized decades later, this one got too influential.

Imagine if instead, the industry had used the advice of smaller, more productive teams when possible, higher-level programming languages and extensive scaffolding as the primary advice of the book where software development would be today.

Go for changes that matter

I don’t have any stats to support this. But I’m pretty certain that every second, a developer somewhere complains about legacy code. Most of the time, no one person can be blamed for it. Other than a few classics demonstrating complete lack of understanding, most bad code out there was not written by any single person. It just grew organically until just looking at it makes it fall apart. Most of the time, it begins with the good intention of keeping the design simple for a simple task. Augment this with a lot of vision by an inspired third party and the will to keep the original design unchanged by the next implementor and you made a step in the wrong direction. At some point, the code becomes so bad that people just put in their little fix and avoid looking at the larger picture.

No matter how you got there, knowing it won’t really help you. You’re stuck with a large pile of code you have no interest in maintaining and whose functionality is a complete mystery. The initial reaction is to vote for a complete rewrite from ground up using more modern technologies. Well, the past has demonstrated this to be a failure over and over again. Starting a new project from scratch is a good way to implement a brand new idea. If the objectives are completely different, it deserves to be a different project. However, a new project to do something the old one did is hardly a good idea. Development will be so long that you will either have to sacrifice your entire user base, who will move away due to neglect of their current solution, or to keep maintaining the current version for a while, which will kill the new initiative due to lack of resources.

The real solution is to devote time to making things better. Use all this time wasted on complaining and actually trying to make things better. While the code smell is often generalized, very few parts are usually rotten. Cleaning up those areas can transform the project without so much effort. The one thing you don’t want to do is begin with the first file and clean up all the parts you don’t like about it., or polish some feature because it could be improved and you understand it enough to do it.

Improving code quality is just like improving performance. Unless you target the areas that really matter, there will be no significant impact. If you spend an hour to optimize a query and gain 50% improvement on it, you can be happy with it, but if that query accounted for 1% of the total execution time, your impact really is 0.5%. Sadly, software quality does not have so many direct numbers that can be observed. There are metrics, but the impact will be seen on the longer term, mixed up with dozens with other issues, making it nearly impossible to measure. It also affects these weird factors like team morale.

To me, the main attributes refactoring candidates have are:

  • Obstructive
  • Untrustworthy
  • Inconsistent

Obstructive issues hamper your ability to grow. They are road blocks. If you drew a directed graph of all the issues and feature requests as nodes and dependencies as edges, those issues would stand in the middle. They cause problems everywhere for no obvious reasons and always prevent you from going up to the next level. In TikiWiki, the permission system was one of those. For a long time, and it still is, the high granularity of the permission system was one of the key features of the CMS. There are currently no less than 200 permissions that can be attributed.  However, the naive implementation caused so many problems that a word was created to identify it in bug reports. It also prevent from having the most demanded feature by large enterprise customers: project workspaces.

Obstructiveness may also apply in terms of development. Every time you have to perform a simple task in a given area, you find yourself juggling with complexity. For historical reasons, just getting close to a piece of code requires a complete ritual dance. So much that you just attempt to work around the issue. It’s likely that the API does not provide the functionality that is required. A lot of copy-pasting is needed and, as a result, a lot of time is wasted.

Untrustworthy code often looks innocent. A function call that looks simple and that you would expect to work. However, for some reason, every time you use it somewhere, you close your eyes before execution. For some reason, you’re not convinced it will act as it should. There are also multiple bugs filed related to the feature under corner conditions and they are always fixed by adding a line or a condition. Typically, it will be a very long function with disparate branching. Overgrown by feature requests over time. It’s not rare to see multiple different ways to do the same thing with different parameters. It was so complicated that someone made a request for something that already existed, and the developer did not even notice. The only way out of it is to map out what it does and what it’s supposed to do, and begin writing tests for it.

Inconsistency is a different kind of smell. There is nothing wrong, except that you always find yourself looking up how to use components. Different conventions are used. Sometimes you need to send an array, other times an object. For no apparent reason, things are done differently from one place to the next. Most of the time, these are easy to fix. Find out which way is the right one and deploy it all over. Don’t let the wrong way be used again. Most of the time, they just spread because someone looked up for an example and took the wrong one. Fixing those issues does not have such a large impact by itself, but it will often reduce the clutter in the code. With less code remaining, it will be easier to see the other problems.

What have I done to this airport?

No matter what I have done wrong in this world, my punishment has a name: Newark. It seems like I can’t think about going there without something wrong happening. A few months ago, it was a missed connection caused by the US border security system going down, causing extremely long waiting lines. Today, I get to spend a lot more quality time in this sterile environment called Newark. At least, this time, it’s possible to sit somewhere to eat.

My flight was supposed to leave Montreal at 17:10. As I got to the airport, it was cancelled. No problem. They booked me on the 14:30 flight, which was delayed until 18:30. One more hour in Montreal, one less in New Jersey. Felt like a good deal. My connection there was planned to be slightly over 3 hours anyway.

Bad luck. They got a green light to take off at 16:30. That’s one more hour in New Jersey.

After a terribly bumpy ride in a regional jet under strong winds, in which every passenger’s face was blank (except for those laughing nervously, clearly having lost control), I was greeted in Newark by an announcement: My next flight was delayed by 3 hours.

Awesome.

That does leave plenty of time to write documentation.

Update: 4 hours delay.

Update: 4.5 hours delay.

Happy New Year?

Year 2004 ended up being a very bad year for our own planet. With war, George W. Bush being re-elected and the destructive Tsunami right before the end of the year, we can be assured the new year won’t be worst. For once, medias did a great job covering the event and pushing populations worldwide to support the victims of the tsunami. Was it the Christmas spirit? By now, over two billions have been unlocked for various purposes. The question is: how long will it last? Everyone’s favorite collective encyclopedia has a great coverage of the tragedy (took he animation there, complete one also available).

Tsunami Representation Animation

Satelite images of the disaster have been taken (before, after, sometimes during). Seeing those images, it’s easy to understand why there are so many missing and dead people. According to BBC, America is not safe from such disasters. Note that this source has been refuted multiple times afterwards, even by themselves.

On a non-human point of view, the earthquake that caused the tsunami was so strong that it actually changed Earth’s rotation axis. The difference should not make any difference since the planet usually has greater variations over a year (whew!). Still, we lost around 3 microseconds in the process.