The best way to improve a system’s stability is to make sure the issues are visible by everyone, everyday. We all get too many emails of various relevance. So many that our vision blurs when looking at our mailbox and our brains automatically filter out some parts of it that have been around for too long. Time to handle an issue becomes:
- Immediately
- End of day
- Never.
The best software systems are built with error handling. They detect errors, validate that they behaved as expected and that the results are within normal boundaries. Typically, the go-to technique to report these issues is to send an email to the right person. That’s until the right person moves on with their life. Then that notification goes to someone who has no clue how to handle it or how important it is. The issues escalate to developers and that’s how my mailbox fills up with useless redundant requests.
Don’t manage errors with a mailbox. Mailboxes are great to handle immediate issues, but the search capabilities are really not as good as you need them to be. When an issue arises in a mailbox, it does not have any context. It’s buried in endless noise out of which you cannot see the patterns. The hard issues can rarely be fixed without identifying a pattern.
I searched for a system that would allow me to do this. I found quite a few interesting solutions, but nothing that suited what I wanted to do. It turns out it’s a single database table, a couple of lines of SQL, and rendering using d3.js.
The table?
- An autoincrement ID because the framework likes it better
- An identifier for the notification
- A date
- A status (green, yellow or orange)
- A time to live
- A one-line message
Each task or verification in the system pushes it’s state at the end of the task, recording the result as green if all is good. Orange if there is an error. Yellow if something is preventing execution.
Red is reserved for moments where the task does not execute within the expected TTL, meaning the task either did not run at all or crashed in the process.
It turns out green is addictive. If you present a dashboard with green bubbles for all success, the few occasional errors stand out and efforts will be put to go back to green. Sometimes resolving the issue involves contacting the customer, sometimes it requires adjusting tests, but having everything in a single location allows to see patterns. Failures rarely come alone.
To help further, when you click on one of those circles, you get to see a graph showing the history of the task. I used a plot of time for X and time between notifications for Y. Dot color shows the state. Interesting patterns emerge there.
As a bonus, an iframe points to a page on the wiki with troubleshooting information. Documentation is only every used when it is in context.
The great thing about having a dashboard is that you can then include additional information in it. The more value it has, the more people will access it. It needs to make their daily jobs easier managing the site.
I hope it will keep my inbox more manageable. Meanwhile, we can get a good handle on the system behavior, resolve long lasting issues and improve performance.