The best way to improve a system’s stability is to make sure the issues are visible by everyone, everyday. We all get too many emails of various relevance. So many that our vision blurs when looking at our mailbox and our brains automatically filter out some parts of it that have been around for too long. Time to handle an issue becomes:
- Immediately
- End of day
- Never.
The best software systems are built with error handling. They detect errors, validate that they behaved as expected and that the results are within normal boundaries. Typically, the go-to technique to report these issues is to send an email to the right person. That’s until the right person moves on with their life. Then that notification goes to someone who has no clue how to handle it or how important it is. The issues escalate to developers and that’s how my mailbox fills up with useless redundant requests.
Don’t manage errors with a mailbox. Mailboxes are great to handle immediate issues, but the search capabilities are really not as good as you need them to be. When an issue arises in a mailbox, it does not have any context. It’s buried in endless noise out of which you cannot see the patterns. The hard issues can rarely be fixed without identifying a pattern.
I searched for a system that would allow me to do this. I found quite a few interesting solutions, but nothing that suited what I wanted to do. It turns out it’s a single database table, a couple of lines of SQL, and rendering using d3.js.
The table?
- An autoincrement ID because the framework likes it better
- An identifier for the notification
- A date
- A status (green, yellow or orange)
- A time to live
- A one-line message
Each task or verification in the system pushes it’s state at the end of the task, recording the result as green if all is good. Orange if there is an error. Yellow if something is preventing execution.
Red is reserved for moments where the task does not execute within the expected TTL, meaning the task either did not run at all or crashed in the process.
It turns out green is addictive. If you present a dashboard with green bubbles for all success, the few occasional errors stand out and efforts will be put to go back to green. Sometimes resolving the issue involves contacting the customer, sometimes it requires adjusting tests, but having everything in a single location allows to see patterns. Failures rarely come alone.
To help further, when you click on one of those circles, you get to see a graph showing the history of the task. I used a plot of time for X and time between notifications for Y. Dot color shows the state. Interesting patterns emerge there.
As a bonus, an iframe points to a page on the wiki with troubleshooting information. Documentation is only every used when it is in context.
The great thing about having a dashboard is that you can then include additional information in it. The more value it has, the more people will access it. It needs to make their daily jobs easier managing the site.
I hope it will keep my inbox more manageable. Meanwhile, we can get a good handle on the system behavior, resolve long lasting issues and improve performance.
I like the idea.
Green lights are indeed quite addictive.
I wonder if you have actually implemented it or you are explaining something you are currently building.
As for the status and errors, and states, in an application, which elements triggers those lights? You tail many logfiles and trigger flags when they happen, how do you make sure the reason of the trigger is effectively solved? Must have some matrix-like table with boolean flags?
Tell us more 🙂
Mostly triggered manually from the code. These are primarily to monitor tasks that happen on a regular basis, either expected to happen or happen on a cron job. A couple of months in, it has been quite useful detecting some silent errors and understanding some error patterns in the system. It’s quite easy to get reporting when a failure occurs. It’s harder to see the problem when something did not occur.
It does not really matter if the issue is resolved or not. What matters is that the system is in a good state right now. There is a complete history that can be displayed as a graph, which is sometimes useful too.