How not to fail miserably with system integration

In the last few weeks, I’ve had the opportunity to get my hands in an SOA system involving multiple components. I had prior experience, but not with that many evolving services at the same time. When I initially read the architecture’s documentation, I had serious doubts any of it was going to work. There were way too many components and each of them had to do a whole lot of work to get the simplest task accomplished. My role was to sit from the outside and call those services, so I did not really have to worry about the internals. Still, unless you’re working with Grade A services provided by reputed companies and used by tens of thousands of programmers, knowing the internals will save a few pains.

Make sure you can work without them

I actually got to this one by accident, but it happened to be a good one. When I joined the team, there were plenty of services available. However, the entry point I needed to get to be able to use any of them was not present at all. I had a vague idea of what the data would look like, so I started building static structures that would be sent off to the views to render the pages. I had a bit of heads up before the integrators joined the team and certainly did not want them to wait around for weeks until the service was delivered.

At some point, the said service actually entered in someone’s iteration, which means it would be delivered in a near future. Fairly quickly, a contract was made for the exact data format that would be provided. Although I was not wrong on the values that would be provided, the format was entirely different. My initial format then became an intermediate format for the view, providing the strict minimum required, and a layer was added in the system to translate the service’s format to our own. The service was not yet available, so the service was just stubbed out. The conversion could be unit tested based on the static data in the service stub. Plugging the service when it arrived was a charm and except for a few environment configurations, it was transparent to integrators.

During the entire development, this fake data ended up being very useful.

  • Whenever the services would change, there was an easy way to compare the expectations to what we actually received, allowing to update the stub data and the intermediate layer.
  • When something would go terribly wrong and services would just fail, it was always possible to revert to the fake data and keep working.
  • It allowed to reach code paths that would not normally be reached, like error handling.

Expect them to fail

One of the holy grails with SOA is that you can replace services on the fly and adjust capacity when needed. This may be partially true, but it also means that while your component works fine, the neighbor may be completely unreachable during a maintenance or while restarting. If you happen to need it, you might as well be down in many cases. While one would hope services won’t crash in production, they happen to crash in development fairly often. To live with this, there is one simple rule: expect every single call to fail, and make a conscious decision about what to do about it. For one, it will make your system fault tolerant. If the only call that failed is fetching some information that is only used being rendered, it’s very likely that you don’t need to die with a 500 internal error. However, if you didn’t expect and handle the failure, that’s what will happen.

Adding this level of error handling does add a significant cost to the development. It’s a lot of code and a lot of reflexions needed. Live with it or reconsider your SOA strategy.

Early on, adding the try/catch blocks wasn’t much of a reflex. After all, you can write code that works without it and PHP sure won’t indicate you that you forgot one. When the first crashes occurred in development, we still had few services integrated. Service interruptions just became worst as we integrated more. What really pushed towards adding more granularity in catching exceptions is not really the non-gracefulness in which they would break the system, it’s the waste of time. With a few team members pushing features in a system, a 15 minutes interruption may not seem like much, but it’s enough to break the flow, which is hard enough to get into in an office environment. Especially when the service that breaks has nothing to do with the task you’re on at the moment.

It does not take much.

  • When fetching information, have a fall-back for missing information.
  • When processing an information, make sure you give a proper notification when something goes wrong.
  • Log all failures and have at least a subtle indication for the end-user that something went wrong and that logs should be verified before bothering someone else.

Build for testability

Services live independently in their own memory space with their own states. They just won’t reset when your new unit test begins, making them a pain to test. However, that is far from being an excuse for not automating at least some tests. Every shortcut made will come back to hunt you, or someone else on the team. It’s very likely that you won’t be able to get the remote system in every possible state to test all combination. Even if you could, the suite would most likely take very long to execute, leading to poor feedback. Mocks and stubs can get you a long way just making sure your code makes the correct calls in sequence (when it matters), passing the right values and stopping correctly when an error occurs. That alone should give some confidence.

To be able to check all calls made, we ended up with an interface defining all possible remote calls providing the exact same parameters and return values as the remote systems. There was a lot of refactoring to get to the solution. Essentially, every single time an attempt was made to regroup some calls because they were called at the same time and shared parameters, or because it was too much data to stub out for just those 2 tiny values, it had to be redone. Some error would happen with the real services because the very few lines of code that were not tested with the real service happened to contain errors, or something would come up and suddenly, those calls were no longer regrouped.

As far as calling the real services go, smoke testing is about the only thing I could really do. Making a basic call and checking if the output seems to be in the appropriate format. In the best of worlds, the service implementers would also provide a stub in which the internal state can be modified, and maintain the stub to reflect the contract made by the real service. It could have solved some issues with the fact that some services are simply impossible to run in a development environment. Sticking to the contract is the only thing that can really be done in an automated fashion for development. I first encountered that type of environment a few years back where running a test actually implied walking to a room, possibly climbing a ladder, switching wires and getting back to the workstation to check.

Have an independent QA team

It might not be miserably, but chances you fail are fairly high when a lot of components need to talk to each other and there is no way you can replicate all of it at once. A good QA team testing in an environment that maps to the production environment will find the most hallucinating issues. In most cases, they are caused by a mismatch between the understanding of the interface between the implementor and the client. When you have a clear log pointing out the exact source of the problem and all your expectations documented in tests and stubs, it does not take a very long discussion to find the source of the issue. Fixing it just becomes adjusting the stubs, and fixing the broken tests.

If you’re lucky enough, there might not be issues left when it goes to production. Seriously, don’t over-do SOA. It’s not as magic as the vendors or “enterprise architects” say it is.

Leave a Reply

Your email address will not be published. Required fields are marked *