Skip navigation

It’s not done yet

Every software developer needs to experience joining projects that are at different stages of their development lifecycle. At the start, faced with an empty repository where you get to make important decisions every day, write a lot of code and functionality, and understand how everything fits together. In the middle, worrying about all those little details, dealing with testers raising defects, running into gotchas that obliterate fundamental assumptions, taking an application into production. And at the ‘end’ when a project is already running in live, fixing bugs and adding new functionality to a codebase you don’t really understand, where writing one line of code that doesn’t cause breakage is an achievement.

At my current job I’ve been fortunate enough to work on three projects covering each of these cases and I’ve learnt a few things along the way.

The first is that writing lots of code and making decisions is fun! But when you only work on that part it’s easy to look down on those who work on the project later. Ha! I wrote 90% of that in 4 weeks, now you’re spending months just adding trivial enhancements around the edges without really understanding what’s going on.

What you miss though, if you never have to take applications into live and beyond, is that there are a whole lot of things you can do that would make other people’s lives a whole lot easier if only you knew. You’ve got to see the issues that operations have to deal with in order to get the “yeah, there’s no way you could know that” empathy.

Logging

When you’re doing development, you don’t need to have very good logging. There’s a problem? Well, I’ll just read the code or step through a debugger. Once the application goes to testing it gets harder – um, I can’t reproduce that, can you give me more information? Production is worse still, with weird intermittent issues that operations need to detect and deal with even before talking to developers.

The first rule is logging is that when the application is running normally, the logs should be “clean”. Right from the beginning of development, we should keep an eye on what’s being logged. Do we get errors or warnings under normal operation? Fix them. You don’t want to be in a situation where you’re looking at a log full of errors and trying to figure out which ones are “normal” and which indicate a problem.

What do the logs contain when run at the level they’re going to be in production. We’re not running at DEBUG level – is there still enough information to see what’s going on. Are there lots of messages about trivial things and very little about other important parts of the system? Are there multiple lines of code that log the exact same message for different reasons?

Logging is one of those things that you need to care about right from the start. If you don’t keep an eye on it you’re never going to get it right. Functional (end-to-end) tests are useful here – check the output when they’re run.

Environment issues

I get really sick of dealing with P1 “defects” due to incorrectly set up test environments. The application fails with an error? What error’s that then. “table FOO doesn’t exist”. Or the reference data isn’t set up correct – “just don’t do that” and everything will be ok.

What I do these days is keep a list of environment issues in a table on the wiki. Everytime someone screws up an environment because they forgot to apply a database patch, I add a row to the table with a query and expected output that you can use to verify that the error hasn’t occured again. You can bet that if a mistake is made setting up one environment that it’s pretty likely to happen again in others.

For reference data the application really needs to apply validation rules to the configuration at load time. It’s not good enough to say that “we don’t support configuring those things together” if the application could just as easily detect the situation and give an error.

One of my favourite head-smacking issues with testing is when multiple deployments of the application end up pointing at the same database. This is especially fun when they’re different versions, and you spend hours trying to figure out how on earth this behaviour is happening in version 2.5.2 when we fixed this back in 2.5.1. This happens during development as well when you “temporarily” tell a colleague to use your development database. One way to avoid this problem is to have the application post regular updates to a ‘heartbeat’ table including a unique runtime identifier (e.g. random number generated at startup), last update timestamp and version information. Then you can run a query to find those identifiers that have updated in the last 5 minutes and see who’s really touching those tables.

Requirements changes

Requirements churn is a good thing. As we find out more about an application and how it’s used we will naturally want to change earlier decisions. The problem is that it becomes difficult for anyone to know how the system is supposed to behave – or actually behaves. The requirements have changed, but did we ever actually have a chance to implement those changes? How did we “interpret” those parts where there was hand-waving? Where did we intentionally deviate without telling anyone ;-) What was that conversation where I spoke to a business analyst and they “clarified” some details of the requirements but never wrote it down?

On one of my projects back in NZ I was a bit laissez-faire about coming up with system behaviour that “made sense” without telling anyone. It wasn’t until we hit testing and I realised that testers had no better source of information than the original business requirements that I realised that although the software functioned well, there were a whole bunch of people who needed to understand how the application worked without reading the code.

What I do these days is to write up wiki pages whenever I’m implementing a major piece of functionality, describing precisely the behaviour I’ve actually implemented. If a requirement was ambiguous and I had a conversation with an analyst, I write down the conclusion in my document. Don’t leave it up to them to do it.

The great thing about this approach is that it gives you a safe way forward when presented with imperfect requirements. You can go ahead and get most of the work done and when it gets to testing and they interpret the requirements in a different way, go and mark it as “Functions as Specified” with a link to your document. I did that twice today and it makes life just that bit more pleasant. Quite often when there are multiple ways of designing some functionality it doesn’t matter all that much which one you pick, so long as people can easily find out which one that was. Even if it later turns out you mis-interpreted the requirements, presented with a clear description of the actual system behaviour the business may decide that the implementation is good enough, and that the requirements should be retrospectively changed to match. Won’t Fix!

Bugs in live

If you hit a bug outside of production, there’s not too much of a problem. We’ll just fix the code, re-run the test, sorted. When you hit a bug in live though, you have to recover from it.

Bugger.

This is the hard part of writing “business applications”. Making them reliable. Most of the time we just don’t bother and then there’s a major panic when anything goes wrong. You’ll have requirements that cover ‘negative flows’ by saying “create an alert”. Great, so the live system has just logged an alert saying we’re fucked. Time to go on holiday. Otherwise you’re sitting in a room full of worried managers who want you to find a workaround and fast.

The best aproach I’ve found for creating reliable applications is to expect them to fail. Expect that there will be bugs. Expect that hardware will fail. Expect that exceptions will occur. Expect that we’re going to have to recover.

How do we do that?

There’s no such thing as a database table that just contains internal application state that no-one other than the developers needs to understand. In some cases operations are going to have to hack this data – the structure and contents need to be well documented and understood. Functional tests are useful here; although some people claim that these tests should only assert against external interfaces and not against the database, I argue that the database is a semi-public¬† interface. It’s not as formal and fixed as a web service, but developers, testers and operations all have a need to know how the database changes under processing.

Have hooks for inserting workarounds into live without deploying a new release (aka legitimate backdoors). For example, if the system applies validation rules to input messages, have some support for conditionally turning some of the rules off. One application I worked on was composed of dozens of little callable services with XML interfaces. The web interface included a developer-only page that had a textfield (service name) and textarea (xml) allowing you to send arbitrary internal messages to the live system. Dangerous but handy. Make sure you try your call in a test environment first! And don’t forget to secure it.

Expect that processing is going to fail, manual workarounds be applied, and we’ll then want the system to continue. The project with the callable services was at a Telco and integrated with billing systems and other monstrosities. Naturally some customers would have accounts that were … unusually set up, and would contradict any assumptions we could possibly make. To deal with this I turned the processing flow into a state diagram and then made it so that if an error occured at any state the system would set a “manual” flag on that customer’s record, generate an alert message, allow for any manual fixups and have a “reprocess” button to try again.

Taking that approach introduces some design contraints into the system that you simply can’t add as an afterthought. It’s easy and convenient to say, well at this processing stage we really need this piece of information, so instead of working it out again we’ll just persist it to the database at an earlier stage. That’s all well and good until the earlier stage didn’t actually go through automated processing – someone hacked it together by hand and then continued.

Whenever anyone pontificates about “loosely-coupled components” that’s what I think of. If Component A doesn’t actually run, will Component B still work, or does it depend on some of the internal behaviour of A?

Don’t try too hard to enumerate all the different things that can go wrong. Just assume that any component can fall over half way in some unexpected way, possibly leaving data in a corrupted state. If you can handle that you can handle anything. Data corruption is an interesting one, actually. Try to segment data to that if say the data related to one customer is corrupted, processing can still continue for other customers. This is similar to sharding that’s done for performance reasons.

Functional Tests

Ok, so I’ve talked about functional tests in a previous post. That’s because they’re great. When a problem occurs in live and you can in ten minutes write a test to replicate the problem and experiment with workarounds – I’ve done that many times. When a new developer joins the project and can actually understand and discover functionality and avoid breaking half the application – yup.

The rest

There are plenty of other issues to consider. How do we deal with upgrades – can we migrate old data? So you’re changing the database schema – are you aware that people have written systems that perform queries directly against your database? They’re going to break and you don’t even know it.

Reporting is a great one. It seems that “reports” are always something that get implemented at the last minute and are a nightmare because, surprise surprise, they want some information that, sorry, we just aren’t representing in our domain model. It gets even worse when you have performance challenges and you’d really like it if the application would’ve persisted certain intermediate statistics as it’s processing rather than us having to work everything out afterwards. Actually, I could do a whole ‘nother post on performance, and on system integrity checking (reconciliation), deployment (clustering) ….

I’d be interested in hearing any ideas or war stories other people have about dealing with live systems. Leave them in the comments!

Post a Comment

Required fields are marked *
*
*

%d bloggers like this: