Live from IBM InterConnect: DevOps meets data science
I just attended a session at IBM Interconnect on Devops Analytics. I expected a few nice BI style charts and so on. But these guys are thinking big.
The problem can be described relatively simple: you are running a devops team and you need to push out a new build as soon as possible. Even assuming you have subjected the build to a sensible and comprehensive set of automated tests, many risks remain as you go live. Typically you will go live in a staged manner.
IBM uses a subset of machines that it calls the “Canary stage.” As it is deployed and as some of the traffic is directed to it, a number of metrics are collected and compared live with the remainder of your production system that runs the previous deployment release. You compare the metrics with SLAs ,and if they run fine over a meaningful period of time you can make the call of either going live completely or initiating a roll back.
So far, this is all quite reasonable. But where it gets interesting is what you can do when things go wrong. In that situation you are under pressure because a planned deployment has to be undone, a fix prepared, new build, new automated test and new deployment started. Apart from the rollback, none of this can happen unless you identified what went wrong. And the answer to that is a needle that lies in the haystack of quite a few things of possibilities. Let me list a few:
- changes in the deployment logic
- in build configuration
- changes in application configuration
- changes in networking
- missed defect in the application coding
So you have a potentially large search space. In classical situations you’d have teams that would scour logs, rerun the application under controlled environments with a few selected changes, a lot of theory development…the usual process of defect isolation we all know. Only that this takes too long.
So the IBM folks said, let’s capture everything, from configuration files, log files, GIT comments etc. And then use data mining to compare this current (bad) deployment to the previous (good) deployment.
Sounds easy – it’ just a diff, isn’t it? Except that we are dealing with
a) a large set of data
b) data that is only partially structured (think of diffing build scripts in an automated manner)
c) data that has temporal meaning that may be relevant (or not)
The DevOps Data Scientist
So what needs to be done appears to be the typical big data issue – you need to add meta information in forms of tags and patterns that allow us to decode essential information – unless you do that it’s nothing but a big soup of ones and zeros.
Much of it can be built up when you set up the data collection feeds. It is a bit of work but nothing uncommon.
What is different though is the added dimension of urgency – you can’t do that when you are facing the problem. You need to have this in place by then. So you need a data scientist. But not any data scientist but an agile one who can tweak all the data mining. Why? Because nothing is ever so simple that you can do this all beforehand and expect it to be complete.
As you run through you deployments and the odd bad one breaks, you will find that there are new issues that you can’t answer by your current set of diffs and metrics. So you need to be able to identify new questions and build ways to answer them as the devops team is waiting for you.
There are many more challenges, which the IBM team highlighted. As you collect data you need to apply filters, to deal with:
- uncorrelated noise (other systems running on the same hardware as your canary VMs, network traffic…)
- warmup behavior (saturation of caches for instance)
- uneven loads
- normalisation of data
- non identical machines
But so are the opportunities. If you collect comprehensive data you can potentially make use of the information to make predictions of behaviour of specific builds/configurations. You can take builds, stick them into a test environment and bombard them with historic data to see how they fare. That and historical analysis can help you improve the deployment and ops part to make it more robust.
I think much of this is still very experimental and speculative with more learning still required. But the fundamental premise is sound and will most probably become a part of accepted practice over time.