Digging into MLops: how it can help getting your Machine Learning models into production
For companies with data science departments the honeymoon phase is coming to an end and they are facing practical issues.
The teams are often effective in delivering machine learning models, but these do not make it to a release. The data scientists’ have no clue how (and argue that this is not part of their job), and the operations department does not know how to deal with these black box algorithms.
We may be in the Golden Age of AI, but the question I hear most from companies is: how the hell do we get these models into production?
ML-what?
‘MLops’, ‘AIops’, ‘DataOps’, DevOps for Data Science…
It goes by many names.
As the name suggests it is a variation of DevOps, a group of concepts which has become a best practice in tech over the last years. It is “a set of practices that automates the processes between software development and IT teams, in order that they can build, test, and release software faster and more reliably. The concept of DevOps is founded on building a culture of collaboration between teams that historically functioned in relative siloes” (source).
Finally gaining ground, MLops attempts to bridge the gap between operations and data scientists. There is no dictionary term available yet, but according to Wikipedia:
MLops is a practice for collaboration and communication between data scientists and operations professionals to help manage production machine learning (or deep learning) lifecycle.
Working with ISV companies as a cloud solution architect I have seen the AI momentum first hand for the last few years. Hell, at times I was part of it.
Many of these companies have started building their own machine learning models, with some even running full-fledged data science departments.
There are, however, practical issues.
The biggest complaint is that the models are not getting into production. And even if companies succeed in doing this, the models’ accuracy dwindle quickly. This can even result in having to retract the entire model (making the investment worthless).
MLops promises to solve these issues. Let’s dig deeper into what this means.
How is it different from DevOps?
Data science is not development. There are some unique problems that need to be dealt with:
Collaboration
Collaborative coding is not the norm in most data science teams.
This is not applicable everywhere, but in my experience it encompasses the majority of companies.
In my opinion there are two main causes:
1. Data scientists are expensive and hard to find. Many companies are happy when they have just one on payroll.
2. In academics it is less likely to work in collaborative teams — working together on code — than it is in business. A bulk of the data scientists are trained in the ‘traditional’ academic way.
As a result there is still a lot of coding being done on a local environment. This means that code needs to be exported once the model is finished, or when collaboration is necessary. Not only does it slow down the collaboration process, it also adds to the problem of entanglement of systems.
Entanglement of systems means that one change in the code can change (and break) everything. So even if the code is written in notebooks (or somewhere else in a cloud) there are no code reviews or wiki’s that describe what’s in the model. If the code-writer leaves the company and the model needs to be maintained or optimized problems arise quickly. Correct use of source and version control is therefore an important skill that should be taught to all teams.
Machine learning models are not built to go into production.
Even before we start programming or processing our data, there is an important question each data science team should ask: “how will this model be applied?”.
Algorithms often are not serving a business case. They might be good academic models but are horrible models in practice as they cannot be applied.
For example: you’re working for an online retail company. The company’s goal is to sell its products better and faster. Your data science team has created a model that classifies product images into categories: trousers, shirts, bags, etc.
The algorithm hits a high accuracy using the latest techniques, it was fun to build and the team learned a lot. But the model is not serving a company use case, does not solve a customer blocker or enhances the customer experience.
Your model may be the best one out there, but it does not serve a GOAL. And therefore it has no practical value.
A lot of the pain for companies lies here: they have invested in a data science team — which delivers models — but are not applicable anywhere.
But let’s assume your model is finished, and does something useful (yay!). The data science team itself does not understand what is happening in the model (see ‘no collaboration’). And when the model is passed through to operations (or any other department for that matter), progress comes to an immediate halt.
This is the black box paradox. In the words of Kaz Sato (developer advocate at Google): “a black box that no one understands. The researcher doesn’t understand the code. The engineer does not understand the model”.
This all can result, in the worse case scenario, in having to rewrite the entire code or discarding the model as a whole.
Testing (and maybe some day: automated testing)
There are different kinds of tests in software engineering, and automating them in your pipeline is one of DevOps’ main goals.
But data science is not software development. And the field is still struggling to define how to apply these tests to a machine learning pipeline.
Testing is a broad term, and it comes in different forms. Tests examine the model and in some cases interacts with it. The aim is usually to verify that your model is robust and will not break when exposed to the outside world, resulting in unexpected behaviour.
An example is unit testing, or component testing, which “validates that each unit of the software performs as designed” (source). It prevents time loss in debugging and training and makes your code more reusable.
Chase Roberts wrote a great article on how to unit test machine learning code. He even wrote ‘mltest’, a function that performs the unit test to your code. It is one of the few articles I could find that give a concrete example of what a test looks like for machine learning models. Other resources are very, very welcome so please share if you have more.
Testing your model should be the first step, and automating these tests the second. With automated testing, your MLops pipeline automatically runs through predefined tests. This helps with quality control and security of the code.
The MLops GitHub of Microsoft talks about how “once the Azure DevOps build pipeline is triggered, it performs code quality checks, data sanity tests, unit tests, builds an Azure ML Pipeline and publishes it in an Azure ML Service Workspace”. The details of what these tests entail are unclear to me at the time of writing this article. There is are many different kinds of tests outside of unit testing (e.g. penetration testing, …). Applying them to a machine learning pipeline is a challenge that different parties are trying to find an answer to.
Monitoring
Because a model isn’t finished once you hit that right accuracy.
The model might need to be optimized or interfered with somewhere down the line. Monitoring your model is a great way to keep track of data drift, but also keep an audit trail. It can be a dashboard that is checked regularly, but you can also set alerts that will notify you when a threshold has been reached.
Conclusion (benefits and challenges)
We may be in the golden age of AI, but having a successful implementation requires investment and training.
In my Master degree I’ve been trained as a traditional data scientist. This means that I spent my days building mathematical models that I can apply to a (prepped) dataset. My goal is academic: to test a hypothesis.
With this background the world of DevOps is a shocking one: source code, version control, Git, branches and testing are new terms that conceptually make sense but are hard to apply to practice — and that we do not always have an answer to yet.
Writing algorithms that apply to business cases require a different way of working. It changes how you should approach problems, and set goals. For one, you are not testing a hypothesis. You are creating a model that will (often real-time) take input and needs an immediate output.
It also needs to be robust, so it can interact with the real-life world without breaking down. this requires new coding principles that I and other data scientists need to skill up in (and fast!):
Simplicity > complexity, without losing accuracy
Applicable to business / use cases
Interpretability > black box
In the winter of 2019 I decided that I wanted to dig deeply into MLops. And the first stop for me was to learn more about DevOps and software engineering. I studied for (and passed) the AZ-400 exam, started building a MLops environment and had regular discussions with my mentor Tim Scarfe.
My biggest conclusion is that MLops is designed too much from the perspective of the software engineer and developer.
The learning curve is large from a data scientist perspective, and there are still large gaps to fill (for example: what does testing mean for a machine learning pipeline?).
Let me stress once more that a data scientist is not a software engineer. Assuming they are familiar with DevOps terms, or have experience with version control is not fair. Your data scientists need to skill up in software engineering principles, and learn a new approach to creating models. Machine learning lifecycle management is an important investment required to succeed.
Despite these gaps MLops promises many advancements for the data science department. It provides the tools to make machine learning models more applicable, and potentially removes hurdles to improve collaboration with operations and getting them into production.