In this series on DevOps for Data Science, I’ve explained the concept of a DevOps “Maturity Model” – a list of things you can do, in order, that will set you on the path for implementing DevOps in Data Science. The first thing you can do in your projects is to implement Infrastructure as Code (IaC), and the second thing to focus on is Continuous Integration (CI). However, in order to set up CI, you need to have as much automated testing as you can – and in the case of Data Science programs, that’s difficult to do. You can, however, mitigate this problem a great deal, and get your part of the solution as automated as possible.
The next step in the DevOps Maturity Model is Continuous Delivery (CD). There’s actually some discussion we need to cover here, since the definitions of DevOps and Continuous Delivery are quite similar, and to some, CD doesn’t belong “under” DevOps. Both DevOps and CD involve an agile mindset of releasing smaller, faster, and automated bits of code into the process rather than waiting for several changes to integrate at once. But DevOps is more a philosophy of teams working together to that end, and CD is a guided process involving all of the steps of design, coding, tracking, testing and release. CD is often more tool-aligned than DevOps is (or at least DevOps shouldn’t be tool oriented). If you look at a standard workflow in Visual Studio Team Services, you’re effectively looking at CD, but not necessarily DevOps.
Just to confuse things a bit further, some DevOps references define the “CD” acronym as Continuous Deployment – which is another implementation function. Continuous Deployment means automating the build so that changes happen automatically, all the way out to the deployment process of the end user’s software. Imagine a smartphone app that can take a picture of a plant and identity it. The Data Science function within this application is a trained model using custom vision API’s, and perhaps you make a change that improves the recognition score. Once tested, your change would not only be placed into the build, but pushed all the way out to the user automatically – perhaps within minutes of the test completing. That’s Continuous Deployment – then mechanisms that make that push possible.
So I’ve included Continuous Delivery as the third maturity of DevOps, which I’m certain will annoy the purists on both sides. However, I think it belongs there because until your teams have a DevOps mindset, it will be harder to effectively implement a true Continuous Delivery system. And I think that starting with IaC and CI is essential to start the CD journey.
So with those explanations in mind, how does the Data Science team fit in to CD? It’s here that we face another change in your day-to-day routine. You’ll need to learn, understand and use whatever CD system your company uses. Here at Microsoft we use Visual Studio Team Services (VSTS), which includes CD and the ability to implement DevOps. And yes, some of the Data Scientists have had to go back to school on it. Learning these systems – the “plumbing” – isn’t often desirable to a bona-fide Data Scientist, but it’s essential to being part of a team, and having a DevOps mindset. Underneath VSTS we use git and github, which has other implications. Most Data Scientists I’ve worked with do understand git commands, so there’s less pushback there.
See you in the next installment on the DevOps for Data Science series, where I’ll cover the next level in your DevOps Maturity Model for Data Science teams.
- Need a quick introduction to DevOps? Check out this series: https://channel9.msdn.com/Series/DevOps-Fundamentals
- Here’s a complete, full course on DevOps on the Microsoft Virtual Academy - https://mva.microsoft.com/en-us/training-courses/devops-with-visual-studio-team-services-and-team-foundation-server-16779#!