Continuous Integration for ML Projects
栏目分类：资料 发布日期：2017-10-31 浏览次数：次
本文为去找网小编(www.7zhao.net)为您推荐的Continuous Integration for ML Projects，希望对您有所帮助，谢谢！ 本文来自去找www.7zhao.net
Over last year we have deployed quite a few services which contain Machine Learning components. This post shares what we learnt in that process and what helped us to minimise risks and time to production.
What does our usual development cycle look like?
- Use some version of Gitflow to manage branches
- Containerize application using Docker
- Deploy application to Kubernetes cluster
It’s important to mention that Docker is not only our choice of packaging format but also our development environment. This helps us minimise the risk of “works on my machine” situations and version dependency management (especially useful for science libraries in python). 本文来自去找www.7zhao.net
Jenkins is our continuous integration system of choice and we use it for building and deploying applications. All of our repos contain a Jenkinsfile which defines the pipeline that Jenkins will execute. Most common steps of the pipeline are:
- Build a Docker image
- Run our unit/integration test (within a instance of that Docker image)
- Run acceptance tests (end-to-end, may require some orchestration)
- Deploy to staging and production
Every time developer pushes their code to remote git server, Jenkins reads the pipeline file and follows the appropriate steps. Having this pipeline define for each repository has proven to give us a lot of flexibility to each service steps.
Here is a visualization of these processes: 本文来自去找www.7zhao.net
Enter Machine Learning in the service
Adding machine learning components to new or existing services means that now we need to resolve a few things:
- How do associate the code and (usually) large files needed for the models?
- How can we increase confidence on changes to models or inference-related code?
- Where does training fit into our development lifecycle?
To answer the first question, one of our engineers we took. To summarise: our models live in S3 and are linked to the code using a dependency file, which is easily staged in the Git repo.
Testing models before deploying
As we do with all other software, before we release changes to ML models we want to have a certain degree of confidence that our changes have not negatively impacted how our system behaves (at least in unexpected ways). But how can we be confident that our models perform as we expect them?
The solution we found is to introduce a new type of test in our test suite: accuracy tests .
An accuracy test exercises the inference code for a test data sample for a model to verify that the output is above an expected threshold. 内容来自www.7zhao.net
If you remember our CI lifecycle above, we made the following changes: 本文来自去找www.7zhao.net
- Allow docker image building step to resolve the model dependencies
- Run unit/integration tests (fast to fail)
- Run acceptance test (usually slower than the previous set)
- Download the test dataset (currently using S3 to store this information)
- Trigger the accuracy tests (speed can vary greatly depending on hardware, sample size, etc.)
With this setup, we have a high degree of confidence when making changes in our services that the performance of models have been unaffected — enabling us to move a lot faster. copyright www.7zhao.net
As this runs inside a container, we can also run these tests locally.
In some scenarios, we have had to make different compromises: 内容来自www.7zhao.net
- For services using small and fast inference models, we can use small (but still statistically significant) test datasets which — with some parallelising — can run on every build and finish in seconds/minutes
- For other services using models with higher resource needs/slower inference time, we chose to run them on a scheduled basis on integration branches. This balances time for feedback with certainty that we’ll still catch regressions before they reach production
At this point we have a system that allows us to add new models, make change with confidence and deploy to production in a streamlined way. The pipeline looks similar to before:
Fitting training on this flow
From the perspective of the CI pipeline, adding new models can simply fit in modifying the code and dependencies files we mentioned, and we are good to go.
One thing that we are interested in systematically training new models, lowering the knowledge barrier to do so and ultimately making this process automated. So far what we have found helps create a process is: 欢迎访问www.7zhao.net
- Move all code required for training into the same git repository
- Use a dedicated docker image for training
- Structuring your training steps with libraries like Luigi or Airflow, makes it a lot simpler to refactor later on, apart other goodies like ability to resume on a failed step.
Moving code into the same repo meant that we can use the accuracy test as one of the steps in the pipeline, and manage code share (at least to a certain degree). 内容来自www.7zhao.net
Docker was a natural decision based on our existing workflows, and provided us with the flexibility to:
- Run training locally, specially during early stages of the project.
- Ensure the right dependencies are installed (images will vary if you need to use GPUs or CPUs between inference and training).
- Being able to take advantage of services like AWS Batch, which can handle all the infrastructure management (including GPU nodes) with little to no effort.
We will soon be writing more into this specific topic since there is a lot more learning from continuously train models with little to none human intervention. 去找(www.7zhao.net欢迎您
Handling ML is not too much different to other components in your systems, but it requires solving some particular problems to get you going. 内容来自www.7zhao.net
I believe that it’s extremely important to streamline this process right from the start and setting up CI that combines all these components will make it a lot easier to keep growing your solutions and improving them in production.
以上为Continuous Integration for ML Projects文章的全部内容，若您也有好的文章，欢迎与我们分享！
- Google open sources Bottery – bot-making language/platform
- Introducing Tarmak - the toolkit for multi-cloud Kubernetes
- DifferentialEquations.jl 3.0 and a Roadmap for 4.0
- Easy Webserver for Node Js, Python, PHP, etc. With Free SSL
- Get ready: A new V8 is coming, Node.js performance is changi
- Continuous Integration for ML Projects
- Why this React code not work in production
- Google open sources Bottery – bot-making language/p
- Common API Mistakes
- CNCF Adds Security, Service Mesh and Tracing Projects
- Kylo Ren CSS Page Preloader
- Lessons learned experimenting with an AWS Lambda orch
- 带控件的 DevOps
- 基于 Docker 构建 Selenium Grid 分布式测试环境