Machine Learning with Kubernetes



Machine learning is continuously evolving as information asymmetry lessens and complex models and algorithm become easier to implement and use. Python libraries like scikit-learn require only a few lines of code(excluding pre-proc) to fit and make predictions using complex high-level ensemble learning techniques like Random Forrest. Does the question then arise what gives you the edge?

There are numerous guides and resources available online to write these few lines of code and predict accurately. The challenge then comes down to how to use ML efficiently and dynamically. In a real use-case, we know that we will not be using just one model to solve an issue. We use hundreds of combinations and optimizer to obtain the best results possible.

Hence, the rise of ML operations (MLOps). The purpose of that is to facilitate complex machine learning pipelines and processes. MLOps is a very new field and with rules and processes being composed every day. I will add a link to a couple of articles below where you can explore more about MLOps. In this article, I will take a small example of a machine learning workflow and show how we can deploy it efficiently on a Kubernetes cluster. I will be using the following:

  • Kubernetes cluster (minikube will also work)

  • Argo

  • AWS S3

  • Docker

Please note these installations and setups are prerequisites for this article.

Like mentioned before, a complex machine learning workflow has several stages to it. In order to be time-efficient, we will run all those processes in a separate container. For this small example, we will do data preprocessing in the first container and then train two different models in the second and third container.

1. Python code

This will be divided into 3 scripts, the first script will be the data processing and the second and the third script will be model training.

Note: The data needs to be hosted somewhere so the container can access it

File 1: Pre-Processing script

import pandas as pd
from sklearn.model_selection import train_test_split

#any data i have hosted this 
df = pd.read_csv('http://localhost:8000/data/sales.csv')

#for this small example, I will just remove a column as data preproc
df.drop('size', inplace=True)

x = df.drop('sales', axis=1)
y = df['sales']x_train, 

x_test, y_train, y_test = train_test_split(df, test_size=0.3)

df.to_csv('x_train.csv')
df.to_csv('x_test.csv')
df.to_csv('y_train.csv')
df.to_csv('y_test.csv')

File 2: Random Forrest regression

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# reading the preproc data
x_train = pd.read_csv('x_train.csv')
x_test = pd.read_csv('x_test.csv')  
y_train = pd.read_csv('y_train.csv') 
y_test = pd.read_csv('y_test.csv')

# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)

# Train the model on training data
rf.fit(x_train, y_train)

# Use the forest's predict method on the test data
predictions = rf.predict(x_test)

# Calculate the MSE
mse = mean_squared_error(y_test, predictions)

# Print out the mean absolute error (mse)
print('Mean Absolute Error:', mse)

File 3: Lasso Regression

from sklearn.linear_model import LassoCV
import pandas as pd

# reading the preproc data
x_train = pd.read_csv('x_train.csv')
x_test = pd.read_csv('x_test.csv')  
y_train = pd.read_csv('y_train.csv') 
y_test = pd.read_csv('y_test.csv')

# initialising and fitting the model
model = LassoCV()
model.fit(x_train, y_train)

# Use the forest's predict method on the test data
predictions = model.predict(x_test)

# Calculate MSE
mse = mean_squared_error(y_test, predictions)

# Print out the mean absolute error (mse)
print('Mean Absolute Error:', mse)

2. Creating the images

Once we have the code ready we can create the images and push it into docker hub (or any other image repository). A simple docker file can do

FROM python3.6

RUN mkdir codes

COPY . codes/

RUN pip3 install -r codes/requirements.txt

Creating and pushing the image

docker image build -t ml_pipeline .
docker image push ml_pipline

3. Defining the ml pipeline using Argo

You can find a quick set up guide for installing and setting up Argo on your Kubernetes cluster using this link below: Also, make sure to set up an artifact repository. I use AWS S3 but there are many other options available: Argo

Argo provides us with a robust workflow engine which enables us to implement each step in a workflow, as a container on Kubernetes. Argo uses YML files to define and write the pipelines. There is, however, a more popular alternative to Argo which is called Kubeflow. Kubeflow uses Argo’s engine and provides a python API to define the pipelines. I will be using Argo in this post but in future do a short tutorial on Kubeflow as well.

Let’s start building our pipeline, I will call it pipline.yml

# Our pipeline
# We will make a DAG. That will allow us to do pre proc first 
# and then train models in parallel.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: ml-pipeline-
spec:
  entrypoint: ml-pipeline
  templates:
 
  # defining the pipeline flow
  - name: ml-pipeline
    dag:
        tasks:
          - name: preprocessing
            template: preproc
          - name: training-rf
            dependencies: [preprocessing]
            template: randomforrest
            arguments:
              artifacts:
              - name: x_train
                from: tasks.preprocessing.outputs.artifacts.x_train
              - name: x_test
                from: tasks.preprocessing.outputs.artifacts.x_test
              - name: y_train
                from: tasks.preprocessing.outputs.artifacts.y_train
              - name: y_test
                from: tasks.preprocessing.outputs.artifacts.y_test                                                
 
          - name: training-lasso
            dependencies: [preprocessing]
            template: lasso
            arguments:
              artifacts:
              - name: x_train
                from: tasks.preprocessing.outputs.artifacts.x_train
              - name: x_test
                from: tasks.preprocessing.outputs.artifacts.x_test
              - name: y_train
                from: tasks.preprocessing.outputs.artifacts.y_train
              - name: y_test
                from: tasks.preprocessing.outputs.artifacts.y_test# defining the individual steps of our pipeline
    - name: preproc
      container: 
        image: docker.io/manikmal/ml_pipline
        command: [sh, -c]
        args: ["python3 codes/preproc.py"]
      outputs:
        artifacts:
        - name: x_train
          path: x_train.csv
        - name: x_test
          path: x_test.csv
        - name: y_train
          path: y_train.csv
        - name: y_test
          path: y_test.csv- name: randomforrest
      inputs: 
        artifacts:
        - name: x_train
          path: x_train.csv
        - name: x_test
          path: x_test.csv
        - name: y_train
          path: y_train.csv
        - name: y_test
          path: y_test.csv
      container:
        image: docker.io/manikmal/ml_pipline
        command: [sh, -c]
        args: ["python3 codes/rf.py"]- name: lasso
      inputs: 
        artifacts:
        - name: x_train
          path: x_train.csv
        - name: x_test
          path: x_test.csv
        - name: y_train
          path: y_train.csv
        - name: y_test
          path: y_test.csv
      container:
        image: docker.io/manikmal/ml_pipline
        command: [sh, -c]
        args: ["python3 codes/lasso.py"]

That’s a big file! That is one of the reasons why data scientists prefer using Kubeflow to define and run ML-pipelines on Kubernetes. I will leave my repo link for the code so don’t worry about copying it from here. In a nutshell, the YML file above declares our workflow defining the order of the tasks needed and then defining each task separately.

4. Deploying our ML pipeline on Kubernetes cluster

Argo comes with a handy CLI for submitting our workflows and getting the results easily. On the master node, submit the workflow using


argo submit pipeline.yml --watch

The watch function will bring up the pipeline in action. Once finished we can just use

argo logs <name of the workflow>

And you will get your results!

What we see is that we were able to successfully deploy a machine learning workflow and bring a good amount of time and resource efficiency by doing the training process of the different models parallely. These pipelines are cable of doing many complex workflows with this being a very basic example of the capabilities. Hence, the growing need for MLOps practises to give you the edge.


Source: Medium.com