How to Build and Deploy Machine Learning Models that Scale

Machine learning is the newest tool in the developer’s tool belt but most developers working in the field nowadays don’t know how to deploy and scale machine learning models.

In this article I will describe the two ways I’ve used to deploy machine learning models and why one is better than the other.

But first, let’s talk about how to make your model in the first place.

Picking an Architecture

All machine learning projects start by identifying a problem, usually a prediction or categorization problem, and looking into what architecture the current machine learning research says you should use.

While I know a fair bit about machine learning architecture I’m by no means an expert and I never will be. If you’re reading this blog you, like me, are probably more interested in making products than you are interested in spending all of your company’s time and money researching new architectures for machine learning. There is a reason that machine learning researchers exists and why they publish papers for anyone to read. They do the work for you so take advantage of the work and pick the best known architecture to solve the problem you’re trying to solve in my moment.

As an example, most machine learning experts use a form of CNN for image categorization and RNN for time series and numerical data. This might not be the best for your product, but it’s a good place to start.

Training Your Model

After you’ve picked your architecture you need to train your model on hundreds of thousands of pieces of data. Most machine learning projects in production use supervised learning techniques with training data that is clearly labeled. You should do the same for your first model.

Training can take a long time and lots of processing power. Most training is done on a system that has a GPU and all the most popular machine learning libraries can use GPU acceleration. Get a GPU accelerated machine on GCP or AWS and use it to train. Shut it down between training because you don’t need all that raw GPU power once the model is ready. In fact as long as the model fits in memory you can deploy it on relatively weak servers.

This saving a model after training will output the weights of the model in a file or folder that can be loaded back in when you deploy. PyTorch uses Pickle files while TensorFlow uses a directory filled with files corresponding to the weights.

Deploy the Model

After the model is trained and saved you can deploy the model using a few techniques. First, the bad way…

The Bad Way

I’ve seen several developers integrate machine learning calls into their application’s ORM layer by having ORM models call an external CLI command, passing the relevant data to the command. This works but is bad for several reasons.

While trained machine learning models don’t take as much processing to run as training the model in the first place it still takes some. Most of the time you won’t be running your machine learning code but since it has to load the model’s weights into memory each time you call a command it can take several seconds to do that after which time it has to unload it. This is just more time than companies can afford.

Additionally, when you package your machine learning code with your API / ORM layer there is no easy way to scale or cache the results from the machine learning code. This is like packaging your database on the same machine as your API code and then trying to scale by syncing each instance of the database. It’s just more complicated and has no upside.

The Good Way

Instead you should treat your machine learning code as a microservice running on its own cluster. You package the model with API code that is specific to serving your machine learning model’s results and run it on a machine that has the needed specs. You can add a caching layer, if desired, very easily to the API code itself and the same way you would add a caching layer to any other API.

Since most machine learning code is written with Python you can use a framework like Flask to serve the requests to your machine learning code and thus keep all the machine learning code in one language. You also get to use the same preprocessing functions from the training stage as well as for the production stage.

Since your main application’s API is probably written in PHP, Node.js, Java, Ruby, or any number of languages this makes it easy for application layers to talk to each other without having to be written in the same code. They just have to understand HTTP requests.

If you like Docker or Kubernetes you can also package the machine learning API layer as a Docker image and deploy it the same way as all the other API code you have.

Machine learning in general can be complication but deploying machine learning code doesn’t have to be any more complicated than deploying regular API code.

Recent Posts

Categories