Scaling up Python to parallel ML pipelines

We used cutting edge tools existing in Python, and also transferred best programming practices & software development processes known from other programming language ecosystems.

scaling-min
Industry
Retail
Technology
ML/Python
The problem – arduous work on financial forecasts

The client had a data science-based application in Python for computing financial forecasts. The existing solution required many manual steps during the deployment of a forecast for only one customer store out of thousands. The client’s team of data scientists needed to scale up their machine learning models written in Python to support multiple contexts,  quickly productionize new models and be able to maintain the codebase on their own. Since Python’s ecosystem is less focused on production-grade solutions, creating such a system in the approach pre-taken by the client would quickly lead to a situation in which maintenance is expensive and development is very slow or even impossible.

 

image for article: Scaling up Python to parallel ML pipelines

 

Our solution

The development of new forecasts and expansion of existing ones have been an ongoing process, so we had to create an application architecture that would help in building new components based on already implemented functionalities, while maintaining the stability of already productionized forecasts. To achieve that, we used cutting edge tools existing in Python, and also transferred best programming practices & software development processes known from other programming language ecosystems, which are usually used in production systems (like Java, C#, Scala).

We created a framework using the Apache Spark’s Python API (pySpark) and Hadoop, which allows the rapid implementation and productionization of ML models of finance forecasts (1), ensuring their correctness at a level unknown in this environment, thanks to static type checking, workflow generation, dependency injection, and applying them to new problems (2). In addition, we have speeded up the calculations of the models themselves (3). The use of state-of-the-art approaches to the software production process has multiplied the number of pipelines that can be maintained and developed in production (4).

 

image for article: Scaling up Python to parallel ML pipelines

The results
1.

The rapid development of new ML model by Data Scientists – from weeks to days (the model with tests in a day and productionization in a matter of hours)

2.

Possibility to add new forecasting pipelines in about 1-2 weeks (compared to months)

3.

Scaling up the processing of ML models 140 times (radical speedup from 24h to 10 minutes in the initial version)

4.

Possibility to maintain and safely improve 70+ production pipelines written in Python

Using the skills of own’s data science team the customer is now able to create more forecasting models for new business contexts in less time.
Mikołaj
Software Engineer & Partner
What the future holds

Our main plan is to increase the scale of forecasts by an order of magnitude. We also want to use this battle-tested solution in the client’s other projects written in Python.

 

image for article: Scaling up Python to parallel ML pipelines

What can we do for you?
  • Our solution can be adapted to any data science project using Python and Hadoop, regardless of the amount of data
  • We have knowledge & tools to develop very large Python-based production projects
We will improve your project
Drop us a line!

"*" indicates required fields

If you click the “Send” button you agree to the privacy policy. Your personal data given in the contact form above will be processed for purposes of answering your inquiry and for any further correspondence regarding this inquiry. The controller of your personal data is VirtusLab Sp. z o.o. For more information, see our Privacy Policy