The client’s Data Science department wished to introduce a fully automated end-to-end solution to deliver recommendations to their webpage. Automatization should be achieved in a number of areas:
The existing solution was built on semi-automatic processes obstructing the delivery of new solutions of the Data Science department to their end-client.
To achieve improvements, we decided to approach the problem with decoupled components methodology. The small code pipelines representing single data transformation, feature or model were easily understandable, testable, and extendable. We could use libraries like PySpark for Big Data processing and Tensorflow for machine learning only where applicable.
To maintain control over the growing amount of pipelines, we proposed a composable configuration. Thanks to that, we enabled sharing the common configuration between a number of environments which makes scalability and productionization easier.
We constructed a number of common building blocks extracting complex logic out of the Data Science code and ensuring common behavior across decoupled modules. A prominent example would be a mechanism for validation of data which does not obstruct the business logic anymore.
To achieve the best results, we cooperated closely with the client’s Data Science team to build a solution integrating well in their ecosystems. We continuously support the adoption of best practices and build the solution having top engineering quality in mind and using unit and acceptance tests, static type checking, linting, code reviews, and continuous integration.
Leveraging the capabilities of the cloud for Machine Learning is yet another step to develop and deliver complex models faster. Such solutions, however, come with the complexity that must be tamed with the well-built architecture, set of guidelines, and building a common understanding between engineers and data scientists.