The real big data
The client has been collecting data for 30+ years. The size of the data went beyond what a standard SQL storage could handle (even specialized clustered Teradata). The business was afraid that moving the data to a Hadoop cluster would worsen the performance of daily queries.
Integration of data
This project consists of 3 teams responsible for 3 domains: Store, Product, and Fulfilment. We integrate data volumes from many sources (like RDBMS, Kafka, REST API, files) in each profile to later build analytics from them.
Data is saved on the Hadoop cluster in proper structure. Each data flow is scheduled using Oozie. There are also flows that publish pre-aggregated data from the cluster to external clients. Teams have their own CI pipelines (almost CD) on Jenkins. They allow to quickly build and deploy changes.
Monitoring and alerting
We obtained a common logging model that allows for optimal transfer of logs from the cluster to Splunk. We monitor the situation and pin down various types of warnings for the team and clients thanks to appropriate dashboards defined on Splunk.
VL’s team improved also the process of metadata gathering and metadata coverage. Previously the client had very large amounts of data from various systems, but most of them did not have metadata. We integrated with 3rd party system to download existing metadata, then defined a set of metrics to check to what extent the metadata of various objects from the system met certain criteria. Thanks to that, the people responsible for data quality were able to fill the missing metadata and better understand the data they are using or which data they need.
The analytical view
By using data from various sources we are able to track product lifecycles in multiple dimensions and build the whole timeline of a given product. Starting from recipe specification, through an agreement with suppliers, packaging, choosing a range, setting a price, and promotions, to quality checks and decommissioning.
Such an analytical view is being used for various reporting, like checking if products being sold, are getting healthier or helping to choose the cheapest supplier for a given product.
VL’s teams have built pipelines that perform spatial and graph-based analytics to help optimize deliveries and delivery van schedules. The objective here was to compute statistics for both road networks and delivery journeys to shorten the time wasted on expensive routes and to improve van capacity due to better grouping of customers per journey.
Our work allowed us to upgrade the semi-manual process (which previously took several days per delivery center) using only a chunk of data to a job taking few hours while processing all the historical data available for entire delivery centers. Tools used include Kafka and HBase for data ingestion and long-term persistence, GeoSpark for distributed spatial computations, a JGraphT library for high-performance, in-memory graph analysis of road networks.
Data structure optimization to ensure the performance of data serving.
+200 dataflows with dedicated monitoring and quality assurance.
Metadata metrics covering more than 200 000 tables with over 4 million columns across the data lake.
Reduced analytical customer view building time (over 60TB of data) from 24h to 1,5h with similar resources.
Geographical and spatial analytics.
Lower processing time of spatial and graph data, from months to hours.
Upgrade of the delivery process which took several days per delivery center to a job taking a few hours for all of them.
+6 types of Generic pipelines for most common source types (Kafka, JDBC, CSV, REST).
Shorter time of generating customer health data, from hours to minutes, so the customer may access the most recent statistics in near real-time.
Single source of truth for all data across the company.
Decrease the number of incidents and frauds in-store by identifying suspicious events and alerting the in-store team.
Supporting coordination of supply and transport in difficult times of the COVID-19 through streamlining data access and improving observability of data pipelines.