Handling billions of rows in Spark

Client:NDA

Industry:Retail

Our client, a worldwide operating retailer, handles billions of data rows for crucial sales decisions. The data set includes prices of products in various stores around the world. To maintain and implement their sales strategy, our client aimed to set up and analyze price changes over time. The data set was recalculated and unavailable for several hours in the analytical platform, delaying significant processes. VirtusLab’s extensive knowledge in Spark and BigData enabled us to implement a solution quickly, so our client could work more efficiently, be adaptable, and be always up-to-date.

The challenge

Our client’s data analysis was hindered by the inability to isolate price changes. They had to recalculate the entire dataset to address this issue, resulting in billions of rows in the price table. This caused delays in data availability for other users, with some experiencing up to a 3-hour wait time. As a result, employees from different departments had difficulty performing their work efficiently, impacting our client's overall productivity. Given the global nature of our client's business and the diverse time zones of its employees, adjusting data transformation times was not a viable solution. Clearly, a new approach to data processing was needed to maintain a competitive edge. This was when our client reached out to VirtusLab.

The solution

Working within our client's tight schedule, VirtusLab (VL) proposed an interim solution to enhance their existing construct using our extensive knowledge and experience in Spark and BigData. VL enhanced the default method of overwriting the entire table in Spark by using file manipulation. Our team utilized various solutions to facilitate background saving and faster file movement in the dedicated data storage file system. As a result, we were able to:

Save recalculated data files separately from the table
Replace the original table files with the moved data files
Repair the table's metadata to ensure data quality and completion

Our proposed solution benefits our client by significantly reducing the time of data unavailability, allowing their employees to work more efficiently. Moreover, our solution leverages the latest industry practices, positioning our client as a competitive and forward-thinking organization.

The results

Overall, our consultancy services have enabled our client to optimize their data processing and achieve measurable improvements in efficiency and productivity. Our services delivered significant results for our global retail client, including:

Higher availability: reducing waiting time from several hours to a matter of seconds.
Increased availability of the table, allowing users to perform their daily work activities more efficiently.
Adoption of state-of-the-art data processing methods, resulting in proper data handling and management.
Our solution is universal and easily adaptable, ensuring our client can apply the same approach to their future processing tasks.

Handling billions of rows in Spark

The challenge

The solution

The results

Tech stack

Languages

Database

Eventing platform

Infrastructure