Let's connect
Let's connect

Handling billions of rows in Spark

4 minutes read

Our client, a worldwide operating retailer, handles billions of data rows for crucial sales decisions. The data set includes prices of products in various stores around the world. To maintain and implement their sales strategy, our client aimed to set up and analyze price changes over time. The data set was recalculated and unavailable for several hours in the analytical platform, delaying significant processes. VirtusLab’s extensive knowledge in Spark and BigData enabled us to implement a solution quickly, so our client could work more efficiently, be adaptable, and be always up-to-date.

Download this success story as PDF

Print it out, take it with you to read later, or share it with your peers.Free download

The challenge

Our client’s data analysis was hindered by the inability to isolate price changes. They had to recalculate the entire dataset to address this issue, resulting in billions of rows in the price table. This caused delays in data availability for other users, with some experiencing up to a 3-hour wait time. As a result, employees from different departments had difficulty performing their work efficiently, impacting our client's overall productivity. Given the global nature of our client's business and the diverse time zones of its employees, adjusting data transformation times was not a viable solution. Clearly, a new approach to data processing was needed to maintain a competitive edge. This was when our client reached out to VirtusLab.

The solution

Working within our client's tight schedule, VirtusLab (VL) proposed an interim solution to enhance their existing construct using our extensive knowledge and experience in Spark and BigData. VL enhanced the default method of overwriting the entire table in Spark by using file manipulation. Our team utilized various solutions to facilitate background saving and faster file movement in the dedicated data storage file system. As a result, we were able to:

  1. Save recalculated data files separately from the table
  2. Replace the original table files with the moved data files
  3. Repair the table's metadata to ensure data quality and completion

Our proposed solution benefits our client by significantly reducing the time of data unavailability, allowing their employees to work more efficiently. Moreover, our solution leverages the latest industry practices, positioning our client as a competitive and forward-thinking organization.

The results

Overall, our consultancy services have enabled our client to optimize their data processing and achieve measurable improvements in efficiency and productivity. Our services delivered significant results for our global retail client, including:

  1. Higher availability: reducing waiting time from several hours to a matter of seconds.
  2. Increased availability of the table, allowing users to perform their daily work activities more efficiently.
  3. Adoption of state-of-the-art data processing methods, resulting in proper data handling and management.
  4. Our solution is universal and easily adaptable, ensuring our client can apply the same approach to their future processing tasks.

The tech stack

Languages: scala, SQL, HiveQL

Database: Hive 

Eventing platform: Kafka

Infrastructure: Hortonworks Data Platform / Spark, Hive,  HDFS, YARN, Oozie, Sqoop, Ranger

Take the first step to a sustained competitive edge for your business

Let's connect

VirtusLab's work has met the mark several times over, and their latest project is no exception. The team is efficient, hard-working, and trustworthy. Customers can expect a proactive team that drives results.

Stephen Rooke
Stephen RookeDirector of Software Development @ Extreme Reach

VirtusLab's engineers are truly Strapi extensions experts. Their knowledge and expertise in the area of Strapi plugins gave us the opportunity to lift our multi-brand CMS implementation to a different level.

facile logo
Leonardo PoddaEngineering Manager @ Facile.it

VirtusLab has been an incredible partner since the early development of Scala 3, essential to a mature and stable Scala 3 ecosystem.

Martin OderskyHead of Programming Research Group @ EPFL

The VirtusLab team's in-depth knowledge, understanding, and experience of technology have been invaluable to us in developing our product. The team is professional and delivers on time – we greatly appreciated this efficiency when working with them.

Michael GrantDirector of Development @ Cyber Sec Company