7 August 2017 / Jan Paw

Diving in the data lake

, , ,


Rapid growth of unstructured data is a serious business challenge for organizations. Data repositories, known as data lakes, have a great chance to play an important role in extracting valuable business information from enormous amounts of data. Storing and processing data on such a scale is a very complex and demanding task. Existing RDBMS-based systems […]

Read more

31 August 2017 / Bartłomiej Tomala

Navigating data lakes using Atlas

, , , , ,


Nowadays almost every company wants to have their own Big Data system to analyse client behaviour and optimise operating costs. One of the most popular solutions for implementing such systems is a Data Lake based on the Hadoop ecosystem. If you don’t know what exactly a Data Lake is, you can read about it in […]

Read more

19 September 2017 / Jan Paw

Hadoop legacy

, , ,


In the previous blog post I explained the basic concepts of data lakes. Some core problems which can occur in data lakes were defined and I gave some hints to avoid them. Most of these pitfalls are caused by the traits of data lakes. Unfortunately, current Hadoop distributions can’t resolve them entirely. Additionally, the architecture […]

Read more

17 April 2018 / Tomasz Lichoń

Benchmarking Spark SQL, Presto and Hive for BI processing on Google’s Cloud Dataproc

, , , ,


Recently our big data team was approached by one of our clients – a company wanting to minimize costs of analytics of its sales data. The client had an analytic team working with the Microstrategy BI tool [1]. The tool fetches data by running SQL queries on an underlying Teradata DB [2]. As Microstrategy can […]

Read more