Metadata exploration and measurement tools suite

Managing more than 200 000 tables with over 4 million columns across the data lake.

CS_metadata_header
Customer
Global retailer
Technology
Scala, Apache Spark, Apache Hive
Scope
metadata management, integration, visualisation

Our client - Top global retailer

Our client is one of the largest retailers in the world, delivering yearly billions of items and constantly innovating and disrupting the shopping experience.

The problem - very low metadata coverage

The client had very large amounts of data from various systems, but most of them did not have metadata. Therefore, analyzing this data and using it by other systems was arduous and sometimes even impossible. The customer had already been using a commercial metadata management system providing data stewards with basic functionalities to manipulate metadata but lacking an easy way to measure the results of their work. Due to the overwhelming amount of metadata that needed to be supplied, the stewards often found it difficult to determine which data in the system needed more curation.

Our solution - metadata exploration and measurement tools suite

VL team took steps to integrate with the aforementioned metadata management system. After extracting the metadata stored in it, we defined a parameterizable set of criteria which allowed us to measure the quality of the metadata. We matched the quality scores of metadata with data stewards responsible for supplying them and with areas in the system which they were supposed to describe. Thanks to that we obtained the ability to exactly determine which data (with the precision of a single table and column within a database) needed to be curated better and by whom.

 

The resulting metrics were then visualized in a form of interactive dashboards with diagrams and presented to data stewards and their managers. This helped the client to more effectively allocate their limited human resources to struggle with the problem of incomplete metadata. This was of great importance in the situation when the amount of data to be cataloged was so huge that providing the metadata for only a small subset of it was possible in a reasonable time. So the choice of the right focus was crucial.

The process of data extraction from the external metadata management system resulted in structuring and exposing the metadata in a better way. This, in turn, had a positive side effect of making the client’s analysts able to easily perform custom queries on metadata stored in the system in order to answer important business questions.

Main technologies we used

icon

Scala

icon

Apache Spark

icon

Apache Hive

The final results

1.

Making the client aware of the scale of the problem is our greatest success

2.

Our solutions allowed the users of the client’s system to gain more confidence in the data and made their work easier, more efficient, and error-prone

3.

Identifying business entities whose metadata do not fulfill given requirements

4.

Answering complex business queries based on metadata

5.

Metadata metrics covering more than 200 000 tables with over 4 million columns across the data lake

6.

Integration with other commercial and open-source metadata management tools

7.

Visualization of metadata coverage within different parts of the managed system with adjustable granularity

Any problems with data or metadata?

Contact us!