Data lake for edge reference architecture: What is it, and how does it integrate with modern edge computing solutions?
Tomasz Jażdżewski
ML Engineer
Published: May 13, 2025|15 min read15 minutes read
Edge computing offers tremendous opportunities across industries. However, regardless of the application, it generates large amounts of different types of data, which are challenging to store, even for organizations with experience in data management. This is where data lakes come in.
Read on to find out what they are and what makes them a valid storage option for solutions built around edge reference architecture. You will also learn how the combination of data lakes and edge computing applies in real life.
A data lake is a type of storage repository. Different types of repositories hold different types of information, and data lakes store large amounts of data in their native, raw format, regardless of whether the data is structured, semi-structured, or unstructured. It also allows for a schema-on-read approach, meaning that data is stored as-is, and schema information is added only when data is needed, for example, for reading, analysis, or to run a query.
Data lakes can scale to terabytes or even petabytes, making them suitable for storing different data types like video, audio, and customer records in one place. Unlike traditional storage repositories such as data warehouses, which typically require data to be processed and transformed before ingestion, data lakes accept raw data in its unaltered form. This approach allows data to be refined only when it’s needed for analysis, not just for storage, which saves time and resources.
Without strong governance, processes, and metadata management, your data lake runs at risk of becoming a waste of time and resources, a so-called “data swamp,” which is a disorganized, inconsistent, and unreliable storage repository. A data swamp often ends up as an ineffective storage system that people are afraid to clean up, for fear of accidentally deleting something valuable.
What types of data lakes are out there?
Depending on the location of their storage infrastructure and storage solutions they use, data lakes fall into one of three categories: on-premise, cloud-based, and hybrid.
On-premise data lakes: These are hosted within an organization's physical infrastructure, offering greater control, but maintenance requires significant resources and a long and expensive planning process. Tools like Apache Hadoop Distributed File System (HDFS) are commonly used for on-premises setups.
Cloud-based data lakes: These leverage cloud storage solutions like Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage, and IBM Cloud Object Storage. Cloud-based lakes are highly scalable and cost-effective, making them popular for modern enterprises.
Hybrid data lakes: Combining on-premises and cloud storage, hybrid data lakes allow organizations to balance control and scalability while optimizing costs.
To understand the importance of edge reference architecture, we need to recap on edge computing first.
What is edge computing?
Edge computing is a distributed computing model that processes and stores data closer to its source, as opposed to relying on centralized data centers. Examples of sources of data are IoT devices, sensors in manufacturing lines, or local servers.
Edge computing handles data locally before transmitting critical insights to central servers for further processing. This approach reduces latency, improves response times, and minimizes bandwidth usage by processing.
Edge reference architecture is a framework that allows for implementing edge computing following current standards.
Data lakes and edge reference architecture are complementary: Edge computing provides real-time, localized processing, while data lakes offer centralized, scalable storage and advanced analytics. Data lakes also store data collected through edge computing.
The integration between data lakes and edge computing enables organizations to maximize the value of both real-time and historical data. This drives operational efficiency and informed decision-making in distributed, data-rich environments. It also allows for optimizations, such as scheduling data transfers during nighttime or other periods when most of the hardware is idle.
An example of the role of a data lake in edge reference architecture: data flow from edge node to BI report and AI model.
Implementing a data lake for edge reference architecture empowers organizations with data management that is scalable, flexible, and cost-effective. In other words, organizations gain tools that help them unlock the full potential of edge computing.
However, it comes with a trade-off: data lakes introduce challenges in multiple areas, such as governance, security, and performance. Addressing these challenges requires careful planning, robust frameworks, and ongoing management.
Benefits of implementing a data lake for edge reference architecture
The biggest benefit of implementing a data lake for edge reference architecture is reduced data silos and centralized management. This makes it easier for organizations to gather all the data from different sources into one big, organized space.
This also provides them with a unified and consistent view of their data, which helps them in analysis and improves collaboration between their teams.
Other benefits vary from performance-related to managerial and financial.
Improved scalability and performance of your architecture: Data lakes are designed to absorb and store large amounts of data from multiple sources. This scalability ensures that the architecture can handle increased data loads without performance bottlenecks, despite a growing number of edge sources.
Enablement of object storage and open table formats: The combination of object storage with open table formats gives organizations a highly scalable, cost-effective, and flexible foundation for modern data lakes. This enables more advanced analytics and machine learning, as well as wider availability of data across your organization.
Increased flexibility of your data management: A data lake’s schema-on-read approach allows organizations to ingest raw data from edge sources without transforming it. This enables organizations to quickly adapt to new data types, sources, and case studies.
Improved quality of analytics and insights: Data lakes centralize data from edge devices, which enables advanced analytics, machine learning, and predictive modeling. Organizations can perform deep analysis on aggregated edge data, uncovering trends, anomalies, and actionable insights that drive smarter decisions.
Additional opportunities for financial savings: Typically, data lakes leverage cloud-based or object storage solutions that are a cost-efficient way to store the massive data volumes. This is especially beneficial for organizations in which data grows in a fluctuating or unpredictable manner. Cost savings also come from the ability to move data to cheaper storage, such as archives.
Better accessibility and usability of data: Data lakes make it easier for data scientists, analysts, and other users to independently access and analyze edge-generated data. This reduces reliance on IT and accelerates innovation within the organization.
Improvements in decision-making process: Organizations gain a holistic view of their operation by integrating real-time and historical data from the edge sources. This enables faster and better-informed decisions that are vital for applications like predictive maintenance, supply chain optimization, and smart environments.
Additional support for compliance and data governance: Modern data lake architectures include robust security, data cataloging, and governance features, ensuring that data from edge sources is managed, protected, and compliant with regulations.
Challenges of implementing a data lake for edge reference architecture
The challenges of implementing a data lake for edge reference architecture appear in multiple domains: from implementation, management, to governance and security.
Data governance and risk of becoming a “data swamp”: Data lakes require strong governance to avoid becoming a “data swamp”. This makes it difficult to find, trust, and use the data effectively, especially when integrating multiple edge sources with different types of data.
Security and compliance challenges: Centralizing large volumes of raw data from edge devices increases the risk of security breaches and non-compliance with regulations. Sensitive data may be exposed if access controls, encryption, and monitoring are not rigorously enforced.
Complexity in data integration: Aggregating data from diverse edge sources and systems is complex. Data often arrives in different formats and requires robust ETL (Extract, Transform, Load) processes to maintain its consistency and integrity. Poor integration of these different data formats leads to data quality issues and hinders the ability to perform analytics.
Potential performance issues: As data lakes grow, the performance of queries degrades, especially with millions of files or poorly optimized storage. Issues like small files, unnecessary reads, and slow query response times hamper real-time analytics and user satisfaction.
High complexity of implementation and operation: Setting up and managing a data lake for edge architectures demands significant technical expertise. The infrastructure, ingestion pipelines, access controls, and ongoing optimization require specialized skills and extensive resources to maintain.
Concerns regarding data quality: Ingesting raw, unstructured data from numerous edge sources introduces inconsistencies, inaccuracies, and duplicate records. Without strong data quality frameworks, users may struggle to trust the data for decision-making.
Difficulties with access control: It is challenging to provide the right level of data access to various users. Too broad access increases security risks, while too restrictive access hinders legitimate collaboration within your organization.
Challenges of proper indexing and metadata management: Robust indexing and metadata management are essential for efficient querying in large, unstructured data lakes. Without these, users may find it difficult to discover and utilize relevant data.
At VirtusLab, we choose solutions based on the needs and requirements of each project. In case of projects that implement a data lake for edge reference architecture, we use a combination of different tools for storage, data platform, integration, and democratization. Here’s an example of a tech stack that we used for our client operating in the retail industry.
Core platform (framework)
Hadoop: Framework for large-scale data storage and processing.
Querying and computing layer
Apache Spark: Analytics engine for fast and distributed data processing.
HiveQL: A querying language similar to SQL, but dedicated for Hadoop.
Apache Tez: Another analytics and data processing engine, but optimized for interactive and batch workloads.
Data democratization
Alation: A Data catalog and query interface to help different teams explore and understand, and work with data.
Storage
Hadoop Distributed File System (HDFS): A file system for data storage at the initial stages. As the project grew, we suggested migrating to S3-compatible blob storage, which supports scalable, cloud-friendly storage solutions.
Data integration
Kafka: An event streaming platform for real-time data streaming and integration with various sources.
Secure File Transfer Protocol (SFTP): A protocol for secure transfers of data.
Change Data Capture (CDC) via Kafka: A process for streaming database changes in real-time.
Database Snapshots: A method for capturing a database state at a given time.
Data lake solutions
Apache Iceberg: We also implement Apache Iceberg as a data lake solution, combined with GCP or S3 storage. It provides scalable, open table formats that support ACID transactions, schema evolution, and high-efficiency queries, making it ideal for cloud-native architectures.
Custom solutions: In some cases, we utilize custom-made solutions that the client has already used. For example, in the case of our retail client, we worked with their data lake solution based on the components of their on-premise infrastructure.
A previously mentioned retail client has successfully applied edge computing technology in visual object recognition.Some of its stores experimented with recognizing products through visual images instead of scanning barcodes. The checkouts had cameras in them, and the data from the cameras was analyzed locally, which allowed for instant recognition of an image. Adding a data lake solution could take this project even further by reducing errors and delays during peak hours.
Data lake solutions also have great potential in renewable energy. For example, they help in the collection and management of large volumes of data generated by multiple solar parks. Currently, these systems mostly rely on measuring current in electrical lines, which is far from being an accurate method.
With data lakes and edge computing solutions, each solar panel could be equipped with an IoT device and sensors to collect detailed data on energy production and consumption. This data would then be transferred to the cloud. To prevent data loss during connectivity outages, backfilling techniques would ensure all data is eventually captured and transmitted.
Once in the cloud, the data is stored in a centralized data lake, where cloud applications process the raw sensor inputs and generate insights to support more effective management of solar farms.
Organizing various types of data in a single, structured space is one of the main reasons why data lakes are a recommended solution for edge computing. Regardless of the industry or application, there is value in being able to modify, adjust, or grow your pool of data collection sensors without worrying about the implications or effects on data management.
Like anything else, it’s not a one-size-fits-all solution. Security risks, or the potential for creating a "data swamp," make it a less-than-ideal option for organizations with limited experience in handling these kinds of challenges. That’s why it’s wise to consult data engineering experts before deciding on the right solution.