Introduction
Starting a data lake initiative is a significant step for any organization looking to move toward fact-based decision making. While traditional EDW (Enterprise Data Warehouse) and BI tools have proven effective for many years, supporting structured analysis for recurring business questions, they have shown limitations in handling adhoc requirements and answering unexpected questions.
This article explores how organizations can approach building a data lake in an agile way, presents the high-level architecture of a data lake and its components, and identifies technologies suited for today's data landscape.
What is a Data Lake?
A data lake is the centralized repository of all data collected from a variety of sources with these key characteristics:
- Raw Data Storage: Stores both relational and non-relational data in raw form with the lowest granularity
- Scalable Storage: Leverages scale-out, cost-effective storage such as HDFS (Hadoop Distributed File System) or Amazon S3
- Staging Layer: Serves as a staging layer for further structured and unstructured analysis
Starting with a Single Use Case
Getting all data in one place is non-trivial. Organizations should not start with the objective of building a data lake simply because competitors are doing so. The best approach is to begin with one specific business problem in mind: getting the required data for that particular business problem streamed into the lake. Additional data sources can be added as needs evolve, gradually building a comprehensive repository.
This iterative approach aligns with industry best practices described in the "Data Warehouse Toolkit" by Ralph Kimball, which advocates applying similar methodologies even in traditional DW/BI systems. Over time, the repository becomes a complete collection of organizational data readily available for answering new questions and operationalizing new analytics pipelines.
Data Lake Architecture and Components
A typical data lake architecture involves several layers and components working together:
Raw Data Staging Layer
The ingestion layer extracts data from various conventional and unstructured data sources in batch or streaming modes:
- Batch Processing: Extracting data from relational OLTP sources nightly or every few hours
- Streaming: Processing data in near real-time fashion
The choice of tools depends on acceptable latency for data availability in decision-making.
Data Sources:
- Logs: Tools like Flume or Logstash forward log data from web servers and application servers. Log data is valuable for click stream analysis, A/B testing, and recommender systems when combined with other data sources.
- Social Networks: Data from Facebook, Twitter, and LinkedIn can be streamed and stored for offline analysis, or collected on an ad-hoc basis for monitoring campaign responses, product launches, or public events.
- Raw Storage: Ingested data from any source is initially stored in raw form in HDFS, representing the true data lake or data reservoir. This raw staged data serves as a source for adhoc analysis should future needs arise.
Performance Layer (Access Layer)
Raw data is subsequently deduplicated, cleaned, transformed, and stored in an efficiently queriable format:
- Columnar Formats: Parquet and similar columnar file formats offer excellent compression and query performance
- Partitioning: Data is partitioned by time intervals to reduce the amount of data scanned during queries
- Streaming Data: Parquet is not ideal for streaming data requiring low-latency access. NoSQL databases like Cassandra, HBase, or ElasticSearch serve this purpose better
Integration with Existing Systems:
Hadoop serves as the landing zone and long-term repository for relational data from OLTP systems. It can also perform ETL operations using Spark compute power and export data to existing Enterprise Data Warehouses.
Query Tools:
Existing BI tools and custom web applications can connect directly to Hadoop using SQL-on-Hadoop engines:
- Cloudera Impala
- Spark SQL with Thrift server
Data Discovery Lab
The data discovery lab serves as a sandbox and playground for innovation:
- Exploratory Analysis: Data scientists can bring raw and/or processed data for exploratory analysis
- Machine Learning: Building and testing new machine learning models
- Findings and Models: Exploration findings can be presented as one-time reports, while validated ML models can be operationalized
- Key Benefit: The data lake enables the existence of such labs where data scientists and business analysts can freely explore data and innovate
Benefits of a Data Lake
A data lake provides significant advantages:
- Enterprise Data Hub: Creates a centralized data repository supporting the entire organization
- Data Discovery: Enables exploration of available data assets
- Real-Time Analytics: Supports near real-time analysis capabilities
- Machine Learning: Facilitates development and operationalization of ML models
- Agility: Allows organizations to answer adhoc questions quickly without lengthy ETL cycles
Getting Started
Building a data lake doesn't require starting with the entire vision. Initial use cases that motivate data lake development include:
- Recommender systems for customer portals
- Offloading ETL workloads from existing EDW systems
- Customer churn analysis
- A/B testing capabilities
Starting with a single, well-defined use case and approaching data lake development in an agile, iterative manner can yield significant long-term organizational benefits. As the data lake matures, it becomes an increasingly valuable asset for analytics, reporting, and innovation.
Conclusion
The data lake approach complements rather than replaces traditional EDW and BI systems. By taking an agile, iterative approach to building an enterprise data hub, organizations can gradually expand their data capabilities while solving immediate business problems. This strategy reduces risk while building the foundation for advanced analytics and data-driven decision making.