Hadoop, Spark and Effect on Data Warehousing/BI

Introduction

The nature of data enterprises deal with has been changing significantly. It is no longer just transactional data. Volume, velocity, and variety of data are increasing steadily. Organizations not capturing, storing, and analyzing new data types are potentially losing competitive edge. Conversely, leveraging new data types can provide significant competitive advantages.

Key Topics

This article covers:

How data warehouses excel at dealing with structured data
The rise of unconventional data
Shortcomings of traditional data warehouses
How Hadoop ecosystem and Apache Spark can help

How Data Warehouses Excel with Structured Data

Business Intelligence, analytics, and data warehousing are mature fields existing for over three decades. BI software is typically powered by underlying data warehouses and data marts. Traditional data warehouses and data marts rely on relational databases (Oracle, Teradata, MySQL) following star schemas. SQL constructs enable slicing, dicing, and drill-down/up of data across different dimensions.

Data warehouses have worked effectively and continue to serve their purpose for highly structured transactional data suitable for relational databases. However, big data introduces new unconventional semi-structured and unstructured data. Additionally, data arriving at thousands or millions of events per second poses challenges for traditional systems.

The Rise of Unconventional Data

Much discussion surrounds the growth of data and the 3 (or 4) V's:

Volume
Velocity
Variety
Veracity (when including 4 V's)

New data types include:

Sensor data
Smart phone data
Images
Social media content
Machine logs

These unstructured and semi-structured data can effectively drive competitive advantage when properly leveraged.

Traditional data warehouses cannot handle these new data types or manage all the V's efficiently. Associated costs are prohibitive due to licensing and infrastructure expenses, with no efficient way to handle unstructured and semi-structured data.

Shortcomings of Traditional Data Warehouses

Inability to Handle the 3 V's

Traditional systems struggle to handle volume, velocity, and variety without excessive costs. They lack efficient mechanisms for unstructured and semi-structured data processing.

Extensive ETL Requirements

Moving data from source transactional (OLTP) systems requires numerous ETL jobs to move and transform data to target data marts. This process is time-consuming and resource-intensive.

Inability to Retain Data

High storage costs force data archival to tapes, making it inaccessible for future analysis.

How Hadoop Ecosystem and Apache Spark Help

Proven Track Record

Hadoop ecosystem has matured significantly since its 2005 inception, proving itself capable of handling the 3 V's. Extensive research has gone into distributed file systems, storage formats, distributed computing, SQL-on-Hadoop engines, and security.

Apache Spark Advantages

Apache Spark is a distributed computing framework running on HDFS or Amazon S3, performing in-memory fast analytics. Spark is 10x to 100x faster than MapReduce, even when all data cannot fit in cluster memory.

Storage Solutions

HDFS and Amazon S3: Resilient, distributed file systems scaling from terabytes to petabytes/exabytes

Storage Formats

Columnar Formats: Special-purpose efficient formats like Parquet enable fast analytics queries

SQL-on-Hadoop Engines

Distributed query engines providing low-latency interactive SQL:

Cloudera's Impala
Spark-SQL
Apache Drill

These engines get faster daily to match relational database speeds. JDBC/ODBC connectivity support allows existing BI tools (OBIEE, Tableau, Microstrategy, Qlik) to connect and query big data.

Modernizing BI Infrastructure

While traditional data warehouses remain relevant in the near future, the Hadoop ecosystem, Spark, and NoSQL databases help modernize BI infrastructure by providing:

Ability to store and analyze new types of unstructured data
High-volume, high-velocity data processing
Cost-effective solutions for large-scale analytics
Integration capabilities with existing BI tools

Conclusion

Organizations looking to modernize their DW/BI infrastructure should explore Hadoop and Spark technologies. These solutions complement rather than replace traditional EDW systems, enabling organizations to handle modern data challenges while maintaining existing investments.

For organizations specializing in big data analytics technologies including Hadoop, Spark, NoSQL databases, machine learning, and predictive analytics, modernization opportunities abound.

Pranav Shukla

A member of the Brevitaz team sharing insights on software engineering, big data, and cloud technologies.

Back to all articles