Introduction
The nature of data enterprises deal with has been changing significantly. It is no longer just transactional data. Volume, velocity, and variety of data are increasing steadily. Organizations not capturing, storing, and analyzing new data types are potentially losing competitive edge. Conversely, leveraging new data types can provide significant competitive advantages.
Key Topics
This article covers:
- How data warehouses excel at dealing with structured data
- The rise of unconventional data
- Shortcomings of traditional data warehouses
- How Hadoop ecosystem and Apache Spark can help
How Data Warehouses Excel with Structured Data
Business Intelligence, analytics, and data warehousing are mature fields existing for over three decades. BI software is typically powered by underlying data warehouses and data marts. Traditional data warehouses and data marts rely on relational databases (Oracle, Teradata, MySQL) following star schemas. SQL constructs enable slicing, dicing, and drill-down/up of data across different dimensions.
Data warehouses have worked effectively and continue to serve their purpose for highly structured transactional data suitable for relational databases. However, big data introduces new unconventional semi-structured and unstructured data. Additionally, data arriving at thousands or millions of events per second poses challenges for traditional systems.
The Rise of Unconventional Data
Much discussion surrounds the growth of data and the 3 (or 4) V's:
- Volume
- Velocity
- Variety
- Veracity (when including 4 V's)
New data types include:
- Sensor data
- Smart phone data
- Images
- Social media content
- Machine logs
These unstructured and semi-structured data can effectively drive competitive advantage when properly leveraged.
Traditional data warehouses cannot handle these new data types or manage all the V's efficiently. Associated costs are prohibitive due to licensing and infrastructure expenses, with no efficient way to handle unstructured and semi-structured data.
Shortcomings of Traditional Data Warehouses
Inability to Handle the 3 V's
Traditional systems struggle to handle volume, velocity, and variety without excessive costs. They lack efficient mechanisms for unstructured and semi-structured data processing.
Extensive ETL Requirements
Moving data from source transactional (OLTP) systems requires numerous ETL jobs to move and transform data to target data marts. This process is time-consuming and resource-intensive.
Inability to Retain Data
High storage costs force data archival to tapes, making it inaccessible for future analysis.
How Hadoop Ecosystem and Apache Spark Help
Proven Track Record
Hadoop ecosystem has matured significantly since its 2005 inception, proving itself capable of handling the 3 V's. Extensive research has gone into distributed file systems, storage formats, distributed computing, SQL-on-Hadoop engines, and security.
Apache Spark Advantages
Apache Spark is a distributed computing framework running on HDFS or Amazon S3, performing in-memory fast analytics. Spark is 10x to 100x faster than MapReduce, even when all data cannot fit in cluster memory.
Storage Solutions
- HDFS and Amazon S3: Resilient, distributed file systems scaling from terabytes to petabytes/exabytes
Storage Formats
- Columnar Formats: Special-purpose efficient formats like Parquet enable fast analytics queries
SQL-on-Hadoop Engines
Distributed query engines providing low-latency interactive SQL:
- Cloudera's Impala
- Spark-SQL
- Apache Drill
These engines get faster daily to match relational database speeds. JDBC/ODBC connectivity support allows existing BI tools (OBIEE, Tableau, Microstrategy, Qlik) to connect and query big data.
Modernizing BI Infrastructure
While traditional data warehouses remain relevant in the near future, the Hadoop ecosystem, Spark, and NoSQL databases help modernize BI infrastructure by providing:
- Ability to store and analyze new types of unstructured data
- High-volume, high-velocity data processing
- Cost-effective solutions for large-scale analytics
- Integration capabilities with existing BI tools
Conclusion
Organizations looking to modernize their DW/BI infrastructure should explore Hadoop and Spark technologies. These solutions complement rather than replace traditional EDW systems, enabling organizations to handle modern data challenges while maintaining existing investments.
For organizations specializing in big data analytics technologies including Hadoop, Spark, NoSQL databases, machine learning, and predictive analytics, modernization opportunities abound.