As Data volumes grow to new, unprecedented levels, new tools and techniques are coming into picture to handle this growth. One of the fields that evolved is Data Lakes. In this post we'll take a look at the story of evolution of Data Lakes and how modern Data Lakes like Iceberg, Delta Lake are solving important problems.
Traditionally Data Warehouse tools were used to drive business intelligence from data. Industry then recognized that Data Warehouses limit the potential of intelligence by enforcing schema on write. It was clear that all the dimensions of data-set being collected could not be thought of at the time of data collection. This meant, enforcing a schema or dropping few entries because they look useless would potentially harm the business intelligence in long run. Additionally Data Warehouse technology was just not able to keep up with the pace of data growth. Since Data Warehouses were generally based on databases and structured data formats, it proved insufficient for the challenges that the current data driven world faces.
This led to the advent of Data Lakes that are optimized for unstructured and semi-structured data, can scale to PetaBytes easily and allowed better integration of a wide range of tools to help businesses get the most out of their data.
Data Lake is a generally overloaded term with different meanings in different contexts, but there are few important properties that are consistent across all Data Lake definitions:
- Support for unstructured and semi-structured data.
- Scalability to PetaBytes and higher.
- SQL like interface to interact with the stored data.
- Ability to connect various analytics tools as seamlessly as possible.
- Finally, modern data lakes are generally a combination of decoupled storage and analytics tools.
Last few years saw rise of Hadoop as the defacto Big Data platform and its subsequent downfall. Initially, HDFS served as the storage layer, and Hive as the analytics layer. When pushed really hard, Hadoop was able to go up to few 100s of TBs, allowed SQL like querying on semi-structured data and was fast enough for its time.
Later on, data volumes grew to new scales and the demands of businesses became more ambitions, i.e. users now expected faster query times, better scalability, ease of management and so on. This is when Hive and HDFS started to make way for new and better technology platforms.
To address Hadoop's complications and scaling challenges, Industry is now moving towards a disaggregated architecture, with Storage and Analytics layers very loosely coupled using REST APIs. This makes each layer much more independent (in terms of scaling and management) and allows using the perfect tool for each job. For example, in this disaggregated model, users can choose to use Spark for batch workloads for analytics, while Presto for SQL heavy workloads, with both Spark and Presto using the same backend storage platform.
This approach is now rapidly becoming the standard. Commonly used Storage platforms are object storage platforms like AWS S3, Azure Blob Storage, GCS, Ceph, MinIO among others. While analytics platforms vary from simple Python & R based notebooks to Tensorflow to Spark, Presto to Splunk, Vertica and others.
The new model of disaggregation, using right tools for each job is generally better at scalability and ease of management. This approach fits very well architecturally as well. But there are data consistency and management challenges that still need to be fixed.
- File or Tables: Disaggregated model means the storage system sees data as a collection of objects or files. But end users are not interested in the physical arrangement of data, they instead want to see a more logical view of their data. RDBMS databases did a great job in making this abstraction. With Big Data platforms taking shape as the data platform of future, it is now expected of these systems to behave in a user friendly way, i.e. don't enforce users to know anything about the physical storage.
- SQL Interface: As explained above, users are no longer willing to consider inefficiencies of underlying platforms. For example, data lakes are now also expected to be ACID compliant, so that the end user doesn't have the additional overhead of ensuring data related guarantees. ACID stands for Atomicity (an operation either succeeds completely or fails, it does not leave partial data), Consistency (once an application performs an operation the results of that operation are visible to it in every subsequent operation), Isolation (an incomplete operation by one user does not cause unexpected side effects for other users), and Durability (once an operation is complete it will be preserved even in the face of machine or system failure).
- Change management: Another very important aspect of managing data at this scale is the ability to rollback and see what changed when and examine specific details. Currently this may be possible using version management of object store, but that as we saw earlier is at a lower layer of physical detail which may not be useful at higher, logical level. It is now an expectation from users to see versioning at logical layer.
The usability challenges along with data consistency requirements led to a fresh new category of software projects. These projects sit between the storage and analytical platforms and offer strong ACID guarantees to the end user while dealing with the object storage platforms in a native manner. Let us take a high level look at some of these projects and see how they compare to each other.
Delta Lake is an open-source platform that brings ACID transactions to Apache Spark™. Delta Lake is developed by Spark experts, Databricks. It runs on top of your existing storage platform (S3, HDFS, Azure) and is fully compatible with Apache Spark APIs. Specifically it offers:
- ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
- Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
- Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
- Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
- Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
- Upserts and deletes: Supports merge, update and delete operations to enable complex usecases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.
Apache Iceberg is an open table format for huge analytic data sets. Iceberg adds tables to Presto and Spark that use a high-performance format that works just like a SQL table. Iceberg is focussed towards avoiding unpleasant surprises, helping evolve schema and avoid inadvertent data deletion. Users don’t need to know about partitioning to get fast queries.
- Schema evolution supports add, drop, update, or rename, and has no side-effects
- Hidden partitioning prevents user mistakes that cause silently incorrect results or extremely slow queries
- Partition layout evolution can update the layout of a table as data volume or query patterns change
- Time travel enables reproducible queries that use exactly the same table snapshot, or lets users easily examine changes
- Version rollback allows users to quickly correct problems by resetting tables to a good state
The Apache Hive data warehouse software has been around for a while. With new challenges coming up, Hive is now trying to address consistency and usability. It facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. Transactions with ACID semantics have been added to Hive to address the following use cases:
- Streaming ingest of data: Tools such as Apache Flume, Apache Storm, or Apache Kafka that they use to stream data into their Hadoop cluster. While these tools can write data at rates of hundreds or more rows per second, Hive can only add partitions every fifteen minutes to an hour. Adding partitions more often leads quickly to an overwhelming number of partitions in the table. These tools could stream data into existing partitions, but this would cause readers to get dirty reads (that is, they would see data written after they had started their queries) and leave many small files in their directories that would put pressure on the NameNode. Hive now supports this use case i.e. allowing readers to get a consistent view of the data and avoiding too many files.
- Slow changing dimensions: In a typical star schema data warehouse, dimensions tables change slowly over time. For example, a retailer will open new stores, which need to be added to the stores table, or an existing store may change its square footage or some other tracked characteristic. These changes lead to inserts of individual records or updates of records (depending on the strategy chosen). Starting with 0.14, Hive is able to support this.
- Data restatement: Sometimes collected data is found to be incorrect and needs correction. Starting with Hive 0.14 these use cases can be supported via INSERT, UPDATE, and DELETE.
- Bulk updates using SQL MERGE statement.
Here is a high level comparison of the tools we reviewed above: