Time to move data analytics to cloud – do you go for a data warehouse or a data lake solution? Read about the strengths and weaknesses of the two approaches.
Cloud environments offer many kinds of benefits like scalability, availability, and reliability. In addition, cloud providers have plenty of native components to build on. There is also a wide selection of third-party tools, some specifically designed for clouds, available via cloud marketplaces.
Tools naturally tend to emphasize their own role in the analytics ensemble. This is often confusing when you are trying to select the best toolset. In this post, we go through general guidelines on the strengths and weaknesses of many of the tools.
This is the first post of a three-part series where we evaluate differences in the basic approaches, or paradigms, of data warehouse and data lake based solutions.
Data Lakes and Warehouses Part 1: Intro to Paradigms
Data Lakes and Warehouses Part 2: Databricks and Snowflake
Data Lakes and Warehouses Part 3: Microsoft Azure Point of View (to be published)
Based on some major component choices, cloud analytics solutions can be divided into two categories: data lakes and data warehouses. Simply put, data warehouse solutions are traditionally centralized while data lake solutions are decentralized to the core. Both approaches have their strengths and are often used for slightly different purposes. Nowadays, it is common that products have features typical of both the categories. Even then, products still exhibit their original category and its point of view.
Let us call this basic category approach a paradigm. Understanding the basic philosophies of paradigms helps in understanding the big picture.
In this post, we dig deeper into the characteristics and differences of the paradigms. We start by dividing the analytics platforms to typical component stages. After this, we discuss ways to select the components from both paradigm points of view.
In the next posts of the series, we will discuss how the paradigm can be seen in some popular products.
Data analytics platforms are usually divided into multiple stages based on the part of the process they cover. A typical batch data pipeline platform is shown in the figure above. However, the article analysis is also applicable to real-time platforms. The tools can be categorized from either processing (pictured in green) or storage (blue) perspective. The tool lines below correspond to their usability in different stages of the platform.
For example, a typical data lake solution consists of separate processing and storage tools. In case of data warehouses, a single solution usually takes care of both the processing and the storage functionalities. Let us clarify the picture a bit more.
From a processing (green) perspective, the data platform stages are:
Moreover, the current trend of the big data world is to store data in multiple layers according to the level of processing applied. The layers of data storage (blue) usually include at least:
The exact cover of data storage layers varies from source to source, but details are irrelevant here. However, it is important to note that, especially in silver and gold layers, data can be stored more than once. For example, gold layers typically offer multiple versions of the data for different use scenarios.
Traditionally, data analytics platforms were solutions for company reporting purposes. For this use case, data warehouses based on relational databases were the de-facto standard. However, data warehouses were not very suitable for processing new kinds of data, often called big data. The problem was due to data volumes, real-time requirements and type diversity which included unstructured and semi-structured data. To complement the toolset, data lake type solutions were developed during the last decade or so.
According to a very broad definition in Wikipedia, data lake is a solution where data can be stored in its original form. In general, this means potentially infinite storage capacity for any file format. In practice, the term also covers the tools for processing the stored data.
There is a tendency in the market to showcase a product as a “holistic data lake solution.” Usually they are right: in theory, even a virtual machine with a large hard drive would enable a capable coder to create a data lake solution. Naturally, this kind of a minimalist definition is not very useful.
Instead, it makes more sense to consider the differences of the paradigms: basic principles of data warehouse and data lake based solutions.
For the data warehouse paradigm, the basic approach is to offer a centralized product that enables data to be stored in an organized hierarchical structure, usually in the form of database tables. This solution includes such things as foreign key references between tables, granular data encryption and detailed user access management. Access to the data is primarily handled through a specific data warehouse product and typically using SQL language.
The advantage of a data warehouse paradigm is the ability to define what data and format are provided to a user. In general, data is offered in a processed and clean format. This way we can guarantee the validity of the data, for instance. In addition, changes in source systems and data can be hidden from a user, at least to some extent.
On the other hand, as a limitation we have a reliance on a single product vendor. For example, retrieving data from a data warehouse solution is only possible in ways supported by the product. Moreover, we need to pay, in one way or another, for retrieval of the data. The data warehouse solution may also become a resource bottleneck for data processing. Recently, there has been significant progress in solving the latter limitation.
The core principle of the data lake paradigm is decentralization of responsibility. With a huge ensemble of tools, anyone, within the limits of access management, can use data in any data layer: bronze, silver, and gold. It is possible to organize data and table relationships, but the usage is usually not forced, and we can easily bypass them.
A major advantage of a data lake solution is decentralization of both computation and processing tools. A data scientist can work on Python image analysis on their own machine with bronze layer data, a data engineer can modify silver layer data using Apache Spark, and an analyst may utilize gold layer data with a reporting tool. SQL language is typically available as one of the possibilities. Moreover, computation is decentralized and there are virtually no bottlenecks.
A major weakness of a data lake paradigm solution is the lack of data organization, including a centralized metadata repository. It may be extremely difficult to track if processed data changes due to error corrections or source system modifications. Moreover, the validity or structure of data cannot always be guaranteed. Centralized data lake metadata management tools are increasingly available, but it is up to development processes to take them into use. Technology rarely forces this.
In this post, we went through differences in the basic approaches, or paradigms, of data warehouse and data lake based solutions. Data warehouse based solutions are typically centralized while data lake solutions are decentralized to the core. However, tools in both the categories are developing and the division is becoming less and less clear. Yet, understanding the paradigm approach helps in understanding the big picture.
In principle, you can build a cloud data analytics platform purely on either a data lake or a data warehouse based solution.
I have seen fully functional platforms that are based heavily on data lake tools. In these cases, information may be served with use case specific database data marts without a data warehouse at all.
On the other hand, there are successful solutions where the entire platform is built on top of a data warehouse product. The data is read directly into the data warehouse, where it is processed and served.
However, because of the differences explained here, a solution based on one of the paradigms is not necessarily optimal in all cases. Their strengths and basic philosophies are different. It may make sense to take advantage of a data lake based approach in the early stages when working on bronze and silver level data. The data can then be stored in a data warehouse for further organizing into silver and gold data. This way all data is available in both raw format for rapid experimentation, but also in structural format for reporting.
That way, we can draw from the strengths of both the approaches.
Timo is a cloud big data expert (PhD) with over a decade of experience in modern data solutions. He enjoys trying out new technologies and is particularly interested in technologies of storing, organizing and querying data efficiently in cloud environments. He has worked in big data roles both as a consultant and in-house.