Time to move data analytics to cloud. We compare Databricks and Snowflake to assess the differences between data lake based and data warehouse based solutions.
To get the background into the topic, please read my previous blog post about the data lake and data warehouse paradigms.
Data Lakes and Warehouses Part 1: Intro to Paradigms
Data Lakes and Warehouses Part 2: Databricks and Snowflake
Data Lakes and Warehouses Part 3: Microsoft Azure Point of View (to be published)
As we learnt in the previous post, data analytics platform can be divided into multiple stages. Above, we can see a picture giving a general understanding of roles for Snowflake and Databricks in the pipelines. Here we can categorize the tools to either processing (green) or storage (blue). Databricks is a processing tool and Snowflake covers both processing and storage. Delta lake, on the other hand, is a storage solution related to Databricks. We will cover it later.
According to the definitions given in the previous article, we can roughly say that Databricks is a data lake based tool and Snowflake is a data warehouse based tool. Let us now dig a bit deeper into these tools.
Databricks is an Apache Spark based processing tool that provides programming environment with highly and automatically scalable computing capacity. Apache Spark is the de facto standard programming framework for coding based big data processing.
Databricks billing is essentially usage based. You pay for the used computational resources and nothing else. In principle, Databricks is particularly suitable for processing data in the early stages of a pipeline, especially between bronze and silver layers. It can also be used for preparing gold layer data but is not at its best in providing data for, say, reporting tools.
Recently, Databricks has significantly extended its capabilities to the direction of a traditional data warehouse. Databricks provides a ready-made SQL query interface and a lightweight visualization layer. In addition, Databricks offers a database type table structure. The database type functionality is specifically developed with Delta file format.
Delta file format is an approach for taking database strengths into the data lake world. The format provides, among others, a data schema versioning and database type ACID transactions. In accordance with the data lake paradigm, the file format itself is open and free to be exploited by anyone.
Based on Delta format and Databricks tool, the company is trying to spread a notion of a novel “Data Lakehouse” paradigm for a data lake and data warehouse hybrid approach.
Snowflake is a scalable data warehouse solution developed specifically for cloud environments. Snowflake stores data in a cloud storage in a proprietary file format. The data is therefore only available through Snowflake, according to the data warehouse paradigm. In addition to computational resources, you also pay for the data storage in the Snowflake file format. However, you also have the typical data warehouse features like granular permission management available.
Snowflake disrupted the data warehouse market a few years ago by offering highly distributed and scalable computation capacity. This is done by completely separating storage and processing layers in the data warehouse architecture. Traditionally, this has been a major obstacle for data warehouse solutions in the big data world. This is one of the ways Snowflake is expanding its solution in the direction of the data lake paradigm. Nowadays it offers, among others, efficient tools for real-time data ingestion.
It is probably not an overstatement to say that the success of Snowflake caused a crisis in Amazon Redshift and Azure Data Warehouse development. Scalability of the latter two data warehouse solutions was significantly more restricted: If you wanted to avoid high expenses, you needed to choose between small storage capacity or slow processing. Very often, a suitable combination was difficult to find. Thus, you usually paid a significant amount of money for reserve resources you did not actually use. Nevertheless, both the products have taken steps towards solving this issue.
In this post we discussed two very popular multi-cloud data analytics products: Databricks and Snowflake. We specifically studied them from the viewpoint of their background paradigms as discussed in the previous blog post.
We noted that Snowflake has a basis in data warehouse world while Databricks is more data lake oriented. However, both have extended their reach beyond typical limits of their paradigms.
Both tools can definitely be used alone to fulfill the needs of a data analytics platform. Databricks can serve data directly from a storage or export data into data marts. There is no need for a separate data warehouse. On the other hand, data can be ingested directly to Snowflake for processing, modeling, and offering. In my experience, pure Snowflake solutions are more common, perhaps because Databricks has not been around for so long.
However, as brought up in the previous post, it might be a good idea to use both of the products in a single platform. The breakdown of this kind of solution is depicted in the picture with Databricks reading and processing raw data and Snowflake taking care of the publishing end of a pipeline. It is also important to note that Databricks and Snowflake are doing collaboration for better integration between the products.
All in all, future seems even brighter for hybrid solutions.
Timo is a cloud big data expert (PhD) with over a decade of experience in modern data solutions. He enjoys trying out new technologies and is particularly interested in technologies of storing, organizing and querying data efficiently in cloud environments. He has worked in big data roles both as a consultant and in-house.