Time to move data analytics to cloud. We discuss how Azure Synapse is positioned on the data lake and data warehouse paradigm scale.
To familiarize yourself with the topic, I suggest you read the previous articles in this series first.
Data Lakes and Warehouses Part 1: Intro to Paradigms
Data Lakes and Warehouses Part 2: Databricks and Snowflake
Data Lakes and Warehouses Part 3: Azure Synapse Point of View
We now consider a more novel solution that has a slightly different angle to the topic. Namely we will discuss Microsoft Azure Synapse Analytics environment. As a matter of fact, the motivation for this post was the number of questions in the line of “Should we take Snowflake, Databricks or Synapse?” After this post, I hope you understand why the question is difficult to answer.
In the previous posts we noted that a data analytics platform can be divided into stages. In the picture above, green illustrates a processing and blue a storage tool. We can see how Azure Synapse environment covers both processing and storage. For the other mentioned products, please check the previous posts.
To be exact, Synapse is not a single product but a framework which offers a set of tools as components. This way, we have multiple cloud data products under a single brand and interface covering all the cloud big data analytics platform stages. Moreover, Synapse environment offers tools for both data warehouse building and data lake development.
Now, the first question is whether we gain anything in branding multiple tools again. Why do we not use the tools separately? Personally, I am starting to think Synapse umbrella product makes sense. We will come back to the question later. Let us first start with an overview of the Azure Synapse environment
Let us briefly go through Azure Synapse Analytics environment as I understand it. Azure Synapse Analytics platform could be described to have following components:
In addition to these, the environment offers following functionalities between the components:
As far as I know, similar overall framework is unique and not yet offered by any other cloud provider.
Some of the tools, especially Data Factory and the data warehouse, were already available before Synapse environment. Thus, they do not really bring new value. It might very well make sense to use the components separately without the full framework.
However, for instance Serverless SQL pool is a great new functionality in Azure big data offering. It is an SQL querying tool available as a service: You do not need to build any infrastructure. It is available right away and you pay by the usage. The best comparison point to this would be AWS cloud environment Athena service. Moreover, Apache Spark pool is a tool which could shortly be described as a light version of Databricks.
All in all, do we gain something with Synapse framework? I must admit I was initially really sceptic about this. However, after getting some experience, my personal answer would be affirmative, to some extent at least. First, there is real integration between the components. For instance, it is possible to define common relational database type tables which are accessible from multiple tools.
On the other hand, having a single workspace as a graphical user interface is beneficial. Typically, you need to have quite a wide understanding on cloud big data components when building a new analytics platform. With Synapse, they are easily available as a package. This both helps new developers to start working but also might help in handling the security of the overall solution. Thus, I would say Synapse framework has been quite a successful investment for Microsoft, at least from a technology perspective.
An interesting detail arises when we go back to the data warehouse and data lake paradigm distinction presented in the first post of the series. From an expenses point of view, the two paradigms can be seen in the Synapse environment components. Except for Synapse Dedicated SQL pool data warehouse, all the processing components are paid by the usage as is typical to the data lake paradigm. All the tools even have an automatic shutdown. Thus, if you try the Synapse environment, remember to shut down the data warehouse to stop it from gathering expenses. The other components take care of themselves.
Azure Synapse environment is quite unique in the sense that all the relevant big data lake and data warehouse tools are gathered in the same package. Even if you can use some of them separately, combining them has its advantages.
Timo is a cloud big data expert (PhD) with over a decade of experience in modern data solutions. He enjoys trying out new technologies and is particularly interested in technologies of storing, organizing and querying data efficiently in cloud environments. He has worked in big data roles both as a consultant and in-house.