noun_Email_707352 noun_917542_cc Map point Play Untitled Retweet Group 3 Fill 1

Data lakes and warehouses part 3: Azure Synapse point of view

Time to move data analytics to cloud. We discuss how Azure Synapse is positioned on the data lake and data warehouse paradigm scale.

Timo Aho / October 25, 2021

In this article, we discuss the Azure Synapse Analytics framework by Microsoft. Specifically, we focus on how the data warehouse and data lake paradigm distinction can be seen in it.

To familiarize yourself with the topic, I suggest you read the previous articles in this series first.

Data lakes and warehouses part 1: Intro to paradigms

Data lakes and warehouses part 2: Databricks and Snowflake

Data lakes and warehouses part 3: Azure Synapse point of view

Data lakes and warehouses part 4: Challenges

We now consider a more novel solution that has a slightly different angle to the topic. Namely we will discuss Microsoft Azure Synapse Analytics environment. As a matter of fact, the motivation for this post was the number of questions in the line of “Should we take Snowflake, Databricks or Synapse?” After this post, I hope you understand why the question is difficult to answer.

Azure Synapse phases

Azure Synapse collects multiple products under the same umbrella

In the previous posts we noted that a data analytics platform can be divided into stages. In the picture above, green illustrates a processing and blue a storage tool. We can see how Azure Synapse environment covers both processing and storage. For the other mentioned products, please check the previous posts.

To be exact, Synapse is not a single product but a framework which offers a set of tools as components. This way, we have multiple cloud data products under a single brand and interface covering all the cloud big data analytics platform stages. Moreover, Synapse environment offers tools for both data warehouse building and data lake development.

Now, the first question is whether we gain anything in branding multiple tools again. Why do we not use the tools separately? Personally, I am starting to think Synapse umbrella product makes sense. We will come back to the question later. Let us first start with an overview of the Azure Synapse environment

Azure Synapse components

Let us briefly go through Azure Synapse Analytics environment as I understand it. Azure Synapse Analytics platform could be described to have following components:

  • Graphic ELT/ETL tool, named Pipelines, for data ingesting and processing. In practice, the component is the same as the older Azure Data Factory service.
  • Dedicated SQL pool data warehouse for data structuring. Related to this, Microsoft made a blunder in launching Synapse. Initially, this component was introduced to cover all the Synapse environment. I still run into the misunderstanding of Synapse being just a new name for the data warehouse.
  • Programming language based Apache Spark pool and Serverless SQL pool for data querying and processing in the cloud. These components are novel and available only in the Synapse environment.

In addition to these, the environment offers following functionalities between the components:

  • A centralized graphical workspace user interface which gives access to all the tools
  • Light visualization capabilities and integration with Power BI reporting
  • A common data lake table schema repository usable in all the tools
  • A natural connection to Azure Data Lake Storage Gen2 cloud storage service and Azure AD permission management

As far as I know, similar overall framework is unique and not yet offered by any other cloud provider.

So, what is new for analytics?

Some of the tools, especially Data Factory and the data warehouse, were already available before Synapse environment. Thus, they do not really bring new value. It might very well make sense to use the components separately without the full framework.

However, for instance Serverless SQL pool is a great new functionality in Azure big data offering. It is an SQL querying tool available as a service: You do not need to build any infrastructure. It is available right away and you pay by the usage. The best comparison point to this would be AWS cloud environment Athena service. Moreover, Apache Spark pool is a tool which could shortly be described as a light version of Databricks.

Conclusions – Tool packaging helps

All in all, do we gain something with Synapse framework? I must admit I was initially really sceptic about this. However, after getting some experience, my personal answer would be affirmative, to some extent at least. First, there is real integration between the components. For instance, it is possible to define common relational database type tables which are accessible from multiple tools.

On the other hand, having a single workspace as a graphical user interface is beneficial. Typically, you need to have quite a wide understanding on cloud big data components when building a new analytics platform. With Synapse, they are easily available as a package. This both helps new developers to start working but also might help in handling the security of the overall solution. Thus, I would say Synapse framework has been quite a successful investment for Microsoft, at least from a technology perspective.

An interesting detail arises when we go back to the data warehouse and data lake paradigm distinction presented in the first post of the series. From an expenses point of view, the two paradigms can be seen in the Synapse environment components. Except for Synapse Dedicated SQL pool data warehouse, all the processing components are paid by the usage as is typical to the data lake paradigm. All the tools even have an automatic shutdown. Thus, if you try the Synapse environment, remember to shut down the data warehouse to stop it from gathering expenses. The other components take care of themselves.

Azure Synapse environment is quite unique in the sense that all the relevant big data lake and data warehouse tools are gathered in the same package. Even if you can use some of them separately, combining them has its advantages.

Data Insiders – Stay in the know

Data changes the world – does your company take full advantage of its benefits? Join Data Insiders, the #1 Nordic data community, a powerful network of top professionals and visionaries of data-driven business.

Data Insiders addresses the trends and phenomena around this hot topic in an understandable and interesting way. Together we share knowledge, offer collegial support and reveal the truth behind hype and buzzwords. We seek the answer to one particular question: how can data help us all to do better business?

Join Data Insiders today and stay at the forefront of the data revolution with access to quality podcasts, peer events and insights.

Timo Aho
Cloud Data Expert, Tietoevry Create

Timo is a cloud data expert (PhD) with over a decade of experience in modern data solutions. He enjoys trying out new technologies and is particularly interested in technologies of storing, organizing and querying data efficiently in cloud environments. He has worked in big data roles both as a consultant and in-house.

Share on Facebook Tweet Share on LinkedIn