noun_Email_707352 noun_917542_cc Map point Play Untitled Retweet Group 3 Fill 1

Data lakes and warehouses part 4: Challenges

All data lakes and warehouses have technical challenges. In this post, I will cover some of them.

Timo Aho / July 02, 2022

In the earlier posts, I went through the two paradigms, data lakes and data warehouses, and how Databricks, Snowflake and Azure Synapse fit into the landscape. I also introduced their strengths. I strongly suggest reading them before this one.

This is the blog series:

Data lakes and warehouses part 1: Intro to paradigms

Data lakes and warehouses part 2: Databricks and Snowflake

Data lakes and warehouses part 3: Azure Synapse point of view

Data lakes and warehouses part 4: Challenges

Then, let's go to the challenges.

They can be divided into three distinct categories:

  1. Challenges related to tools that are third-party cloud applications, not created by the cloud provider like Microsoft or Amazon.
  2. Issues typical for the paradigm.
  3. Tool specific features.

Let us tackle the first two in this blog post. I will use Databricks, Snowflake and Azure Synapse components as examples. However, it is important to note that both the challenge and categories are not limited to them.

I will try to find some solutions for the challenges in future posts.

Scenario

I need to highlight that even if the discussion is generalizable, most of my and my colleagues’ experience is related to Microsoft Azure cloud environment.

To add up, we work in DevOps and continuous delivery type projects. We always have a common version control like Git and code review practices. We also have multiple identical infrastructure environments that are typically configured with infrastructure such as code tools including Terraform. We target continuous deployments to production.

Consequently, graphical user interface tools do not fit into the approach very well. At least, we should be able to export the configuration to a readable format like JSON. For instance, Azure Data Factory supports this.

Third-party cloud application: Network and permissions

The first category is about the challenges related to third-party solutions. From the introduced technologies, this covers evenly both Databricks and Snowflake. There is a wide range of other tools that have the same questions.

Synapse framework, on the other hand, is developed by Azure cloud provider Microsoft. Thus, in this category its position is stronger even if some questions are nontrivial.

First, security. We usually minimize data movement outside the customer’s own cloud environment. We should avoid taking data from the cloud provider network to public internet. Cloud providers have tools like virtual networks and firewalls for this.

Very often the third-party tools run in a separate Azure environment: either in their own account or at least in a separate network architecture. The network configuration is usually more limited compared to native solutions so we might be able to configure secure access for web browser usage but not for the background engine. Sometimes third-party tools even need to be available for public internet connections. This was originally the case for both example technologies, but things are a lot better nowadays.

Related questions are the permissions. In Azure native services, we have a standardized way of working. We can use Azure Active Directory (AAD) users, groups, and role-based access control (RBAC). We can usually synchronize our groups to third-party tools, but the permission model is typically different and must be configured separately.

As explained in the previous posts, the problem might not be as serious for data lake tools due to their simpler permission model. However, data warehouse tools have significantly more granular permission control.

Associated questions relate to getting out logs, error messages and alerts from these tools. We usually need to make some additional configuration to do this.

Paradigm – Apache Spark and Data Lake

In a typical data lake scenario, the major component is based on Apache Spark - a distributed computation framework. This is the case for tools such as Databricks and Synapse Spark pool. Let us make it clear: Apache Spark is not a simple tool to use.

The following example is descriptive. Assume that we have an optimized terabyte level data set <data 1>, which has information about names, ages, and countries of residence.

A typical simple Spark code could include the following phases:

  1. Read the data set <data 1>
  2. Print the number of rows in <data 1>
  3. Sort <data 1> according to age and store the result to data set <data 2>
  4. Take only European people from <data 2>
  5. Print the first ten rows of the <data 2>

The code looks harmless. We could assume that lines 3 and 4, sorting and filtering, would take the most resources to compute. However, instead line 2, printing the number of rows, might explode in the face: it may fill the computer memory or just freeze computation.

The reason is that Spark is a lazy framework and tries to postpone processing as long as possible. Spark might notice that only the ten oldest European people are needed on line 5. Thus, it is enough to find the persons and skip most of the sorting and storage data reading. Nevertheless, printing the row amount on line 2 might force reading all storage data.

On the other hand, if the data in storage were distributed in another fashion, we could end up with the opposite situation. Counting rows might be immediate, and everything else is slow. We need to understand both the Spark framework and the way the stored data is organized.

Paradigm – SQL and Data warehouses

In data warehousing, data is almost exclusively processed with SQL language. Its compatibility with the scenario is, thus, a critical issue. This covers not only data warehouse products like Snowflake and Synapse Dedicated SQL pool, but also, to some extent, SQL based data lake tools like Synapse Serverless SQL pool.

SQL is a declarative query language. Other typically used languages are either imperative or functional. Illustrating the significance needs a bit of effort so bear with me.

With functional and imperative languages, a typical data processing workflow is:

  1. Read data from storage file <file 1>.
  2. Do modifications to the data.
  3. Store the result data to a file <file 2>.
  4. Later, use data in <file 2>.

Assume we noticed a mistake in the modification code on line 2. We repair the code, re-run the workflow, and replace data <file 2> with repaired one. For future development, we store the source data <file 1>and the latest code version.

Nevertheless, SQL language processes the data by either inserting new data, altering values, or deleting in-place. Thus, a typical SQL workflow is:

  1. Do modifications to the data in <data table 1>.
  2. Later, use data in <data table 1>.

Again, let us repair a hypothetical bug on line 1. We need to revert the changes done to <data table 1> and execute the repaired code for it. In practice, we need to re-read the original source data and execute the full processing history for it. We hope we have the processing history stored somewhere. In addition, we hope nobody bypassed the process by making some manual changes or so – most likely a false hope.

To make things even worse, we typically have multiple modules that depend on each other. For example, we might store data warehouse modification code, data mart code and permission management code in different repositories. Interdependent ordering is rarely stored, and we need to painfully re-execute everything manually in guessed order.

The same situation appears if we want to create one more identical infrastructure environment for, say, testing purposes.

To be precise, we can use the functional/imperative style workflow for SQL development by, e.g., having multiple table versions. This is cumbersome and rarely done. Moreover, databases might also support writing non-SQL code via a component called stored procedures. However, they are not meant to replace SQL but add some more functionality on top of it.

Conclusions

We have gone through some major challenges that are common for third-party cloud tools, data lake solutions and data warehouse products in a continuously deployed development environment.

In Azure cloud, third-party tools such as Databricks and Snowflake typically have challenges with configuration of, for instance, network infrastructure, permissions, log collecting and alerts. On the other hand, Azure Synapse framework has strengths in this case.

Many data lake paradigm solutions are at least partially based on a major data lake processing environment Apache Spark. As explained, it is a challenging framework. Efficient optimization needs deep understanding of the framework and the way data is stored.

On the other hand, data warehouses are almost exclusively based on SQL querying language. SQL language data modification workflow is problematic from the continuous deployment point of view. It concentrates on altering data in-place which makes processing lineage hard. Repairing bugs and creating new environments needs significantly more effort than with alternative approaches.

In future posts we will hopefully get some ideas on how to solve the issues. Moreover, we might be able to dig a bit deeper into the more product specific questions.

Data Insiders – Stay in the know

Data changes the world – does your company take full advantage of its benefits? Join Data Insiders, the #1 Nordic data community, a powerful network of top professionals and visionaries of data-driven business.

Data Insiders addresses the trends and phenomena around this hot topic in an understandable and interesting way. Together we share knowledge, offer collegial support and reveal the truth behind hype and buzzwords. We seek the answer to one particular question: how can data help us all to do better business?

Join Data Insiders today and stay at the forefront of the data revolution with access to quality podcasts, peer events and insights.

Timo Aho
Big Data Expert

Timo is a cloud big data expert (PhD) with over a decade of experience in modern data solutions. He enjoys trying out new technologies and is particularly interested in technologies of storing, organizing and querying data efficiently in cloud environments. He has worked in big data roles both as a consultant and in-house.

Share on Facebook Tweet Share on LinkedIn