I will share tips on how to solve some of the typical challenges associated with cloud data platforms presented in my previous blog.
In parts 1-4 of this blog, I discussed the paradigms of data lakes and data warehouses and compared the characteristics of key technologies such as Databricks, Snowflake and Azure Synapse Analytics.
Previous posts are:
This post digs deeper into technology than the previous ones and might need a bit more technical understanding. Moreover, technologies evolve fast and, thus, the information deprecates significantly faster. I base the learnings on my own and colleagues’ experience which also increases over time.
I have recently worked in Microsoft Azure cloud environment and concentrate mostly on that area. In addition to that, I have a few years of experience of Amazon Web Services.
Let’s start with some background discussion because my suggestions are based on them.
One of my personal principles is to take the minimal version of the ‘bus factor’ into account. We need to prepare for the possibility of a single or many team members getting COVID or winning a jackpot. In practice, it is important to reduce dependency on a single person with a bit of cross-functionality: at least two people should be able to solve any given project task.
This results also in better code quality and technological choices. You will probably get less unpleasant surprises. Moreover, it reduces the stress level by preventing an individual from burning out. It gives you freedom to enjoy vacation without being interrupted.
Some methods are technical and straightforward: documentation and storing everything as code in version control, for instance, Infrastructure as Code – IaC, and having multiple identical environments with continuous deployment DevOps pipelines. However, the more painful ones are nontechnical: code reviewing and pair programming, especially the notorious novice-expert pair programming. In the latter one, an expert monitors and mentors a novice working on code created by the expert.
It is obvious that this way of working is not plausible in all projects such as MVPs or pilots. However, the aim is good to keep in mind.
Infrastructure as Code (IaC) tools enable many things: reviewing, automatic deployment, and copying existing good practices including network, permissions, and security. They also often notify you about risky manual changes that bypass the process.
Nevertheless, IaC might not offer enough benefits in all cases as they are not compatible with all technological components. Moreover, duplicating environments might not be possible in all projects.
There are a few alternatives for IaC and cloud infrastructure management tools. Cloud providers typically offer ways to describe the infrastructure in semi-structured configuration files or even full IaC tools such as Azure Bicep. The tools are still limited to cloud native resources and do not fluently support many third-party solutions, Databricks, Snowflake, and others.
Nowadays, the most used third party IaC tool for cloud infrastructure is Terraform. It uses its own configuration language and is available for a variety of resources. Pulumi is a more recent alternative. It is a programming language-based tool supporting Python, Typescript and Java. If most of the team is comfortable with programming, you should give it a try.
Some non-native tool problems raised in the previous blog post were network configuration, authentication, and authorization.
A major networking problem with both Databricks and Snowflake is the multitude of different layers to handle. These could include the actual data storage, compute environment, API layer, and front-end environment for user access. However, in Azure cloud, there has recently been significant progress. Both Snowflake and Databricks offer Azure Private link service which should allow secure access in most scenarios.
In an Azure environment, we have at least three categories of authentication and authorization management:
In the Azure environment, the first two cases are straightforward: Both Databricks and Snowflake support Active Directory syncing. Thus, you can get Azure native users to the third-party tool and manage the permissions within.
The third question is tricky. The best option would be to use Azure native authentication and authorization for Databricks or Snowflake users. For Azure storage, Databricks offers multiple ways for access, but all of them have either usability or authorization issues. Personally, I prefer the deprecated AD passthrough for data storage or the creation of a separate workspace for each permission group.
In addition, you hopefully store the sensitive information in Azure Key vault. With Databricks you need to trust its internal security model and give it full data access with a common service principal.
With Snowflake you have less options, whether you like it or not. Access to Azure storage is possible only with a common service principal. Moreover, there is no direct Azure Key Vault Integration.
There are some general questions that need to be answered in a cloud data environment.
The obvious one is how to manage tools that are difficult to take into the version control, review, and pipeline deployment. This includes analytics tools such as Power BI or Tableau, and some ETL tools including Azure Data Factory. My suggestion is to use these tools in the most trivial way without adding too much business logic to them. For example, Azure Data Factory can be used to trigger Databricks notebooks or Snowflake SQL procedures.
Another question is how to make sure codebase and IaC configurations are up-to-date, and no manual changes are made. The answer is a standard software engineering solution - a full redeployment of all resources from scratch. This can be done either in a development environment or by creating a separate smoke test sandbox environment. The use of smoke environment also helps with another typical problem, testing. End-to-end data pipeline testing is a lot easier.
In the case of Databricks and Apache Spark, usage of DevOps practices is moderately easy. Git integration and code reviewing are straightforward with the new Repos functionality.
As mentioned in the previous post, a typical problem with Spark is the management of data lake files. Many small Parquet files are very ineffective for Spark processing. As a rule of thumb, try to keep the file size around 100 MB to 1 GB.
Parallel processing with Parquet or ORC files is also problematic. Fortunately, the current improvements to data lake file formats help you. There are a few alternatives, but for Databricks the open Delta file format is a natural choice. Delta offers ACID transactions for parallel processing and handles file size optimization neatly.
As mentioned in the previous blog post, SQL does not fit into DevOps mindset very well. SQL handles both infrastructure, code, and data state in parallel making version control and code reviewing exceptionally important in a multi-developer environment.
Flyway can be described as a simple versioning tool for SQL files. It takes care of SQL not being executed multiple times and allows simple review before migration.
On the other hand, dbt is more versatile. It is based on SQL but includes a macro language which reduces code redundancy and enables minimal testing. From DevOps perspective, it unfortunately misses a critical feature, dry run.
The third alternative, Liquibase, uses its own semi-structured configuration to generate SQL. It is specifically meant for DevOps type development cycle but lacks SQL readability.
Snowflake has some other practical questions to deal with. Accessing Azure storage is a multiple phase process with permissions and service principals. Moreover, it is easy to get lost in granular SQL permission control. Due to these, Snowflake is prone to manual changes. In both cases, one should carefully document these in IaC configuration or SQL. You should also consider redeploying the environment now and again.
Snowflake has some great features for, for instance, data ingestion. However, centralized Azure monitoring might cause problems. The errors might only be stored in a database table.
I have concentrated on Databricks, Snowflake and partially Synapse technologies in my blogs.
However, it is worth noting that the field between data warehouses and data lakes is bubbling. We get more and more hybrid solutions that have both data warehouse and data lake features meaning it is good to keep eyes open.
Here are some examples:
Firebolt, a direct competitor for Snowflake, is a data warehouse paradigm solution allegedly optimizing query performance. Apache Iceberg is a SQL query tool for data lake. Moreover, Apache Hudi and Apache Iceberg are alternatives to Databricks Delta file format. They all allow ACID transactions and incremental update for data lake files.
I have dug deeper into technical questions related to cloud data projects, especially Snowflake and Databricks in Azure cloud. To conclude, DevOps kind of development process with cross-functional teams, code version control, reviewing and continuous deployment are beneficial.
I have also discussed how to put these into use on technical level. This includes Infrastructure-as-Code tools, ways to review Spark or SQL code and so forth.
We live interesting times in cloud data environment. The sea of technology is windy, and waves are high. Have fun surfing!
Data changes the world – does your company take full advantage of its benefits? Join Data Insiders, the #1 Nordic data community, a powerful network of top professionals and visionaries of data-driven business.
Data Insiders addresses the trends and phenomena around this hot topic in an understandable and interesting way. Together we share knowledge, offer collegial support and reveal the truth behind hype and buzzwords. We seek the answer to one particular question: how can data help us all to do better business?
Join Data Insiders today and stay at the forefront of the data revolution with access to quality podcasts, peer events and insights.
Timo is a cloud data expert (PhD) with over a decade of experience in modern data solutions. He enjoys trying out new technologies and is particularly interested in technologies of storing, organizing and querying data efficiently in cloud environments. He has worked in big data roles both as a consultant and in-house.