Today, massive amounts of data are being created with ever-increasing speed. This data has a lot of value if it is understood right.
Today, massive amounts of data are being created with ever-increasing speed. This data has a lot of value if it is understood. To analyze these enormous amounts of data, a new kind of Big Data infrastructure is needed. The infrastructure is best utilized in scalable cloud environments, and we call this modern version a Cloud Data Platform.
Complex and abstract things, such as vast computer systems, are often described with the means of analogous real-world examples. For Cloud Data Platforms, the real world has a nice analogy: A cloud data platform is essentially a (data) logistics network. Suppliers are the various information systems generating data.
Distributors, in turn, are the data pipelines consisting of integrations and automated data transformation tasks delivering the data to stores ready for consumption. Stores are the data warehouses or Data APIs that serve the (data) products. They can be browsed through a catalogue — a data catalogue.
Finally, the consumers are the data reporting tools and solutions, various applications or data analysts and scientists who use data to train machine learning models, for example.
Ok, sounds understandable — hopefully. But what has happened in the industry to make cloud data platforms possible? What has happened that was not possible before? Well, let’s get into it.
First of all, we’re working in an industry where probably three of the biggest megatrends in the whole IT industry meet — that is cloud, big data and AI. It is an industry where the world’s largest companies fight fiercely and where 18-month-old technology may be deprecated.
The speed of change and development is just massive. It is also an industry where enormous open source platforms reside in symbiosis with the vendor-specific offering.
In fact, big companies commit to open source projects with hundreds of dedicated developers. A good example of this is Apache Spark. Due to this, the business logic is somewhat different from the traditional software product industry. It would be pure stupidity to compete with the largest players in the game, so the key is to be able to use the provided technology, and continuously adapt with the change.
The reason why these companies are interested in putting such a huge effort on big data and AI technology development and tools is in the end very clear. These companies eventually get a lot of revenue from the usage of their cloud. The best services and tools attract users, and really the only way to create those great services for big data is by scaling.
Managing and analyzing big data is all about scaling. Within this area massive leaps in frameworks have been taken during the last ten years. Distributed file systems have enabled us to manage massive amounts of unstructured data, Hadoop Distributed File System (HDFS) being the game changer initially.
The data lakes that lie in the center of data platforms are essentially distributed file systems. The management and control, as well as interfaces to access the files, have been improved a lot, but the basic idea is the same — to provide a storage space where the user (the engineer) doesn’t need to understand and implement the tedious and complex problem of replicating data without losing it, and without having more than one view of it.
Another framework-level enabler has been the data processing frameworks that again take away the need to understand the complexity of parallel processing. Or put it the other way around, let the user (again the engineer) focus on the data analytics or machine learning algorithm development. The framework can worry about compute cluster initialization, node-to-node communication, task scheduling and optimization, and many other tasks that are needed to make a cluster of multiple Compute nodes look like the algorithm is run on a single server.
In the early days of Hadoop, MapReduce was the framework, but quite soon the in-memory processing framework Spark took over. Now the usability of these tools has been developed, and big data tools are offered as readily usable services like Databricks.
The beauty in these frameworks is basically the same — the framework genuinely takes care of the initialization and scaling of a whole compute cluster for the time compute is needed and runs it down when it is not needed anymore. For the people who are working in the industry, this is already everyday life. But seriously, this is pretty cool — and advanced.
In addition to these beautiful frameworks, the cloud has genuine and tangible benefits as a platform. For example, Microsoft Azure has Databricks (as well as the rest of the needed cloud data platform services) available as a managed service, which means that the resources Databricks runs on are automatically up to date with the latest operating system versions and security batches. That’s something that is causing a major burden and cost of maintenance for solutions relying on a self-hosted virtual machine infrastructure.
One of the biggest trends in organizations is to become data-driven. Becoming data-driven means being able to analyze data and base decisions on the analyses.
Again, being able to analyze the data means that data and analytics tools need to be available. A typical situation in organizations preventing this is that the data has gotten siloed in various systems, and it is not available for analytics. It might also be that the management of the data is done manually, and adding new, even publicly available data sources is slow and expensive.
This will often result in a situation where newly hired data analysts or data scientists lose valuable time looking for, cleaning and understanding the data. And in the worst-case scenario, if a common platform is missing, the machine learning model that is finally built is only deployed on the data scientist’s laptop without any possibility to leverage it in a wider scope, and with no one else updating it.
Here is where the cloud data platform comes into play. It is a system to orchestrate the data for analytics or third party applications. Typical logical parts of a cloud data platform are:
However, there are emerging technologies that are about to change this somewhat stable structure. One of them is the lake house concept, which we will be discussing in another blog article soon.
Ok, hopefully all of you who had no previous experience of cloud data platforms now understand that they are essentially data logistics networks, which orchestrate the data for analytics or applications.
And by the way, if the parties working in the data logistic network are talking different languages, the work quickly becomes slow and frustrating — that’s why we need to have a data glossary.
If the suppliers are allowed to deliver their goods with or without a description of the content (metadata), if they are using whatever names they wish about their products (master data), or if nobody has any control on the quality of the goods the suppliers deliver (data quality), it is soon apparent that the logistics network will halt. We obviously need someone to govern the whole thing — so, govern the data.
Teemu Ekola is the Head of Big Data Solutions at TietoEVRY, leading a team of experts in the areas of Big Data, AI, and Data Advisory. If you want to know more about cloud data platforms or other related topics, you can connect with him on LinkedIn to find out more.