Without taking relevant data under control, organizations will struggle to understand and harness business information, discusses Tomi Mustonen in the 3rd blog of the Data as an Enabler series.
Say, you want to search for information about a new TV show that your friend recommended. You just type the name into the search engine (or generative AI solution), and you will instantly get more information and most likely the navigation to a service where you can watch the show.
Now, imagine that you have started a job in a new company and want to find financial information or other important facts. You most likely search from the Intranet; some might try a specific IT system, and one tries to find a colleague whom to ask. When you manage to find the information, the next question you might have is how the figures are calculated and what they mean. This leads to another round of searching and consumes more of your valuable time.
Respectively, data professionals face time-consuming tasks when they start creating new reports or other data products that utilize existing data from organization´s databases. Finding trustable data can be difficult and laborious as you need to find the correct data sources, get access to them, and often clean up the data to ensure that it is accurate, consistent, and fit for use.
According to studies, data professionals still spend around 38% of their time in cleaning and curating data (Source 1). Although this number has decreased from 45% in 2020 (Source 2), there is a significant amount of time that data experts could utilize for more productive tasks. And this is only the professionals who use data, not the ones creating data in their daily work. Users of operational business systems don’t often know what data should be inputted in which system, in what format, and lack knowledge where their inputted data will be used. This can lead to careless data inputs. We have seen that data catalog is one of the key solutions to avoid this.
Key information that organizations have include:
- Structured (tabular) data, usually stored in traditional databases, produced and mainly used in operational business systems.
- Unstructured data, including documents, webpages, emails, social media content, mobile data, images, audio, and video.
In addition, the following can be seen as key information assets: reports, data visualizations, and dashboards; machine learning models; integrations between systems and databases; external data sources, publicly available data, and data purchased from 3rd parties (Source 3).
Why do we need a data catalog to gain control over these key information assets? One of the descriptions for data catalog is organized inventory of the relevant data in the organization (Sources 4 & 5). One analogy for data catalog is the card catalog used in libraries to register all bibliographic items into a central location. Each card in library catalog contains key information about one single book, such as author, style and unique identifier of the book. The card catalog, either web-based or physical, makes it is easier to search and find the book you are looking for.
One should not try to include all data objects into the scope of data catalog. Instead, the organization should slice the data elephant and concentrate on the most relevant data. Relevant data can be defined as valuable piece of information that an organization uses to support and operate its day-to-day business, and information it uses to make decisions and forecasts.
In short, data catalog collects metadata, “data about data”, meaning data stored in some form of IT solution that improves the business and technical understanding of data. The following figure illustrates key contents in a data catalog.
Let’s explain the key contents:
To summarize, a data catalog:
A data catalog answers questions and concerns around data and gives a comprehensive view on the most important data objects for business operations.
Figure 2. Data catalog answers questions and concerns about data.
Why should one invest in implementing such a solution then?
The key benefits of the data catalog can be summarized as follows:
Also, by providing centralized place for documentation, data catalog can decrease the lead time of data initiatives. Today, most of the development initiatives need, create and/or consume data and these initiatives will benefit of the documented and approved data landscape. When existing data, definitions, descriptions, owners are locations are known, it’s easier to evaluate the gap between the current and future state and the development required for initiatives.
Once different data objects are documented and known, it is easier and faster to combine different data sets and implement integrations between systems. A data catalog also improves interoperability between organizations. When data is being shared outside the organization, whether sold or as open data, having well-defined definitions and proper documentation becomes crucial. Equally important is the understanding of data received from external sources.
A data catalog is an important enabler for efficient data discovery and the trust in data. According to studies, the time spent to locate data and reports can be reduced by 50% or even more (Source 6 & 7).
However, implementing a data catalog incurs costs, as it requires initial investment in its development and ongoing attention for maintenance. It is a technical solution used by humans and not a silver bullet fixing all data issues. Proper data management practices, as well as processes to curate the catalog itself, are needed.
There are several mature and sophisticated data catalog solutions in the market, including an array of features but also potentially carrying a high price tag. Data executives should thoroughly assess the specific requirements of their organization. If the requirements are straightforward, then opting for a simple solution is recommended. The process of finding and utilizing accurate data to address critical business challenges should be as effortless as using your preferred e-commerce or search platform to discover and purchase a product. A data catalog is a valuable tool that can assist in achieving this objective.
If you want to get your data in order and say goodbye to data chaos, do not hesitate to reach out. Our team is ready to help!
1: State of Data Science report, 2022 |Anaconda
2: State of Data Science report, 2020 | Anaconda
3: Data Catalog |IBM
4: What is a Data Catalog? | Alation
5: Data Catalog | Oracle
6: The Total Economic Impact Of The Alation Data Catalog | Alation
7: The Business Value of Collibra Data Intelligence Cloud | Collibra