Business analysts and data scientists who use data cannot do their jobs effectively when they struggle to find and access accurate, complete and trustworthy data. As a result, they spend more time searching for data than actually using data to generate analyses and impactful insights.
A Data Catalog inventories and classifies all the usable datasets in an organization. It enriches the metadata in order to better select the datasets and use them in data projects. The sole aim of a Data Catalog is to enable operational teams working on the data to identify, understand and select the datasets they need in order to create value.
Beware of these claims
Whenever purchasing a Data Catalog solution, stop for a moment if you encounter sentences like:
A Data Catalog is a Data Governance solution. Having sound data governance is one of the pillars of an effective data strategy. Governance, however, has little to do with tooling.
A Data Catalog is a Data Quality Management (DQM) solution. Data quality needs to be assessed as early as possible in the pipeline feeds. The role of the Data Catalog is not to do quality control but to share as much as possible the results of these controls.
A Data Catalog is a compliance solution. Regulatory compliance is above all a matter of documentation and proof and has no place in a Data Catalog. However, the Data Catalog can help identify (more or less automatically) data that is subject to regulations.
A Data Catalog is a query solution. On a modern data architecture, the capacity to execute queries from a Data Catalog isn’t just unnecessary, it’s also very risky (performance, cost, security, etc.). Data teams already have their own tools to execute queries on data, and if they haven’t, it may be a good idea to equip them. Integrating data access issues in the deployment of a catalog is the surest way to make it a long, costly, and disappointing project.
A Data Catalog is a Business modeling solution. As useful and complete as they may be, business models are still just models: they are an imperfect reflection of the operational reality of the systems and therefore they struggle to provide a useful Data Catalog. Traditional Business models are too complex and too abstract to be adopted by data teams. However, Data Product Toolkit® is built for this specific purpose and intended to act as single source of truth for the whole team including every one in the data team.
A Data Catalog must not rely on automation. A Data Catalog handles millions of information in a constantly shifting landscape. Maintaining this information manually is virtually impossible, or extremely costly. Without automation, the content of the catalog will always be in doubt, and the data teams will not use it.
Comentarios