An oft-cited estimate by IBM calculated that the annual cost of data quality issues in the U.S. amounted to $3.1 trillion in 2016. In an article he wrote for the MIT Sloan Management Review in 2017, data quality consultant Thomas Redman estimated that correcting data errors and dealing with the business problems caused by bad data costs companies 15% to 25% of their annual revenue on average.
bad data costs companies 15% - 25% of their annual revenue on average.
Why should I be worried about data health?
Some more numbers about the true costs of data breaking bad:
50–80 percent of a data practitioner’s time is spent collecting, preparing, and fixing “unruly” data. (The New York Times)
40 percent of a data analyst’s time is spent on vetting and validating analytics for data quality issues. (Forrester)
27 percent of a salesperson time is spent dealing with inaccurate data. (ZoomInfo)
50 percent of a data practitioner’s time is spent on identifying, troubleshooting, and fixing data quality, integrity, and reliability issues. (Harvard Business Review)
Get your data fixed- or at least monitor it
Instead of continuing debating data quality issues, we might want to focus on data health. We are most likely talking about the same aims, but use a different approach. Perhaps changing the vocabulary can open some of the locks we have created in the industry.
But what are the pillars of data health?
Barr Moses defines the 5 pillars to be:
Freshness: are there suddenly some unexpected gaps in my data? Is my data up-to-date? Detecting anomalies in freshness would cause notifications and alerts to check system and data value chain health. Keep in mind that your data value chain more and more often is ecosystemic and you rely on other organizations systems as well.
Distribution: One way to measure distribution issues is to track null values in the data. Any changes in the expected percentage, it a sign of a distribution issue. For example, if you in longer-term have an average of 1-2% null values and suddenly some data contains 60% null values, it is worth exploring what might be causing it.
Volume: tracking the data volume behavior over time is another good measurement for data health. This is commonly easy to implement for example with database analytics tools to see if volume drops or increases abnormally fast or in larger quantities.
Schema: Are there missing fields in data stream? Did we encounter additional information? Common could practice in SW development has always been that never pass anything directly to database from user interface inputs (such as forms or url parameters). If your system does not validate the incoming stream, you should be worried. in 80% of the data sources you should know what's coming in. Why? You should be using productized or servitized data and in those cases data stream formatting (schema) is fixed.
Lineage: this is about the data value chain. Lineage also is where you combine the above 4 pillars into a story. Barr gives an example: "upstream there was a schema change that resulted in a table downstream that had a freshness problem that results in another table downstream that had a distribution problem that resulted in a wonky report the marketing team is using to make data-driven decisions about their product." The given example juiced with nice graphs would also be a great example of data story. Often lineage is easier and faster to grasp if it is visualized.