Resource Section Overview

Newsletter Archive
Whitepapers
Presentations
Links
Recommended reading

You will need Adobe Acrobat Reader to open documents with the pdf logo. For more information, see

Tom Breur combines Art & Science approach to data mining concepts. His abilities and visionary approach to data mining, the unique way of implementing models, are the most professional and advance we came across among many companies we are working with worldwide. He combines deep knowledge of customer behavior and has the ability to implement complex solutions and systems to get, keep and grow customers in a very competitive environment. Tom has the most pleasant personality and highest communicative skills.
With great respects
Dr. Ronit HaNegby
CEO
EASTAT ltd.

How To Measure Data Quality: Metrics And Scorecards

Tom Breur
July 2010

Introduction

Measuring data quality is still in it’s infancy. This is one of the reasons why making solid business cases is so hard. Because our mastery of fundamental concepts and principles is relatively meager, for the moment we are limited to local and idiosyncratic measures. Herzog et al (2007): “… most current quantification approaches are created in an ad hoc fashion that is specific to a given database and its use.” And: “The construction of such metrics is a fertile area for future research.” 

At present we lack an overarching theory to derive universal quality measures. Goethe: “the first sign we don’t know what we are doing is an obsession with numbers.” Given these limitations then, what is attainable in terms of measurement? Most professionals would agree that it does make a lot of sense to measure data quality. Preferably on a continuous basis. It’s hackneyed, but you can’t manage what you don’t measure.

When you define “quality” as value to some user, it inherently follows that quality needs to be measured relative to the benefits that user is getting from the data. To make numbers that represent “data quality” meaningful, we need is a framework that relates quality of decision making to the quality of the data on which it is based. Then value to the data user (decision maker) is tied to information, rather than data, and it’s output (decisions) drives the value.

What does the literature have to say?

In his landmark achievement, English (1999) puts data quality in the total data quality (TQM) framework. English’ 2009 tome, “Information Quality Applied”, continues along this tradition. He goes to great lengths to tie breakdowns in primary business processes to data faults, and to monetize these. Examples might be sending duplicate (and hence wasted) mail packs, overstocking product that perishes or goes out of fashion, or missed sales as a result of stock-outs.

Given the historical roots of TQM in manufacturing, this bias towards primary process breakdown is quite natural. However, we live in a knowledge economy, and many areas that use lots of data have a difficult time linking the costs of primary processes to data errors. And many areas like medical, government, or non-profit, have a hard time attaching financial numbers to errors. Yet for them, data quality is just as important as anywhere else.

Herzog et al (2007) provide three kinds of data quality metrics: completeness, proportion of duplicates, and the proportion of each data element that is missing. Completeness refers to the proportion of entities from the population that is represented on the database/list. There is always talk about the census coverage, for instance, which is supposed to be 100%, but isn’t really. Proportion of duplicates refers to overrepresentation of some entities that might appear more than once in a list. In particular when you acquire multiple lists for a marketing campaign, there is always a chance of drawing the same person more than once, which leads to waste if you mail the same person twice. This also doesn’t look very professional. Proportion of data elements missing is a measure that represents how many rows within each column are missing when they really should be there.

Maydanchik (2007) recommends monitoring data quality using scorecards. His recommended approach boils down to performing a post hoc data clean up that you subsequently use to derive business rules to which new, incoming data ought to conform.

Data quality rules can be derived at the field level, like counting the number of unexpected missings, or values that fall outside an acceptable range (like ages > 150). Rules can also be derived between fields like when gender is male, pregnancy can’t possibly be set to “yes”, etc. The number and relation between data elements can grow arbitrarily. The complexity of these business rules tends to grow as increasingly elaborate ways of validating results become available.

Data Quality Scorecards

Continuous measurement of data quality goes a long way towards raising awareness. Like the Hawthorne studies have shown, and as you can see for instance in call centers, merely recording quality levels can in itself drive quality up. But how should you represent “quality”? What are valid and equitable measures?

The measures that Herzog et al and Maydanchik have proposed all lend themselves to monitoring. Although they lack a firm theoretical framework (like English’ approach), they are a pragmatic first step. But don’t be fooled, any number you report in a scorecard is in itself arbitrary, unless you can directly relate it to your bottom line or observable improvement in decision making.

It is important to include a range of metrics in scorecard. Not only address data quality at a wider scale, but also to avoid optimization on some narrow metrics, at the expense of other quality drivers. The selection of your scorecard metric should be driven by holistic insight in drivers of business value. Don’t confuse determinants of success with their outcomes! There is simply no substitute for causal analysis here (e.g.: structural equation modeling).

Conclusion

For want of an overarching data quality theory we are limited for the moment to some idiosyncratic measures for data quality. There is nothing wrong with that, it symbolizes the current state of affairs. There is no point in suggesting scientific precision when you don’t really know what you should measure with scientific precision.

This isn’t necessarily a bad thing; it matches the current maturity of our profession. Measurement without the appropriate model to govern observations merely allows extrapolation. So although one could measure consecutive months, and reasonably infer the next (assuming linearity), it doesn’t help you imbue those observations with meaning.

When possible, tying data quality errors to primary process breakdown is an elegant way to measure and monetize data quality. However, in many settings this is far from straightforward. In our knowledge economy so many people make data based decisions on a repetitive basis, yet a rock solid business case can still be hard.

Some decision outcomes can only be assigned arbitrary values. What is it worth to improve decision making in support of choices about administering life saving drugs? What is the economic value of human lives? Steven Levitt wrote about this controversial topic in Freakonomics (2006), but there are many other decisions where no undisputed relation to money seems possible.

Merely recording data quality on an ongoing basis has proven remarkably effective in raising awareness, and hence improves data quality without any further actions taken.

If you decide to build a data quality scorecard, ensure that improving scores coincides with integral improvement in performance. To draw a parallel with consumer research: customer satisfaction is not a KPI. It’s the result of doing things right for customers, and more particularly things that are of importance to them.

Isolated data quality metrics in and of themselves don’t “automatically” lead to better performance. Smart professionals understand what gets measured, and what gets measured gets rewarded. Ensure these rewards are primary drivers for your business goals – achieving that kind of business alignment is easier said than done.

References

Data Quality and Record Linkage Technique.
Thomas Herzog, Fritz Scheuren & William Winkler (2007)
ISBN# 0387695028

Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits.
Larry English (1999)
ISBN# 0471253839

Information Quality Applied: Best Practices for Improving Business Information, Processes and Systems.
Larry English (2009)
ISBN# 047013447X

Data Quality Assessment.
Arkady Maydanchik (2007)
ISBN# 9780977140022

Software Quality Management, Volume 2 – First Order Measurement.
Gerald Weinberg (1993)
ISBN# 0932633242

Freakonomics.
Steven Levitt & Stephen Dubner (2006)
ISBN# 0061234001

Contact
XLNT Consulting
Tom Breur, Principal

E-mail
Email Tom Breur

Telephone
+31-6-463 468 75

Address
Langestraat 8-03
5038 SE Tilburg
the Netherlands