Resource Section Overview

Newsletter Archive
Whitepapers
Presentations
Links
Recommended reading

You will need Adobe Acrobat Reader to open documents with the pdf logo. For more information, see

Tom Breur combines Art & Science approach to data mining concepts. His abilities and visionary approach to data mining, the unique way of implementing models, are the most professional and advance we came across among many companies we are working with worldwide. He combines deep knowledge of customer behavior and has the ability to implement complex solutions and systems to get, keep and grow customers in a very competitive environment. Tom has the most pleasant personality and highest communicative skills.
With great respects
Dr. Ronit HaNegby
CEO
EASTAT ltd.

Data Quality: How to detect duplicate records

Tom Breur
April 2010

Introduction

Duplicate database entries are a pernicious problem. They come in at least two flavors. Sometimes merging of customer records in a data warehouse is compromised. What happens is that some assets are represented in one record, and part of the assets are covered in another record. Then both records (pertaining to the same customer) under represent this customer’s real value to the business, because they both only provide a partial view. The other flavor, and the kind of duplicates this paper will deal with, occurs when two records basically represent the exact same information. However, due to some error this information has populated two (or more) distinct rows.

Now if duplicate records were real duplicates, we wouldn’t have much of a problem. In reality, what we call “duplicate records” usually refers to the same entity in the real world, but in the database this entity shows up as two slightly different cases. These two records are suspiciously similar, however. Determining which records are a “true” match, and which are actually different is our challenge here.

Deduplicating always involves multiple fields because two people could have identical names, like John Smith, and people in the same household have the same address. There could be minor typographical errors in some of the fields, use of “special” characters, middle initials present or absent, or there could be missings. All this mishap leads to minor differences, yet records should still be identified as duplicates, when they really are.

Large lists are often the result of merging multiple files, each with their own peculiarities. Some, for instance, may contain a telephone number, and others may not. Some attributes are prone to change, like age or address, and the data might have been captured at different points in time. So the same customer could show up with two different addresses.

The way fields were captured may differ across supplying lists. If in one list first and last name were separate fields, and in another they were recorded together, often you will concatenate the two fields in the source that had separate fields before merging. Some attributes might not have been recorded in some sources, so they will need to stay missing, etc.

Typical problem fields are name and address. The reason these are problematic is for two reasons. Exactly these fields play a crucial role in positively identifying matches. The other reason is that address and name fields often contain multiple data elements like first and last name, initials, titles, etc. All of these could have been recorded together in the “name” field. Similar for address fields that are notoriously difficult to standardize. These problems are exacerbated considerably for international lists.

Detecting duplicates: the classification challenge

For sake of reasoning, we can ignore cases where two records are completely identical. This scenario is easy, but unfortunately rare. Another, slightly less trivial problem is where the database is indexed by a (supposedly) unique identifier, but when you do a count distinct on ID, it turns out that some of them aren’t unique after all. For the purpose of this paper, we can safely consider that a special case of the scenario we will be studying. For the duplicate ID’s, the question typically is: are these indeed matches, or are they different?

What we are interested in, are records that are suspiciously similar. Barring a “golden” match field, namely the id, you need to look at several fields, and typically in conjunction with each other. That is the interesting challenge we’ll be considering in this paper.

The task of determining match/no-match needs to be automated, because manual checking is too expensive (and error prone) for most practical applications. Additionally, we’d like to get some measure of “success” in finding duplicates. Ideally you want the process to output empirical estimates of the number of type I and type II errors. That means that we get a reliable (hopefully unbiased) estimate of the number of false matches and the number of undetected duplicates remaining.

Obviously when two records are not exactly identical, there is always the risk that our assessment of which records are duplicates turns out wrong. We have classical type I and type II errors here. Either you decide that two records are indeed duplicates when in reality they are not, or you fail to detect duplicates that should have been identified as such. Matching algorithms ideally should take both risks and associated costs into account when assessing trade-offs.

Detecting duplicates: the computational challenge

In a list with N rows, there are (N ´ (N – 1))/2 possible pairs of records to compare. Since this problem grows exponentially with the size of the list, we are looking at a computational challenge. When you identify a likely duplicate, some manual process will often be required, to assess a positive duplicate and decide on a permanent removal of the entry. In order to manage these processes, efficiency and reliable estimates are key.

To deal with the computational complexity we need to partition the list into mutually exclusive and exhaustive blocks. The purpose of this partitioning is to increase the number of duplicates within a block, and at the same time decrease the number of record pairs you need to compare. Of course the question then arises: how do you decide on “the best” blocking variables?

What are important attributes for a blocking variable, or set of variables (fields)? Bear in mind that we need to take into account the possibility that the value of the blocking field itself could be wrong. And since we only compare records within the subset of records with identical values for this field, that means we would fail to detect any duplicate pair if the value of the blocking field itself is wrong (or inadvertently missing). Records which disagree on the value for the blocking variable will always be classified as non-matches. So therefore one of the requirements for a potential blocking field is that it should be accurate.

A second requirement for a good blocking variable is that it should contain a large number of possible values that, preferably, should be fairly uniformly distributed as well. When a list has 10.000 records, we have 49.995.000 (10.000 ´ 9.999/2) comparisons to make. If half were men and half were women, and we’d use that as our blocking variable, there are still 2 ´ 12.497.500 (5.000 ´ 4.999/2) = 24.995.000 comparisons to make. If these 10.000 records were evenly distributed across 50 states, there are only 50 ´ 19.900 (200 ´ 199/2) = 995.000 comparisons to make. Even better would be a ZIP code, or some other variable (like telephone number, etc.) with many unique values.

Phonetic coding systems derive a standardized code based on the way a name is pronounced. This is quite useful in dealing with spelling errors. One of their most powerful uses is as blocking fields. After a phonetic code has been generated, you still retain the original name field, to verify (in conjunction with other fields) whether you have hit a true match.

Detecting duplicates: the statistical challenge

One of the practical ways to estimate the total number of duplicates in a list is by employing capture-recapture techniques. These are essentially adopted from practices that are used to estimate wildlife populations, for instance. You capture some species, earmark them before release, and when you do a second capture count the number of earmarked animals in the second sample. That number is then used to estimated the total population.

By the same token, you draw a sample and count the number of duplicates. Then you draw another sample and identify duplicates again. By matching these two lists you get a count, so to speak, of the number of earmarked animals. The Lincoln-Peterson capture-recapture estimator allows you to arrive at an estimate for the total number of duplicates in the list using two samples.

For cases where you have more than samples, a log-linear model can be employed. Note that the estimates for both the LP-estimator as well as log-linear models assume independent samples. This can be a bit tricky, as even “innocent” combinations like gender and age can turn out to be biased, for instance because women tend to grow older.

By employing an intelligent blocking strategy, you arrive at (much) higher proportions of duplicates than a random sample would provide. This has the advantage of being both very efficient in terms of using your resources as well as giving more reliable statistical estimators.

Iteratively removing duplicates

A benefit of using multiple iterations with several different blocking variables, is that you give yourself the best possible chance to acquire empirical data on where “true” duplicates are being identified, and where “false” matches occur. This helps you direct your (manual) resources accordingly.

With the Lincoln-Peterson formula, you get an estimate of the total number of duplicates in the list. Another benefit of iteratively removing duplicates is that once you have this total number of duplicates, you can decide after each pass how much more effort you are willing to put in to find the last N remaining duplicates. That effort tends to go up considerably for the last few duplicates, and often at the risk of identifying false matches as true duplicates.

This is where the power of an iterative blocking strategy has a chance to shine. Each pass through the data, using different (sets of) blocking fields, allows you focus on different fields to establish an automated definite match. Once duplicates are removed, you will switch blocking strategies again. For each set of blocking fields, you employ different sets of matching fields to decide on a positive match (see next section). Needless to say these choices are governed by the number of type I and type II errors encountered during (manual) checks.

A great place to start is the name field itself, but used as a blocking variable. To cope with possible misspellings, you derive a phonetic coding as blocking variable. Because there are so many different values for names, it often provides an excellent, highly efficient first pass. Having caught a great many “easy” duplicates, you can now concentrate your effort on the “hard” records.

Which variables to use for deduplicating?

Since most large lists are the result of merging multiple sources, it can be helpful to use the record source code. This will carry meaning, for instance, like telling which fields are expected to be missing, simply because they weren’t available in that particular list.

But even when the source of a record is given, it is non-trivial to decide which fields must provide a positive match to call two records a duplicate. Here the input from domain experts is of paramount importance (and so are good meta data), as they are most likely familiar with the procedures used for collecting the data and compiling the lists.

The choice of variables for deduplicating is closely related to the blocking strategy. When you use more fields, you will find less matches. If you can, try to standardize address fields, by parsing them into multiple columns, each with their own content. This will allow for a more flexible blocking strategy.

Another tactic is to use string comparator metrics like Jaro or others (typically in the 0-1 range), and to set a threshold, above which the text field (usually name or address) is considered a match. If you combine name with many fields, you can lower this threshold, if you combine name with few fields, you will need to be more restrictive in order to avoid calling two records a “match” when they really are distinct.

Conclusion

Managing (limiting) the number of duplicates in your lists is driven both by cost considerations (eliminating waste), as well as by managing reputation risks. Not only is it a shame to send the same (or worse: conflicting) message more than once to the same customer, but it also doesn’t look very professional. In some cases, contacting the “wrong” customer can even have legal or compliance implications.

Detecting (and erasing) duplicate records is both art and science. The “art” has to do with relating your blocking strategy to your choice of matching fields. Another part of the fine art is carefully managing where you decide to employ your manual resources. To efficiently manage large lists, you need to minimize manual labor. You will want to employ resources to get an efficient assessment of the number of type I and type II classification errors, as these are key numbers for the business to know.

Because the classification problem grows exponentially with the number of records, there is a computational challenge you need to manage. The statistical challenge has to do with estimating the total number of duplicates in your list, and doing this accurately yet efficiently. This will allow you to guide your resource allocation, because as always, an 80/20 rule applies. The first duplicates are easiest to detect, and as you progress it becomes more difficult and more expensive.

An iterative approach to finding your duplicates works so well because it allows you to assess at each pass how many additional records you found, what the classification risks are (removing records that weren’t real duplicates), and how much it will cost to go after the next ones.

Using a sound methodology, and relying on statistics it is possible to accompany your deduplication work with solid estimates of costs/benefits. Because as interesting and fascinating as this work is, in the end all the business wants to know is: “what’s in it for me?” It is up to us as data quality professionals, to answer that question with valid and accurate, business driven assessments.

References

Data Quality and Record Linkage Technique. Thomas Herzog, Fritz Scheuren & William Winkler (2007)
ISBN# 0387695028

Data Preparation for Data Mining.
Dorian Pyle (1999)
ISBN# 1558605290

Contact
XLNT Consulting
Tom Breur, Principal

E-mail
Email Tom Breur

Telephone
+31-6-463 468 75

Address
Langestraat 8-03
5038 SE Tilburg
the Netherlands