In Part 1 of this blog series on de-duplication, I wrote about Deterministic Matching approach. This blog post looks into Probabilistic Matching. I will briefly summarize the key highlights of this matching technique and help understand how this can benefit your implementation.
Probabilistic Matching uses statistical algorithms in order to deduce the most reliable match. Every record in a data set is compared with every other record to yield a score which represents the confidence that the records refer to same customer. The key phrase here is the – ‘Determination of the likelihood of a match’. It’s not an exaggeration to say that this approach mimics the way a human brain thinks.
Internally, matching engine applies fuzzy logic, such as edit distance to compare the fields. Scores are calculated based upon weights associated with values for specific attributes. For example, birth date information is subject to errors made by a mistake on a single digit. Names are more likely to be recognizable if a single error is made. Example, Caster and Castar. Probabilistic matching can detect these variations and can assign appropriate weights. We can then aggregate these weights to derive what is often referred as confidence score.
When confidence score is thus calculated, it’s compared against a threshold to determine what action to take. Usually, the matching results fall into 3 categories – automatic linking, manual review and don’t link. One of the key aspects of this matching exercise is to analyze the thresholds (This is a huge topic in itself which I will cover in a later blog post) so that you have linked the records which are real duplicates and dropped the ones which are true non-matches. You also want to keep your manual review tasks manageable as they will require human intervention to conclude match or non match.
One of the powerful features of probabilistic technique is that, it takes into consideration the frequency of the occurrence of a data value within a particular distribution (or data set). For example, in United States, the last name “Smith” is considered the most common name. Census 2000 page here shows that there were around 2.4M people ending with last name Smith. Given this, the engine should render a lower matching score than matching on a name which is rare. For example, my last name “Chandramohan”. The likelihood that the last name Smith is a true match is lower than the likelihood that Chandramohan is a true match when you are dealing with North American data sets.
The statistical approach adds tremendous value to the probabilistic matching approach and delivers match confidence based on distribution of a given data set
This statistical aspect adds tremendous value to fuzzy matching. The algorithm generates matching scores depending on your data set and distribution of values in it. As a result, now instead of just fuzzy comparing information contained in two data elements; you can use this additional information to say how well this match contributes to overall confidence.
Probabilistic matching approach is usually preferred when you have large number of records and many attributes are involved in the matching process. And statistical matching makes more sense when no common identifiers enable linking of records. While this matching technique has many positive aspects, it is also complex and requires great deal of experience. And certainly, there are drawbacks and things you need be cautious about. I will talk about this in my next blog post.
What is your experience with Probabilistic Matching? What is good, what is not good? Please share your opinion via comments.