De-duplication Part 2: Probabilistic Matching Approach

Blog

De-duplication Part 2: Probabilistic Matching Approach

In Part 1 of this blog series on de-duplication, I wrote about Deterministic Matching approach. This blog post looks into Probabilistic Matching. I will briefly summarize the key highlights of this matching technique and help understand how this can benefit your implementation.

Probabilistic Matching

Probabilistic Matching uses statistical algorithms in order to deduce the most reliable match. Every record in a data set is compared with every other record to yield a score which represents the confidence that the records refer to same customer. The key phrase here is the – ‘Determination of the likelihood of a match’. It’s not an exaggeration to say that this approach mimics the way a human brain thinks.

Internally, matching engine applies fuzzy logic, such as edit distance to compare the fields. Scores are calculated based upon weights associated with values for specific attributes. For example, birth date information is subject to errors made by a mistake on a single digit. Names are more likely to be recognizable if a single error is made. Example, Caster and Castar. Probabilistic matching can detect these variations and can assign appropriate weights. We can then aggregate these weights to derive what is often referred as confidence score.

When confidence score is thus calculated, it’s compared against a threshold to determine what action to take. Usually, the matching results fall into 3 categories – automatic linking, manual review and don’t link. One of the key aspects of this matching exercise is to analyze the thresholds (This is a huge topic in itself which I will cover in a later blog post) so that you have linked the records which are real duplicates and dropped the ones which are true non-matches. You also want to keep your manual review tasks manageable as they will require human intervention to conclude match or non match.

Statistical Approach

One of the powerful features of probabilistic technique is that, it takes into consideration the frequency of the occurrence of a data value within a particular distribution (or data set). For example, in United States, the last name “Smith” is considered the most common name. Census 2000 page here shows that there were around 2.4M people ending with last name Smith. Given this, the engine should render a lower matching score than matching on a name which is rare. For example, my last name “Chandramohan”. The likelihood that the last name Smith is a true match is lower than the likelihood that Chandramohan is a true match when you are dealing with North American data sets.

[pullquote_left width=”30%”] The statistical approach adds tremendous value to the probabilistic matching approach and delivers match confidence based on distribution of a given data set [/pullquote_left]

This statistical aspect adds tremendous value to fuzzy matching. The algorithm generates matching scores depending on your data set and distribution of values in it. As a result, now instead of just fuzzy comparing information contained in two data elements; you can use this additional information to say how well this match contributes to overall confidence.

Probabilistic matching approach is usually preferred when you have large number of records and many attributes are involved in the matching process. And statistical matching makes more sense when no common identifiers enable linking of records. While this matching technique has many positive aspects, it is also complex and requires great deal of experience. And certainly, there are drawbacks and things you need be cautious about. I will talk about this in my next blog post.

What is your experience with Probabilistic Matching? What is good, what is not good? Please share your opinion via comments.

COMMENTS

5 Thoughts on De-duplication Part 2: Probabilistic Matching Approach
    Bob
    2 Jul 2014
     2:19pm

    Prashant, would you agree that probabilistic matching for material data (e.g. material description) is more appropriate than deterministic matching. – however deterministic matching fits more for customer / vendor data because Name & address Details (postal-code; Address-structure at all) are more standardized, at least on Country Level.

    0
    0
    Rob
    10 Jul 2014
     4:29pm

    Prashant, would you agree that Probabilistic Matching is more suitable for Product / Material Data versus
    Deterministic Matching for Customer Data ?!

    0
    0
    Shashi
    12 Oct 2014
     12:06am

    hi Prashant,
    Can you throw light on what all MDM products are available for probabilistic matching and market leaders. I have been using Initiate since last 6 years and found that initiate product is quiet efficient.

    0
    0
    Nidheesh N
    5 Oct 2016
     3:25am

    Hi Prashant,

    Is there any change in PME during the upgrade of MDM 11.0 to 11.5.

    0
    0

Leave A Comment

RECENT POSTS

Businex-Blog

Composable Applications Explained: What They Are and Why They Matter

Composable applications are customized solutions created using modular services as the building blocks. Like how...

Businex-Blog

Is ChatGPT a Preview to the Future of Astounding AI Innovations?

By now, you’ve probably heard about ChatGPT. If you haven’t kept up all the latest...

Businex-Blog

How MDM Can Help Find Jobs, Provide Better Care, and Deliver Unique Shopping Experiences

Industrial data is doubling roughly every two years. In 2021, industries created, captured, copied, and...