De-duplication Part 1: Deterministic Matching Approach

De-duplication Part 1: Deterministic Matching Approach

Data matching of party data is at the heart of a MDM implementation. De-duplication process as we know it, is a key activity in MDM and brings tons of challenges. One of the major dilemmas is to recognize correct matching technique to use for a given scenario.

One of my blog readers recently asked me – How to choose between deterministic and probabilistic matching during data consolidation? While this question is fairly straightforward, answer to it isn’t all that simple. There is no universal right or wrong answer and the correct solution for you depend on accuracy and completeness of the data. Standardized representation of key data elements such as address, name play important role and are major factors influencing the method you want to pick.

In this series of blog posts, I will first focus on deterministic matching approach. Next blog will be centered on probabilistic method. We will look into each of these matching techniques, understand them and see how they can benefit your implementation. I will also touch on the topic Craig Milroy highlighted here.

Deterministic Matching

Deterministic matching is a rules-based process to determine an “exact match” between two records. Business rules are pre-defined and this matching technique compares and matches records to meet the rule. The comparison is done to their precision or adherence to meet the defined business rule.

We compare a set of values for all of a given party’s critical data elements with those of another. Usually, an equal weight is given for different type of information a record has. For example, we might place equal reliance on a match between the names on two records or a match between two birth dates.

The result of deterministic matching will usually result in a matching score. The comparison takes into account the presence, absence, and content of the values. Values that are present in both suspect parties and match create a unique score that is referred to as the relevancy score. Any values that are present in both parties but do not match create the non-match relevancy part of the score. Taken together, the match and non-match relevancy scores define the type of suspect that has been found. When there are values missing in one or both parties for a particular critical data element, this element is not included in the creation of the unique match and non-match relevancy scores.

Deterministic approach is more suitable when you have a common unique identifier available across systems from where records are consolidated. Works best when comparing records where an exact match is anticipated and all the critical data elements are present. We have to ensure that the data consistent and represented in a standardized format.

[pullquote_left width=”30%”] Ensure that the data is consistent and represented in a standardized format for deterministic matching to be effective. [/pullquote_left]

Deterministic matching engines have tendency to cause false negatives; a condition where there are un-identified duplicates resulting in false sense of security that “all is well” with the data.

What are your thoughts on deterministic matching approach? What is good about it and what is bad? Do share your views…

COMMENTS

6 Thoughts on De-duplication Part 1: Deterministic Matching Approach

Shobhit Bagga

23 May 2014

11:36am

Good reference for clients trying to understand the difference between two methods.

Show Replies

Shweta

11 Jun 2014

1:46pm

Nice one. Probably we should add this content to MDM InfoCenter. 🙂

Show Replies

De-duplication Part 2: Probabilistic Matching Approach - MDM - A Geeks Point Of ViewMDM – A Geeks Point Of View

12 Jun 2014

7:32am

[…] Part 1 of this blog series on de-duplication, I wrote about Deterministic Matching approach. This blog post looks into Probabilistic Matching. I will briefly summarize the key highlights of […]

James Cookfair

17 Jul 2014

4:00pm

Exact match leaves a large percentage of duplicates remaining. Many match providers have improved this deterministic approach by leveraging their own datsbases. For example, recognizing that John, Jack and J. Are the same first name. Corporate names and abbreviation are identified and placed in the correct party hierarchy. Is the a fuzzy line between deterministic and probabilistic approach?
Looking forward to your next blog.

MDM – A GEEK'S POINT OF VIEW