Data matching of party data is at the heart of a MDM implementation. De-duplication process as we know it, is a key activity in MDM and brings tons of challenges. One of the major dilemmas is to recognize correct matching technique to use for a given scenario.
One of my blog readers recently asked me – How to choose between deterministic and probabilistic matching during data consolidation? While this question is fairly straightforward, answer to it isn’t all that simple. There is no universal right or wrong answer and the correct solution for you depend on accuracy and completeness of the data. Standardized representation of key data elements such as address, name play important role and are major factors influencing the method you want to pick.
In this series of blog posts, I will first focus on deterministic matching approach. Next blog will be centered on probabilistic method. We will look into each of these matching techniques, understand them and see how they can benefit your implementation. I will also touch on the topic Craig Milroy highlighted here.
Deterministic matching is a rules-based process to determine an “exact match” between two records. Business rules are pre-defined and this matching technique compares and matches records to meet the rule. The comparison is done to their precision or adherence to meet the defined business rule.
We compare a set of values for all of a given party’s critical data elements with those of another. Usually, an equal weight is given for different type of information a record has. For example, we might place equal reliance on a match between the names on two records or a match between two birth dates.
The result of deterministic matching will usually result in a matching score. The comparison takes into account the presence, absence, and content of the values. Values that are present in both suspect parties and match create a unique score that is referred to as the relevancy score. Any values that are present in both parties but do not match create the non-match relevancy part of the score. Taken together, the match and non-match relevancy scores define the type of suspect that has been found. When there are values missing in one or both parties for a particular critical data element, this element is not included in the creation of the unique match and non-match relevancy scores.
Deterministic approach is more suitable when you have a common unique identifier available across systems from where records are consolidated. Works best when comparing records where an exact match is anticipated and all the critical data elements are present. We have to ensure that the data consistent and represented in a standardized format.
Ensure that the data is consistent and represented in a standardized format for deterministic matching to be effective.
Deterministic matching engines have tendency to cause false negatives; a condition where there are un-identified duplicates resulting in false sense of security that “all is well” with the data.
What are your thoughts on deterministic matching approach? What is good about it and what is bad? Do share your views…