An odd request came in last week when a prospective customer asked us about a benchmark on the percentage of duplicates we can find for them using MDM.

In this blog, I wanted to touch base on few key reasons why this is odd in many ways. I would also like to take this chance to explain what are the right questions you should be asking to your vendor when it comes to MDM matching.

MDM Matching QuestionI have worked with dozens of customers directly in last 12 years. In my current role, I talk to companies implementing master data management, the practitioners and the thought leaders on a day-to-day basis. I can confidently say, every customer requirements around mastering are unique.

When it comes to identifying duplication of data in your organization, the discussion quickly changes to a customer’s specific requirements. The usage of the data (ex: analytics for marketing segmentation, real-time access to trusted data across the company, etc.), the number of sources and target systems, the quality of data in those sources are all different even for organizations within the same industry. Often the business requirements dictate what you need to do with data and there are instances such as legal and compliance when the requirements suggest certain duplicates must survive.

On top of this, there are project timelines, certain trade-offs the customer makes to achieve the level of accuracy, performance, and quality of the data. Think of adjusting several knobs on your stereo to get the best sound which YOU like.

Result? The percentage of duplication we can find using an MDM tool varies from customer to customer. It depends on several parameters, and you need to find what is right for you. Your tolerance level for false positives and false negatives dictates your configuration.

The real question you should be asking your vendor is –

  • How sophisticated is your matching engine?
  • Can it support probabilistic (fuzzy) and deterministic (exact) matching styles?
  • Is it easy to configure the matching engine? Is it easy to understand for my data stewards, IT and business users, so they are all on the same page?
  • How easy or hard it is for us to change tolerance level for missed and false matches?
  • Does the matching engine consider phonetic spellings, partial fields, the statistical distribution of records and more?
  • Does the vendor tool allow fine-grained tuning of the ranges to search, tightness of match and other parameters for balancing the degree of matches amount of processing (performance)?
  • How is a data set with international names and addresses handled?
  • Can the matching engine learn from past behavior from stewards and self-correct?
  • What about data survivorship? Does the vendor provide easy ways for us to configure survivorship rules?
  • Does the vendor take a configuration over coding approach for both matching and survivorship?
  • Can you provide attribute level survivorship?
  • Does the vendor offer scalability for matching large data sets and performing multiple matches with different criteria?
  • Can you do fast searches that leverage matching in real-time? Is the matching engine designed to handle bulk, near real-time and real-time modes?

These are only a few of the questions that come to my mind. A thorough analysis of this can help you use the best solution in the market. A correct decision here can save you months of person hours in the form of manual stewardship.

Back to the original question, the answer depends on what you are trying to achieve.

I would love to hear your views. Please leave your response in the comments section or reach out to me at @mdmgeek on Twitter.

Image courtesy of