Improve Matching by Avoiding Apple to Orange Comparison

Recently Henrik Liliendahl Sørensen (@hlsdk) wrote a blog post where he discusses the data matching challenges involved while dealing with small scale business owners.

Unlike individual customers and business customers, these small scale business owners fall into an intermediate category causing a lot of confusion in our data matching rules.

We compare records which are of same type to get accurate matching results. For example- compare person-person, organization-organization, address-address record etc. This will not only simplify the matching process, but also help in identifying duplicates easily. Failure to identify the type of record or placing a record in wrong category can cause discrepancies in the data matching process.

Imagine this scenario when the hub receives a person record from a channel which only has the last name field populated. Apparently, the front desk staff did not have good way to capture organization customer details and populated last name with organization name because it was mandatory (Believe me on this, it happens very often). And if the matching is configured to match only Person-Person & Organization-Organization records, we sure have a problem.

So what we need to do to address this kind of issues?

One would argue that our matching rules need to be more efficient. They should be able to detect what type of record it is before applying fuzzy matching logic on different critical data elements. The fundamental problem is see here is that we are trying to compare one type of data (Apples) with some other type (Oranges).

[pullquote_right]
Categorizing similar records is as important as standardization and consistent representation of data.
[/pullquote_right]

My sure-fire answer to this issue is to do upfront, proactive data quality management. We can add simple validations like – making a mandatory first name and last name for a person; capturing legal name, type of business for an organization record etc. I know sometimes the rules are hard to implement at the customer facing applications, but the data quality control mechanisms built into the solution should take care of these transformations.

Categorizing similar records is as important as standardization and consistent representation of data. Upfront data quality management in your solution should handle this thus boosting the matching process and help improve duplicate consolidation percentage.

More than everything, the staff handling data needs to be educated. You wouldn’t want the loop holes in the applications feeding data to allow wrong classification of master data records. Even if they do, the data quality control workflow should address this.

COMMENTS

3 Thoughts on Improve Matching by Avoiding Apple to Orange Comparison

Dylan Jones

10 Feb 2012

9:36am

Very valid points, good post.

Quite often matching tools are seen as a silver bullet but without a bedrock of data quality dimensions in place all the matching algorithms in the world won’t solve incomplete or poorly standardised data.

Show Replies

uke

17 Sep 2015

2:18pm

I was faced with matching gloriously crappy customer data from multiple silos where DQ was pretty much completely ignored. The team I was on tried the Informatica Data Profiling/Data Quality approach but was defeated by the sublime awfulness of the data

I ended up coding highly probabilistic fuzzy matching algorithm that could not care less about the lousy data quality. It matches on average with fewer than 2% overmatches. Undermatches are nearly nonexistent when sufficient data is available.

This system I built outperforms pretty much any other method for matching I have seen. So I disagree with your statement that “all the matching algorithms in the world won’t solve incomplete or poorly standardised data” — mine does.

MDM – A GEEK'S POINT OF VIEW