Among several challenges faced when we kick start an MDM implementation is the step to determine which source to consider for initial phase of deployment. Amidst all crucial aspects such as data collection, data transformation, normalization, standardization, matching etc, this step of source identification is critical factor for realizing MDM benefits early on.
The proven process to implement MDM is to start with small set of data sources and grow incrementally. Once we identify the sources having correct entities, dependent domains and attributes, we can do an effective ground work for
Getting things straight at the beginning is critical aspect of the MDM project as it acts as a foundation for future source system integration plans. This also allows us to accomplish easier enterprise wide MDM roll out by adding additional sources of data to MDM hub.
So, the question is how to choose the sources which will get into MDM during this inaugural phase considering the organizations will have huge application landscape and will not know which systems are responsible for which master data. This is also a very revealing act for many of customer representatives themselves when they find dozens of databases containing data which they did not know existed.
Depending on the master domain you are implementing you would usually start by listing down the most trusted data sources the company currently uses for its customer facing applications. So, for example if you are implementing customer master, you will ask, which system currently manages customer name, their current address and contact information? It’s easier said than done though as you will find the organization indeed has multiple silo applications all having this information for a specific line of business. Each division, department and business process has customer information which is complete as per the corresponding business owners.
One of the strong belief in our MDM arena is, larger the data, larger the data quality issues and even larger are the duplicate records. Put in a nut shell, we would usually choose the data sources which own maximum number of customer records. This gives us an option to set up rules as accurate and generic as possible so a wider set of data issues can be addressed upfront.
Using data profiling tools is a great way of scanning data for missing values, incorrect values and elements violating business rules.
Also, remember that you’ll need as much information as possible to do an adequate data matching. So emphasis on completeness of these attributes and the source you choose should have these attributes densely filled. To help you discover more about the source data, you will need a quick initial profiling phase to take certain decisions. Data profiling tools help in scanning data for missing values, incorrect values and elements violating business rules. This will allow you to make better effort estimation for clean up work required. Profiling will also help you to carefully weigh each source and judge whether it is reliable source of master data.
How do you analyze the data? And how do you determine the correct sources of master data? Please share your experience and opinions via comments. Thank you.







Prash, selecting the sources is indeed very essential.
One aspect I have been working with a lot is how to involve external sources as well.
In the customer data arena this will be things as address directories (as we also discussed earlier here on the blog related to geocoding), business directories for B2B data and consumer/citizen directories for B2C very much dependent on the countries and industry in question.
These sources may be very helpful within standardization and data matching and including touch sources in future data entry will have a great impact on data quality if you are able to include this into your business processes.
Hi Henrik, Thank you for commenting.
You bought up very important aspect of involving external data sources. I can only imagine how difficult things can get when country/industry specific B2B & B2C directories need to be integrated with MDM.
While working with many organizations to set up product master hub, I have seen significant challenges posed by the data flows coming from supplier, distributer and partners systems. A very hard nut to crack I must say.
Hey Prashant,
Selecting the best source of data to start with is a recurring challenge for many of our customers. Henrik brings up a great point that some external sources can provide an initial source of truth like Dun and Bradstreet with B2B data for example. The bigger challenge is that regardless of the accuracy of the external source, that data will still need to match up with the potential mess you’ll be dealing with internally. Ex. How easily can you link helpIT systems with helpIT, helpIT inc, helpIT systems inc, or HSI?
I mentioned in a blog post I wrote about the Retail Single Customer View (http://www.helpit.com/cleandata/?p=138) that a customer of ours selected their website data as a starting point due to the fact that the customers would care most about receiving an order so it is in their own best interest to provide accurate information or the customer will have to deal with logistics headaches down the road.
It’s even possible with some software, including helpIT’s applications, to score the quality of the information within each record based on completeness and accuracy.
So may be answer is a combination of identifying the source that cares most about accuracy of information with a record quality scoring methodology.
Hi Josh,
Thank you for commenting.
Good thoughts there. I second your opinion about identifying sources which care about accuracy of information along with use of quality scoring methodology.
Hi Prashant
Thanks for the post. However, I have several very serious concerns about the overall approach you are advocating.
The first of these is that you at no point mention the MOST CRITICAL element for Master Data Definition and Management, which is the LOGICAL DATA MODEL (LDM). If you have not got this you cannot be said to me managing you master data. In fact, it would be impossible. It woul be like claiming that you could manage the electrics in a large building without having a wiring diagram.
Secondly, Master Data Elements must be DEFINED by senior management, they cannot be inferred from existing data. What an enterprise is currently categorising and grouping its data as may be right or it may be very wrong. What it OUGHT to be cannot be inferred from the data itself. It must be defined and this definition will be shown in the LDM.
Thirdly, normalising existing data is a laborious, archaic and error prone activity that should be avoided at all costs. This is a thoroughly outdated excercise called Relational Data Analysis (RDA), that I used lecture on 20 years ago, that has been totally superseded by the Relational Data Model.
If those practising Master Data Management within an enterprise are to be taken seriously then they must be seen to operating at the highest level of quality, using all of the very best techniques. They cannot be seen as a center of excellence if they are leaving out vital elements, such as the LDM, and using a flawed techniques such as RDA.
Regards
John
Hi John,
Thanks for reading and providing your valuable insights.
I completely agree with you about the criticality of LDM in defining Master Data Management. In fact, I have written dedicated blog posts here in this site on this topic. Also, in my post Key Master Data Management Functionalities I put data model as the top most feature to consider.
The primary focus of this particular blog post is source system identification. Let’s assume we have the LDM defined and know what master data elements we need to manage, my emphasis here is to select correct data sources which can provide all the bits and pieces we need for effective ground work for data matching and clean/standardization rules. This is a key aspect of master data management projects as we would like to show some good results to the business in what is usually a long term (and often frustrating) implementation cycle.
Best Regards
-Prashant
Hi Prashant
Thanks for the feedback and the context.
I agree that once the the LDM is in place that you can cross check with existing data to see if you have missed any.
However, I would strongly suggest that you always normalise in the LDM and then map all of your existing data onto that.
A properly drawn LDM will be fully normalised to 5NF.
Once again, thanks for the feedback.
Kind regards
John
Interesting – thanks.
A good post about the importance of a single customer view on this site.
Thanks again,
Tom
I’m really enjoying the theme/design of your blog. Do you ever run into any web browser compatibility problems? A couple of my blog readers have complained about my website not operating correctly in Explorer but looks great in Safari. Do you have any advice to help fix this problem?
If some one needs expert view on the topic of running a blog then i propose him/her to visit
this web site, Keep up the fastidious job.
Thank You!