Among several challenges faced when we kick start an MDM implementation is the step to determine which source to consider for initial phase of deployment. Amidst all crucial aspects such as data collection, data transformation, normalization, standardization, matching etc, this step of source identification is critical factor for realizing MDM benefits early on.

The proven process to implement MDM is to start with small set of data sources and grow incrementally. Once we identify the sources having correct entities, dependent domains and attributes, we can do an effective ground work for

  • Creating broad set of rules to cleanse the data
  • Building standardization engines applicable to all relevant data entities and
  • Constructing rules to identify suspects so as to create single version of truth (As discussed in my earlier post)

Getting things straight at the beginning is critical aspect of the MDM project as it acts as a foundation for future source system integration plans. This also allows us to accomplish easier enterprise wide MDM roll out by adding additional sources of data to MDM hub.

So, the question is how to choose the sources which will get into MDM during this inaugural phase considering the organizations will have huge application landscape and will not know which systems are responsible for which master data. This is also a very revealing act for many of customer representatives themselves when they find dozens of databases containing data which they did not know existed.

Depending on the master domain you are implementing you would usually start by listing down the most trusted data sources the company currently uses for its customer facing applications. So, for example if you are implementing customer master, you will ask, which system currently manages customer name, their current address and contact information? It’s easier said than done though as you will find the organization indeed has multiple silo applications all having this information for a specific line of business. Each division, department and business process has customer information which is complete as per the corresponding business owners.

One of the strong belief in our MDM arena is, larger the data, larger the data quality issues and even larger are the duplicate records. Put in a nut shell, we would usually choose the data sources which own maximum number of customer records. This gives us an option to set up rules as accurate and generic as possible so a wider set of data issues can be addressed upfront.

Using data profiling tools is a great way of scanning data for missing values, incorrect values and elements violating business rules.

Also, remember that you’ll need as much information as possible to do an adequate data matching. So emphasis on completeness of these attributes and the source you choose should have these attributes densely filled. To help you discover more about the source data, you will need a quick initial profiling phase to take certain decisions. Data profiling tools help in scanning data for missing values, incorrect values and elements violating business rules. This will allow you to make better effort estimation for clean up work required. Profiling will also help you to carefully weigh each source and judge whether it is reliable source of master data.

How do you analyze the data? And how do you determine the correct sources of master data? Please share your experience and opinions via comments. Thank you.