Originally published on Hub Designs Magazine May 2012
Over the last few years, Master Data has been recognized as one of the most important types of business information to be managed. Organizations are heading in the right direction by implementing Master Data Management (MDM) systems to take control of critical data like customers, products, employees, suppliers, materials, locations, etc.
There are different architecture and styles that an organization can adopt while building its MDM solution. Gartner defines Consolidated, Registry, Coexistence and Transaction hubs as the top implementation styles. The hub-and-spoke architecture is a very common approach and provides a single consolidated repository of different master data entities. The main advantage of this architecture is its ability to provide an enterprise-wide, complete and accurate view (often referred as the “360 Degree View” or “Single View of the Truth”) for various master data domains.
Implementing MDM requires in-depth changes to the way organizations work, partly because the technologies adopted here are fairly new, and mainly due to the cultural challenges MDM poses to most organizations. When a company is implementing an MDM hub, by definition, it is building a system that will have a footprint across all the departments and lines of businesses in the organization.
Statistics show that MDM implementations either take a long time to complete or in some cases, yield less than the expected return. While the key reasons are the failure to assess data quality and to establish data governance bodies, maturity of the organizations can also be a significant issue.
Over the past 8 years of providing MDM consultation across different industries, one other aspect I keep stumbling upon is the absence of teams having an end-to-end vision. An important role during MDM inception is that of solution architect – the one who should have the complete picture of the solution involving applications and technologies for profiling, data cleansing, consolidation, enrichment, de-duplication and synchronization which are key ingredients of the MDM recipe.
More than anything else, the solution architect will know the complete information system landscape, what integration mechanisms should be applied, and can estimate the hardware and infrastructure requirements based on the volume of data. Technical constraints and lack of expertise are common pitfalls and knowing about those is a key aspect.
One can’t build a house without a blueprint. The same holds true for MDM implementations. Setting the right foundation and architecting the solution while keeping long-term goals in mind is crucial.
The next section earmarks the five key factors that a solution architect should look into during MDM hub design. A carefully designed architecture covering each of these points helps create a sustainable and scalable solution. This is also a balanced, holistic approach that accelerates the implementation, thus helping to realize MDM benefits more rapidly.
- Data profiling to understand the current state of data quality
- Data integration mechanisms to consolidate the data
- Designing an extensible master data repository
- Robust data matching & survivorship functionality
- Seamless synchronization of master data
We’ll elaborate each of these factors and note some of the key takeaways that help to architect a robust MDM solution.
The reason most often cited for implementing MDM is to reduce dirty data in the organization. Cleansing data is a difficult, repetitive and cumbersome process. The politics and the viral nature of data and the silos that exist in the organization can all add to the muddle to make it even more challenging.
Given that, we need someone who can tell us how much of the data is dirty in first place. Data profiling tools do that for you. Not only does data profiling provide extremely useful information about the quality of data, it will also help in discovering the underlying characteristics and discrepancies associated with data.
Position the data profiling tool at the forefront of your MDM solution. The tool you use should be able to generate easy-to-interpret reports for pre- and post- data cleansing activities. As an architect, one has to put in place a well-designed, easy-to-use and effective data profiling tool.
The tool should be placed relative to both initial data migration as well as ongoing data integration tasks. For example, data profiling can provide useful insights about data coming from a specific source at the beginning of the project. The generated report may include information such as the percentage of customers whose birth date is defaulted to 01-01-1901, or Tax Id’s set as 111-111-1111. These errors need to be fixed either in the source or on the way into the MDM repository.
Although profiling is a key aspect, don’t let this become a major hindrance to the overall architecture. So be cautious about the time you take here. The biggest benefit of profiling comes in the form of providing accurate estimates for the data integration effort, which is the next part of our solution.
Data integration mechanisms to consolidate the data
While data profiling helps us determine the rules required, data integration tools actually transform the data. They help in applying numerous rules so that an intermediate clean state of data can be reached.
Historically, data integration tools have been helping business to efficiently and effectively gather, transform and load data during mergers and acquisitions, or during the streamlining of different departments. When it comes to MDM implementations, these tools play a crucial role as they can pretty much “make or break” the establishment of an MDM hub.
A well-architected, sustainable and fast batch ETL solution may be appropriate for integrating the source and target systems with a Master Data hub. Careful consideration should be given to the timely availability of the data by designing the optimal number of staging areas. The architect should know exactly what goes through batch ETL versus real-time data integration. It’s very common to integrate source data via batch processing but as the hub evolves, there will be more need for real-time availability of data.
For real-time data integration, it is better to leverage integration tools based on an Enterprise Service Bus (ESB) and / or Service Oriented Architecture (SOA).
Define clear steps to prepare the data, the processing that needs to happen, and the volume of data flowing though the solution on a daily basis. Design the ETL solution as generically as possible so that new sources can be easily integrated or with minimal changes. If the groundwork is flawed, your future integration tasks will become difficult and time-consuming.
Designing an extensible master data repository
I listed some of the key master data management functionalities on my blog sometime ago. The following points should be closely looked at by the architect while designing the master data tools and technologies:
Robust data matching and survivorship functionality
A powerful and effective fuzzy matching engine is a must-have to eliminate duplicate or redundant data. If you look at the matching capabilities of the MDM tools available in the market, most of them do a great job of identifying duplicate records. However, the tools are not simple to use because of the complex matching criteria used in every implementation.
Although certain commonalities exist among organizations, there are always variations to the rules used to compare data elements.
The first challenge is to determine if there are matches. Figure out the business definition of ‘a match’ in your organization. To define this, you will have to come up with a list of critical elements which are necessary for matching. Next, you have to assign weights to these elements. For example a phone number match is given more weight than a name match or a date of birth match. Many MDM products come with certain pre-defined matching algorithms and your best bet is to start from that and modify it to your specific scenario. Careful consideration of algorithm, testing and sampling are the keys here.
One of the questions I often face is the choice of using deterministic matching versus probabilistic (fuzzy) matching. Believe it or not, there are many organizations that feel more comfortable using deterministic, although it’s not as efficient and realistic as probabilistic matching is.
Choose the right option that fits into the customer requirements. For example: if the customer says “we want to make sure that Tax ID is exactly the same to ensure we have a duplicate record”, you need to design your matching rules to accommodate this.
Data survivorship is another major puzzle. You will have to find the best answers for the following questions when you merge two or more duplicate records:
Seamless synchronization of master data
Architects spend enormous time on system integration. No exceptions here. MDM must seamlessly integrate with a variety of applications. Whether it’s the system that’s responsible for data entry, or the data warehouse or the business intelligence system, MDM should bring the right information to the right person at the right time.
Realistically speaking, no MDM solution will replace all the sources of enterprise information overnight. It’s a continuously evolving process. Keeping this in mind, the architecture you design should allow other sources of master data to leverage the features of MDM as soon as possible.
Here are some of the important architectural concerns that need to be nailed down:
There are many instances where solutions get architected poorly and thus add more time and effort in realizing the full potential of MDM. I hope the points discussed here will help you architect a more robust solution.