The purpose of this white paper is to discuss how machine learning and deep learning techniques can impact the way companies implement metadata discovery and business term assignments using the IBM DataOps platform.
First, a set of definitions to ensure consistent understanding of the topic:
- Master Data: the consistent and uniform set of identifiers and extended attributes that describe the core entities of an enterprise, such as existing or prospective customers, products, services, employees, vendors, suppliers, hierarchies and the charts of accounts
- Machine Learning: an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed
- Data Governance: the overall management of data availability, relevancy, usability, integrity and security in an enterprise
- Regulatory Compliance: an organization’s adherence to laws, regulations, guidelines and specifications relevant to its business
Most organizations spend a great deal of time and energy wrestling dirty or poorly-integrated data. Their people either cannot find the right data or cannot trust the data they find. On top of that, they must deal with multiple regulations in their industry that are barriers to self-service and data democratization. As a result, organizations try to fix their data through a variety of labor-intensive tasks, from writing custom programs to global replace functions. As a result, the organization’s data analysts and data scientists can find their productivity diminished.
This is particularly true within large organizations, where many years of mergers and acquisitions have resulted in an extremely complex data environment of diverse systems and databases. While organizations are busy maintaining these legacy data environments, they are constantly creating new data at unseen speed. Some try to solve this problem using master data management tools by unifying disparate data sources to achieve a single view of their critical business entities.
Several vendor tools approached this problem with a rule-based engine that unites a variety of data sources in their offerings. Rules are easy to implement and understood by many. However, rule-based engines do not scale very well. In the context of large enterprises, where organizations must deal with large amounts of data and a variety of disparate systems, machine learning technologies are now replacing rules engines.
Machine learning has proven remarkably powerful in accomplishing a wide variety of analytics objectives, such as predicting customer churn or detecting fraud in online credit card transactions. While identifying data similarities or unifying data may not be the most exciting application of machine learning, it is one of the most beneficial and financially valuable applications to IBM clients.
The Benefits of Managing Master Data
Building a data catalog can be very labor-intensive and time-consuming, which is why so many organizations give up on creating and updating a well-organized data catalog. They also face additional challenges, such as:
- Standardizing business definitions and creating a business glossary
- Cataloging all data sources and updating with clear business descriptions
- Linking business terms to data fields across all data sources
Time is not the only thing needed to build a robust data catalog. It can also be extremely expensive to hire domain experts who can perform this on an ongoing basis. This is where artificial intelligence and machine learning technology can help. IBM DataOps uses machine learning and neural networks to identify probabilistic matches of multiple data records that are likely to be the same entity, even if they look different. This enables analysis of master data for quality and business term relationships, a major pain point for IBM clients. Projects that used to take months can now be done in a few weeks.
While machine learning enables automation of tasks, there is always a need for human intervention in the process, like any other artificial intelligence or machine learning application. Through feedback learning, if the confidence score of match is below a certain threshold level, the system will refer the candidate data records to a human expert using the workflow. It is far more productive for those experts to deal with a small subset of weak matches that an entire dataset.
The benefits of this activity are huge—both for data curators and data scientists in any organization. Consider a new data scientist who is given a task to develop a machine learning model to detect customer churn for a specific product or service. While the data scientist has an idea on what needs to be accomplish, he or she has no idea what data sets he/she can use to start the task. With IBM data governance technology enabled with machine learning, the data scientists can easily search for business terms such as “customer retention” to get a graph view of all connected entities. Then they can drill down and get information about the quality and authenticity of the data.