Taking Responsibility for Data Quality through Data Governance
David Plotkin Finance Data Quality Manager
Bank f A B k of America i
Data Quality 2012 Asia Pacific Congress
Last revised: 01/21/2012
Agenda
Introduction Understanding Data Governance and its impact (and value add) to the Enterprise. Data Governance and Data Stewardship How to implement Data Governance:
What does the organization look like Figuring out what youve got Adding DG to the Project Methodology The tools youll need
Setting up a Communications Plan Measuring Success
Data Challenges: Data Needs to be Managed
Collecting Data Definitions for isolated databases is not enough:
Definitions written in haste by project staff Not rationalized across the Enterprise Documentation gets lost
Formal Enterprise-Wide Data Governance
Treat data as an asset inventory assign owner publish inventory, owner, glossary Ownership at a granular level of detail Consistent names & definitions across all apps and databases Data Governance involved in all aspects of Data Quality Include data governance processes in software lifecycles lifecycles.
Understanding Data Governance
Data Governance is the execution of authority over data management:
It all about d t ownership at the organizational l Its ll b t data hi t th i ti l level (D t l (Data Governance board) And decision making at the data element level (data stewardship)
The exercise and enforcement of decision-making authority over the management of data assets and the performance of data functions functions.
(Robert Seiner, TDAN and KII Consulting)
Ensuring that the enterprises data assets are formally managed. Coordinating communication to achieve collective goals through collaboration. th h ll b ti
(Steven Adler, IBM)
What is Data Governance (Practical)?
Represents the Enterprise in all things data and metadata
Metadata: Mandates capture of this information Data Quality: Issues, fixes, rules, and projects Data-related policies and procedures Champions data quality improvement projects Instigates methodology changes to ensure capture of data and metadata
Owns the data and metadata Driven by relatively high-ranking individuals who can make decisions for the Enterprise.
Data Governance Value
Data Governance must tie back to the universal value drivers:
Increase revenue and value Manage cost and complexity Ensure survival through attention to risk, compliance, security, and privacy (Gwen Thomas)
And does it? Thi k about: A dd Think b t
How much time is wasted arguing over ill-defined or undefined data elements. How many bad decisions are made due to undefined elements and poor quality. How lack of trust in data drives analysts to do strange things things
Enterprise Data Governance in a Nutshell
Data Governance ensures that data is treated as a valuable asset and that it is wellwell defined, accurate, consistent, and meets business needs. Data Governance provides project support along with an evolving set of policies, procedures, and guidelines to achieve these goals.
Everyone Inventory shared data, requirements, and issues Project Data Stewards work on projects to collect these these. DG SharePoint site facilitates the work. Data Governance Team Publish policies, processes and organization Coordinate committees Publish definitions, valid values and rules in Business Glossary Work with project teams to align deliverables to definitions Publish data quality issues and resolution decisions Data Stewardship Committee Ownership is by Business Function Business Data Steward represents each Business Function Escalation path is to Data Governance Council Data Stewards Define Data Elements, Valid Values & Derivation Rules Perform data quality analysis Work with SMEs and Technical Data Stewards Choose DQ remediation plan
Identify Data Elements and Issues
Assign Owner
Communicate Process, Decisions, Results
Define, Assess, A Make Decisions
The DGI Data Governance Framework
Enables overcoming challenges and achieving common goals
$ Goals 1 2
Cant easily Customize Product offerings and bundles
Revenue Generation
Undermines
Cost Reduction / Avoidance
Inhibits Cant easily Reduce data errors causes High Infrastructure cost causes
Compliance & Risk
Undermines
Strategy & Business Agility
Undermines
High potential Remediation costs
Non-compliant With state & Federal regulations
Difficult to Meet demands Of new business channels
In nhibitors
Cant easily Identify high value customers causes causes
Ad-hoc data Quality methods
Tarnished Brand reputation
Higher than Necessary Probability of Data misuse
Cant easily Identify key Relationships & hierarchies
Cant easily Identify crossSell, up-sell opportunities
Lack of data Retention policies Weak data Security monitoring
No single view Of customer
Exposure of Personally Identifiable Id tifi bl Information in Non-production
Cant easily Consolidate data From silos, Integrate new Systems quickly (M&A)
Courtesy of Steven Adler, IBM
Data Governance and Data Stewardship
A data stewardship program is a key part of an overall data governance program. It is the operational aspect of data governance. If (as we stated earlier) data governance is the the execution of authority over the management of data, then data stewardship is formalized accountability for the management of that data.
(courtesy of Robert Seiner, KIK Consulting)
This is where the day to day work gets done day-to-day done.
What do we mean by a Data Steward?
A key representative in a specific business area that is accountable for quality and use of that data throughout the organization. The data stewards are the owners of organization the data and the decision-makers about the data (Sherry
Michaels, Erie Insurance)
Data stewards are the ones who can reach into the organization and pull out the knowledge (and knowledgeable people) that are needed needed. Data Stewardship is NOT a job it is the formalizing of data responsibilities that are likely in place in an informal way. Data Stewardship involves specific tasks for which the p p stewards must be trained.
Data Stewardship: Needed for Data Quality
A data quality initiative introduces new constraints on the ways that individuals create, access, use, modify, and retire d d i data. T ensure that these constraints are not To h h i violated, the data governance and data quality staff must introduce stewardship That means: stewardship.
Data Quality policies: introduced and monitored Enough metadata to support the data quality processes Incorporation of data quality into system design by the developers. Data Quality requirements must support enterprise usage of the data (not just what is needed for the source system). Identifying important business impacts of poor quality.
Data Stewards are Accountable
Data Stewardship establishes accountability for:
Data definitions and derivations Data D t quality rules and th i enforcement lit l d their f t Key role in improving data quality Data-related communications Data element rationalization Contributing to data-related policies and procedures. Understanding the downstream uses of their data and how proposed changes impact those uses.
Data Stewards have authority
Their decisions are enforceable Oversees all data-related work in their business function Represents their business function as the single point of contact.
What Happens Without Data Governance?
Different parts of the organization:
Use their own definitions for data, so they may enter different values. L d t b d d i i l Leads to bad decisions, numbers th t d t match, etc. b that dont t h t Derive their numbers based on different calculations and the numbers dont match. Make different determinations of the data quality, leading to different degrees of confidence in the numbers (or even a decision not to use certain data) data). Long arguments about meaning and quality.
Master Data Management (MDM) is impossible! Improving Data Quality is very hard except in limited silos. silos
The organization without Data Governance
Data Quality without Data Governance
Data quality deteriorates over time Hard to correct because:
Data producers are incented to be fast, but not necessarily accurate. Stewards must champion changing the business priorities. priorities Data quality rules are not defined. Stewards can define the rules and required quality levels. Individuals make their own corrections. Stewardship exposes this and the costs of these processes. Poor quality data is not detected proactively Stewards can proactively. demand (and demand funding for) enforcement of DQ rules during system loads.
Data Governance Organization
Business and IT view of the Data Governance Organization:
Business
Data Governance Business Sponsor PT
IT
Data Governance IT Sponsor PT
Data Owners PT
Chief Data Steward FT Enterprise Application Owner (Delivery Manager) PT Application Domain Owner (Business Partners) PT
Enterprise Data Steward FT
Project Data Stewards FT
Business Data Stewards PT
Data Domain Stewards FT
Legend
Data Governance Committee Data Stewardship Council
Technical Data Stewards PT
PT = Part Time FT = Full Time
Creates working group
The Stewardship Organization
Data Stewardship Council
Enterprise Data Steward Sa es Sales
Membership Products
Insurance S i Services Claims
HR Underwriting Operations
Call C t Center
Marketing a et g
Financial M d li Modeling
IT
Financial T ti Transactions
Travel a e
Actuarial
Business Functions
Data Stewardship Committee
Functional body for data governance program Apply data standards, policies, and principles. Participate in and contribute to data governance processes. Evaluate effectiveness of processes. Approve and manage data-related information information. Contribute to and ensure completeness of data-related documentation (metadata). Make decisions on ownership of data. Communicate data governance vision & objectives to business function and data analyst community community. Shape data governance design and implementation; ensure alignment to the business. Communicate decisions of the committee.
Why Add Data Governance to Project Methodology?
DG tasks benefit from scope limitations of a project.
Limited block of data Limited number of source systems
Management of tasks and deliverables benefit from professionals (Project Managers).
PMs will bird dog the deliverables and ensure they get done (that s (thats the theory, anyway). theory anyway) PMs will schedule the tasks and allocate the resources.
Projects have the business attention business attention.
Subject matter experts are assigned. Time is allocated to work on the project tasks.
What needs to be added to Project Methodology?
Integration with Project Management Metadata components (definitions derivations data (definitions, derivations, quality rules). Data Quality Components Solution Evaluation components QA Components (including Data Quality Assurance)
Data Governance Value to a Project
Collection of data definitions
Building a body of stewarded and understood data definitions benefits all those in the enterprise who use the data, and alleviates confusion when discussing the data. This also helps with conversions.
Collection of data derivations
B ildi a body of stewarded and validated data derivations l d t a Building b d f t d d d lid t d d t d i ti leads to common way of calculating numbers. The result is not only that the project delivers results that match the official calculation method, but p y y p y much less time is spent by data analysts across the company attempting to reconcile reports.
Identification and resolution of data quality issues
Poor data quality can keep a project from going into prod ction The q alit production. risk to a project is lessened by early identification (and where possible, resolution) of data quality issues. Data profiling measures p , p p specifics of the data, and provides a comparison between what the data looks like and what the data quality rules say it should look like.
Adjust Project Methodology: Data Quality
Collect (during Analysis and Design):
Data Quality issues and rules for measuring quality (meet guidelines) Data Quality rules: When the data goes bad, how do you know? Information to verify the issues and quantify severity
Project resources
Guided by Project Data Steward, collected from business analysts/SMEs y j y Documented in Mapping document or DQ rule dictionary
Measure and validate rules against data using Data Profiling.
Quantifies the extent of the data quality p q y problem. Rules may need to be restated if fit to data is poor. Data is examined and results reported back to the business. Determination must be made as to fitness for use.
Metrics:
Total DQ rules stated and validated Fit of data to stated rules Change in quality of data over time
Adjust Project Methodology: QA
QA test cases written using Data Quality rules
Test cases run as part of regular QA process D t defects tracked in QA system and prioritized and worked Data d f t t k d i t d i iti d d k d just like any other defects. Some business rules and relationships may show up as data defects (policies without dri ers) itho t drivers).
QA test cases written using metadata (definitions)
Do screens show data expected based on definitions? Do valid value sets show values expected based on definitions and stated value sets? D screens show multiple fi ld th t are actually th same thi Do h lti l fields that t ll the thing (due to acronyms)? Has the metadata been entered into the EMR and glossary?
Data Governance and Data Quality
A primary deliverable for Data Governance is improved data quality This should go beyond just response to DQ issues (reactive) and include defining, finding, and fixing DQ issues before the customer does (proactive). Should include Data Quality Analysis and Reconciliation Needs to be driven by the Business Impacts of poor quality: some data may be bad, but if it doesnt stop important business processes, MOVE ON.
The Data Quality Improvement Cycle
(1) Identify and measure how poor data quality impedes business objectives
Analyze
(2) Define business-related data quality rules & performance targets
(5) Monitor data quality against targets (3) Design quality improvement processes that remediate process flaws.
Act
(4) Implement quality improvement methods and th d d processes
Business Results Metrics Example
Cost of poor quality data to your business:
Calling/Mailing costs: How many times did we contact someone who already had a particular type of policy or who was not eligible for that type of policy? How much postage/time was wasted? Loss of productivity/opportunity cost: How many policies could have been sold if agents had only contacted eligible policyholders? How much would those policies have been worth? Loss of business cost: How many policyholders canceled their policies because we didnt understand their needs or didnt appear to value their business (survey can give you an idea). What is the lost lifetime value of those customers? Compliance cost: How much did we spend responding to regulatory or audit requests (demand!). How much of that was attributable to poor data quality or information not available?
Steps to Data Quality Analysis and Reconciliation
Data Profiling Data Profiling Results Review
Reviewing the data quality analysis with Data Stewards to determine acceptable ranges of data quality, associated risk, transformation guidelines and recommendations on data guidelines, cleansing.
Data Cleansing
The development of required ETL processing to cleanse the data. Only want to do this once after the process has been fixed. Or thats the theory, anyway
Collecting the Data Quality Rules
Get the rules from the Data Stewards C eate template co ect the quality u es Create a te p ate to collect t e qua ty rules:
Mandatory, optional, valid values, valid range, data type, patterns R l ti Relationships b t hi between d t elements data l t Relationships between records in different tables
Guided G id d conversations with stewards t gather rules ti ith t d to th l Helping the business help us define what we mean by good quality f a d t element. d lit for data l t Can help to pre-profile the data (do a sample extract) to h t show th stewards what is actually present now. the t d h ti t ll t
What is Data Profiling?
Data Profiling is a process whereby one examines the data available in an existing database and collects statistics and i f i i d information about that d i b h data.
Wikipedia, [Link]
Data Profiling is the use of analytical techniques to discover the structure, content, and quality of data. Danette McGilvray Granite Falls Consulting, Inc. McGilvray, Consulting Data Profiling is a set of algorithms for statistically analyzing and assessing the quality of data values within a data set as well as exploring relationships that exist between data elements or across data sets.
David Loshin, Knowledge Integrity, Inc.
What is Data Profiling (continued)?
Uses both real data and metadata to determine the quality of data. Identified source data requires both a detailed analysis of the raw data values currently stored in existing databases and files and review of the files, existing metadata, to determine the actual meaning, descriptions and relationships that should be found in the data. Data profiling should be used whenever data is g being converted, migrated, warehoused or mined. Can help discover business rules embedded within p data sets, which can be used for ongoing inspection and monitoring.
General Benefits from Data Profiling
Identify or validate availability of information. Improve predictability of project ti li j t timelines. Lower the risk of design changes late in the project project. Data integration and migration testing support. pp p Support compliance and audit requirements. Rapid assessment of which fields are consistently populated against model expectations. Focus data quality efforts where they th are really needed. ll d d Improve visibility to quality of data that supports business decision making. Compare source, target, and transitional data stores. Identify transformation rules for y migration and integration.
Danette McGilvray, Granite Falls Consulting, Inc.
Benefits: Saves the Programmers time and effort
Programmers already examine the data to make sure their work doesnt lead to code/load/explode.
If they believe what they are told about the data contents, it invariably leads to code failures. They end up reviewing the results with the project team to decide whether to code around the bad data or fix it. Profiling puts a rigorous process in place to prevent the need for this effort effort. Real example: 24 defects, $556,000 in development time, $142,000 in QA time, 6 month delivery delay because of unexpected d d data i the f d in h feed.
Scope of the Data Profiling Process
Not just done on raw data elements:
Includes counts and aggregations Other derived values
Can be run on:
Individual columns Across columns in a table Across tables Across applications and databases
Using Data Profiling for DQ Assessment
1. Extract data to be profiled 2. Analysts profile the data using a profiling tool and review results 3. Potential anomalies are noted within tools repository. Record: The data element in question The potential issue Why it might be an issue
4. Reports are generated from the profiling tool and reviewed by business Subject matter experts
5. Issues are reviewed and evaluated, e.g., Red: definitely an issue Green: not an issue Yellow: requires additional review review. Gray: Out of scope
6. Results reviewed for next steps steps.
Data Profiling is also a process
Determine Issues Worth fixing
1 Define Data Quality Rules
2 Profile the data Using a Data Profiling tool
3 Review Data Findings
4 Analyze Data Quality Issues
5 6
Set and Enforce Data Quality targets
Monitor ongoing Data Quality
Impacts on Metadata
The data quality rules discovered via data profiling are metadata. The results (quality of the data) are also metadata Must be documented Profiling results in a determination that either:
The interpretation of the data given by the metadata is correct and the data is wrong, or The data is correct and the metadata (data quality rules) are wrong Unless they are both wrong
Metadata M t d t needs to be recorded! d t b d d!
What Data Profiling Achieves
Metadata: Accurate and Inaccurate
Accurate Metadata
Data: Accurate and Inaccurate
Data Profiling
Facts about Inaccurate Data
Data Quality Issues
Analysis: An example of birthdates
Check out the beginning of the year
Looks too high
and the end of the year and year.
Finishing Up
Data Governance is a program that needs corporate support and an organization Data is an asset that must be defined, managed, stewarded and governed. Accountability and Communication are crucial. Data Quality and Robust Metadata are benefits of a Data Governance program Taking responsibility for Data Quality across the corporation is a primary goal of Data Governance
Thank you andany questions?