Using Generative AI to Improve Data Quality
Mitch Kwiatkowski is a seasoned Data and AI Leader with more than 20 years of experience driving the digital transformation of hospitals, physician practices, and health plans.
One of the first data governance issues I ever worked on involved tracking the status of a patient across a health system. This was triggered by an incident where a birthday card was sent to a child who had passed away in one of the system’s hospitals several years earlier. We should have known that and avoided an embarrassing mistake. After a thorough investigation, we understood that our systems had at least a half dozen ways in which the patient could be marked as deceased, and we lacked consistent business processes to create a source of truth.
Unfortunately, data quality issues like this are common in healthcare and impact clinical care, patient outreach, and business operations.
Most of us would agree that the quality of our data is vital to digital and analytic endeavors in any industry. A 2022 study published by Great Expectations reported that 77% of respondents admitted to having data quality issues in their organization. Of those, 91% believe they were having an impact on company performance. 1 Despite being a key component of an exemplary enterprise data management program, data quality is easy to neglect. It can be daunting to solve data quality issues when they seem to occur in all data in all systems. The business might not trust data teams to fix the issues, and data teams may be unable to find a business owner to help make key decisions. Worse, the organization may not be willing or able to invest in the time or tools to measure, monitor and resolve issues.
Good Quality Data is Critical for AI
The output of any system is only as good as the data that goes into it. The explosion of generative AI (GenAI) means billions of records will train and enhance tools like LLMs, and the results will depend on the quality of information poured into foundation models.
Consider the patient status example from above. If an LLM trains based on what it knows about patients, the system may make false conclusions about patients who are alive vs. those who are deceased.
Poor data quality can lead to a variety of scenarios in healthcare:
- Decisions made about the most appropriate treatment plan could be flawed if data are missing or inaccurate.
- Errors in calculating the correct medication dosage can lead to dangerous outcomes.
- Incorrect diagnoses could lead to ineffective or improper care pathways.
- A health insurer might deny coverage for a service inappropriately, resulting in high medical costs for a patient.
As with any data quality program, an organization should focus on what is most important. Start small with Critical Data Elements (CDEs) and build momentum from there. Engage with business areas to establish rules and make decisions about the data. If you run into challenges, share stories about the potential impact of bad data quality with relatable examples.
Can AI exist without good data quality? Sure, but it’s only a matter of time before bad data leads to a wrong decision, lost money, customer harm, or damaged reputation.
This means unlearning decades of bad habits and dedicating time to the cause. This is a collaborative effort across business and data teams, and work must come with quantifiable metrics to show the value of the work being done. Fortunately, we may finally be at a place where we can do this with less time, cost, and effort.
AI May Be the Key to Better Data Quality
There are some exciting clinical and operational opportunities for GenAI in healthcare, but one of the more interesting ones is in data quality. A scalable, automated GenAI system could inject much-needed life into data quality management programs that have struggled to get support over the years. It not only eliminates the need for large, dedicated teams, but it could reduce the burden and disruption of issue resolution that often falls to business areas.
Here are just a few examples of how GenAI could help with data quality functions:
Data Profiling
AI could profile a data source, summarize results, and identify errors, missing values, inconsistencies, or redundancies. From there, a data management team could work with the business to prioritize and address resolutions.
Data Classification and Tagging
Many of the existing classification and tagging tools are limited to what a person configures, enters, and approves. AI could automate much of this intelligently and tag nearly every piece of data an organization has in its catalog. The more metadata collected at the source, the more that can be shared across systems to support true interoperability.
Issue Resolution and Cleansing
AI can quickly identify data quality issues, and given good metadata and rules, it could be configured to automatically fix quality issues up to and including the source system. AI could also standardize data to fit specific formats. For example, it could automatically repair a piece of data format if it falls outside of defined parameters (e.g., 10-digit phone number).
Data Quality Summarization
An AI system could be configured to monitor and measure quality across data sources (at rest and in motion) and report out in a consumable, business-friendly format. It might offer suggestions for enhancements, or it could predict potential hot spots to watch in the coming days and weeks. If it doesn’t know how to resolve a quality issue automatically, AI can present some options for human follow-up.
Don’t Wait to Get Started
Can AI exist without good data quality? Sure, but it’s only a matter of time before bad data leads to a wrong decision, lost money, customer harm, or damaged reputation. We’re finally at a point where we can use AI to improve the data that will eventually feed into other AI systems. Although data quality isn’t the most exciting use case for GenAI, it’s a business opportunity that can deliver value early by reducing errors and building trust. It’s also a way to get data and ML teams engaged in GenAI technology so they are informed and prepared when the bigger clinical and operational use cases hit their queue.