TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
- Webinars
  - Expert Panel Exploring Best Practices for Unified Data Management January 13, 2025
  - De-Risking Innovation: Safely Adopting GenAI January 14, 2025
  - Building Reliable Data and AI Systems January 15, 2025
  - Talking Business to Your Data: Conversational Analytics January 16, 2025
- Virtual Summits
  - Virtual Events TDWI Virtual Summit Series: Generative AI in Action: Lessons Learned from Successful Implementations December 9, 2024
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
Train
- In-Person Events
  - Conference TDWI Transform West - Las Vegas December 13, 2024
  - Executive Summit TDWI Modern Data Leader's Summit Las Vegas: Modern Data Foundations: Essential Strategies for AI Success December 20, 2024
- Virtual Live Seminars
  - Seminar Data Architecture Essentials: Building a Data Foundation for Enterprise Analytics November 26, 2024
  - Seminar Getting Started with AI in Your Organization November 26, 2024
  - Seminar Data Modeling Essentials November 26, 2024
  - Seminar ChatGPT 101 for Business Users November 26, 2024
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Artificial Intelligence and the Data Quality Conundrum

Garbage training data in, garbage model out. Here are four things to address to solve data quality problems.

By Brian J. Dooley
November 21, 2019

Machine learning (ML) and other forms of artificial intelligence are evolving quickly today and creating a powerful array of valuable new processes for business. Most experimentation has been geared to finding specific solutions to specific problems. However, data quality challenges are likely to become increasingly important. As with the old saying, "garbage in, garbage out," the nature of the input data can strongly influence the results that come from these systems.

For Further Reading:

Why the Key to AI Success is a Tidy Data House

CEO Q&A: Data Quality Problems Will Still Haunt Your Analytics Future

AI and BI Projects Are Bogged Down With Data Preparation Tasks

Data quality has always been an issue in database and data collection systems. Transactional databases have established procedures for data quality assurance, but a new range of concerns is raised by ML. The types of data errors and their potential consequences are different from those experienced with transaction-based systems. Use of very large data sources, streaming data, complex data, and unstructured data add to quality issues, and new concerns are raised by modeling and training.

Data with a Difference

ML utilizes very large data sets in training its models as well as in practice when the models are run. This data can be subject to systemic bias that can create serious accuracy problems as well as potentially violating laws and social norms. Biases may not be immediately apparent, particularly when models use training data that is not obviously suspect. The algorithms, the data, and the results are conditioned by the definition of the problem and its solution. For example, if the data only includes male respondents, the model can only yield results that apply exclusively to males with any certainty. The same is true for minorities and other significant differentiating characteristics that may be embedded in data.

The problem of bias is well recognized in ML circles, but it is only the tip of the iceberg. In ML, models and data quality are intrinsically linked through the use of training data. Algorithms may be viewed as a kind of scientific experiment; if the wrong data is selected, then the experiment can fail to produce an adequate result.

In addition to questions of bias, the need to use extremely large data sets results in more common problems such as noise, missing values, outliers, lack of balance in distribution, inconsistency, redundancy, heterogeneity, timeliness, data duplication, and integration. Coding issues can creep in where preparation and attention to detail are lacking.

Huge data sets can be screened and wrangled through programmatic methods, some of which include ML or other AI-based methodologies. However, even in these cases it is difficult to ensure that systemic bias or incorrect problem definition does not occur. Checking algorithms and training them against diverse data is imperative for ensuring data quality. The algorithm and data need to be understood in terms of the desired result.

Quality Issues from Models

Another issue with data prepared for ML and AI is the need to create static models for real-time use after training has completed. Although AI provides considerable flexibility in discovering patterns and creating workable models for specific cases, changes in conditions reflected in the data stream can result in another kind of error. The data may be processed in real time, but the use of a static model means that even small changes in the data stream can produce incorrect results. For this reason, results need to be continuously monitored to ensure that new biases or wrong conclusions are not derived due to alterations in the data.

An additional cause for concern is the interaction of algorithm, training, data quality, and result. The algorithm itself can include data definitions that are inherently prejudicial, or data used in training may not reflect the global data against which the system is to be used. This problem is compounded where data is collected from an area entirely different from the domain of the training data and original use of the model.

Finding a Solution

To solve your data quality problem, you must ensure that both your training data and your working data repository have sufficiently high quality for the task at hand. This requires:

Data analysis including data characteristics, distribution, source, and relevance.

Review of outliers, exceptions, and anything that stands out as suspicious with respect to the business conditions being considered.

Domain expertise from subject matter experts to explain unexpected data patterns so that potentially valid information is not lost and potentially invalid information does not influence the result

Documentation: the process used must be transparent and repeatable. A data quality reference store is a good way to maintain metadata and validity rules, and this should make the creation of new algorithms and adjustments easier.

Additionally, the processing pipeline needs to be continuously validated based on the rules and experience of previous analysis. Although the specifics might need to be adjusted as data changes, each business will have its own set of domain rules that need to be applied to determine validity.

To do all this requires a data quality team and a sufficient set of tools to operate on the data used in machine learning and AI programs. Given the complexity of data and the individuality of domains, each case is likely to be significantly different. In general, the greater use of complex data and unstructured data, the more careful evaluation needs to be.

As digital transformation proceeds, more enterprises are rapidly jumping on the ML bandwagon and creating larger and more complex data streams with greater data quality difficulties. Quality tools will continue to evolve in response.

About the Author

Brian J. Dooley is an author, analyst, and journalist with more than 30 years' experience in analyzing and writing about trends in IT. He has written six books, numerous user manuals, hundreds of reports, and more than 1,000 magazine features. You can contact the author at bjdooley.query@yahoo.com.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Artificial Intelligence and the Data Quality Conundrum

Related Articles

Trending Articles

What’s Ahead in Generative AI in 2025? (Part Two)

What’s Ahead in Generative AI in 2025? (Part One)

Curb Your Hallucination: Open Source Vector Search for AI

4 Practical Tips to Create Value with AI

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Artificial Intelligence and the Data Quality Conundrum

Related Articles

Trending Articles

What’s Ahead in Generative AI in 2025? (Part Two)

What’s Ahead in Generative AI in 2025? (Part One)

Curb Your Hallucination: Open Source Vector Search for AI

4 Practical Tips to Create Value with AI

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career