TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
- Webinars
  - Expert Panel Exploring Best Practices for Unified Data Management January 13, 2025
  - De-Risking Innovation: Safely Adopting GenAI January 14, 2025
  - Building Reliable Data and AI Systems January 15, 2025
  - Talking Business to Your Data: Conversational Analytics January 16, 2025
- Virtual Summits
  - Virtual Events TDWI Virtual Summit Series: Generative AI in Action: Lessons Learned from Successful Implementations December 9, 2024
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
Train
- In-Person Events
  - Conference TDWI Transform West - Las Vegas December 13, 2024
  - Executive Summit TDWI Modern Data Leader's Summit Las Vegas: Modern Data Foundations: Essential Strategies for AI Success December 20, 2024
- Virtual Live Seminars
  - Seminar Data Architecture Essentials: Building a Data Foundation for Enterprise Analytics November 26, 2024
  - Seminar Getting Started with AI in Your Organization November 26, 2024
  - Seminar Data Modeling Essentials November 26, 2024
  - Seminar ChatGPT 101 for Business Users November 26, 2024
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Overcome Data Shortages for ML Model Training with Synthetic Data

Recent innovations produce synthetic data that is richer, more varied, and similar to real data, making it more useful than ever in providing the missing data machine-learning models need.

By Sigal Shaked
June 14, 2021

There are many roadblocks to developing and deploying machine learning models: matching business objectives with technological capabilities, moving workloads between cloud and on-premises, finding experienced staffing, and breaking down data silos. All of these challenges are complex and difficult to solve. However, another obstacle -- the shortage of data for machine-learning model training -- is closer to being overcome, thanks to recent innovations.

For Further Reading:

The Machine Learning Data Dilemma

Data Requirements for Machine Learning

The Rise of Automated Machine Learning

Models Starved for Data

The lack of data that reflects the full depth, granularity, and variety of real life conditions is often the reason why a machine-learning model performs poorly. An enormous number of data sets are required to run an unbiased ML model that creates meaningful insights for all types of scenarios. Different model types have varying data requirements, but finding data is always a challenge. Linear algorithms need hundreds of examples per class; more complex algorithms need tens of thousands (possibly millions) of data points. A rule of thumb is that you need roughly ten times as many examples as there are degrees of freedom in your model.

If there is insufficient data, a model is more prone to overfitting, making it unable to analyze new data properly. If data types are missing specific populations, the model could be biased and not reflect the realities of the environment where it will run. Training data needs to include a proportionally accurate sample size of each member of a population, including all types of instances and combinations. This becomes even more severe in anomaly detection problems where the unusual pattern that needs to be detected may be under represented. Enterprises may also face the problem of incomplete data where attribute values are missing for data sets.

Causes of Data Shortages

There are several reasons why there is insufficient data available for AI/ML models. The first is that enterprises are not allowed to use sensitive customer data without their explicit permission due to data privacy laws. There aren't enough customers, employees, or users that agree to have their data be used for research purposes.

Another reason is that ML models might be designed to work with new trends or respond to new technologies, processes, or product features for which no historical data is yet available.

The nature of the data itself can result in smaller sample sizes. For example, a model that measures stock prices' sensitivity to the consumer price index is limited to indices published once a month. Even 50 years of CPI history will result in 600 records -- a very small data set.

Sometimes the effort to label data is not timely or cost-effective. For example, a model predicting customer satisfaction might require an excessive number of hours to manually inspect hundreds of recordings of service calls, text messages, and emails to measure customer sentiment.

New Advances for Creating Synthetic Data

Able to generate large volumes of safe data to keep enterprises compliant, synthetic data provides the data that models need while filling in the gaps that keep the data balanced and complete. Recent innovations that improve the accuracy of synthetic data have made it even more useful in providing the missing data machine-learning models need.

Used successfully to improve the quality of images, generative adversarial networks (GAN) generative models are now used to improve the accuracy of synthesized tabular data. GAN generative models use two neural network models, one that generates new plausible samples and another that differentiates generated examples from actual data. The two work against each other. The generator model provides samples to trick the discriminator, and through experience and fine-tuning, they create synthetic data that is more realistic.

An even more recent advancement is Wasserstein GAN, or WGAN. Instead of using a discriminator to predict the probability of generated images as being real, the WGAN uses a critic that scores the realness of a given image. The critic neural network seeks a minimal distance between the distribution of the data observed in the training data set and the distribution observed in generated examples and then trains the generator model to create more realistic data.

Unlike GANs that seeks stability by finding an equilibrium between two opposing models, the WGAN seeks convergence between the models, resulting in synthetic data that has characteristics more closely aligned with real life.

As technologies evolve to make synthetic data richer, more varied, and similar to real data, there is a high likelihood that synthetic data will become easy to generate and use. Eventually used to solve the data shortage, synthetic data will protect privacy, and enable enterprises to stay compliant while improving the speed and wisdom of ML models.

About the Author

Sigal Shaked is the co-founder and CTO at Datomize. In her career, Sigal gained vast experience as a data scientist and a researcher for industry and government projects. Her Ph.D. investigated the utilization of machine learning techniques to protect data privacy. You can reach the author via LinkedIn.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Overcome Data Shortages for ML Model Training with Synthetic Data

Related Articles

Trending Articles

What’s Ahead in Generative AI in 2025? (Part Two)

What’s Ahead in Generative AI in 2025? (Part One)

Curb Your Hallucination: Open Source Vector Search for AI

4 Practical Tips to Create Value with AI

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Overcome Data Shortages for ML Model Training with Synthetic Data

Related Articles

Trending Articles

What’s Ahead in Generative AI in 2025? (Part Two)

What’s Ahead in Generative AI in 2025? (Part One)

Curb Your Hallucination: Open Source Vector Search for AI

4 Practical Tips to Create Value with AI

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career