Why Synthetic Data Could Be the Ultimate AI Disruptor
Synthetic data -- computer-generated data that mimics real-world phenomena -- is disrupting traditional data-to-insight pipelines, allowing organizations of all scales to test, tune, and optimize revolutionary AI models.
- By Yashar Behzadi
- June 28, 2019
The three pillars of AI are models, computing power, and data. Although all three are required to successfully complete an AI project, data collection and organization present the most difficult challenge for today's enterprises. Any business that deploys AI needs a large volume of relevant, well-organized data.
The trouble is that there is no clear-cut answer for how much data you need to initially train and refine a given AI model. In addition, accessing data that can be accurately labeled and organized for your purposes is not always easy or affordable.
Sourcing data is a major hurdle. Due to mounting privacy concerns, individuals (read: your customers or users) may be extremely suspicious of requests to share any of their personally identifiable information. There is also the delicate issue of ensuring that users are provided services deemed valuable enough to warrant the exchange of sensitive data -- a feat achievable by only a small percentage of companies with consumer-facing apps.
Labeling data is another burden. Outsourcing data collection and labeling to third parties can be extremely expensive and creates difficulties in terms of compliance and security for many industries. Crowdsourcing or building an in-house data labeling team can take years and is costly in terms of management time and other resources. You'll need significant expertise to ensure that you properly design and explain tasks, manage labeler incentives, select and apply tools appropriate to the task at hand, and monitor labeler reliability.
Furthermore, you must intelligently partition your data into training, validation, and holdout sets to prevent skew. You must check your data for unexpected or missing feature values, training/serving skew, and train/test/validation set skew. Finally, you'll need a sufficient number of examples relative to the number of trainable parameters for any given model.
As a result of these difficulties, many industry players avoid rather than solve the "data problem." Rather than open up their customer databases to risk or invest in architecture for data collection, many AI platform companies and AI transformation consultancies ingest and conduct inference only on easily managed types of existing or publicly available data.
The majority of significant AI platform offerings available today are largely organized around specific industry use cases for which it is likely their customer base already has existing data. Although this may obviate the need to collect, classify, and label new data, unfortunately it also has the effect of reducing AI to the role of just another analytics tool, operating on the same time-based data from enterprise systems of record that has been the fodder of such tools for decades.
Computer-Generated Data
Enterprises that want to use AI to create truly new types of products and open new markets need a more ambitious approach. Enter synthetic data -- computer-generated data that mimics real-world phenomena. Synthetic data is disrupting traditional data-to-insight pipelines, allowing organizations of all scales to test, tune, and optimize revolutionary AI models to create dramatically better business value. Furthermore, perfectly accurate labeling is inherent in the design of synthetic data, and large, well-balanced data sets can be created that cover hard-to-capture edge and corner cases and are free of bias.
Martin Casado and Peter Lauten of Andreessen Horowitz recently wrote of synthetic data: "We know of a startup that produced synthetic data to train their systems in the enterprise automation space; as a result, a team with only a handful of engineers was able to bootstrap their minimum viable corpus [and] beat two massive incumbents relying on their existing data corpuses collected over decades at global scale, neither of which was well-suited for the problem at hand."
Leaders in the autonomous driving space have also been early adopters of synthetic data, with Google's Waymo self-driving car AI said to complete over three million miles of driving in simulation each day. Synthetic data further allows Waymo's engineers to test any improvements in simulation before they are tested in the real world.
Nvidia recently announced that it has honed an iterative process to allow robotics to perform more accurately in real-world environments following training on synthetic data.
New Products, New Markets
By helping solve the data problem in AI (rather than just ignoring it), synthetic data technology has the potential to inspire new product categories and open new markets rather than merely optimize existing business lines. For example, at Neuromation we are currently using synthetic data to help a major consumer electronics manufacturer carry out rapid prototyping of proposed new AI systems by enabling the testing of new sensor placement and modalities without requiring a prolonged process of building representative hardware, acquiring data under various configurations, labeling the images, and building various models.
We are also using synthetic data to correct for errors in real-world data sets stemming from geographic and demographic bias in data collection, which is frequently encountered by even the largest companies. In another project, we have used synthetic data to enable accurate robotic manipulation of unlabeled transparent bottles, allowing for new manufacturing and logistics workflows that were not previously possible.
In addition to solving AI's data collection problem, businesses must also contend with intense competition. Thanks to open source initiatives across the globe, top-performing AI models (and computing power via cloud PaaS) have become more widely available. Rather than view AI as merely the latest analytics tool for optimizing existing business units, enterprises must consider how synthetic data technology can provide the flexibility and freedom to conceive of robust, unbiased, economical, and fundamentally new products and services that will truly move the needle for their next stage of growth.
About the Author
Yashar Behzadi is an experienced entrepreneur who has built transformative businesses in AI, medical technology, and IoT markets. He comes to Neuromation after spending the last 12 years in Silicon Valley building and scaling data-centric technology companies. His work at Proteus Digital Health was recognized by Wired as one of the top 10 technological breakthroughs of 2008 and as a Technology Pioneer by the World Economic Forum. Yashar has over 30 patents and patents pending and a Ph.D. in Bioengineering from UCSD. You can contact the Neuromation CEO via the company's website.