Diving Deep into Synthetic Data with Alex Watson of Gretel.ai

Machine Learning Engineered

Apr 20 2021 • 1 hr 19 mins

Alex Watson is the co-founder and CEO of http://Gretel.ai (Gretel.ai), a startup that offers APIs for creating anonymized and synthetic datasets. Previously he was the founder of http://Harvest.ai (Harvest.ai), whose product Macie, an analytics platform protecting against data breaches, was acquired by AWS. Learn more about Alex and Gretel AI: http://gretel.ai (http://gretel.ai) Every Thursday I send out the most useful things I’ve learned, curated specifically for the busy machine learning engineer. Sign up here: https://www.cyou.ai/newsletter (https://www.cyou.ai/newsletter) Follow Charlie on Twitter: https://twitter.com/CharlieYouAI (https://twitter.com/CharlieYouAI) Subscribe to ML Engineered: https://mlengineered.com/listen (https://mlengineered.com/listen) Comments? Questions? Submit them here: http://bit.ly/mle-survey (http://bit.ly/mle-survey) Take the Giving What We Can Pledge: https://www.givingwhatwecan.org/ (https://www.givingwhatwecan.org/) Timestamps: 02:15 Introducing Alex Watson 03:45 How Alex was first exposed to programming 05:00 Alex's experience starting Harvest AI, getting acquired by AWS, and integrating their product at massive scale 21:20 How Alex first saw the opportunity for http://Gretel.ai (Gretel.ai) 24:20 The most exciting use-cases for synthetic data 28:55 Theoretical guarantees of anonymized data with differential privacy 36:40 Combining pre-training with synthetic data 38:40 When to anonymize data and when to synthesize it 41:25 How Gretel's synthetic data engine works 44:50 Requirements of a dataset to create a synthetic version 49:25 Augmenting datasets with synthetic examples to address representation bias 52:45 How Alex recommends teams get started with http://Gretel.ai (Gretel.ai) 59:00 Expected accuracy loss from training models on synthetic data 01:03:15 Biggest surprises from building http://Gretel.ai (Gretel.ai) 01:05:25 Organizational patterns for protecting sensitive data 01:07:40 Alex's vision for Gretel's data catalog 01:11:15 Rapid fire questions Links: https://gretel.ai/blog (Gretel.ai Blog) https://www.wired.com/2010/03/netflix-cancels-contest/ (NetFlix Cancels Recommendation Contest After Privacy Lawsuit) https://greylock.com/portfolio-news/the-github-of-data/ (Greylock - The Github of Data) https://gretel.ai/blog/improving-massively-imbalanced-datasets-in-machine-learning-with-synthetic-data (Improving massively imbalanced datasets in machine learning with synthetic data) https://gretel.ai/blog/deep-dive-on-generating-synthetic-data-for-healthcare (Deep dive on generating synthetic data for Healthcare) https://medium.com/gretel-ai/synthetic-data-performance-report-e5a3cd6b1e6d (Gretel’s New Synthetic Performance Report) https://www.goodreads.com/book/show/18007564-the-martian (The Martian) https://www.penguinrandomhouse.com/books/172832/snow-crash-by-neal-stephenson/ (Snow Crash) https://us.macmillan.com/series/themurderbotdiaries/ (The MurderBot Diaries)