Practical AI – Episode #121

Cooking up synthetic data with Gretel

featuring John Myers

All Episodes

John Myers of Gretel puts on his apron and rolls up his sleeves to show Dan and Chris how to cook up some synthetic data for automated data labeling, differential privacy, and other purposes. His military and intelligence community background give him an interesting perspective that piqued the interest of our intrepid hosts.

Featuring

Sponsors

Code-ish by Heroku – A podcast from the team at Heroku, exploring code, technology, tools, tips, and the life of the developer. Check out episode 101 for a deep dive with Cornelia Davis (CTO of Weaveworks) on cloud native, cloud native patterns, and what is really means to be a cloud native application. Subscribe on Apple Podcasts and Spotify.

Knowable – Learn from the world’s best minds, anytime, anywhere, and at your own pace through audio. Get unlimited access to every Knowable audio course right now. Click here to check it out and use code CHANGELOG for 20% off!

The Brave Browser – Browse the web up to 8x faster than Chrome and Safari, block ads and trackers by default, and reward your favorite creators with the built-in Basic Attention Token. Download Brave for free and give tipping a try right here on changelog.com.

Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.

Notes & Links

đź“ť Edit Notes

Transcript

đź“ť Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another episode of Practical AI. This is Daniel Whitenack, I am a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a principal emerging technology strategist at Lockheed Martin. How are you doing, Chris?

I am doing very well. How’s it going today, Daniel?

It’s going good, it’s a nice, cold day here in Indiana… We’ll have a few more of those before this is all done, so…

Probably so… It’s not quite as cold down here in the sunny South… Although it’s not sunny today actually, so…

Yeah… And we had a new project funded this year related to some AI work for local languages, and trying to get all of that spun up, and infrastructure in place, and community website up for some shared AI tasks, and that sort of thing… It’s a lot of structuring and setup right now for me, I would say. I don’t know what big things you’ve got going, but yeah, it seems like that time of year for me.

Absolutely. From my standpoint, I’ve just been enjoying getting out there this weekend; I flew around a little bit. I had my first night flight… Pleasant to get back to work here this week, and it’s good.

Yeah, Chris is getting his pilot’s license, for any of those out there that are wondering what he’s talking about… But yeah, that’ll be exciting.

I didn’t run into anything last night. It was a good story.

Yeah, I’m close by Purdue University, and we have an airport at the university, so I told Chris as soon as he gets his license, he can fly up here and then we can do our recordings in-person, which will be nice.

There you go.

Today I’m really excited by the topic that we have going. Chris, a lot of times in our past conversations we’ve made reference to synthetic data, or augmented data, or data augmentation methods… We’ve also talked in various forms about privacy… But I don’t think we’ve really had an episode that has combined those in the great way that we’re about to… So today we have with us John Myers, who is CTO and co-founder at Gretel. Welcome, John!

Hey, good morning. Happy to be here.

Yeah. First, before we dive into all of that good data-related stuff and practical goodness, could you just give us a little bit of an idea about your background and how you ended up working with Gretel?

Yeah, absolutely. I think the way I ended up at Gretel is somewhat accidentally, which I think a lot of folks who ended up in this field ended up that way… So my background is computer science by education, and then I did about 14 years in the Air Force. When I joined the Air Force, I came in as a communications and informations officer, which is kind of a fancy word for network IT leader. I did a couple years working in space launch communications out in California, the Vandenberg Air Force Base… And then I got an interesting application to go to the National Security Agency and do some really cool, hands-on engineering and development work.

It sounded really awesome. At that point, my knowledge of the NSA was basically having seen Enemy of the State with Will Smith, and I was like “That sounds…” [laughter]

Wait, you mean it’s not like that?

I wouldn’t say it is a documentary of sorts… I was like “Well, it sounds really cool.”

Loosely based on reality.

Yeah, loosely based on reality. I didn’t meet Will Smith or anything, so that was a bummer, but… It was over in Maryland. I’m from Philly originally. I was like “It’s close to home. I can get out there and just experience something new.” When I got there, I got immersed into the intelligence community, and working at NSA, and I still wore the Air Force uniform, but I was kind of in an offshoot program there; I got to work on really cool stuff, working on low-level operating system engineering, building exploits, stuff like that.

Then I kind of pivoted into doing big data analysis, which was a kind of up-and-coming field at that time. Then I left there, did one more stint with the Air Force, doing a different set of things out of Las Vegas, and at that point I was at a critical point of if I wanted to stay in full-time or do something else. At that point I was so hooked on building… I just wanted to build, and engineer, and building engineering teams, that I decided to leave active duty; I joined the reserves, and then when I got out, I did the complete opposite thing you can do when you’re part of a 300,000-person organization, and I have launched a startup in cyber-security with three other people.

[08:21] Then we did an enterprise security startup, and we did that for three years (it was called Efflux Systems), and then we were acquired by a company called NetScout, that kind of is one of the leaders in network performance monitoring, and wanted to utilize some of what we had in cyber-security for some of their upcoming products.

When I got there, I was a principal architect, and among a lot of projects I worked on, I worked a lot on a lot of their cloud infrastructure. They build capabilities that also help enterprises and service providers detect and stop DDOS attacks. A lot of those devices collect a ton of telemetry from the customer environment and send it up to a cloud repository where it’s securely kept… And we wanted to look at how we can do analysis of that data. And you start [unintelligible 00:09:12.15] through that data and you start to realize there’s a lot of sensitive information in this data, and we should probably pre-process it, so we can kind of work on it safely. I think I spent a lot of time doing that; probably more time than I wanted. That was kind of like this pivotal transition point where I got into doing engineering to enable me to do engineering.

At the same time, back of my head it’s kind of like one of those Shark Tank pitches; I’m like “Surely, I’m not the only person, but this problem - what can we do about it?” with some other close colleagues of mine, as we started talking about these stories and we all kind of shared very similar, but different pain points; we kind of orbited around this idea of “What if we could just make data anonymization, and being able to make data safe to use, kind of just like a general-purpose thing that engineers everywhere can use?” and yadda-yadda-yadda. We launched Gretel.

So we launched Gretel in the fall of 2019, with myself and our CEO and my other co-founder, Alex Watson, who has a very heavy machine learning and data science background, and was previously the general manager at AWS Macie, which is another product that is very successful at AWS, for detecting sensitive information in S3 buckets.

So he has a whole different slew of stories - some will call them nightmares - about data anonymization. And yeah, we’ve been doing it ever since. And really, our mission is to make data anonymization and creating safe data generally available to engineers everywhere, not just the resourced organizations like the Facebooks and the Googles, who have massive resources to experiment with all the techniques to do it.

I am kind of struck – I was thinking while you were talking, especially with your background in Air Force and at the National Security Agency and other places… I was sort of remembering back to our conversations with the founders of Immuta, which is another company that does – I guess they’re more focused on the combination of law and data and governance and all of that sort of thing… But it seems like there’s this really strong - whatever it is about people coming from that sort of background; they had a sort of similar background, it sounded like. It really creates some deep thinking around these problems of data anonymization, privacy, governance… I don’t know, John, what’s your perspective from that side? How do you think your background with these sorts of agencies or the military has shaped how you think about data maybe differently than maybe someone like me, who’s just always sort of – started in startups and just sort of got what data I could, and have used it, and… Yeah.

And Daniel, that was exactly what I was gonna ask next, too. Just so that you know.

Yeah. Well, also - Chris has some experience in that world as well.

[12:02] Gotcha. Yeah, so I think one of the things that I learned a lot about doing intelligence work in the military and working with data is that I learned a lot about the chain of custody of data. And a lot of times when I meet folks that are like yourself, a data scientist, or they’re in the analytics space, a lot of times they are kind of just given data and given some task, and say “Here, go make magic happen.” And I don’t know how often the chain of custody or how that data was actually generated is thought about, but for me, I always think about “Where was that data born? At the moment that data was collected, something happened and it was written into a database.” I think that way a lot.

My other co-founders also have a background in the intelligence community, so it was something we were all aware of. So when we started talking about Gretel, really we wanted to make consumers safer, to make their personal data not used as the way it is today… Because often when you think about big companies like Google, Facebook - they build products, but they also look at their users as a product.

So we kind of backed into saying “What if we can enable engineers to make the data safe at the moment that it’s created? So right at that inception of the data.” That’s something that we are just really aware of, of where we came from, because that chain of custody of the data is so important.

And it’s not as much a governance thing, versus more of an engineering problem… Because as a data engineer, when I’m writing my data into my production database, can I at the same time create a safe version of that data and write it into a staging database that anyone can access, but privacy guarantees, so I don’t have to go through this whole repetitive process of snapshotting my production database, combing over it, writing some bespoke script to sanitize it; can’t we just make it part of the entire pipeline, at the point where the data is created?

I’d like to go back for one second… When you got to that moment where you realized “Am I the only one that’s dealing with this issue?” and you kind of had maybe an a-ha moment or something there, where you realized that - what got you to that? I’m curious about that moment of recognition, because I think other engineers and other data scientists wonder “Are they gonna have something similar as they’re out there creating?” What was it that made you suddenly realize “This is something that I’m recognizing not only impacts me, but probably impacts the broader community”, as well as “I have something to contribute toward that solution”? And as part of that, was any of the background – you know, we talked about your intelligence background there… Did any of that contribute to that moment and that recognition? If you hadn’t had any of those experiences, might you have missed that altogether?

I think there’s two big things to answer that, and I’ll start with the former. When I had the a-ha moment, I had a small team at my previous company, and we were kind of analyzing the data, and we realized that we needed right ways to detect the sensitive information that’s in it… And the sensitive information is an information – the fact of the sensitive information is it’s like names, company names, email addresses, IP addresses, it was things that can identify our customers…

PPI…

PPI, yeah… So we were like “Okay, let’s just write some detectors for it. We can use a lot of regex’es, we can write custom rules”, and then we were like “Okay, now we need a way to write a rule really quickly. Okay, now we need a framework to put the rule into.” And I was like “I can’t be the only person who’s trying to figure out if an email address slipped into a data stream.”

Some communities have really specific data structures, that are really specific to them, like in healthcare, and stuff. This was just things that our PPI, and to a degree, PII that identify organizations and people… And there are generic ways to do that, where like “Can you just bring a regular expression to the table, and just like some framework kicks in and can scale for you?” That was one of the things I turned at; what we were doing was fairly a repeatable process, and we assumed it was a repeatable process in many industries.

[15:54] And the second part of that was where to apply that detection and where to apply whatever type of transformation or synthesis we wanna do. It was kind of a no-brainer that you wanna do it as close to the source, where the source of the data is itself. We’re talking about systems that can collect private information… Like, can you do it on the system before you even think about transmitting it to the cloud, so there’s no risk there? Or can you do it on the edge, as people would say these days? That’s not even a question for me, based off of our backgrounds and being so in tune with data custody.

So I’m curious, as you’ve built out this set of products, which - we’ll definitely get into the details of those in a bit, and talk a lot more about the practicalities of synthetic data, and all that… But you kind of mentioned that this was like doing engineering so that you could do engineering…

As you’ve engaged with various companies that are using your product, has that story been getting – they sort of immediately understand what you’re after? Because I remember when I was first getting into data science there wasn’t a lot of talk about this sort of rigor and the way we were treating data… And probably people might have seen something like this as maybe a little bit burdensome, like something they have to do before they actually get into the work that they really wanna do… But how have people been feeling that need in the industry, and been accepting this sort of solution, from your perspective?

Yeah, I think it’s been received really well, and it’s kind of a classic build vs. buy problem, and a lot of folks are just looking to buy… But what they don’t wanna buy is some type of really difficult to install appliance or virtual appliance that kind of breaks their workflow… So the way that we’re targeting this is making it so it is our end user, our developers that can easily integrate it into what they’re doing already… So making just another API call in their stack of what they’re executing on, versus saying “Yeah, sure, we can do this, but we have to kind of common-install a virtual plant”, and you have to re-route your entire data pipeline through it.

So as soon as we kind of explain that, it works right into their existing infrastructure, and we take care of the scale for them, where they can bring their predictors or bring what they wanna actually detect on to the table. They’d much rather just buy versus build it, because it eats up a ton of cycles for them to build this thing. It’s not a build once and deploy type of thing either; it’s not like they’re building a framework that they can deploy once. It requires care and feeding, because you’re constantly adjusting what type of information you’re processing and what types of things you wanna anonymize on… So we can kind of go on that journey with you and enable you to [unintelligible 00:18:39.15] a lot faster.

I’m kind of curious - I know in the beginning of the conversation, when Daniel was introducing you, John, he talked about synthetic data… Could you start off by telling us what is synthetic data and kind of give us a little bit of a background before we dive into the specifics of what Gretel does and how it gets there? Give us the terms that we need to know to be able to follow.

Sure. So we were at a happy hour, [unintelligible 00:21:19.06] give you that level of definition…

Perfect. [laughter] Our podcast is always the happiest of hours in our listeners’ week, I’m sure.

That was funny.

So I would say synthetic data is the ability to generate data that is almost indistinguishably recognizable from some type of source dataset. And it has all the granular elements of the original dataset that you would want; however, if you combine those granular elements, you don’t have a one for one matching to a record in the source dataset.

It heavily relies on machine learning and artificial intelligence to learn the semantics of the source dataset, and at that point, once you learn those semantics and that model is built, you could just continue to generate records that in aggregate tell the same story as the source data, which is kind of like one of the key elements that we always like to talk about - you could still run the same types of aggregate queries and get the same story. It’s not about just being able to use the individual records that you synthesize.

Now, we’ll say, there are use cases for using those individual records, like if you have a development environment and you’re building a system and you wanna look and see how the records fit into your layouts and stuff, but for the most part, the idea is to be able to use those records in some type of set of aggregate feature.

There’s a whole lot of jargon, of course, in our industry, and you’ve already mentioned as well anonymizing data. How does synthetic data complement anonymization techniques, or maybe it’s an alternative to it? How do those two things fit together in terms of anonymization and synthetic data?

Yeah. I would say it all starts with the core use case, but it could either be a complement, it could be totally separate, or they could support each other. So we have two large buckets that we focus on at Gretel, and one of them is being able to detect PII, detect PPI, and then apply different transformation techniques to the data in place, so that your data is essentially the same, but there’s typical redactions, or character replacements, or whatever. And that falls in line with a lot of the existing solutions that are out there, that fall under kind of like a data loss prevention capability… And you’ll see a lot of the cloud providers, like Azure, AWS, Google - they all have a DLP set of APIs you can apply… Except that usually requires to be bought into their ecosystem and already have your data sitting there. In our mind, that’s table stakes, just to even have a conversation about privacy. We offer a set of APIs that allow you to detect and do those typical transforms.

[24:01] And synthetic data for us a way to take the dataset, build a model, and let the model generate new records that you can just accumulate and use however… And it doesn’t necessarily require you to funnel each record through a certain type of detector and look for PII, because we’re just gonna learn the semantics of the entire dataset and generate new records… But those records should not be the original records that you had. And they play hand in hand.

For one example, let’s say you have really sensitive PII, let’s say social security numbers in the source dataset - if you can detect that a certain column are social security numbers, we might go ahead and recommend that you generate new randomized social security numbers, which is very deterministic… And then you can have that new column in that dataset, then send it into our synthetic capability, and that will just help guarantee that we don’t memorize any of the tokens or replay any of those social security numbers… Because that is always a risk with synthetic data - you might memorize and replace some secrets. And that’s where that whole field of differential privacy is coming in to address that situation as well.

So the synthetic data that’s being generated - does it always start with being a replacement for the actual PII that you’re contending with at the time? Is it always kind of starting as a replacement factor, or is there ever a use case where you’re generating maybe – what if you’re starting with no data, and you wanted to generate it entirely synthetic, just because you don’t have something to start with. Is that within that, or would that be a separate type of use case, a separate product?

I would say that the synthetic data generation is not just based on doing anonymization, because you can kind of do that type of anonymization without the underlying need for machine learning and AI.

I think the issue that comes up is that you have a lot of different attacks, like [unintelligible 00:25:51.04] attacks, that are completely plausible and possible on data that’s been just anonymized in place. So just because you’re anonymizing names and addresses and phone numbers and email addresses - well, let’s say just for argument’s sake you have a bag of customer data and you have a bunch of records… You know, I like in Baltimore, and let’s say I’m your only customer who is a male in his mid-thirties in Baltimore; even if you take all my personal information out, you might be able to join the fact that you have a customer like me in Baltimore and I’m the only one - well, now you’ve re-identified me.

So with synthetic data it’s “How can we actually generate a lot of those other risky fields that are really risky in aggregate?” So you look at categorical fields like ages, genders, locations… How do you actually generate those records, so that they can’t be recombined to really identify someone, but they’re still useful when you wanna look up the average amount of revenue you get from people in Baltimore, or some type of aggregate question like that.

And then on the second question, for us to generate synthetic data you do need some type of training input to learn the underlying semantics. And then once you have that model, you can generate any number of records. it doesn’t have to be a one-to-one. If I have 5,000 training records, you can generate five synthetic records, you can generate 20,000. But once you learn that semantics and the fact that you can generate any number, you can do a lot of interesting things. You can do enforcement on what you’re generating.

Let’s say I wanna generate records, but I only wanna accept records that are of a certain category, a certain age or a gender group. You can use that to synthesize new records that help balance a dataset that might be otherwise biased, and not have enough samples of something that you’re trying to predict on, for example. Once you have that core model built, you can kind of generate records to meet a lot of those needs.

We’ve mostly been talking about use cases around private data, and privacy aspects, but is this synthetic data generation capability - does it also help people who are working in data-scarce scenarios, or imbalanced dataset scenarios? Let’s say that we don’t have any personally-identifying data in our dataset, at least to our knowledge we’re not dealing with that issue… But we do either have an imbalance dataset, or maybe we’re just working in a sort of data-scarce domain, where we do have some data… Like you say, maybe we have 5,000 records, but we really need 25,000 records for our model… Is it viable to use synthetic data in that type of scenario?

[28:31] Yes, and that is actually one of the core use cases that we have experienced, where there might be already a situation where data is deemed safe to use. I’ll use fraud as an example, because frauds are a really good one, where you have so many records that are not fraud. You’re usually trying to predict the opposite, so the field you’re trying to predict is an actual fraudulent event, but you might have just not enough records of that fraudulent event.

What we’re able to do is kind of guide you through how to synthesize more records that fit into the fraud category, so that when you go and you build your actual machine learning algorithm, there’s enough of those fraudulent records there that it could actually create a proper decision boundary, or whatever, so you have a net better model at the end.

I was just wondering, like – because I’m familiar with imbalanced datasets, where you do some type of maybe interpolation to get some extra points, or something like that… But you’re talking about really training a neural network model or some type of machine learning model to actually generate this data. So maybe this is too much of a simplification; I’m sure your methods are pretty advanced… But is the basic idea that you would have some of the fields of a record be input to the model during inference, and the it’s trying to predict another of the fields of a record? Similar to predicting the next word after a sequence of other words, or something like that. Is that some of the basic idea, or can you give us any intuition in terms of how the data is set up and how that works?

Yeah, that’s actually very close… So our core engine for training and generating synthetic data is completely open source; it’s on GitHub under GretelAI… And our initial implementation with the open source package - it’s kind of a wrapper framework, and you can have a bunch of pluggable backends… So right now, our first pluggable backend is an LSTM on TensorFlow. It is kind of a sequential model, which is a lot different than a lot of the other techniques that are out there… And what we do is we have the ability to focus on text input.

Then in the open source package we also have another module that kind of wraps that entire thing inside of a data frame… And then we infer different field delimiters, and essentially we’re able to reconstruct those records as a sequence, and then exactly [unintelligible 00:30:56.15] We can do it one of two ways. One, you can just say “Just keep generating records with no input”, or you could specify what we call a seed, where it’s like “Okay, I only want you to create records that start with this age group, this gender”, and then it’ll complete those records. That allows you to more efficiently increase a record of a certain type based on what your requirements are.

Then what we actually as the product - we have a bunch of different layers that work on top of that to do data validation. We have separate models that we build to learn and enforce the semantics of individual fields to make sure that when records are generated, they still fit within the constraints that you had before, whether it’s the right character sequences, the right structure of the fields if you have date times, making sure that categorical fields are always recreated… We don’t wanna invent a new state. So if there’s 50 states that are in a certain column, we’ll make sure we’re only generating valid states. Those are things that we provide inside of the product. But the open source package lets you kind of just jump right in to build and train on structured or unstructured data.

John, I’m curious if some types of data are easier to synthesize than other types of data. You mentioned dates, categorical variables, categories, labels, numbers… But then we also have things like audio, and imagery, and other things like that. What’s the sort of current state of the art in terms of synthesized data, and what data types or domains of data maybe are the bread and butter right now, and maybe which ones have some challenges in terms of synthesizing data in certain scenarios?

Yeah, so right now what Gretel ships is really focused around structured and unstructured text. So I think about records from a database, or any type of text input. Audio and video and imagery is next that we will probably see in a future iteration of the product, and it’s something that we’re working on now… A lot of the state of the art around that is in our wheelhouse now, because we were able to just back into our customer problems via structured records. Right now, we kind of have to pick our battles, and right now that’s the main one that we’re focused on, is being able to enable people to synthesize new versions of database tables, or static datasets, so they can more safely share them.

I’m curious - we’ve talked a little bit about the product side of things, and also you’ve made reference to the open source as well… Could you differentiate a little bit between what each side of that has to offer, to kind of give people a framework in their head about what they would go to for each, and where do they maybe step up from open source to your paid products/services, that kind of thing?

Yeah, absolutely. Right now the open source packages - and there’s two of them; one of them allows you to get started with synthetic data, and the other one allows you to get started with our traditional transformers, to kind of mutate data in place. Those are Python libraries that are available to anyone, licensed under Apache 2.0. Obviously, you go in using those knowing that it’s Python, and that already is kind of a qualification for a lot of our customers… Which isn’t a problem; we have a ton of researchers and data scientists that live and breathe in Jupyter Notebooks, so they’re able to plug that right in.

Last August we launched a beta of more of our premium features, and that beta basically allows you to use our cloud service to test out our labeling capabilities. Then what we ended up doing was packaging up a lot of our premium capabilities, which include automatic data validation, it does a lot of analysis to make sure that correlations across all your data are held, distributions across your data are held properly… We also released those available as an SDK that you can download through our authenticated API.

[35:52] We had a great about several months of going through that beta, getting a ton of feedback from users, and then what we walked away from that knowing is that what we really wanna do is make this available to engineers everywhere. And engineers everywhere can’t necessarily just download a Python SDK and incorporate it in their pipeline if, let’s say, your entire backend is written in Java. And so how do we drastically simplify what these premium SDKs do? So what we’re building now is the ability to launch Gretel services as kind of containerized capabilities that are backed by REST APIs. So now you can interact with our services purely through a REST API, which is completely language-agnostic. Every engineer at some point has gone through the process of making API calls through remote service, and so now that is kind of the qualification factor.

We wouldn’t have learned everything that we learned if we didn’t have that granular-level capability out there through the beta… So now the entry point will kind of either be you can run Gretel services in your environment, we’re also building a hosted service where we can run and scale these capabilities for you… But it should be as easy as taking your dataset, or taking some records, a lightweight configuration, pushing it to an endpoint, and that endpoint will then trigger a whole bunch of backend work to learn, build a model, or generate data for you. And at that point we really just wanna be a bump in the line in your entire data workflow, to be able to call into these API. So that’s what we’re working on now, is just really simplify that down.

Based on what you’ve seen with your current users and customers – like, if I’m a data scientist working on a new project, getting into some new data, do you have any recommendations in terms of workflow with your tools? Like, you know, when I get the data, I’m profiling it, I’m doing some exploratory analysis, where and when should I be thinking about fitting in some of these REST calls, or Python SDK elements into my workflow, so that I can make sure that I’m dealing with maybe both sides of things, anonymity and creating synthetic examples? Maybe more specifically my question would be are you seeing that done in a workflow upfront, and they do this on data, and then they use that data moving forward always? Or are you seeing this as sort of an ongoing part of people’s workflow?

I would say that upfront is not the usual case that we recommend, and we recommend that there’s usually a little bit of data cleaning you wanna do; not down to the granularity of doing a ton of all the exact feature engineering you would do to build a model, but at a minimum – and we have blueprints that help folks go through this process as well… You wanna identify, for example, columns that you probably don’t need to worry about synthesizing, because they’re not something that your model is gonna grab onto.

So if you have records that have maybe people names in them, typically those people names aren’t gonna be correlated to a lot of your continuous variables and your other variables in the dataset, and if you can drop those columns first, you’re gonna save a lot of time on being able to train a synthetic model for that.

Other examples would be, you know, there’s a lot of datasets we get from customers that are highly dimensional - several hundred columns - and they’re trying to train a model, maybe like [unintelligible 00:39:08.03] model on that to predict something, and a lot of times what we recommend is “Look, if you can kind of train your model first, and then you identify what the algorithm deems are the most valuable columns, just drop a lot of the other columns”, because then you’re gonna get way better performance out of maintaining the correlations on different subsets of the dataset.

So we do recommend at that point - really right before you would actually think about actually training your model - once your data is pretty much in that good state, around that ballpark… But it completely varies based off of use case. We have some customers that the first stop is coming to Gretel, because they wanna immediately [unintelligible 00:39:45.25] PII that they can remove. So I’d say it definitely varies.

[39:51] Yeah, and I’ve found that Gretel’s Blueprints repo, which seems pretty interesting. I see a bunch of these examples [unintelligible 00:39:58.17] create synthetic data from a CSV or a data frame… All sorts of examples. So if our listeners are interested in that, it looks like there’s some notebooks and things in there that they can look at. We’ll link that in our show notes for sure for people to take a look at.

Maybe one thing to kind of start us thinking about things into the future - where do you see the current challenges that are unsolved right now in terms of privacy, and maybe data augmentation, or synthetic data? What are some of those problems out there that you still see as open problems that need to be addressed?

Yeah, I’d say there’s a couple of problems, and they’re somewhat related. One of them is it’s still a very nascent field, and there’s a lot of tools out there, and there’s no magic bullet. There’s no way just to magically take a dataset and create a version of it that is perfect and doesn’t violate privacy. There’s always going to be a trade-off between utility and privacy, and helping out people understand that I think is gonna be a really big challenge.

There’s a ton of great research out there into how to do that trade-off between utility and privacy, and that’s one of the things that we wanna figure out, is how to make that more obvious to engineers when they wanna anonymize data or make data safe to share; it’s like all these knobs you can tune. Ideally, you don’t wanna go to a software engineer who’s maybe a full-stack engineer and they have access to a production table and they want to make a safe version of that data - you don’t wanna ask them to tune a bunch of hyperparameters for a TensorFlow LSTM, because they’re gonna be like “Whoa, I don’t know what’s going on here.”

But you might wanna ask them to say like “Look, what is the trade-off in utility and privacy that we should have here? Are you sharing this externally, are you sharing this internally?” Ask them what those levels are, and then how can we infer what all those really nitty-gritty knobs are that need to be turned for the underlying model that needs to be built…

Which kind of segues into the second problem I see - making these tools generally available to software engineers everywhere is gonna be a massive challenge. You can’t ask every engineer to download a Python SDK and have a crash course in machine learning to ask them to build a safe version of their dataset… So how do we kind of bundle and package these capabilities in a way that engineers everywhere wanna use as part of their day-to-day workflow? If you look at companies that made things dead simple - Stripe made payments less scary, because they have a ton of language bindings, it’s really easy to integrate into your app; it’s just another API call that you make and you don’t think about it, and they’re doing all this heavy-lifting of processing payments, which is a very complex thing… How do we kind of generalize down to that level? And that’s definitely one of the big visions and missions that we have here at Gretel.

[42:55] I’m kind of curious - as you’re describing that, and going back to the beginning of that second challenge that you’re looking at in terms of… It really strikes me the scale of what needs to happen here… So kind of beyond the specific challenges that maybe need to be solved, and that maybe Gretel wants to address, the scale of this is definitely holding a lot of engineers back, that are contending with this and can’t get where they wanna go… And if you’re looking out over the next few years at where this has to go as an industry, and the need to broadly, at scale, be able to increase productivity in AI/ML in general, and this being such a core tenet of that - where do you see the industry going with that? What needs to happen in the large to enable ten times, a hundred times as many engineers to be able to overcome these problems and get productive with the problems they’re trying to solve? You really got me thinking as you were answering those last two about how to get there from here… How do you get there?

That is a great question. In my mind - and this is something that we even do inside of Gretel - I think one of the key things that has to get us there is that we just have more of a free form exchange of (I guess) ideas and talent among different types of developers and engineers that are out there. When you look at a lot of organizations, there’s still always a lot of segregation between your platform engineers, and your software engineers, and your data engineers, and you have your machine learning engineers, and your data scientists… And really, I think everyone needs to be able to do a little bit of everything, and it’s like, how do you build toolsets that allow a software engineer to easily take a look at the data - even though it’s using some complex machine learning capabilities - without having to go and request a machine learning engineer to spend tons of time doing it, when that MLE should be maybe researching other parts that are more vital to what the core mission is of that organization.

You see that there’s been a lot of acceleration in micro-frameworks that are building REST APIs, and that is a really good example for how that allowed a lot of people to operationalize things. Even as a data scientist, you could fire up model training and predictions and back it with a REST API and make it generally available… Like, what’s the machine learning version of that micro-framework for a REST API that allows software engineers to quickly take use of all the capabilities that are out there, that are up and coming with synthetic data?

At our company now we have a complete blend of backgrounds; we don’t want the whole sequential motion of like “Well, this person builds the model, and then hands the model over to this person, who builds this…” We just want everyone to be able to plug in and build… So how do organizations move to that methodology.

Tearing down the walls there, so to speak, of those distinctions.

Yeah, for sure. Well, John, I am super-excited about what Gretel is doing, and I really appreciate your detailed description of why these things are important, and how you’re solving some of these problems. I think it’s really important. We’ll make sure and link, like I mentioned, some of these links that we talked about in our show notes for people to check out. Please go and check these out, try to generate some synthetic data with their tools and check out their platform.

Thank you so much, John. I appreciate you taking time to talk to us.

Awesome. It’s been a pleasure to be on the show. Thanks for having me.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. đź’š

Player art
  0:00 / 0:00