On Monday, AmeliorMate founder and independent researcher Katie Evanko-Douglas published AmeliorMate’s first substantial piece of original research answering the question:
Can OpenAI’s GPT-3 language model generate effective synthetic training data for text classification algorithms?
View or download the full report including links to the project’s full codebase and datasets here.
In the study, Katie attempted to answer the question by using three common toy datasets: IMDB movie reviews, Enron emails, and a spam text message dataset. She used a randomly selected 5% sample from each dataset to train comparison models.
She also used that 5% dataset as the seed for GPT-3-generated synthetic training data. She generated ten sets of synthetic data per original dataset to test the effects of model size and the number of examples given. She then trained three machine learning models on each set of synthetic data.
Because she had faced such a predicament in the past, she wanted to know:
What if an engineer only has access to a small amount of labeled data?
For example, what if they want to create a text message spam filter but cannot access a large number of text messages due to cost constraints or privacy concerns?
Is it better to train on the tiny, real-world dataset you have or is it better to use that tiny dataset to generate larger amounts of synthetic training data with GPT-3?
On the whole, the study found that machine learning models trained on GPT-3 synthetic data tend to perform better than models trained on small amounts of real-world data, with over half (54.44%) of all models trained on GPT-3 data (N=90) reporting higher accuracies than their 5% counterparts while reaching at least the 0.01 level of significance.
However, there were notable differences in model performance based on dataset and model type. For example, the IMDB and SMS models only produced accuracy scores that were higher than their real-world data counterparts while the Enron email results were mixed. Regarding model type, all 30 logistic regression models produced higher accuracy scores, three of 30 random forest models produced lower accuracies, and five of 30 passive-aggressive classifier models produced lower accuracies.
Though more research is needed to codify best practices and to ascertain under which exact circumstances GPT-3 is most useful when generating synthetic training data, this study offers an interesting glimpse into the future.
It would be absolutely groundbreaking if one could create vast amounts of effective training data simply by giving an API a natural-language prompt and a few examples of the task.
For example, the SMS models portion of the experiment began with a 5% dataset of only 161 non-spam messages and 26 spam messages (186 text messages total). Even in the most pessimistic case (the passive-aggressive classifier), models trained on GPT-3-generated synthetic data still showed modest gains in accuracy. More importantly, every model trained on GPT-3 data created via at least two examples had recall gains between 17 and 22 points, all at the 0.001 level of significance.
Data such as personal text messages are not always easy to come by. Besides the obvious privacy concerns, it can be expensive and/or time-consuming to procure. Yet algorithms that can filter out spam or at least alert people to potentially dangerous spam text messages provide value to consumers.
The main takeaway from this study is that we may well live in a world someday where instead of needing thousands or tens of thousands of examples of such pieces of data, one may only need a couple of hundred pieces of data to train an effective model.
A machine learning engineer might be able to quickly and easily create useful algorithms they never dreamed possible without violating people’s privacy or paying out the nose for additional data.
Such a future would be exciting indeed.
If you’re interested in experimenting with synthetic textual data for your machine learning projects, please contact Katie Evanko-Douglas at katie@ameliormate.com.