TL;DR — We can leverage the text generation power of large language models like GPT3 to generate labeled data to use for supervised learning. We can do so using prompting, in which we give the LM a description of the task, some examples, and a new example to generate a label. The data generated from this is, in general, going to be noisy and of lower quality than human labels. However, the speed of data collection and the ability to use humans in the loop makes this an effective way to collect a significant amount of labeled data for tough-to-label tasks.
Data Labeling in NLP
It is well understood at this point that the quality and volume of labeled data is the single best predictor for the success of a machine learning project. In industry, having this data or not is often what dictates if a project is green-lit or left on the back burner.
However, collecting within many contexts can be extremely expensive and time-consuming. Labeling in specialized domains such as healthcare or recruiting may need an expert to assist in the annotation. This can make annotations more expensive and shrink the pool of potential annotators.
In addition, some tasks don't lend themselves well to human annotation. Generating summaries, for example, is notorious for being difficult as it is both low throughput (as each example requires the reading, comprehension, and writing of a potentially long document) and suffers from high annotator disagreement (where what is important to summarize could vary significantly from person to person).
Language model augmented data labeling
A growing trend in the ML literature is using large language models as a replacement for, or supplemental to, traditional annotation pipelines. This allows models to do the heavy lifting for the most labor-intensive parts of the annotation pipeline.

LM augmented data annotation works as follows:
- Generate noisily labeled dataset from language model using prompting
- Human annotators either accept or reject samples
- A critic model is learned to filter generated examples based on the annotators' decisions
- The critic model is applied across the generated dataset to filter noisy examples
Fast noisy labels > Slow clean labels
When working as an ML engineer in the industry, quickly turning around effective POCs is often more important than finding the optimal approach. Engineering time is often one of the company's most important resources, and it can be unacceptable to allow ourselves to be blocked, even something as important as data. Here we will see that, while generated data is not always better, it is significantly faster and cheaper to generate, which can help answer the critical question that every PM has before an ML project will this work?

In this review, I will:
- Give a brief over using language models to generate data
- Look at a case study using the end-to-end labeling pipeline
- Show a working example of bootstrapping a book title classification dataset
Generating Data From Language Models
Large language models
Over the previous several years, language models have grown exponentially in size and performance. Contemporary language models such as GPT-3 are orders of magnitude larger in terms of the models' size, amount of training data, and amount of training time used to create them. These models have become extremely powerful and have shown a remarkable ability to generate text for new tasks without data.
Prompt-based generation
We can generate labels for unlabeled data by constructing text templates called prompts, which will likely generate the label when we allow the language model to continue running. Prompts normally contain three components:
- A description of the task we want to perform
- Examples of the task being performed (also called in-context demonstrations)
- A new example we want to label

The above example is a working example from the openai playground showing how we could label food as vegetarian or not using the language model.
Symbolic Knowledge Distillation — A Case Study on Human in the Loop and Data Quality
The Symbolic Knowledge Distillation: from General Language Models to Commonsense Models. EditSign paper applies this idea to generating data for the common sense reasoning task. Yannic Kilcher gives a very detailed description of the technical details of the paper, which I would highly recommend watching, but I will give a brief overview of the important findings here.
Common sense reasoning task
The task is to generate common sense reasoning triplets. These take the form of a subject, predicate, and result, as shown below:

Generated dataset
The authors generate a dataset using the approach described in the previous section using GPT3. They can generate a far larger dataset, 10x larger for 15% of the cost than human annotation.

Data generation quality
How much data we have doesn't matter if it's all garbage. The authors perform a meta-analysis evaluating the quality of the data produced from human annotation and GPT3. Naturally, the data is of much lower quality, as shown in the accept rate column. However, as we apply the critic filtering model, we see that quality of the dataset increases and surpasses the human dataset.

Affect on downstream performance
This process has produced more data at a higher quality and cheaper than human annotation, but what is the effect on the downstream models? Authors show that simply by augmenting the dataset that the same model can achieve a significant jump in performance.

Example — Book Genre Prediction
Let's look at how we can use this approach to predict the genre of a book given its title. We will use data from the Book Genre Prediction Kaggle dataset, which contains 4,657 books with titles, summaries, and genres. Openai's API charges by token, so I will limit the exploration to the book title, but the same approach will apply using the longer context.
Loading data
First, we load the data downloaded from kaggle as well as set our openai API key copied from https://beta.openai.com/
import pandas as pd
import os
import openai
from tqdm.auto import tqdm
openai.api_key = os.getenv("OPENAI_API_KEY")
books = pd.read_csv("./data.csv")In-context demonstrations
We sample one example of each genre (11 total) to act as demonstrations for the model.
# Sample 1 labeled example of each class to serve as the seed for our generator model
few_shot_example_idx = []
for genre in books["genre"].value_counts().index:
few_shot_example_idx.append(books[books["genre"] == genre].sample(1).index[0])
books.loc[few_shot_example_idx]Prompt design
I write a prompt that gives one random in-context demonstration of each genre and a brief description of the problem. I also add the title of the book we want to generate for below:
# Create a template for the prompt with the examples
prompt_template = f"Classify the given book title as thriller, fantasy, science, history, horror, crime, romance, psychology, sports or travel:\n"
for idx in few_shot_example_idx:
prompt_template += f'{books["title"].loc[idx]}=> {books["genre"].loc[idx]}\n'
Classify the given book title as thriller, fantasy, science, history, horror, crime, romance, psychology, sports or travel:
Deception Point=> thriller
Hounded=> fantasy
The Star Fraction=> science
Laura Blundy=> history
The Vampire Lestat=> horror
At Bertram's Hotel=> crime
City of Lost Souls=> romance
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life=> psychology
Long Shot=> sports
The Old Ways: A Journey on Foot=> travel
Drowned Wednesday=>Generating the data
We make a call to openai to generate the label for each example. Note I would recommend testing on a smaller scale first to validate the prompt before processing the whole dataset. It took around 10 minutes to generate the rest of the annotations from this seed of 11 examples.
for i in tqdm(range(books.shape[0])):
prompt = prompt_template + books.iloc[i]["title"] + "=>"
response = openai.Completion.create(
model="text-ada-001",
prompt=prompt,
temperature=0,
max_tokens=60,
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.0
)
books["gpt3_annotations"].iloc[i] = response.to_dict()["choices"][0]["text"].strip()Feature representation
I represent the titles using the popular Sentence Transformers library, which builds embeddings for each title to use in a downstream logistic regression model (to keep things simple).
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sentence_transformers import SentenceTransformer
le = preprocessing.LabelEncoder()
true_labels = le.fit_transform(books["genre"])
noisy_labels = le.transform(books["gpt3_annotations"])
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(books["title"])
embeddings_train, embeddings_test, y_true_train, y_true_test, y_noisy_train, y_noisy_test = train_test_split(embeddings, true_labels, noisy_labels, test_size=0.2, random_state=42)Critic model
The critic model learns to accept or reject generated examples based on human annotations. Here we do so with 100 examples. Note that this can be learned very efficiently as:
- It makes the annotation task easier for humans by making it a binary labeling task
- It allows for more sample-efficient learning as we don't need coverage of the whole label space.
import numpy as np
from sklearn.linear_model import LogisticRegression
# Simulate a human critic accepting or rejecting the label
critic_examples = np.random.randint(embeddings_train.shape[0], size=100)
# Train model to learn from these critic examples
critic_features = embeddings_train[critic_examples]
critic_labels = (y_noisy_train == y_true_train)[critic_examples]
critic_model = LogisticRegression()
critic_model.fit(critic_features , critic_labels)
# Score examples in the dataset and remove those that score in the lowest 30% for acceptance
critic_scores = critic_model.predict_proba(embeddings_train)[:,1]
filtered_training_input = embeddings_train[critic_scores > np.percentile(critic_scores, 30)]
filtered_training_label = y_noisy_train[critic_scores > np.percentile(critic_scores, 30)]Model comparison
I train simple logistic regression models against each dataset (the small manually labeled seed dataset, the noisy labels without the critic, critic filtered data, and the fully labeled dataset) and compare the accuracy of the results. I report several human annotations required for each dataset as well.
While we certainly perform worse than full supervision, the model is pretty respectable in its performance, considering that it only has access to 11 human-labeled examples. I think that this gap can be narrowed further with more effort spent on tweaking the prompt and using a more expensive engine.
+----------------------------+----------+
| Data | Accuracy |
+----------------------------+----------+
| Only Labeled Examples (11) | 0.117 |
| Noisy Examples (11) | 0.177 |
| Critic Filtered (111) | 0.185 |
| True Labels (3725) | 0.251 |
+----------------------------+----------+Conclusion
Traditional human annotation can be costly from the perspective of money and engineering time. A new human-in-the-loop language model augmented data annotation paradigm aims to take a lot of the heavy lifting off of humans using GPT3s vast generation capabilities. This technique can be very useful in tasks where:
- Specialist annotators are needed (healthcare)
- Annotations are labor intensive (abstractive summarization)
- Time to market is important (faster > more accurate for an MVP)
By incorporating humans as reviewers in the annotation process rather than the actual annotators, we can still control the quality of data produced and get higher quality data in domains where inner annotator disagreement may be high.