Member-only story

Hands-on Augmentation of Natural Language Dataset Using Paraphrase Database

A line-by-line coding tutorial of generating linguistic variants of source natural language data

Eileen Pangu

Published in

Better Programming

5 min readJan 19, 2022

Even though natural language text is abundantly available on the internet nowadays, we often still face the shortage of data for domain specific tasks. For example, say we want to build a natural language interface that takes in the user’s natural language input and outputs a domain specific translation such as SQL, graph query, and so on. In this case, the natural language input usually has a certain semantic structure that the random text in Wikipedia or Google News can’t easily represent.

Humans can quickly hand-craft a small set of input output pairs, maybe in the order of hundreds or even thousands. But it’ll be too expensive to manually create millions of examples, which is commonly what it takes to train a good natural language model. So, we’ll have to resort to automatic data augmentation. This blog post provides a coding tutorial of natural language data augmentation using a paraphrase database.

Word Embedding as a Alternative First

Before we go into the paraphrasing idea, let’s explore an alternative to augment natural language data. To simplify the…

Better Programming

Hands-on Augmentation of Natural Language Dataset Using Paraphrase Database

A line-by-line coding tutorial of generating linguistic variants of source natural language data

Word Embedding as a Alternative First

Published in Better Programming

Written by Eileen Pangu

Responses (2)