Member-only story
Hands-on Augmentation of Natural Language Dataset Using Paraphrase Database
A line-by-line coding tutorial of generating linguistic variants of source natural language data
Even though natural language text is abundantly available on the internet nowadays, we often still face the shortage of data for domain specific tasks. For example, say we want to build a natural language interface that takes in the user’s natural language input and outputs a domain specific translation such as SQL, graph query, and so on. In this case, the natural language input usually has a certain semantic structure that the random text in Wikipedia or Google News can’t easily represent.
Humans can quickly hand-craft a small set of input output pairs, maybe in the order of hundreds or even thousands. But it’ll be too expensive to manually create millions of examples, which is commonly what it takes to train a good natural language model. So, we’ll have to resort to automatic data augmentation. This blog post provides a coding tutorial of natural language data augmentation using a paraphrase database.
Word Embedding as a Alternative First
Before we go into the paraphrasing idea, let’s explore an alternative to augment natural language data. To simplify the…