Better Programming

Advice for programmers.

Follow publication

Member-only story

Hands-on Augmentation of Natural Language Dataset Using Paraphrase Database

Eileen Pangu
Better Programming
Published in
5 min readJan 19, 2022

--

Photo by Raj Rana on Unsplash

Even though natural language text is abundantly available on the internet nowadays, we often still face the shortage of data for domain specific tasks. For example, say we want to build a natural language interface that takes in the user’s natural language input and outputs a domain specific translation such as SQL, graph query, and so on. In this case, the natural language input usually has a certain semantic structure that the random text in Wikipedia or Google News can’t easily represent.

Humans can quickly hand-craft a small set of input output pairs, maybe in the order of hundreds or even thousands. But it’ll be too expensive to manually create millions of examples, which is commonly what it takes to train a good natural language model. So, we’ll have to resort to automatic data augmentation. This blog post provides a coding tutorial of natural language data augmentation using a paraphrase database.

Word Embedding as a Alternative First

Before we go into the paraphrasing idea, let’s explore an alternative to augment natural language data. To simplify the…

--

--

Eileen Pangu
Eileen Pangu

Written by Eileen Pangu

Manager and Tech Lead @ FANG. Enthusiastic tech generalist. Enjoy distilling wisdom from experiences. Believe in that learning is a lifelong journey.

Write a response