Better Programming

Advice for programmers.

Follow publication

Member-only story

Extractive Text Summarization Using spaCy in Python

Ng Wai Foong
Better Programming
Published in
8 min readJan 30, 2020

Photo by Romain Vignes on Unsplash

Traditionally, TF-IDF (Term Frequency-Inverse Data Frequency) is often used in information retrieval and text mining to calculate the importance of a sentence for text summarization.

The TF-IDF weight is composed of two terms:

  • TF: Term Frequency — Measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear many more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length, such as the total number of terms in the document, as a way of normalization.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
  • IDF: Inverse Document Frequency — Measures how important a term is. While computing the term frequency, all terms are considered equally important. However, it is known that certain terms may appear a lot of times but have little importance in the document. We usually term these words stopwords. For example: is, are, they, and so on.
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Ng Wai Foong
Ng Wai Foong

Written by Ng Wai Foong

Senior AI Engineer@Yoozoo | Content Writer #NLP #datascience #programming #machinelearning | Linkedin: https://www.linkedin.com/in/wai-foong-ng-694619185/

Lists

See more recommendations