Member-only story
Extractive Text Summarization Using spaCy in Python
Find the top sentences from an article based on keywords extracted using spaCy

Traditionally, TF-IDF (Term Frequency-Inverse Data Frequency) is often used in information retrieval and text mining to calculate the importance of a sentence for text summarization.
The TF-IDF weight is composed of two terms:
- TF: Term Frequency — Measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear many more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length, such as the total number of terms in the document, as a way of normalization.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
- IDF: Inverse Document Frequency — Measures how important a term is. While computing the term frequency, all terms are considered equally important. However, it is known that certain terms may appear a lot of times but have little importance in the document. We usually term these words stopwords. For example: is, are, they, and so on.
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)