Member-only story

Extractive Text Summarization Using spaCy in Python

Find the top sentences from an article based on keywords extracted using spaCy

Ng Wai Foong

Published in

Better Programming

8 min readJan 30, 2020

Traditionally, TF-IDF (Term Frequency-Inverse Data Frequency) is often used in information retrieval and text mining to calculate the importance of a sentence for text summarization.

The TF-IDF weight is composed of two terms:

TF: Term Frequency — Measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear many more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length, such as the total number of terms in the document, as a way of normalization.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

IDF: Inverse Document Frequency — Measures how important a term is. While computing the term frequency, all terms are considered equally important. However, it is known that certain terms may appear a lot of times but have little importance in the document. We usually term these words stopwords. For example: is, are, they, and so on.

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

Better Programming

Extractive Text Summarization Using spaCy in Python

Find the top sentences from an article based on keywords extracted using spaCy

Create an account to read the full story.

Published in Better Programming

Written by Ng Wai Foong

Responses (1)

More from Ng Wai Foong and Better Programming

Introduction to ControlNet for Stable Diffusion

Better control for text-to-image generation

How To Update Your Status During Standup Like a Senior Engineer

A status update is where you can showcase how well you manage ambiguity and is an important way to build trust with your team

Why I Prefer Regular Merge Commits Over Squash Commits

I used to think squash commits were so cool, and then I had to use them all day, every day. Here’s why you should avoid squash

Beginner’s Guide to Neural Speaker Diarization with pyannote

An open-source toolkit written in Python for speaker diarization

Recommended from Medium

Trying LLM-powered pandas DataFrame

Pandas is a popular Python programming tool in data science, and its DataFrame allows further processing of two-dimensional structured…

Building a Streamlit RAG Chatbot using Langchain

In our previous blog, we discussed how to build a Retrieval Augmented Generation (RAG) pipeline. You can read it here.

Lists

Coding & Development

General Coding Knowledge

Predictive Modeling w/ Python

Practical Guides to Machine Learning

[Python-Doc] Efficient Text Replacement in Word Documents

Python script is designed to replace specific text in a Word document using the python-docx library. Here’s a detailed breakdown of how the…

Unlocking Document Processing with Python: Advanced File Partitioning and Text Extraction

Processing and extracting information from diverse document formats is essential for numerous applications. Python’s unstructured library…

Extract Structured Data from Unstructured Text using LLMs

Using LangChain’s create_extraction_chain and PydanticOutputParser

Creating an Audio Transcription and Summarization with OpenAI’s Whisper and Python

Audio processing has never been more accessible. With advancements in machine learning, we can now transcribe and summarize audio…