Member-only story

Pandas: How to Process a Dataframe in Parallel

Konstantinos Patronas
Towards Dev
Published in
3 min readFeb 20, 2022

--

Photo by Stone Wang on Unsplash

Pandas is the most widely used library for data analysis but its rather slow because pandas does not utilize more than one CPU core, to speed up things and utilize all cores we need to break our data frame to smaller ones, ideally in parts of equal to the number of available CPU cores.

Python concurrent.futures allows as to easily create processes without the need to worry for stuff like joining processes etc, consider the following example (pandas_parallel.py)

And the CSV file that we will use it to create the Dataframe

https://github.com/kpatronas/big_5m/raw/main/5m.csv

Explaining the Code

Those are the libraries we need, concurrent.futures is the one that provides what we need to execute process the data frame in parallel

The do_something function accepts a Dataframe as parameter, this function will be executed as a separate processes in parallel

The bellow functions return the Parent PID and the current process PID

os.getpid()
os.getppid()

The pandas operation we perform is to create a new column named diff which has the time difference between current date and the one in the “Order Date” column. After the operation, the function returns the processed Data frame

The bellow part of the code is actually the start and initiation part of our script

--

--

Published in Towards Dev

A publication for sharing projects, ideas, codes, and new theories.

Written by Konstantinos Patronas

DevOps engineer, loves Linux, Python, cats and Rock music

Write a response