Member-only story
Pandas: How to Process a Dataframe in Parallel
Make pandas lightning fast
Pandas is the most widely used library for data analysis but its rather slow because pandas does not utilize more than one CPU core, to speed up things and utilize all cores we need to break our data frame to smaller ones, ideally in parts of equal to the number of available CPU cores.
Python concurrent.futures
allows as to easily create processes without the need to worry for stuff like joining processes etc, consider the following example (pandas_parallel.py)
And the CSV file that we will use it to create the Dataframe
https://github.com/kpatronas/big_5m/raw/main/5m.csv
Explaining the Code
Those are the libraries we need, concurrent.futures
is the one that provides what we need to execute process the data frame in parallel