Better Programming

Advice for programmers.

Follow publication

Load Data Faster in Python With Compressed Pickles

Store any Python object faster and in a smaller file size

Carlos Valcarcel
Better Programming
Published in
4 min readJan 27, 2020

--

Photo by Michael Jasmund on Unsplash

Do you hate how long it takes to load data? Is your hard drive running low on available space? Here are four easy-to-implement functions that will help any Python programmer, from beginner to advanced, manage their projects.

Compressed Pickles

If you have been working in Python for a while, you may be familiar with the _pickle library.

It saves almost any Python object (including massive datasets) as bytes. It cuts loading time to a fraction. Depending on the object, this might save you some space as well. However, it often won’t be enough.

Enter the bz2 library for python, which enables bz2 compression for any file. By sacrificing some of the speed gained by pickling your data, you can compress it to a quarter of its original size.

The Four Functions

Below are four Python methods that make short work of working with data, functions that I include in the utils.py file of any project I work on.

Imports

import bz2
import pickle
import _pickle as cPickle

1. Full pickle

The full_pickle method takes almost any object (list, dictionary, pandas.DataFrame, and more) and saves it as a .pickle file.

# Saves the "data" with the "title" and adds the .pickle
def full_pickle(title, data):
pikd = open(title + ‘.pickle’, ‘wb’)
pickle.dump(data, pikd)
pikd.close()

Example usage:

full_pickle('filename', data) 
  • filename is the name of the file with no extension.
  • data is any object.

2. Loosen

Load the pickle files you or others have saved using the loosen method. Include the .pickle extension in the file arg.

# loads and returns a pickled objects
def loosen(file):
pikd = open(file, ‘rb’)
data = pickle.load(pikd)
pikd.close()
return data

Example usage:

data = loosen('example_pickle.pickle') 
  • file is the file name with the .pickle extension.

3. Compressed pickle

The compressed_pickle works just like full_pickle. It even takes the same arguments. It creates a pickle object and then compresses it using the bz2 library, adding the .pbz2 extension to the saved file automatically.

# Pickle a file and then compress it into a file with extension 
def compressed_pickle(title, data):
with bz2.BZ2File(title + ‘.pbz2’, ‘w’) as f:
cPickle.dump(data, f)

Example usage:

compressed_pickle('filename', data) 
  • filename is the name of the file with no extension.
  • data is any object.

Notice that this compresses a pickle file, it doesn’t work as well the other way around.

4. Decompress pickle

The decompress_pickle method works just like the loosen function. Include the .pbz2 extension in the file arg.

# Load any compressed pickle file
def decompress_pickle(file):
data = bz2.BZ2File(file, ‘rb’)
data = cPickle.load(data)
return data

Example usage:

data = decompress_pickle('example_cp.pbz2') 
  • file is the file name with the .pbz2 extension.

Benchmarks

So, how much faster is pickling and how much space are we saving?

Here’s a benchmark test I performed on an AWS virtual machine for less than a penny ($0.01) using a module I created for cloud computing.

Save CSV File: 3.384 seconds
Load CSV File: 1.977 seconds
CSV File Size: 39,575,154 bytes
Save Pickle File: 3.422 seconds
Load Pickle File: 0.156 seconds
Pickle File Size: 40,759,166 bytes
Save Compressed Pickle: 4.837
Load Compressed Pickle: 1.139
Compressed Pickle File Size: 1,467,842

Saving the 39 MB pandas.DataFrame() object as a .csv file took 3.4 seconds. Almost as long as it took to save the .pickle file and more than one second faster than it took to compress.

The .pickle file and the .csv files took up about the same space, around 40 MB, but the compressed pickle file took up only 1.5 MB. That’s a lot of saved space.

Another big difference is in the load times. If you’re looking for faster loading, either function will work, it just depends on your space needs.

Loading the .csv file took 2 seconds, loading the compressed pickle .pbz2 file took only 1.2 seconds, whereas loading the pickle files took a mere 0.15 seconds.

Things to Try or Look Out For

  • The order of pickling then compressing is tested and works without degrading data. Changing the order leads to worse performance.
  • You might want to try other compression methods that suit your needs better than bz2 compression.
  • Pickling or compressing certain class objects might not work, in these cases, try saving the class attributes (usually accessible as a dictionary) and then loading another class instance and assigning it the attributes.

That’s it. In the future, I will be writing more articles about simple yet remarkably useful functions and classes that I often use in my projects. Some of them build on what we’ve seen here, others do not.

Thanks for reading.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Responses (3)