Analyzing Polars From a Data Science Perspective

Using polars to make more robust models

Walter Sperat
Better Programming

--

Photo by Hans-Jurgen Mager on Unsplash

If you work in data science, you have surely heard of numpy and pandas. Together, these two libraries (though you can also find some scipy every now and then) account for the vast majority of data science experimentation and work.

Numpy (from Numeric Python or something similar) is a library for scientific computing whose core functionality is mostly written in C; it's designed from the ground up to express multi-dimensional numeric arrays efficiently in memory (something that Python's native lists don't do). The idea is that you can delegate the heavy computations to efficient C code and not worry about coding the logic yourself with a convenient API in Python-familiar syntax.

One of the great things about numpy is that it allows for single-instruction-multiple-data operations (SIMD), which is a very low-level way of executing operations on multiple elements of an array at once. At the time of writing, however, numpy (mostly) doesn't parallelize over multiple processors.

For its part, Pandas (from Panel Datas) can be understood as a wrapper over numpy with some added functionality designed to deal with SQL-like tables. This doesn't mean it's neither a small nor a simple library, but a general explanation of how it runs under the hood. Just like numpy, it doesn't parallelize most operations, though the latest versions have been quietly adding some parallelization here and there.

Both numpy and pandas are usually the bread and butter of data science libraries for analysis or modeling. Most tools tend to interact directly with them, and if they don't, they have convenient tools that take numpy/pandas data structures and turn them into their internal representation. Why would they do that? Well, on the one hand, most real-life data isn't just composed of numbers but can also have dates, categorical types, and text; also, when dealing with massive amounts of data (such as multi-gigabyte to terabyte size datasets), both numpy's and pandas' start to have a disproportionately large memory footprint.

Enter polars: a library written in Rust (yeah, I know the base language doesn't make that much of a difference, but I just think it's cool; it's my article, sue me) designed from the ground up to be pandas-but-better with nothing but efficiency in mind. This efficiency comes in three forms:

  1. A different though pretty intuitive API to interact with its table and column objects.
  2. A really small memory footprint, which is especially noticeable with large tables.
  3. It's fast, like really fast. Polars not only vectorizes operations (like numpy and pandas do) but its implementation of data structures in memory is insanely optimized towards making everything as easy as possible on the CPU. Oh, and did I forget to mention that it also parallelizes across several cores? For a rather dated though very clear and in-depth explanation about how this works from the developer himself, I highly recommend his article on the subject.

With that out of the way, let's get started with some polars.

The Polars API

If you're familiar with SQL languages or Spark's dataframe API, learning to use polars will have a very short learning curve. It also has some pandas-like things:

import polars as pl

df = pl.read_csv(path_to_data)
df = pl.read_parquet(path_to_data)
.
.
.

Nothing particularly weird so far. What about selecting columns?

df.select(['a', 'b', 'c'])
df[['a', 'b', 'c']]

Both kinds of syntax are functionally equivalent, will give the same result, and take the same time.

Applying masks on the data, a.k.a. filtering, is done like so:

df.filter(pl.col('a')==5)

That's far away from numpy and pandas. However, you can intuitively understand what's going on: we're calling the dataframe's filter method and passing a type of object called a polars expression, constructed from the column object named 'a' and a boolean comparison to the 5 integer literal.

This is where we get to the most important thing we must understand to use polars effectively: polars expressions. In short, we use this domain-specific language to tell polars what we want it to do on the data, and the polars engine will automatically optimize our expressions into (theoretically) equivalent ones to be even faster.

So, instead of calling methods from polars Series or DataFrame objects (though we still can do several operations), it's better to calculate things using this expression language directly by calling them from the pl.col objects.

Creating a dataframe with new or modified columns can be done through a very pyspark-like method called with_columns (though notice the final "s" that makes it, confusingly, different from pyspark's withColumn):

df.with_columns(
(pl.col('a') + pl.col('b')).alias('a+b'),
pl.col('c').mean().alias('c_mean'),
pl.col('b') / 2
)

This will create a new column called a+b, another called c_mean and modify the original b column dividing it by 2. It should be noted that the with_columns method will not modify any underlying columns unless we explicitly choose to do so by assigning it to the df variable.

Group aggregations are also easy to understand:

df.groupby('d').agg(
pl.col('a').sum().alias('a_sum'),
pl.col('a').count().alias('a_count')
)

As you can see, it's very similar to how group by operations are generally thought of and, I would argue, even more intuitive than how pandas implement it.

What now?

The things we've covered so far by no means make us experts in polars, but they lay the groundwork to start stretching our wings with it. I recommend you read the documentation (it's pretty good) and try it out on your own problems.

Now, on to some data science stuff.

Building a Custom Scikit-Learn Estimator With Polars

Before jumping into polars, let’s consider a hypothetical (though based on very real situations) problem: say that we run a monthly subscription service and need to predict what customers will unsubscribe during the month. The services are highly fractional, meaning that users pay for what they use (imagine you paid for Netflix according to the number of hours you watched), which means that most customers pay different amounts of money. Right now, the company’s services are available in four countries, called 1, 2, 3, and 4. The Customer team knows the cost to be the main driver for churn in all four countries.

We'll first generate data that follows this behavior:

import numpy as np
import pandas as pd
import polars as pl

N = 10_000_000

countries = np.repeat([*range(1, 5)], N)

cost = (
np.array([np.random.randn(N)+country+5 for country in np.unique(countries)])
).flatten()

y = cost > countries+5

pd_data = pd.DataFrame({
'country': countries,
'cost': cost.ravel(),
'y': y.ravel()
})

This will generate data that looks like this (though a couple of transformations were done for easier visualization):

Distribution of the target according to both cost and country code (image by author)

Yellow points correspond to customers that will churn, while purple ones correspond to customers that won't.

Given the two available variables, this particular problem is easily solvable without using any advanced analytical technique, being trivial to separate both classes linearly. However, this artificial problem intends to represent a real-world situation that may be more applicable for more (or not…) complex situations where some smart feature engineering is needed.

Now, consider that payments are performed in the local currency (not in Euros or US dollars). That means that the vertical axis in the plot above represents the number of currency units paid, which means we can't really assign it any real-world units. So how do we model this?

As expressed above, the problem is simple enough:

from sklearn.linear_model import LogisticRegression

ordinal_model = LogisticRegression()
ordinal_model.fit(pd_data[['country', 'cost']], pd_data['y'].values.ravel())

ordinal_predictions = ordinal_model.predict(pd_data[['country', 'cost']])

Which indeed gets the correct results:

Predictions of the first model (image by author)

The model is great; it captures the exact behavior we need it to. All right, so we go to the Customer team and tell them this is our deliverable. It's ready to go into deployment. However, a very data-savvy analyst of the Customer team starts digging into the model and calls us back. She tells us that, despite the model being great, it won't generalize, and shows us this:

Coefficients of the logistic regression (image by author)

Surprised, we tell her that the coefficients she observed are sound: considering how countries were encoded, these negatively influence the churn probability, while the cost has a positive one. Furthermore, all analyses (not shown here) show exactly the opposite of what she's claiming: the behavior learned, i.e., that larger costs increase the probability of churn; as long as nothing changes this (such as a pandemic…), the model will generalize just fine.

That's where the analyst tells us something we don't know yet. The company is planning to expand to other countries. This could potentially generate a couple of problems:

  1. New models will need to be trained as more countries are added.
  2. During the first few months in a new country, we'll be flying completely blind, having no way of predicting churn until we collect more data, leading to undesired customer loss.
  3. The relationship between the country and churn will probably not fit in the neat one shown in the plot above, meaning an appropriate encoding must be found. This may be time consuming, again leading to excess churn.

But, she tells us, we need not worry, for she has recently started working with this exciting new library called polars and has written a custom estimator to deal with our problem:

from sklearn.base import BaseEstimator, TransformerMixin
import polars as pl

class PolarsGroupStandardizer(BaseEstimator, TransformerMixin):
def __init__(self, group_column : str, columns : list[str]) -> None:
self.group_column = group_column
self.columns = columns

def fit(self, X : pd.DataFrame, y) -> "PolarsGroupStandardizer":
return self

def transform(self, X : pd.DataFrame) -> np.ndarray:
return pl.DataFrame(X).with_columns(
self._standardize(column) for column in self.columns
).select(self.columns).to_numpy()

def fit_transform(self, X : pd.DataFrame, y) -> np.ndarray:
return self.transform(X)

def _standardize(self, column : str) -> pl.Expr:
return (
(
pl.col(column) - pl.col(column).mean().over(self.group_column)
) / (
pl.col(column).std().over(self.group_column)
)
)

def __repr__(self) -> str:
return f'PolarsGroupStandardizer({self.group_column}, {self.columns})'

def __str__(self) -> str:
return self.__repr__()

Let's go through the class bit by bit:

  1. The PolarsGroupStandardizer inherits from both BaseEstimator and TransformerMixin, which will make it compatible with scikit-learn composable classes (i.e., Pipelines and ColumnTransformers).
  2. The __init__ takes two arguments: a string and a list of strings, and stores them. The first string will be the column we want to use to group (in our case, it will be the country; if you still need to understand what's going on, don't worry, we'll get there).
  3. Interestingly, the fit method does nothing. This means that this transformer won't learn anything from the data. Why would this be so? Be patient, grasshopper.
  4. The transform method is where things get interesting. First, the pandas dataframe is turned into a polars dataframe. Then, the with_columns method is called on a generator that will execute the mysterious _standardize method for every one of the columns that were passed to the __init__ method. Finally, it returns a numpy array that contains only those columns by using the select and to_numpy methods.
  5. As expected, the fit_transform method only calls transform and returns its result, as fitting the estimator isn't needed.
  6. The _standardize method is where the magic happens, as this is what is called to do the heavy lifting inside transform. Looking at the function signature, we can see that it takes a string (indicating the name of a column) and returns a polars expression; this last thing makes a lot of sense, considering that with_columns takes polars expressions.
    The first line takes the passed column and subtracts something from it. The subtracted part itself is what in SQL are called window functions: the expression is calculating the mean of the column (which in our case will be the cost variable) over the group_column (which, in our case, will be the country); this means that the mean of each group will be calculated and subtracted from the value of the column, line by line. Then, the same windowing logic is applied but with the standard deviation instead of the mean.
  7. The __repr__ and __str__ methods are just to make the class representation look pretty.

Let's think about what the analyst did: she created a class that standardized the cost variable for each country separately. This means that, after the transformation, the cost for each country will have zero mean and a standard deviation of one.

Why is this so great? It will allow us to create a model that takes only one variable, namely the cost, and has the same performance as before. This will deal with all of our aforementioned problems with generalization to new countries:

from sklearn.pipeline import Pipeline

standardized_model = Pipeline([
('standardizer', PolarsGroupStandardizer('country', ['cost'])),
('glm', LogisticRegression())
])

standardized_model.fit(pd_data[['country', 'cost']], pd_data['y'])
standardized_predictions = standardized_model.predict(pd_data[['country', 'cost']])

First of all, performance appears to be essentially the same as before:

Predictions of the model containing the standardizer class (image by author)

It also depends only on cost:

Coefficients of the logistic regression (image by author)

By following this strategy, we now have a sound model from the business perspective, predicting the probability of increasing cost. However, as it is standardizing all the data points, it can also extend to other countries. Basically, we are foregoing the specific cost, and instead of learning the direct relationship between cost and churn probability, the model has learned the relationship between the relative cost that a customer is paying and churn probability.

Conclusions

Polars is great. It's easy to use, it's fast, and it's very memory-efficient. It's also good at interfacing with pandas and numpy, allowing easy conversions between the data structures of those libraries from and to the polars' equivalents. Building the PolarsGroupStandardizer was both easy and intuitive, greatly simplifying the model while adding a small compute footprint.

From the modeling side, this group-standardization strategy is extremely useful when dealing with categorical data and variables whose distributions change over time. Regarding this last point, consider inflation and its impact on the absolute numbers related to monetary values (e.g., costs, savings, salaries, etc.). When dealing with datasets containing money and some time components, using this kind of strategy and grouping by period has helped me stabilize models quite a bit.

To again repeat the obvious, this is by no means a complete introduction to polars, but I hope it will be enough to get you up and running to experiment with it.

Thank you for reading!

--

--