5 Python Scripts I Found Useful for Data Processing Operations

Some practical scripts I discovered in my archives

Published in

Better Programming

5 min readFeb 1, 2023

Image by author: Long exposure of a moonrise (source)

I don’t know how it goes for you, but I need to reference my old projects for some scripts, even though I applied them at least 3–4 times a month for a while. From my point of view, my brain does not want me to memorize them and wants me to keep looking at my ancient tasks or Stack Overflow.

I confess I wanted to create a Jupyter notebook to construct a reference for my future applications. At the same time, I thought an article might be a helpful resource for someone, so you can discover scripts concerning data processing to filter, aggregate, sort, and code performance optimization in this article.

Introduction

Conditions, loops, and exception handling are permanent terms, and they are language-independent. Various names can express them in different languages, but the contributions to the code logic are identical.

Adding more about the part of loops motivated me to write this article. In small-scale tasks, we ought not to care about code performance, but if users have this service as a library or a web service, response time or memory usage is crucial. Thus, using loops to generate new objects employing existing sequential data structures such as lists, arrays, or dictionaries may damage our code performance.

The scripts below may aid you in removing your loops, particularly data processing operations. As a real example, during my master’s research, I studied text classification, and the experiments required plenty of text preprocessing. In the first draft of the code, the feature extraction stage took around two days (yes, 40–42 hours). After I began to get benefits from the scripts below, the duration diminished to nearly two hours. In addition, to handling loops, they may help you increase readability by making them simple without over-engineering.

Before starting, I want to note I will also share a randomly-generated dataset, the script for the process, and the result for individual examples to maintain the context more perceptible.

Groupby With Naming

Here we go! I want to start with the group_by method. If you work with data, there is no doubt you are spending a lot on the analysis. In SQL, Python, and R, group_by allocates a substantial portion of my code to accomplish aggregations for inferences. The issue I encounter is using a single column for numerous aggregations. So, the script below can allow you to deliver different names for various aggregations using a single column:

Filtering Nested Dictionaries

Since I have worked with PostgreSQL and JSON columns, dictionaries have become one of my best friends. Using loops to handle filtering is a popular approach in Python. After I realized I could filter nested dictionaries, the number of lines decreased to one line. In the code below, dictonary_name.items() returns key/value pairs inside the built-in filter method. At the lambda part, [0] represents the key, and [1] represents the value for each item. So, if we do item[0][desired_column_name] for a condition, we can easily filter the dictionary.

Sorting Nested Dictionaries

As I started to talk regarding dictionaries above, it sometimes is required to sort dictionaries without Pandas transformations. You can also manage sorting operations without employing loops as well. The built-in sorted method is used with a similar lambda utilization. If you desire to sort by nested value, you can employ x[1][‘age’]. To sort by keys, x[0] can work for you.

Generate Rolling Features

Working with time-series data is one of my favorites. In addition to data analysis, I also enjoy feature engineering, which gives me additional motivation and satisfaction. For each group, we can generate moving features or compute the average of the earlier N values of the identical group. These feature generations can be produced by indices and values using a loop. It is possible to handle these feature generations by Pandas with a single line.

The shift(n) method allows us to obtain the previous values of the group. You can also determine which preceding value you want to pick with n. Furthermore, it’s possible to calculate the fundamental arithmetic operations of preceding elements, such as sum and average. At first, the shift() method excludes the present value, so you must use expanding to employ all the previous values. Ultimately, the desired value is generated by mean() or sum() methods.

In addition to the moving features, you can extract features based on previous n values. As cited above, we replace expanding() with rolling() to calculate the expected rolling value. You can provide a specific parameter to comprise how many preceding values are available for the computation. Afterward, mean() and sum() calculates the desired outcome.

Pandas Vectorization

We arrived at the final stop, Pandas vectorization. It makes up a large portion of the optimization story of my master’s study. In my honest opinion, utilizing loops is the most damaging approach in data preprocessing. In the early days of my Python voyage, I was digging the articles for significant code performance refinements. The experiments indicate that vectorization with pandas or numpy can enhance your performance, not including some specific cases.

In the following example, the commented method is the traditional practice that addresses the process item by item. Vectorization can involve the same logic. You can write a function that processes the values with a given column name. Afterward, the apply method can fulfill this function over the rows. To process row by row, you must pass axis=1 to the apply method because the default axis value is 0.

Final Words

I wanted to share five useful Python scripts with examples. You can also find the examples in the repository. I hope my examples will be a good reference for your developments. I want to mention that I do not claim the examples above will boost your code performance. There might be some exceptional circumstances concerning memory.

If you have any feedback or recommendations for further advancement, please leave your ideas in the comments! :)

Want to Connect?

To say "hi" or ask me anything:

LinkedIn: https://www.linkedin.com/in/ktoprakucar/
GitHub: https://github.com/ktoprakucar

5 Python Scripts I Found Useful for Data Processing Operations

Some practical scripts I discovered in my archives

Introduction

Groupby With Naming

Filtering Nested Dictionaries

Sorting Nested Dictionaries

Generate Rolling Features

Pandas Vectorization

Final Words

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Better Programming

Written by Kemal Toprak Uçar

No responses yet

More from Kemal Toprak Uçar and Better Programming

Stop using Alpine Docker images

Everybody loves Alpine images because they are light and have a smaller attack surface, but maybe they are not the best option anymore.

How we hire: Interviewing for a role at SumUp

We often get asked by candidates about our hiring process. We understand that when someone asks questions about how we hire, they are…

Containers from Scratch — Part 1

There is no better way to learn something than by building it. Let's understand and build a container from scratch.

Evaluating the performance of an LLM application that generates free-text narratives in the context…

Evaluating a Large Language Model (LLM) application that generates unstructured data, such as a free-text narrative or dialogue, is a…

Recommended from Medium

The Difficulties in Unmasking My Autism

My Experience in Late Diagnosed Autism

The Reshaping of Brilliant Minds

The boy sitting across from me is fourteen, shoulders hunched, eyes down, fingers nervously picking at a loose thread on his sleeve. “I…

Lists

Living Well as a Neurodivergent Person

Stories to Help You Live Better

First Personal

Company Offsite Reading List

Work is a Chronic Illness

My time as a sick spoonie closely resembled my experience as a full-time worker.

Autistic People Are Being Weaponised to Restrict Gender-Affirming Care

Is it time society just accepted that we’re a diverse bunch?

10 Times Some Should’ve Said “This Girl’s Autistic!”

Sharing childhood stories of undetected neurodivergence

What Late-Diagnosed Autism Feels Like

People Don’t See Your Autism