5 Python Scripts I Found Useful for Data Processing Operations
Some practical scripts I discovered in my archives
data:image/s3,"s3://crabby-images/d0fb9/d0fb94695533a58458cc816d4ba64cb8099b6728" alt=""
I don’t know how it goes for you, but I need to reference my old projects for some scripts, even though I applied them at least 3–4 times a month for a while. From my point of view, my brain does not want me to memorize them and wants me to keep looking at my ancient tasks or Stack Overflow.
I confess I wanted to create a Jupyter notebook to construct a reference for my future applications. At the same time, I thought an article might be a helpful resource for someone, so you can discover scripts concerning data processing to filter, aggregate, sort, and code performance optimization in this article.
Introduction
Conditions, loops, and exception handling are permanent terms, and they are language-independent. Various names can express them in different languages, but the contributions to the code logic are identical.
Adding more about the part of loops motivated me to write this article. In small-scale tasks, we ought not to care about code performance, but if users have this service as a library or a web service, response time or memory usage is crucial. Thus, using loops to generate new objects employing existing sequential data structures such as lists, arrays, or dictionaries may damage our code performance.
The scripts below may aid you in removing your loops, particularly data processing operations. As a real example, during my master’s research, I studied text classification, and the experiments required plenty of text preprocessing. In the first draft of the code, the feature extraction stage took around two days (yes, 40–42 hours). After I began to get benefits from the scripts below, the duration diminished to nearly two hours. In addition, to handling loops, they may help you increase readability by making them simple without over-engineering.
Before starting, I want to note I will also share a randomly-generated dataset, the script for the process, and the result for individual examples to maintain the context more perceptible.
Groupby With Naming
Here we go! I want to start with the group_by
method. If you work with data, there is no doubt you are spending a lot on the analysis. In SQL, Python, and R, group_by
allocates a substantial portion of my code to accomplish aggregations for inferences. The issue I encounter is using a single column for numerous aggregations. So, the script below can allow you to deliver different names for various aggregations using a single column:
Filtering Nested Dictionaries
Since I have worked with PostgreSQL and JSON columns, dictionaries have become one of my best friends. Using loops to handle filtering is a popular approach in Python. After I realized I could filter nested dictionaries, the number of lines decreased to one line. In the code below, dictonary_name.items()
returns key/value pairs inside the built-in filter method. At the lambda part, [0]
represents the key, and [1]
represents the value for each item. So, if we do item[0][desired_column_name]
for a condition, we can easily filter the dictionary.
Sorting Nested Dictionaries
As I started to talk regarding dictionaries above, it sometimes is required to sort dictionaries without Pandas transformations. You can also manage sorting operations without employing loops as well. The built-in sorted method is used with a similar lambda utilization. If you desire to sort by nested value, you can employ x[1][‘age’]
. To sort by keys, x[0]
can work for you.
data:image/s3,"s3://crabby-images/7703e/7703e6b270d6908d0b77428c19978df95d6395e6" alt=""
Generate Rolling Features
Working with time-series data is one of my favorites. In addition to data analysis, I also enjoy feature engineering, which gives me additional motivation and satisfaction. For each group, we can generate moving features or compute the average of the earlier N
values of the identical group. These feature generations can be produced by indices and values using a loop. It is possible to handle these feature generations by Pandas with a single line.
The shift(n)
method allows us to obtain the previous values of the group. You can also determine which preceding value you want to pick with n
. Furthermore, it’s possible to calculate the fundamental arithmetic operations of preceding elements, such as sum and average. At first, the shift()
method excludes the present value, so you must use expanding
to employ all the previous values. Ultimately, the desired value is generated by mean()
or sum()
methods.
In addition to the moving features, you can extract features based on previous n values. As cited above, we replace expanding()
with rolling()
to calculate the expected rolling value. You can provide a specific parameter to comprise how many preceding values are available for the computation. Afterward, mean()
and sum()
calculates the desired outcome.
Pandas Vectorization
We arrived at the final stop, Pandas vectorization. It makes up a large portion of the optimization story of my master’s study. In my honest opinion, utilizing loops is the most damaging approach in data preprocessing. In the early days of my Python voyage, I was digging the articles for significant code performance refinements. The experiments indicate that vectorization with pandas or numpy can enhance your performance, not including some specific cases.
In the following example, the commented method is the traditional practice that addresses the process item by item. Vectorization can involve the same logic. You can write a function that processes the values with a given column name. Afterward, the apply
method can fulfill this function over the rows. To process row by row, you must pass axis=1
to the apply
method because the default axis value is 0
.
data:image/s3,"s3://crabby-images/28ecc/28ecc2943f59feeb2934b87f21c371b336952ecb" alt=""
Final Words
I wanted to share five useful Python scripts with examples. You can also find the examples in the repository. I hope my examples will be a good reference for your developments. I want to mention that I do not claim the examples above will boost your code performance. There might be some exceptional circumstances concerning memory.
If you have any feedback or recommendations for further advancement, please leave your ideas in the comments! :)
Want to Connect?
To say "hi" or ask me anything:
LinkedIn: https://www.linkedin.com/in/ktoprakucar/
GitHub: https://github.com/ktoprakucar