How to Use Pandas to Consume Data and Perform Data Analysis

An introduction to data analysis with Pandas

Published in

Better Programming

6 min readApr 14, 2022

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal — source (https://pandas.pydata.org/)

Installation

To get started with pandas, let’s install it. Installing pandas as part of Anaconda is recommended, but we can still install it using pip.

pip install pandas

The pandas module is usually imported as an alias pd, and all operations are done using pd, as shown below.

import pandas as pd

Pandas support 2 types of data structures namely:

Series
DataFrame

A DataFrame is a 2-dimensional data structure (an array within an array) that can store different types of data in rows and columns. You can think of it as similar to a spreadsheet or SQL table.

screenshot taken from https://pandas.pydata.org/docs/

DataFrame

Let’s start by creating a simple dataframe with the following data.

Here we create a Python dictionary whose keys are used as column headers and the values as columns

The result will be:

      First_name Last_name                  Email
0       Matt  Mckinney     mattneey@gmail.com
1      Harry   ndarray  Harryndarry@gmail.com
2        ete       Wes      petewes@gmail.com

Our table has 3 columns namely First_name, Last_name, and Email.

To get the first 2 rows

df.head(2)

The result is:

      First_name Last_name                  Email
0       Matt  Mckinney     mattneey@gmail.com
1      Harry   ndarray  Harryndarry@gmail.com

Another way to create a dataframe is using a list as shown below

Read and Write Tabular Data

Let’s read data from the following CSV file. A CSV(Comma Separated Values) contains data separated by commas.

pandas provide the read_csv() function for reading csv files into pandas DataFrame. Other file formats supported b pandas include:

csv,
excel,
SQL,
JSON,
parquet

import pandas as pd
df = pd.read_csv('students.csv')print(df)

The result will be:

        student_id first_name last_name
0         1000       Matt    Leakey
1         1001      Nelly     Green
2         1002    Rachael   Johnson
3         1003      David      Wood
4         1004        Bob       Lee
5         1005        Tom     Smith
6         1006     Taylor    Miller
7         1007    Barbara     Stone
8         1008    Marlene      Cook
9         1009     Marlon    Marlon
10        1010     selene    martin

You can also load data to a CSV using the .csv function. Let's write our data to a CSV file.

The output will be:

,First_name,Last_name,Email
0,Matt,Mckinney,mattneey@gmail.com
1,Harry,ndarray,Harryndarry@gmail.com
2,ete,Wes,petewes@gmail.com

Series

A series is a one-dimensional array that can hold any data type. In simpler terms, a series is a column. Let’s look at how to work with columns.

Let's assume we are working on the following dataframe

          Name Math Calculus Statistics
0       Matt Harry    70       50         56
1  Mckinney ndarray   67       56         78
2        Cathy Matt   63       65         60
3        Leakey Wes   54       55         56

There are two ways of selecting a column, you can use the dictionary[key] format or the dictionary.key format.

If we wish to select the math column, we should have

print(df['Math'])
    #orprint(df.Math)

The result from both print statements should be the same as shown below.

0     70
1     67
2     63
3     54
Name: Math, dtype: object
0     70
1     67
2     63
3     54
Name: Math, dtype: object

The dictionary[key] format is used if the column contains spaces or special characters; for example, if we had a column named physical education, we would select it as shown

df['physical education']

Working with Columns

We’ve learned how to select data from a column, now let’s start working with that data; let's store the data from the math column in a variable called math_data

math = df.Math
print(math)

We can also select more than one column as shown below

df_data =df[['Math','Calculus']]print(df_data)

The result will be:

   Math Calculus
0   70       50
1   67       56
2   63       65
3   54       55

Since the data contains more than one column, it’s no longer a series but a dataframe.

print(type(df_data))

The output is

<class 'pandas.core.frame.DataFrame'>

Working with Rows

To select a row, we use the dataframe. iloc property and pass in the index of the row. For example, to get the row data of the last student in our dataframe, we would do this:

print(df.iloc[-1])

The result will be:

Name          Leakey Wes
Math                  54
Calculus              55
Statistics            56

The iloc property will raise an IndexError if you pass an index that doesn’t exist. For example, if you pass 6 as the index

print(df.iloc[6])

You will get the following error

raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

You can also select more than one row, just like we did with columns. But first, let’s add more data to our dataframe. It should now look like this:

Name Math Calculus Statistics
0       Matt Harry    70       50         56
1  Mckinney ndarray   67       56         78
2        Cathy Matt   63       65         60
3     Thomas Thomas   64       75         46
4      Richard Grey   74       65         56
5       Ching Chang   44       55         46
6          Jim Cary   64       65         76

For example, let’s select the data in the second, third and fourth row

print(df.iloc[2:5])

The result will be:

             Name Math Calculus Statistics
2     Cathy Matt   63       65         60
3  Thomas Thomas   64       75         46
4   Richard Grey   74       65         56

Adding and Modifying Rows and Columns

So far, we have been able to work with existing data, now let learn how to

add new data into rows and columns
rename columns

Add new columns

Suppose we want to add a new column for a new subject, let’s do that.

df['Quantitative Analysis'] = [60,56,78,45,64,78,69]print(df)

Our new data will be:

Name Math Calculus Statistics  Quantitative Analysis
0       Matt Harry    70       50         56                     60
1  Mckinney ndarray   67       56         78                     56
2        Cathy Matt   63       65         60                     78
3     Thomas Thomas   64       75         46                     45
4      Richard Grey   74       65         56                     64
5       Ching Chang   44       55         46                     78
6          Jim Cary   64       65         76                     69

Renaming columns

Consider the following dataframe:

       Product Name   Cost
0  Tony valencia 7047   36.5
1  Tony valencia 6171   28.5
2  Tony valencia 7050   36.5
3  Tony valencia 7056   28.5
4  Tony valencia 6170   28.5
5       Rayban Rb3025  200.0

Let's rename the cost column to cost of goods

df.columns = ['Product Name', 'Cost of goods']print(df)

The table now looks like this:

           Product Name  Cost of goods  
0  Tony valencia 7047           36.5    
1  Tony valencia 6171           28.5    
2  Tony valencia 7050           36.5    
3  Tony valencia 7056           28.5    
4  Tony valencia 6170           28.5    
5       Rayban Rb3025          200.0

Add a new Column using existing data

For the data above, add a column profit equal to Cost*0.4

df['Profit'] = df.Cost *0.4print(df)

Our table now looks like this:

         Product Name   Cost  Profit
0  Tony valencia 7047   36.5    14.6
1  Tony valencia 6171   28.5    11.4
2  Tony valencia 7050   36.5    14.6
3  Tony valencia 7056   28.5    11.4
4  Tony valencia 6170   28.5    11.4
5       Rayban Rb3025  200.0    80.0

Applying Lamba to Rows and Columns

Lambdas are used to perform complex operations with pandas in rows and columns. The syntax for a lambda function is:

lambda args : expression

Consider the dataframe below,

Suppose we want to create a column average showing the overall performance of each student. Create a lambda function average that takes a value and divides it by 3( total number of subjects)

average = lambda a : a/3

Now apply the lambda function to the Total_marks column

df["average"] = df.Total_marks.apply(average)

The output will be:

This tutorial has covered everything you need to know to get started with the pandas module. For more advanced examples, refer to the cookbook guide.

Better Programming

How to Use Pandas to Consume Data and Perform Data Analysis

An introduction to data analysis with Pandas

Installation

DataFrame

Read and Write Tabular Data

Series

Working with Columns

Working with Rows

Adding and Modifying Rows and Columns

Add new columns

Renaming columns

Add a new Column using existing data

Applying Lamba to Rows and Columns

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Better Programming

Written by Esther Vaati

No responses yet