3.2. Data manipulation with Pandas#

3.2.1. What is Pandas?#

No, it is not the plural of panda the bear! Pandas is a Python package designed for data manipulation and analysis. Its name is derived from “panel data” (not the bear!) and is also from “Python data analysis” 1. It is especially popular for its data structures that help process tabular data in a fast and efficient way.

3.2.2. Importing Pandas#

To be able to work with Pandas we need to import the Pandas package into our Python code. Below is the community agreed convention:

import pandas as pd

3.2.3. The dataset#

For this session we will be using data from the World Bank Data Catalog 2 from the World Development Indicators (WDI) collection which was transformed by Alexia Cardona for the purpose of teaching data manipulation and visualisation programming. This data is compiled using official international resources.

Download the dataset from here and save it into a data folder in your project.

This dataset contains official data for each country worldwide and contains indicators on life expectancy, population, and CO2 emissions. Below is the metadata of the dataset:

Table 3.2 Dataset Metadata#

Column name

Description

country

Name of country.

year

Year data is representative of.

life_expectancy_f

Life expectancy at birth, female (years). Life expectancy at birth indicates the number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life.

life_expectancy_m

Life expectancy at birth, male (years).

life_expectancy_t

Life expectancy at birth, total (years).

land_area

Land area (sq. km).

population_m

Male population count.

population_f

Female population count.

population_t

Total population count.

3.2.4. Reading tabular data from a file#

Now that we have everything in place, let us start by first reading our data into our Python code. Pandas has conventient functions that load files containing tabular data.

df = pd.read_csv("data/world-bank-1_data.csv")

The code above reads the file world-bank-1_data.csv into a DataFrame object and creates variable df that points to it. There are other functions in Pandas that read other file formats. You can see a list of these here.

3.2.5. The DataFrame#

The popularity of Pandas mainly lies with its data structure; the DataFrame. The DataFrame data structure is suitable for data that is in tabular form (2-dimensional); data with columns and rows. Since most of the data comes into this form, it makes the DataFrame data structure and Pandas widely used.

Now that we have loaded the DataFrame in Python, we need to check if the data has loaded properly. Let us try to check this in the Console, by printing the first 5 rows of the DataFrame:

df.head(5)
       country  year  population_m  population_f  population_t  \
0  Afghanistan  2000    10689508.0    10090449.0    20779957.0   
1  Afghanistan  2001    11117754.0    10489238.0    21606992.0   
2  Afghanistan  2002    11642106.0    10958668.0    22600774.0   
3  Afghanistan  2003    12214634.0    11466237.0    23680871.0   
4  Afghanistan  2004    12763726.0    11962963.0    24726689.0   

   population_density  land_area  life_expectancy_f  life_expectancy_m  \
0           31.859861   652230.0             57.120             54.663   
1           33.127872   652230.0             57.596             55.119   
2           34.651540   652230.0             58.080             55.583   
3           36.307546   652230.0             58.578             56.056   
4           37.910996   652230.0             59.093             56.542   

   life_expectancy_t  co2_emissions_pc  
0             55.841          0.036574  
1             56.308          0.033785  
2             56.784          0.045574  
3             57.271          0.051518  
4             57.772          0.041655  

You can also get summary statistics for each column of the df DataFrame:

df.describe()
              year  population_m  population_f  population_t  \
count  3038.000000  2.708000e+03  2.708000e+03  3.030000e+03   
mean   2010.500000  1.810384e+07  1.781787e+07  3.211690e+07   
std       7.229606  6.969846e+07  6.568821e+07  1.284477e+08   
min    2000.000000  3.571000e+04  4.029700e+04  9.392000e+03   
25%    2003.000000  9.764312e+05  9.934100e+05  6.599435e+05   
50%    2013.500000  3.641720e+06  3.747408e+06  5.644552e+06   
75%    2017.000000  1.234708e+07  1.228930e+07  1.980094e+07   
max    2020.000000  7.237710e+08  6.873290e+08  1.411100e+09   

       population_density     land_area  life_expectancy_f  life_expectancy_m  \
count         2968.000000  2.998000e+03        2800.000000        2800.000000   
mean           410.489783  6.063802e+05          72.870253          67.979873   
std           1894.724659  1.755858e+06           9.485278           8.831021   
min              0.136492  2.027000e+00          40.005000          38.861000   
25%             32.530300  1.014750e+04          67.318000          62.900000   
50%             83.018459  9.428000e+04          75.450000          69.591000   
75%            207.487240  4.512200e+05          79.900000          74.555250   
max          21388.600000  1.638139e+07          88.100000          82.900000   

       life_expectancy_t  co2_emissions_pc  
count        2800.000000       2481.000000  
mean           70.379877          4.294507  
std             9.085448          5.431735  
min            39.441000          0.020367  
25%            65.076000          0.651679  
50%            72.409000          2.463436  
75%            77.124707          6.080600  
max            85.387805         50.954034  

You can also get a summary of the dtype of each column together with the number of non-null values for each column.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3038 entries, 0 to 3037
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   country             3038 non-null   object 
 1   year                3038 non-null   int64  
 2   population_m        2708 non-null   float64
 3   population_f        2708 non-null   float64
 4   population_t        3030 non-null   float64
 5   population_density  2968 non-null   float64
 6   land_area           2998 non-null   float64
 7   life_expectancy_f   2800 non-null   float64
 8   life_expectancy_m   2800 non-null   float64
 9   life_expectancy_t   2800 non-null   float64
 10  co2_emissions_pc    2481 non-null   float64
dtypes: float64(9), int64(1), object(1)
memory usage: 261.2+ KB
None

Alternatively you can get the dtype of each column by using the dtypes attribute of the DataFrame; df.dtypes. Note that columns which hold string data values are imported into Pandas as object dtype.

To get the number of rows and columns that the DataFrame is made up of use the shape attribute:

df.shape
(3038, 11)

Note that shape is an attribute not a method, so when called, no parentheses are used after shape.

3.2.6. The Series#

Each column in a DataFrame is a Series. A Series is a data structure in Pandas and is essentially a 1-dimensional array. Fig. 3.2 shows the main components of the Series.

_images/series.png

Fig. 3.2 Components of the Series data structure.#

Exercise 3.4 (Exploring DataFrames)

Level:

In this exercise we will be using the dataset from the World Bank Data Catalog that is used in this course. Follow the instructions above to load the dataset into Python and make use of DataFrame methods to answer the following questions:

  1. What type is df?

  2. How many columns and how many rows does the df have?

  3. What are the data types of each column?

3.2.7. Slicing DataFrames#

Fig. 3.3 shows the core components of the DataFrame data structure. The first column we see when we look at the contents of df is the index. The index is a column of data items in the DataFrame but rather it contains the index positions of each record in the DataFrame. Like all the other data structures in Python we have looked at so far, it starts with 0.

_images/dataframe.png

Fig. 3.3 Components of the DataFrame data structure.#

We can extract columns, rows or data items from DataFrames in different ways. The sections below go over the different ways we can slice DataFrames.

3.2.7.1. Extracting column data (label-based indexing)#

To extract data from a column in the DataFrame we use the following notation:
df[“column_name”]
where df is the DataFrame object. If we want to extract multiple columns, we can provide a Python list of column names inside the square brackets as follows:
df[[“column_name1”, “column_name2”]]

#extract the column "life_expectancy_t" and save it in variable le_t
le_t = df["life_expectancy_t"]

#check what type is le_t
print(f"Type of le_t is: {type(le_t)}")

# get the length of the Series
print(f"Size of le_t is: {le_t.size}")

#print the first 5 items
le_t.head()
Type of le_t is: <class 'pandas.core.series.Series'>
Size of le_t is: 3038
0    55.841
1    56.308
2    56.784
3    57.271
4    57.772
Name: life_expectancy_t, dtype: float64

Note that in the example above, since we are extracting only one column from df the result of the first slicing is a Series object. The code below, on the other hand, extracts 3 columns from df, and therefore the result of this is another DataFrame object.

#extract the columns "life_expectancy_t", "life_expectancy_m", "life_expectancy_f" and save it in variable le
le = df[["life_expectancy_t", "life_expectancy_m", "life_expectancy_f"]]

# get type of object returned from slicing
print(f"Type of le is: {type(le)}")

#get size of DataFrame
print(f"Size of le is: {le.shape}")

# print the first 5 items
le.head()
Type of le is: <class 'pandas.core.frame.DataFrame'>
Size of le is: (3038, 3)
life_expectancy_t life_expectancy_m life_expectancy_f
0 55.841 54.663 57.120
1 56.308 55.119 57.596
2 56.784 55.583 58.080
3 57.271 56.056 58.578
4 57.772 56.542 59.093

3.2.7.2. Extracting row data (boolean indexing)#

We can extract rows from our DataFrame that meet a specified condition by using the conditional expressions in our slice operators as follows:
df[conditional expression]

Let us try an example on our df DataFrame. We would like to get the records where life expectancy is more than 80:

le_top = df[df["life_expectancy_t"] > 80]
le_top.head()
country year population_m population_f population_t population_density land_area life_expectancy_f life_expectancy_m life_expectancy_t co2_emissions_pc
143 Australia 2003 9926226.0 9969174.0 19895400.0 2.589771 7682300.0 82.8 77.8 80.239024 17.721684
144 Australia 2004 10042276.0 10085124.0 20127400.0 2.619971 7682300.0 83.0 78.1 80.490244 18.174727
145 Australia 2005 10177992.0 10216808.0 20394800.0 2.654778 7682300.0 83.3 78.5 80.841463 18.146292
146 Australia 2013 11541758.0 11586371.0 23128129.0 3.010574 7682300.0 84.3 80.1 82.148780 16.442316
147 Australia 2014 11705167.0 11770519.0 23475686.0 3.055815 7682300.0 84.4 80.3 82.300000 15.830422

Similarly to Numpy, you can combine different comparison expressions using the & (and) or | (or) operators. You would need to wrap each conditional expression in parentheses. See example below:

# get all records that have life expectancy > 80, where year is either 2020 or 2019
le_top[(le_top["year"] == 2020) | (le_top["year"] == 2019)]
country year population_m population_f population_t population_density land_area life_expectancy_f life_expectancy_m life_expectancy_t co2_emissions_pc
152 Australia 2019 12632259.0 12733486.0 25365745.0 3.297670 7692020.000 85.0 80.9 82.900000 15.238267
153 Australia 2020 12794929.0 12898338.0 25693267.0 3.340250 7692020.000 85.3 81.2 83.200000 NaN
166 Austria 2019 4372358.0 4507562.0 8879920.0 107.609307 82520.000 84.2 79.7 81.895122 7.293984
167 Austria 2020 4395554.0 4521310.0 8916864.0 108.057004 82520.000 83.6 78.9 81.192683 NaN
264 Belgium 2019 5686560.0 5802420.0 11488980.0 379.424703 30280.000 84.3 79.8 81.995122 8.095584
... ... ... ... ... ... ... ... ... ... ... ...
2631 Sweden 2020 5186267.0 5167175.0 10353442.0 25.420723 407283.532 84.2 80.7 82.407317 NaN
2644 Switzerland 2019 4252686.0 4322594.0 8575280.0 217.007630 39516.030 85.8 82.1 83.904878 4.359041
2645 Switzerland 2020 4284690.0 4351871.0 8636561.0 218.558418 39516.030 85.2 81.1 83.100000 NaN
2882 United Kingdom 2019 33008768.0 33827559.0 66836327.0 276.263080 241930.000 83.1 79.4 81.204878 5.220514
2883 United Kingdom 2020 33144663.0 33936337.0 67081000.0 277.274418 241930.000 82.9 79.0 80.902439 NaN

78 rows × 11 columns

You can write the same code using the isin() function which providing it with a list of values, it returns True for each row where the values are present. The example below returns the same result as the code above, but it uses the isin() function to check whether each row in le_top has either 2000 or 2019 in the year column.

le_top[le_top["year"].isin([2020, 2019])]
country year population_m population_f population_t population_density land_area life_expectancy_f life_expectancy_m life_expectancy_t co2_emissions_pc
152 Australia 2019 12632259.0 12733486.0 25365745.0 3.297670 7692020.000 85.0 80.9 82.900000 15.238267
153 Australia 2020 12794929.0 12898338.0 25693267.0 3.340250 7692020.000 85.3 81.2 83.200000 NaN
166 Austria 2019 4372358.0 4507562.0 8879920.0 107.609307 82520.000 84.2 79.7 81.895122 7.293984
167 Austria 2020 4395554.0 4521310.0 8916864.0 108.057004 82520.000 83.6 78.9 81.192683 NaN
264 Belgium 2019 5686560.0 5802420.0 11488980.0 379.424703 30280.000 84.3 79.8 81.995122 8.095584
... ... ... ... ... ... ... ... ... ... ... ...
2631 Sweden 2020 5186267.0 5167175.0 10353442.0 25.420723 407283.532 84.2 80.7 82.407317 NaN
2644 Switzerland 2019 4252686.0 4322594.0 8575280.0 217.007630 39516.030 85.8 82.1 83.904878 4.359041
2645 Switzerland 2020 4284690.0 4351871.0 8636561.0 218.558418 39516.030 85.2 81.1 83.100000 NaN
2882 United Kingdom 2019 33008768.0 33827559.0 66836327.0 276.263080 241930.000 83.1 79.4 81.204878 5.220514
2883 United Kingdom 2020 33144663.0 33936337.0 67081000.0 277.274418 241930.000 82.9 79.0 80.902439 NaN

78 rows × 11 columns

3.2.7.3. Extracting rows and columns#

To extract rows and columns from a DataFrame we need to use the loc and iloc operators.

3.2.7.3.1. loc#

loc is mainly used with label-based indexing and boolean indexing. The general usage format is the following:
df.loc[row condition expression, column_name]
In the example below we extract the list of countries from our df DataFrame where life expectancy is greater than 80.

countries = df.loc[df["life_expectancy_t"] > 80, "country"]

print(f"Type of countries is: {type(countries)}")
print(f"Length of Series is: {countries.size}")

# use unique method of Series to remove duplicates
print(countries.unique())
Type of countries is: <class 'pandas.core.series.Series'>
Length of Series is: 333
['Australia' 'Austria' 'Belgium' 'Bermuda' 'Canada' 'Channel Islands'
 'Chile' 'Costa Rica' 'Cyprus' 'Denmark' 'Faroe Islands' 'Finland'
 'France' 'Germany' 'Greece' 'Guam' 'Hong Kong SAR, China' 'Iceland'
 'Ireland' 'Israel' 'Italy' 'Japan' 'Korea, Rep.' 'Liechtenstein'
 'Luxembourg' 'Macao SAR, China' 'Malta' 'Netherlands' 'New Zealand'
 'Norway' 'Portugal' 'Puerto Rico' 'Qatar' 'Singapore' 'Slovenia' 'Spain'
 'St. Martin (French part)' 'Sweden' 'Switzerland' 'United Kingdom']

3.2.7.3.2. iloc (integer-based indexing)#

iloc is mainly used with integer-based indexing, which means that index positions are supplied to iloc in the following format:
df[R, C]
where R is the row index position and C is the column index position. In short, if you want to extract data items from a DataFrame based on index positions, than iloc is the way to do that.

#get the first 10 rows and the first 5 columns of the df Dataframe
df.iloc[0:10, 0:5]
country year population_m population_f population_t
0 Afghanistan 2000 10689508.0 10090449.0 20779957.0
1 Afghanistan 2001 11117754.0 10489238.0 21606992.0
2 Afghanistan 2002 11642106.0 10958668.0 22600774.0
3 Afghanistan 2003 12214634.0 11466237.0 23680871.0
4 Afghanistan 2004 12763726.0 11962963.0 24726689.0
5 Afghanistan 2005 13239684.0 12414590.0 25654274.0
6 Afghanistan 2013 16554278.0 15715314.0 32269592.0
7 Afghanistan 2014 17138803.0 16232001.0 33370804.0
8 Afghanistan 2015 17686166.0 16727437.0 34413603.0
9 Afghanistan 2016 18186994.0 17196034.0 35383028.0

Exercise 3.5 (Slicing DataFrames)

Level:

Using the df DataFrame created above, write code to perform the following tasks:

  1. Get the first 3 columns of df.

  2. Get the first 3 rows of df.

  3. Get the first row of df.

  4. What are the maximum and minimum values of life expectancy (use life_expectancy_t column)? Which countries have these?

  5. Extract all the data from the United Kingdom and save it into another DataFrame. How many records were returned?

  6. Extract all the records from the United Kingdom from the year 2020. How many records were returned?

  7. When extracting rows from a DataFrame, try not using parentheses between two conditional expressions when using & or | and see what happens.


1

McKinney, Wes (2017). Python for Data Analysis, Second Edition. O’Reilly Media. ISBN 9781491957660.

2

The World Bank: World Development Indicators Collection. Terms of Use here. This dataset can also be viewed interactively on the World Data Bank website here.