3.4. More Pandas#

3.4.1. Create a new column#

Similar to NumPy, Pandas supports vectorised operations. This means, that we do not need to create loops to perform basic element-wise operations. In the example below, we create a new column population_p, that takes the values of column population_t and divides them by the values of column land_area.

import pandas as pd

# read data from file
df = pd.read_csv("data/world-bank-1_data.csv")

# check the size of the DataFrame
print(df.shape)

# calculate the population density
df["population_d"] = df["population_t"]/df["land_area"]

# check the size of the DataFrame again
print(df.shape)

# check the first few rows
df.head()

(3038, 11)
(3038, 12)

	country	year	population_m	population_f	population_t	population_density	land_area	life_expectancy_f	life_expectancy_m	life_expectancy_t	co2_emissions_pc	population_d
0	Afghanistan	2000	10689508.0	10090449.0	20779957.0	31.859861	652230.0	57.120	54.663	55.841	0.036574	31.859861
1	Afghanistan	2001	11117754.0	10489238.0	21606992.0	33.127872	652230.0	57.596	55.119	56.308	0.033785	33.127872
2	Afghanistan	2002	11642106.0	10958668.0	22600774.0	34.651540	652230.0	58.080	55.583	56.784	0.045574	34.651540
3	Afghanistan	2003	12214634.0	11466237.0	23680871.0	36.307546	652230.0	58.578	56.056	57.271	0.051518	36.307546
4	Afghanistan	2004	12763726.0	11962963.0	24726689.0	37.910996	652230.0	59.093	56.542	57.772	0.041655	37.910996

Similarly, other mathematical operators can be used to perform other arithmethic operations on columns.

Exercise 3.8 (Create columns)

Level:

Using the df DataFrame object created above, create the following new columns:

fraction of male population (population_m_f)
fraction of female population (population_f_f)

3.4.2. Writing data to files#

Now that we have made changes to our dataset, let us save it in a file so that we keep it safe. To save our dataset into a .csv format the to_csv() method is used.

df.to_csv("data/world-bank-1m_data.csv")

In the code above, the to_csv()method is saving the data that is currently present in df into a file called world-bank-1m_data.csv in the data folder. There are other file formats that DataFrames can be saved into. They normally have the format of to_*. Look into Pandas documentation for a whole list.

Exercise 3.9 (Saving data into files)

Level:

Perform the following:

Save the df DataFrame into a tab-delimited .txt file in the data folder with the name world-bank-1_data.txt.
Check the file has been created and open it to verify that it is tab-delimited.
Read the file back into Python into a new DataFrame object (df2) and check that the data has loaded well.

3.4.3. Handling missing data#

Missing data in Pandas is displayed as nan. In practice, if we are compiling a dataset ourselves, we should keep the respective data item empty so that the dataset can be used by different programming languages and systems.

One useful operation is to remove all rows that have an nan in them (even if it is just in one column). This can be done by using the dropna() method.

print(f"Size of df is: {df.shape}")

df_complete = df.dropna()

print(f"Size of df_complete is: {df_complete.shape}")

Size of df is: (3038, 12)
Size of df_complete is: (2346, 12)

When dealing with data something you might also want to detect missing values in a column rather than the whole dataset. The isna() function returns a Series of True and False depending on whether the data item in the Series is nan or not. The notna() function does the opposite of isna(), it returns False if the respective data item in the Series is nan, and True if a value is present.

# Aruba does not seems to have co2 emissions data
aruba_data = df[df["country"] == "Aruba"]

co2_emissions_isna = pd.isna(aruba_data["co2_emissions_pc"])
print(co2_emissions_isna)

co2_emissions_notna = pd.notna(aruba_data["co2_emissions_pc"])
print(co2_emissions_notna)

  True
  True
  True
  True
  True
  True
  True
  True
  True
  True
  True
  True
  True
  True
Name: co2_emissions_pc, dtype: bool
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
Name: co2_emissions_pc, dtype: bool

3.4.4. Select-Apply-Combine#

So far we have applied operations over all the DataFrame object. However, in data analysis, especially when dealing with big data, a common approach to data exploration is the split-apply-combine strategy. The idea behind this strategy is to split the data into more managable pieces, apply any operations required on the data independently on each piece and then combine the results together. The Fig. 3.4 below illustrates the approach that is done in the split-apply-combine approach.

_images/split-apply-combine.png — Fig. 3.4 An illustration on how the Split-Apply-Combine strategy works.#

The code below is the Python version of the Split-Apply-Combine approach. In this code, we are splitting the df DataFrame by year. For each year records, we are applying the mean() function on it. We are the combining the results from each year grouping and merging it into one DataFrame which is then stored in the res variable.

# for each year, get the average co2 emissions
res = df.groupby("year")["co2_emissions_pc"].mean()
print(res)

year
  4.244367
  4.323111
  4.349692
  4.466854
  4.505477
  4.525854
  4.326678
  4.226973
  4.184129
  4.195432
  4.199802
  4.164970
  4.115138
       NaN
Name: co2_emissions_pc, dtype: float64

Exercise 3.10 (Select-Apply-Combine)

Level:

Find the maximum value of CO2 emmissions for each country in the df DataFrame.

Exercise 3.11 (Becoming an independent programmer)

Level:

The purpose of this course is to teach you the core concepts well. It is up to you to then apply these concepts to solve your computational/mathematical problems.

One thing you should be aware of is that popular packages normally have a cheat sheet. A cheat sheet is a summarised but user-friendly version of all the functionality of a package. You can find the cheat sheet for pandas here and Matplotlib here.

Have a look at the cheat sheet and attempt using a new functionality.

Note

The cheat sheet alone is not enough. To maximise your potential as a programmer you would also need to use the reference documentation of each respective package. If you go in the documentation of a function and scroll down to the bottom of the page, you will also see exercises that will help you understand better (the sometimes complex) documentation.
Link to Matplotlib documentation.
Link to Pandas documentation

Use the links above to explore the documentation of Matplotlib and Pandas.

Python Programming

More Pandas

Contents

3.4. More Pandas#

3.4.1. Create a new column#

3.4.2. Writing data to files#

3.4.3. Handling missing data#

3.4.4. Select-Apply-Combine#