{ "cells": [ { "cell_type": "markdown", "id": "7aaf9f6f", "metadata": {}, "source": [ "# More Pandas\n", "\n", "## Create a new column\n", "Similar to NumPy, Pandas supports vectorised operations. This means, that we do not need to create loops to \n", "perform basic element-wise operations. In the example below, we create a new column `population_p`, that takes the values \n", "of column `population_t` and divides them by the values of column `land_area`." ] }, { "cell_type": "code", "execution_count": 1, "id": "a09b3478", "metadata": { "tags": [ "output_scroll" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(3038, 11)\n", "(3038, 12)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearpopulation_mpopulation_fpopulation_tpopulation_densityland_arealife_expectancy_flife_expectancy_mlife_expectancy_tco2_emissions_pcpopulation_d
0Afghanistan200010689508.010090449.020779957.031.859861652230.057.12054.66355.8410.03657431.859861
1Afghanistan200111117754.010489238.021606992.033.127872652230.057.59655.11956.3080.03378533.127872
2Afghanistan200211642106.010958668.022600774.034.651540652230.058.08055.58356.7840.04557434.651540
3Afghanistan200312214634.011466237.023680871.036.307546652230.058.57856.05657.2710.05151836.307546
4Afghanistan200412763726.011962963.024726689.037.910996652230.059.09356.54257.7720.04165537.910996
\n", "
" ], "text/plain": [ " country year population_m population_f population_t \\\n", "0 Afghanistan 2000 10689508.0 10090449.0 20779957.0 \n", "1 Afghanistan 2001 11117754.0 10489238.0 21606992.0 \n", "2 Afghanistan 2002 11642106.0 10958668.0 22600774.0 \n", "3 Afghanistan 2003 12214634.0 11466237.0 23680871.0 \n", "4 Afghanistan 2004 12763726.0 11962963.0 24726689.0 \n", "\n", " population_density land_area life_expectancy_f life_expectancy_m \\\n", "0 31.859861 652230.0 57.120 54.663 \n", "1 33.127872 652230.0 57.596 55.119 \n", "2 34.651540 652230.0 58.080 55.583 \n", "3 36.307546 652230.0 58.578 56.056 \n", "4 37.910996 652230.0 59.093 56.542 \n", "\n", " life_expectancy_t co2_emissions_pc population_d \n", "0 55.841 0.036574 31.859861 \n", "1 56.308 0.033785 33.127872 \n", "2 56.784 0.045574 34.651540 \n", "3 57.271 0.051518 36.307546 \n", "4 57.772 0.041655 37.910996 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# read data from file\n", "df = pd.read_csv(\"data/world-bank-1_data.csv\")\n", "\n", "# check the size of the DataFrame\n", "print(df.shape)\n", "\n", "# calculate the population density\n", "df[\"population_d\"] = df[\"population_t\"]/df[\"land_area\"]\n", "\n", "# check the size of the DataFrame again\n", "print(df.shape)\n", "\n", "# check the first few rows\n", "df.head()" ] }, { "cell_type": "markdown", "id": "0bd3c6c5", "metadata": {}, "source": [ "Similarly, other mathematical operators can be used to perform other arithmethic operations on columns.\n", "\n", "```{exercise-start} Create columns\n", ":label: create-columns\n", "```\n", "**Level:** {octicon}`star-fill;1em;sd-text-warning` {octicon}`star;1em;sd-text-warning` {octicon}`star;1em;sd-text-warning`\n", "\n", "Using the `df` DataFrame object created above, create the following new columns:\n", "1. fraction of male population (population_m_f)\n", "2. fraction of female population (population_f_f)\n", "\n", "```{exercise-end}\n", "```\n", "\n", "\n", "[//]: # (## Renaming columns)\n", "\n", "\n", "## Writing data to files\n", "Now that we have made changes to our dataset, let us save it in a file so that we keep it safe. To save our dataset into a \n", ".csv format the [`to_csv()` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html?highlight=to_csv#pandas.DataFrame.to_csv) is used." ] }, { "cell_type": "code", "execution_count": 2, "id": "f7aa87ce", "metadata": {}, "outputs": [], "source": [ "df.to_csv(\"data/world-bank-1m_data.csv\")" ] }, { "cell_type": "markdown", "id": "61f04099", "metadata": {}, "source": [ "In the code above, the `to_csv()`method is saving the data that is currently present in `df` into a file called `world-bank-1m_data.csv` \n", "in the `data` folder. There are other file formats that DataFrames can be saved into. They normally have the format of `to_*`. Look \n", "into [Pandas documentation](https://pandas.pydata.org/docs/reference/frame.html#serialization-io-conversion) for a whole list.\n", "\n", "```{exercise-start} Saving data into files\n", ":label: save-df-file\n", "```\n", "**Level:** {octicon}`star-fill;1em;sd-text-warning` {octicon}`star-fill;1em;sd-text-warning` {octicon}`star;1em;sd-text-warning`\n", "\n", "Perform the following:\n", "1. Save the `df` DataFrame into a tab-delimited .txt file in the data folder with the name world-bank-1_data.txt. \n", "2. Check the file has been created and open it to verify that it is tab-delimited.\n", "3. Read the file back into Python into a new DataFrame object (df2) and check that the data has loaded well.\n", "\n", "```{exercise-end}\n", "```\n", "\n", "## Handling missing data\n", "\n", "Missing data in Pandas is displayed as `nan`. In practice, if we are compiling a dataset ourselves, we should keep the \n", "respective data item empty so that the dataset can be used by different programming languages and systems. \n", "\n", "One useful operation is to remove all rows that have an `nan` in them (even if it is just in one column). This can be done \n", "by using the `dropna()` method." ] }, { "cell_type": "code", "execution_count": 3, "id": "48e70cb8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Size of df is: (3038, 12)\n", "Size of df_complete is: (2346, 12)\n" ] } ], "source": [ "print(f\"Size of df is: {df.shape}\")\n", "\n", "df_complete = df.dropna()\n", "\n", "print(f\"Size of df_complete is: {df_complete.shape}\")" ] }, { "cell_type": "markdown", "id": "5c3c1285", "metadata": {}, "source": [ "When dealing with data something you might also want to detect missing values in a column rather than the whole dataset. \n", "The `isna()` function returns a Series of `True` and `False` depending on whether the data item in the Series is `nan` or not.\n", "The `notna()` function does the opposite of `isna()`, it returns `False` if the respective data item in the Series is `nan`, and \n", "`True` if a value is present." ] }, { "cell_type": "code", "execution_count": 4, "id": "8157b256", "metadata": { "tags": [ "output_scroll" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "126 True\n", "127 True\n", "128 True\n", "129 True\n", "130 True\n", "131 True\n", "132 True\n", "133 True\n", "134 True\n", "135 True\n", "136 True\n", "137 True\n", "138 True\n", "139 True\n", "Name: co2_emissions_pc, dtype: bool\n", "126 False\n", "127 False\n", "128 False\n", "129 False\n", "130 False\n", "131 False\n", "132 False\n", "133 False\n", "134 False\n", "135 False\n", "136 False\n", "137 False\n", "138 False\n", "139 False\n", "Name: co2_emissions_pc, dtype: bool\n" ] } ], "source": [ "# Aruba does not seems to have co2 emissions data\n", "aruba_data = df[df[\"country\"] == \"Aruba\"]\n", "\n", "co2_emissions_isna = pd.isna(aruba_data[\"co2_emissions_pc\"])\n", "print(co2_emissions_isna)\n", "\n", "co2_emissions_notna = pd.notna(aruba_data[\"co2_emissions_pc\"])\n", "print(co2_emissions_notna)" ] }, { "cell_type": "markdown", "id": "b572b86c", "metadata": {}, "source": [ "## Select-Apply-Combine\n", "\n", "So far we have applied operations over all the DataFrame object. However, in data analysis, especially when dealing with\n", "big data, a common approach to data exploration is the split-apply-combine strategy. The idea behind this strategy is \n", "to split the data into more managable pieces, apply any operations required on the data independently on each piece and \n", "then combine the results together. The {numref}`split-apply-combine` below illustrates the approach that is done in the split-apply-combine approach.\n", "\n", "```{figure} images/split-apply-combine.png\n", "---\n", "name: split-apply-combine\n", "---\n", "An illustration on how the Split-Apply-Combine strategy works.\n", "```\n", "\n", "The code below is the Python version of the Split-Apply-Combine approach. In this code, we are splitting the `df` DataFrame \n", "by `year`. For each `year` records, we are applying the `mean()` function on it. We are the combining the results from each `year` grouping \n", "and merging it into one DataFrame which is then stored in the `res` variable." ] }, { "cell_type": "code", "execution_count": 5, "id": "31412303", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "year\n", "2000 4.244367\n", "2001 4.323111\n", "2002 4.349692\n", "2003 4.466854\n", "2004 4.505477\n", "2005 4.525854\n", "2013 4.326678\n", "2014 4.226973\n", "2015 4.184129\n", "2016 4.195432\n", "2017 4.199802\n", "2018 4.164970\n", "2019 4.115138\n", "2020 NaN\n", "Name: co2_emissions_pc, dtype: float64\n" ] } ], "source": [ "# for each year, get the average co2 emissions\n", "res = df.groupby(\"year\")[\"co2_emissions_pc\"].mean()\n", "print(res)" ] }, { "cell_type": "markdown", "id": "3f937e86", "metadata": {}, "source": [ "```{exercise-start} Select-Apply-Combine\n", ":label: split-apply-combine-ex\n", "```\n", "**Level:** {octicon}`star-fill;1em;sd-text-warning` {octicon}`star-fill;1em;sd-text-warning` {octicon}`star;1em;sd-text-warning`\n", "\n", "Find the maximum value of CO2 emmissions for each country in the `df` DataFrame.\n", "\n", "```{exercise-end}\n", "```\n", "\n", "```{exercise-start} Becoming an independent programmer\n", ":label: independent-programmer\n", "```\n", "**Level:** {octicon}`star-fill;1em;sd-text-warning` {octicon}`star-fill;1em;sd-text-warning` {octicon}`star-fill;1em;sd-text-warning`\n", "\n", "The purpose of this course is to teach you the core concepts well. It is up to you to then apply these concepts to solve your \n", "computational/mathematical problems. \n", "\n", "One thing you should be aware of is that popular packages normally have a cheat sheet. A cheat sheet is a summarised but user-friendly \n", "version of all the functionality of a package. You can find the cheat sheet for [pandas here](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) and \n", "[Matplotlib here](https://matplotlib.org/cheatsheets/_images/cheatsheets-1.png).\n", "\n", "Have a look at the cheat sheet and attempt using a new functionality.\n", "\n", "```{note}\n", "The cheat sheet alone is not enough. To maximise your potential as a programmer you would also need to use the reference documentation \n", "of each respective package. If you go in the documentation of a function and scroll down to the bottom of the page, you will also see \n", "exercises that will help you understand better (the sometimes complex) documentation. \n", "[Link to Matplotlib documentation](https://matplotlib.org/stable/api/index). \n", "[Link to Pandas documentation](https://pandas.pydata.org/docs/) \n", "\n", "Use the links above to explore the documentation of Matplotlib and Pandas.\n", "```\n", "\n", "```{exercise-end}\n", "```\n", "\n", "\n", "[//]: # (## joining data frames)\n", "\n", "[//]: # ()\n", "[//]: # (## pivot)\n", "\n", "[//]: # ()\n", "[//]: # (## sort data in a dataframe)\n", "\n", "[//]: # (which of the countries has the highest life expectancy)" ] } ], "metadata": { "jupytext": { "formats": "md:myst", "text_representation": { "extension": ".md", "format_name": "myst" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" }, "source_map": [ 11, 20, 39, 63, 65, 92, 98, 105, 116, 136, 140 ] }, "nbformat": 4, "nbformat_minor": 5 }