{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7aaf9f6f",
   "metadata": {},
   "source": [
    "# More Pandas\n",
    "\n",
    "## Create a new column\n",
    "Similar to NumPy, Pandas supports vectorised operations.  This means, that we do not need to create loops to \n",
    "perform basic element-wise operations.  In the example below, we create a new column `population_p`, that takes the values \n",
    "of column `population_t` and divides them by the values of column `land_area`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a09b3478",
   "metadata": {
    "tags": [
     "output_scroll"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(3038, 11)\n",
      "(3038, 12)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>year</th>\n",
       "      <th>population_m</th>\n",
       "      <th>population_f</th>\n",
       "      <th>population_t</th>\n",
       "      <th>population_density</th>\n",
       "      <th>land_area</th>\n",
       "      <th>life_expectancy_f</th>\n",
       "      <th>life_expectancy_m</th>\n",
       "      <th>life_expectancy_t</th>\n",
       "      <th>co2_emissions_pc</th>\n",
       "      <th>population_d</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>2000</td>\n",
       "      <td>10689508.0</td>\n",
       "      <td>10090449.0</td>\n",
       "      <td>20779957.0</td>\n",
       "      <td>31.859861</td>\n",
       "      <td>652230.0</td>\n",
       "      <td>57.120</td>\n",
       "      <td>54.663</td>\n",
       "      <td>55.841</td>\n",
       "      <td>0.036574</td>\n",
       "      <td>31.859861</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>2001</td>\n",
       "      <td>11117754.0</td>\n",
       "      <td>10489238.0</td>\n",
       "      <td>21606992.0</td>\n",
       "      <td>33.127872</td>\n",
       "      <td>652230.0</td>\n",
       "      <td>57.596</td>\n",
       "      <td>55.119</td>\n",
       "      <td>56.308</td>\n",
       "      <td>0.033785</td>\n",
       "      <td>33.127872</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>2002</td>\n",
       "      <td>11642106.0</td>\n",
       "      <td>10958668.0</td>\n",
       "      <td>22600774.0</td>\n",
       "      <td>34.651540</td>\n",
       "      <td>652230.0</td>\n",
       "      <td>58.080</td>\n",
       "      <td>55.583</td>\n",
       "      <td>56.784</td>\n",
       "      <td>0.045574</td>\n",
       "      <td>34.651540</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>2003</td>\n",
       "      <td>12214634.0</td>\n",
       "      <td>11466237.0</td>\n",
       "      <td>23680871.0</td>\n",
       "      <td>36.307546</td>\n",
       "      <td>652230.0</td>\n",
       "      <td>58.578</td>\n",
       "      <td>56.056</td>\n",
       "      <td>57.271</td>\n",
       "      <td>0.051518</td>\n",
       "      <td>36.307546</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>2004</td>\n",
       "      <td>12763726.0</td>\n",
       "      <td>11962963.0</td>\n",
       "      <td>24726689.0</td>\n",
       "      <td>37.910996</td>\n",
       "      <td>652230.0</td>\n",
       "      <td>59.093</td>\n",
       "      <td>56.542</td>\n",
       "      <td>57.772</td>\n",
       "      <td>0.041655</td>\n",
       "      <td>37.910996</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country  year  population_m  population_f  population_t  \\\n",
       "0  Afghanistan  2000    10689508.0    10090449.0    20779957.0   \n",
       "1  Afghanistan  2001    11117754.0    10489238.0    21606992.0   \n",
       "2  Afghanistan  2002    11642106.0    10958668.0    22600774.0   \n",
       "3  Afghanistan  2003    12214634.0    11466237.0    23680871.0   \n",
       "4  Afghanistan  2004    12763726.0    11962963.0    24726689.0   \n",
       "\n",
       "   population_density  land_area  life_expectancy_f  life_expectancy_m  \\\n",
       "0           31.859861   652230.0             57.120             54.663   \n",
       "1           33.127872   652230.0             57.596             55.119   \n",
       "2           34.651540   652230.0             58.080             55.583   \n",
       "3           36.307546   652230.0             58.578             56.056   \n",
       "4           37.910996   652230.0             59.093             56.542   \n",
       "\n",
       "   life_expectancy_t  co2_emissions_pc  population_d  \n",
       "0             55.841          0.036574     31.859861  \n",
       "1             56.308          0.033785     33.127872  \n",
       "2             56.784          0.045574     34.651540  \n",
       "3             57.271          0.051518     36.307546  \n",
       "4             57.772          0.041655     37.910996  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# read data from file\n",
    "df = pd.read_csv(\"data/world-bank-1_data.csv\")\n",
    "\n",
    "# check the size of the DataFrame\n",
    "print(df.shape)\n",
    "\n",
    "# calculate the population density\n",
    "df[\"population_d\"] = df[\"population_t\"]/df[\"land_area\"]\n",
    "\n",
    "# check the size of the DataFrame again\n",
    "print(df.shape)\n",
    "\n",
    "# check the first few rows\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0bd3c6c5",
   "metadata": {},
   "source": [
    "Similarly, other mathematical operators can be used to perform other arithmethic operations on columns.\n",
    "\n",
    "```{exercise-start} Create columns\n",
    ":label: create-columns\n",
    "```\n",
    "**Level:** {octicon}`star-fill;1em;sd-text-warning` {octicon}`star;1em;sd-text-warning` {octicon}`star;1em;sd-text-warning`\n",
    "\n",
    "Using the `df` DataFrame object created above, create the following new columns:\n",
    "1. fraction of male population (population_m_f)\n",
    "2. fraction of female population (population_f_f)\n",
    "\n",
    "```{exercise-end}\n",
    "```\n",
    "\n",
    "\n",
    "[//]: # (## Renaming columns)\n",
    "\n",
    "\n",
    "## Writing data to files\n",
    "Now that we have made changes to our dataset, let us save it in a file so that we keep it safe.  To save our dataset into a \n",
    ".csv format the [`to_csv()` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html?highlight=to_csv#pandas.DataFrame.to_csv) is used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f7aa87ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_csv(\"data/world-bank-1m_data.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61f04099",
   "metadata": {},
   "source": [
    "In the code above, the `to_csv()`method is saving the data that is currently present in `df` into a file called `world-bank-1m_data.csv` \n",
    "in the `data` folder.  There are other file formats that DataFrames can be saved into.  They normally have the format of `to_*`. Look \n",
    "into [Pandas documentation](https://pandas.pydata.org/docs/reference/frame.html#serialization-io-conversion) for a whole list.\n",
    "\n",
    "```{exercise-start} Saving data into files\n",
    ":label: save-df-file\n",
    "```\n",
    "**Level:** {octicon}`star-fill;1em;sd-text-warning` {octicon}`star-fill;1em;sd-text-warning` {octicon}`star;1em;sd-text-warning`\n",
    "\n",
    "Perform the following:\n",
    "1. Save the `df` DataFrame into a tab-delimited .txt file in the data folder with the name world-bank-1_data.txt. \n",
    "2. Check the file has been created and open it to verify that it is tab-delimited.\n",
    "3. Read the file back into Python into a new DataFrame object (df2) and check that the data has loaded well.\n",
    "\n",
    "```{exercise-end}\n",
    "```\n",
    "\n",
    "## Handling missing data\n",
    "\n",
    "Missing data in Pandas is displayed as `nan`.  In practice, if we are compiling a dataset ourselves, we should keep the \n",
    "respective data item empty so that the dataset can be used by different programming languages and systems.  \n",
    "\n",
    "One useful operation is to remove all rows that have an `nan` in them (even if it is just in one column).  This can be done \n",
    "by using the `dropna()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "48e70cb8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Size of df is: (3038, 12)\n",
      "Size of df_complete is: (2346, 12)\n"
     ]
    }
   ],
   "source": [
    "print(f\"Size of df is: {df.shape}\")\n",
    "\n",
    "df_complete = df.dropna()\n",
    "\n",
    "print(f\"Size of df_complete is: {df_complete.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c3c1285",
   "metadata": {},
   "source": [
    "When dealing with data something you might also want to detect missing values in a column rather than the whole dataset. \n",
    "The `isna()` function returns a Series of `True` and `False` depending on whether the data item in the Series is `nan` or not.\n",
    "The `notna()` function does the opposite of `isna()`, it returns `False` if the respective data item in the Series is `nan`, and \n",
    "`True` if a value is present."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8157b256",
   "metadata": {
    "tags": [
     "output_scroll"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "126    True\n",
      "127    True\n",
      "128    True\n",
      "129    True\n",
      "130    True\n",
      "131    True\n",
      "132    True\n",
      "133    True\n",
      "134    True\n",
      "135    True\n",
      "136    True\n",
      "137    True\n",
      "138    True\n",
      "139    True\n",
      "Name: co2_emissions_pc, dtype: bool\n",
      "126    False\n",
      "127    False\n",
      "128    False\n",
      "129    False\n",
      "130    False\n",
      "131    False\n",
      "132    False\n",
      "133    False\n",
      "134    False\n",
      "135    False\n",
      "136    False\n",
      "137    False\n",
      "138    False\n",
      "139    False\n",
      "Name: co2_emissions_pc, dtype: bool\n"
     ]
    }
   ],
   "source": [
    "# Aruba does not seems to have co2 emissions data\n",
    "aruba_data = df[df[\"country\"] == \"Aruba\"]\n",
    "\n",
    "co2_emissions_isna = pd.isna(aruba_data[\"co2_emissions_pc\"])\n",
    "print(co2_emissions_isna)\n",
    "\n",
    "co2_emissions_notna = pd.notna(aruba_data[\"co2_emissions_pc\"])\n",
    "print(co2_emissions_notna)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b572b86c",
   "metadata": {},
   "source": [
    "## Select-Apply-Combine\n",
    "\n",
    "So far we have applied operations over all the DataFrame object.  However, in data analysis, especially when dealing with\n",
    "big data, a common approach to data exploration is the split-apply-combine strategy. The idea behind this strategy is \n",
    "to split the data into more managable pieces, apply any operations required on the data independently on each piece and \n",
    "then combine the results together. The {numref}`split-apply-combine` below illustrates the approach that is done in the split-apply-combine approach.\n",
    "\n",
    "```{figure} images/split-apply-combine.png\n",
    "---\n",
    "name: split-apply-combine\n",
    "---\n",
    "An illustration on how the Split-Apply-Combine strategy works.\n",
    "```\n",
    "\n",
    "The code below is the Python version of the Split-Apply-Combine approach.  In this code, we are splitting the `df` DataFrame \n",
    "by `year`.  For each `year` records, we are applying the `mean()` function on it.  We are the combining the results from each `year` grouping \n",
    "and merging it into one DataFrame which is then stored in the `res` variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "31412303",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "year\n",
      "2000    4.244367\n",
      "2001    4.323111\n",
      "2002    4.349692\n",
      "2003    4.466854\n",
      "2004    4.505477\n",
      "2005    4.525854\n",
      "2013    4.326678\n",
      "2014    4.226973\n",
      "2015    4.184129\n",
      "2016    4.195432\n",
      "2017    4.199802\n",
      "2018    4.164970\n",
      "2019    4.115138\n",
      "2020         NaN\n",
      "Name: co2_emissions_pc, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# for each year, get the average co2 emissions\n",
    "res = df.groupby(\"year\")[\"co2_emissions_pc\"].mean()\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f937e86",
   "metadata": {},
   "source": [
    "```{exercise-start} Select-Apply-Combine\n",
    ":label: split-apply-combine-ex\n",
    "```\n",
    "**Level:** {octicon}`star-fill;1em;sd-text-warning` {octicon}`star-fill;1em;sd-text-warning` {octicon}`star;1em;sd-text-warning`\n",
    "\n",
    "Find the maximum value of CO2 emmissions for each country in the `df` DataFrame.\n",
    "\n",
    "```{exercise-end}\n",
    "```\n",
    "\n",
    "```{exercise-start} Becoming an independent programmer\n",
    ":label: independent-programmer\n",
    "```\n",
    "**Level:** {octicon}`star-fill;1em;sd-text-warning` {octicon}`star-fill;1em;sd-text-warning` {octicon}`star-fill;1em;sd-text-warning`\n",
    "\n",
    "The purpose of this course is to teach you the core concepts well.  It is up to you to then apply these concepts to solve your \n",
    "computational/mathematical problems. \n",
    "\n",
    "One thing you should be aware of is that popular packages normally have a cheat sheet.  A cheat sheet is a summarised but user-friendly \n",
    "version of all the functionality of a package.  You can find the cheat sheet for [pandas here](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) and \n",
    "[Matplotlib here](https://matplotlib.org/cheatsheets/_images/cheatsheets-1.png).\n",
    "\n",
    "Have a look at the cheat sheet and attempt using a new functionality.\n",
    "\n",
    "```{note}\n",
    "The cheat sheet alone is not enough.  To maximise your potential as a programmer you would also need to use the reference documentation \n",
    "of each respective package. If you go in the documentation of a function and scroll down to the bottom of the page, you will also see \n",
    "exercises that will help you understand better (the sometimes complex) documentation.  \n",
    "[Link to Matplotlib documentation](https://matplotlib.org/stable/api/index).    \n",
    "[Link to Pandas documentation](https://pandas.pydata.org/docs/)  \n",
    "\n",
    "Use the links above to explore the documentation of Matplotlib and Pandas.\n",
    "```\n",
    "\n",
    "```{exercise-end}\n",
    "```\n",
    "\n",
    "\n",
    "[//]: # (## joining data frames)\n",
    "\n",
    "[//]: # ()\n",
    "[//]: # (## pivot)\n",
    "\n",
    "[//]: # ()\n",
    "[//]: # (## sort data in a dataframe)\n",
    "\n",
    "[//]: # (which of the countries has the highest life expectancy)"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "md:myst",
   "text_representation": {
    "extension": ".md",
    "format_name": "myst"
   }
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  },
  "source_map": [
   11,
   20,
   39,
   63,
   65,
   92,
   98,
   105,
   116,
   136,
   140
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 5
}