{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data wrangling, grouping and aggregation\n",
    "\n",
    "Next, we will continue working with weather data, but expand our analysis to cover longer periods of data from Finland. In the following, you will learn various useful techniques in pandas to manipulate, group and aggregate the data in different ways that are useful when extracting insights from your data. In the end, you will learn how to create an automated data analysis workflow that can be repeated with multiple input files having a similar structure. As a case study, we will investigate whether January 2020 was the warmest month on record also in Finland, as the month was the warmest one on record globally [^noaanews]. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleaning data while reading\n",
    "\n",
    "In this section we are using weather observation data from Finland that was downloaded from NOAA (see `Datasets` chapter for further details). The input data is separated with varying number of spaces (i.e., fixed width). The first lines and columns of the data look like following:\n",
    "\n",
    "``` \n",
    "  USAF  WBAN YR--MODAHRMN DIR SPD GUS CLG SKC L M H  VSB MW MW MW MW AW  ...\n",
    "029440 99999 190601010600 090   7 *** *** OVC * * *  0.0 ** ** ** ** **  ...\n",
    "029440 99999 190601011300 ***   0 *** *** OVC * * *  0.0 ** ** ** ** **  ...\n",
    "029440 99999 190601012000 ***   0 *** *** OVC * * *  0.0 ** ** ** ** **  ...\n",
    "029440 99999 190601020600 ***   0 *** *** CLR * * *  0.0 ** ** ** ** **  ...\n",
    "```\n",
    "\n",
    "By looking at the data, we can notice a few things that we need to consider when reading the data:\n",
    "\n",
    "1. **Delimiter:** The columns are separated with a varying amount of spaces which requires using some special tricks when reading the data with pandas `read_csv()` function\n",
    "2. **NoData values:** NaN values in the NOAA data are coded with varying number of `*` characters, hence, we need to be able to instruct pandas to interpret those as NaNs. \n",
    "3. **Many columns**: The input data contains many columns (altogether 33). Many of those do not contain any meaningful data for our needs. Hence, we should probably ignore the unnecessary columns already at this stage. \n",
    "\n",
    "Handling and cleaning heterogeneous input data (such as our example here) can be done after reading in the data. However, in many cases, it is actually useful to do some cleaning and preprocessing already when reading the data. In fact, that is often much easier to do. In our case, we can read the data with varying number of spaces between the columns (1) by using a parameter `delim_whitespace=True` (alternatively, specifying `sep='\\s+'` would work). For handling the NoData values (2), we can tell pandas to consider the `*` characters as NaNs by using a paramater `na_values` and specifying a list of characters that should be converted to NaNs. Hence, in this case we can specify `na_values=['*', '**', '***', '****', '*****', '******']` which will then convert the varying number of `*` characters into NaN values. Finally, we can limit the number of columns that we read (3) by using the `usecols` parameter, which we already used previously. In our case, we are interested in columns that might be somehow useful to our analysis (or at least meaningful to us), including e.g. the station name, timestamp, and data about the wind and temperature: `'USAF','YR--MODAHRMN', 'DIR', 'SPD', 'GUS','TEMP', 'MAX', 'MIN'`. Achieving all these things is pretty straightforward using the `read_csv()` function: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Define relative path to the file\n",
    "fp = \"data/029820.txt\"\n",
    "\n",
    "# Read data using varying amount of spaces as separator,\n",
    "# specifying '*' characters as NoData values,\n",
    "# and selecting only specific columns from the data\n",
    "data = pd.read_csv(\n",
    "    fp,\n",
    "    delim_whitespace=True,\n",
    "    na_values=[\"*\", \"**\", \"***\", \"****\", \"*****\", \"******\"],\n",
    "    usecols=[\"USAF\", \"YR--MODAHRMN\", \"DIR\", \"SPD\", \"GUS\", \"TEMP\", \"MAX\", \"MIN\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see now how the data looks by printing the first five rows with the `head()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>USAF</th>\n",
       "      <th>YR--MODAHRMN</th>\n",
       "      <th>DIR</th>\n",
       "      <th>SPD</th>\n",
       "      <th>GUS</th>\n",
       "      <th>TEMP</th>\n",
       "      <th>MAX</th>\n",
       "      <th>MIN</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601010600</td>\n",
       "      <td>180.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>34.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601011300</td>\n",
       "      <td>20.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>32.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601012000</td>\n",
       "      <td>320.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>30.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601020600</td>\n",
       "      <td>320.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>33.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601021300</td>\n",
       "      <td>230.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    USAF  YR--MODAHRMN    DIR   SPD  GUS  TEMP  MAX  MIN\n",
       "0  29820  190601010600  180.0   2.0  NaN  34.0  NaN  NaN\n",
       "1  29820  190601011300   20.0   6.0  NaN  32.0  NaN  NaN\n",
       "2  29820  190601012000  320.0   6.0  NaN  30.0  NaN  NaN\n",
       "3  29820  190601020600  320.0  10.0  NaN  33.0  NaN  NaN\n",
       "4  29820  190601021300  230.0  15.0  NaN  35.0  NaN  NaN"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perfect, looks good. We have skipped a bunch of unnecessary columns and also the asterisk (\\*) characters have been correctly converted to NaN values.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Renaming columns\n",
    "\n",
    "Let's take a closer look at the column names of our DataFrame: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['USAF', 'YR--MODAHRMN', 'DIR', 'SPD', 'GUS', 'TEMP', 'MAX', 'MIN'], dtype='object')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we see, some of the column names are a bit awkward and difficult to interpret (a description for the columns is available in the metadata [data/3505doc.txt](data/3505doc.txt)). Luckily, it is easy to alter labels in a pandas DataFrame using the `rename()` function. In order to change the column names, we need to tell pandas how we want to rename the columns using a dictionary that converts the old names to new ones. As you probably remember from Chapter 1, a `dictionary` is a specific data structure in Python for storing key-value pairs. We can define the new column names using a dictionary where we list \"`key: value`\" pairs in following manner:\n",
    "   \n",
    "- `USAF`: `STATION_NUMBER`\n",
    "- `YR--MODAHRMN`: `TIME`\n",
    "- `SPD`: `SPEED`\n",
    "- `GUS`: `GUST`\n",
    "- `TEMP`: `TEMP_F`\n",
    "\n",
    "Hence, the original column name (e.g. `YR--MODAHRMN`) is the dictionary `key` which will be converted to a new column name `TIME` (which is the `value`). The temperature values in our data file is again represented in Fahrenheit. We will soon convert these temperatures to Celsius. Hence, in order to avoid confusion with the columns, let's rename the column `TEMP` to `TEMP_F`. Also the station number `USAF` is much more intuitive if we call it `STATION_NUMBER`. Let's create a dictionary for the new column names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'USAF': 'STATION_NUMBER',\n",
       " 'YR--MODAHRMN': 'TIME',\n",
       " 'SPD': 'SPEED',\n",
       " 'GUS': 'GUST',\n",
       " 'TEMP': 'TEMP_F'}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_names = {\n",
    "    \"USAF\": \"STATION_NUMBER\",\n",
    "    \"YR--MODAHRMN\": \"TIME\",\n",
    "    \"SPD\": \"SPEED\",\n",
    "    \"GUS\": \"GUST\",\n",
    "    \"TEMP\": \"TEMP_F\",\n",
    "}\n",
    "new_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our dictionary looks correct, so now we can change the column names by passing that dictionary using the parameter `columns` in the `rename()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['STATION_NUMBER', 'TIME', 'DIR', 'SPEED', 'GUST', 'TEMP_F', 'MAX',\n",
       "       'MIN'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = data.rename(columns=new_names)\n",
    "data.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perfect, now our column names are easier to understand and use. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using functions with pandas\n",
    "\n",
    "Now it's time to convert those temperatures from Fahrenheit to Celsius. We have done this many times before, but this time we will learn how to apply our own functions to data in a pandas DataFrame. We will define a function for the temperature conversion, and apply this function for each Celsius value on each row of the DataFrame. Output celsius values should be stored in a new column called `TEMP_C`. But first, it is a good idea to check some basic properties of our new input data before proceeding with data analysis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>STATION_NUMBER</th>\n",
       "      <th>TIME</th>\n",
       "      <th>DIR</th>\n",
       "      <th>SPEED</th>\n",
       "      <th>GUST</th>\n",
       "      <th>TEMP_F</th>\n",
       "      <th>MAX</th>\n",
       "      <th>MIN</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601010600</td>\n",
       "      <td>180.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>34.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601011300</td>\n",
       "      <td>20.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>32.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   STATION_NUMBER          TIME    DIR  SPEED  GUST  TEMP_F  MAX  MIN\n",
       "0           29820  190601010600  180.0    2.0   NaN    34.0  NaN  NaN\n",
       "1           29820  190601011300   20.0    6.0   NaN    32.0  NaN  NaN"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# First rows\n",
    "data.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>STATION_NUMBER</th>\n",
       "      <th>TIME</th>\n",
       "      <th>DIR</th>\n",
       "      <th>SPEED</th>\n",
       "      <th>GUST</th>\n",
       "      <th>TEMP_F</th>\n",
       "      <th>MAX</th>\n",
       "      <th>MIN</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>198332</th>\n",
       "      <td>29820</td>\n",
       "      <td>201910012200</td>\n",
       "      <td>287.0</td>\n",
       "      <td>21.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>47.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198333</th>\n",
       "      <td>29820</td>\n",
       "      <td>201910012300</td>\n",
       "      <td>287.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        STATION_NUMBER          TIME    DIR  SPEED  GUST  TEMP_F  MAX  MIN\n",
       "198332           29820  201910012200  287.0   21.0  25.0    47.0  NaN  NaN\n",
       "198333           29820  201910012300  287.0   33.0  41.0    46.0  NaN  NaN"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Last rows\n",
    "data.tail(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 198334 entries, 0 to 198333\n",
      "Data columns (total 8 columns):\n",
      " #   Column          Non-Null Count   Dtype  \n",
      "---  ------          --------------   -----  \n",
      " 0   STATION_NUMBER  198334 non-null  int64  \n",
      " 1   TIME            198334 non-null  int64  \n",
      " 2   DIR             193660 non-null  float64\n",
      " 3   SPEED           196436 non-null  float64\n",
      " 4   GUST            26649 non-null   float64\n",
      " 5   TEMP_F          197916 non-null  float64\n",
      " 6   MAX             29868 non-null   float64\n",
      " 7   MIN             29536 non-null   float64\n",
      "dtypes: float64(6), int64(2)\n",
      "memory usage: 12.1 MB\n"
     ]
    }
   ],
   "source": [
    "# Data types\n",
    "data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nothing suspicous for the first and last rows, but here with `info()` we can see that the number of observations per column seem to be varying if you compare the `Non-Null Count` information to the number of entries in the data (N=198334). Only station number and time seem to have data on each row. All other columns seem to have some missing values. This is not necessarily anything dangerous, but good to keep in mind. Let's still look at the descriptive statistics:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>STATION_NUMBER</th>\n",
       "      <th>TIME</th>\n",
       "      <th>DIR</th>\n",
       "      <th>SPEED</th>\n",
       "      <th>GUST</th>\n",
       "      <th>TEMP_F</th>\n",
       "      <th>MAX</th>\n",
       "      <th>MIN</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>198334.0</td>\n",
       "      <td>1.983340e+05</td>\n",
       "      <td>193660.000000</td>\n",
       "      <td>196436.000000</td>\n",
       "      <td>26649.000000</td>\n",
       "      <td>197916.000000</td>\n",
       "      <td>29868.000000</td>\n",
       "      <td>29536.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>29820.0</td>\n",
       "      <td>1.990974e+11</td>\n",
       "      <td>199.798033</td>\n",
       "      <td>14.794462</td>\n",
       "      <td>23.895606</td>\n",
       "      <td>43.717845</td>\n",
       "      <td>46.405852</td>\n",
       "      <td>40.537446</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.0</td>\n",
       "      <td>2.691914e+09</td>\n",
       "      <td>95.299190</td>\n",
       "      <td>8.482308</td>\n",
       "      <td>9.231637</td>\n",
       "      <td>14.306138</td>\n",
       "      <td>14.446547</td>\n",
       "      <td>14.350235</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>29820.0</td>\n",
       "      <td>1.906010e+11</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>11.000000</td>\n",
       "      <td>-28.000000</td>\n",
       "      <td>-20.000000</td>\n",
       "      <td>-28.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>29820.0</td>\n",
       "      <td>1.979011e+11</td>\n",
       "      <td>120.000000</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>17.000000</td>\n",
       "      <td>34.000000</td>\n",
       "      <td>36.000000</td>\n",
       "      <td>31.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>29820.0</td>\n",
       "      <td>1.997061e+11</td>\n",
       "      <td>212.000000</td>\n",
       "      <td>14.000000</td>\n",
       "      <td>22.000000</td>\n",
       "      <td>43.000000</td>\n",
       "      <td>45.000000</td>\n",
       "      <td>40.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>29820.0</td>\n",
       "      <td>2.013113e+11</td>\n",
       "      <td>270.000000</td>\n",
       "      <td>20.000000</td>\n",
       "      <td>29.000000</td>\n",
       "      <td>55.000000</td>\n",
       "      <td>59.000000</td>\n",
       "      <td>52.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>29820.0</td>\n",
       "      <td>2.019100e+11</td>\n",
       "      <td>990.000000</td>\n",
       "      <td>97.000000</td>\n",
       "      <td>73.000000</td>\n",
       "      <td>90.000000</td>\n",
       "      <td>102.000000</td>\n",
       "      <td>77.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       STATION_NUMBER          TIME            DIR          SPEED  \\\n",
       "count        198334.0  1.983340e+05  193660.000000  196436.000000   \n",
       "mean          29820.0  1.990974e+11     199.798033      14.794462   \n",
       "std               0.0  2.691914e+09      95.299190       8.482308   \n",
       "min           29820.0  1.906010e+11       1.000000       0.000000   \n",
       "25%           29820.0  1.979011e+11     120.000000       9.000000   \n",
       "50%           29820.0  1.997061e+11     212.000000      14.000000   \n",
       "75%           29820.0  2.013113e+11     270.000000      20.000000   \n",
       "max           29820.0  2.019100e+11     990.000000      97.000000   \n",
       "\n",
       "               GUST         TEMP_F           MAX           MIN  \n",
       "count  26649.000000  197916.000000  29868.000000  29536.000000  \n",
       "mean      23.895606      43.717845     46.405852     40.537446  \n",
       "std        9.231637      14.306138     14.446547     14.350235  \n",
       "min       11.000000     -28.000000    -20.000000    -28.000000  \n",
       "25%       17.000000      34.000000     36.000000     31.000000  \n",
       "50%       22.000000      43.000000     45.000000     40.000000  \n",
       "75%       29.000000      55.000000     59.000000     52.000000  \n",
       "max       73.000000      90.000000    102.000000     77.000000  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Descriptive stats\n",
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By looking at the `TEMP_F` values (Fahrenheit temperatures), we can confirm that our measurements seems more or less valid because the value range of the temperatures makes sense, i.e. there are no outliers such as extremely high `MAX` values or low `MIN` values. It is always a good practice to critically check your data before doing any analysis, as it is possible that your data may include incorrect values, e.g. due to a sensor malfunction or human error. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Defining a function\n",
    "\n",
    "Now we are sure that our data looks okay, and we can start our temperature conversion process by first defining our temperature conversion function from Fahrenheit to Celsius. Pandas can use regular functions, hence you can define functions for pandas exactly in the same way as you would do normally (as we learned in Chapter 1). Hence, let's define a function that converts Fahrenheits to Celsius: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fahr_to_celsius(temp_fahrenheit):\n",
    "    \"\"\"Function to convert Fahrenheit temperature into Celsius.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "\n",
    "    temp_fahrenheit: int | float\n",
    "        Input temperature in Fahrenheit (should be a number)\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "\n",
    "    Temperature in Celsius (float)\n",
    "    \"\"\"\n",
    "\n",
    "    # Convert the Fahrenheit into Celsius\n",
    "    converted_temp = (temp_fahrenheit - 32) / 1.8\n",
    "\n",
    "    return converted_temp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have the function defined and stored in memory. At this point it is good to test the function with some known value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fahr_to_celsius(32)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "32 Fahrenheits is indeed 0 Celsius, so our function seem to be working correctly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using a function by iterating over rows\n",
    "\n",
    "Next we will learn how to use our function with data stored in pandas DataFrame. We will first apply the function row-by-row using a `for` loop and then we will learn a more efficient way of applying the function to all rows at once.\n",
    "\n",
    "Looping over rows in a DataFrame can be done in a couple of different ways. A common approach is to use a `iterrows()` method which loops over the rows as a index-Series pairs. In other words, we can use the `iterrows()` method together with a `for` loop to repeat a process *for each row in a Pandas DataFrame*. Please note that iterating over rows this way is a rather inefficient approach, but it is still useful to understand the logic behind the iteration (we will learn a more efficient approach later). When using the `iterrows()` method it is important to understand that `iterrows()` accesses not only the values of one row, but also the `index` of the row as we mentioned. Let's start with a simple for loop that goes through each row in our DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index: 0\n",
      "Temp F: 34.0 \n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Iterate over the rows\n",
    "for idx, row in data.iterrows():\n",
    "    # Print the index value\n",
    "    print(\"Index:\", idx)\n",
    "\n",
    "    # Print the temperature from the row\n",
    "    print(\"Temp F:\", row[\"TEMP_F\"], \"\\n\")\n",
    "\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the `idx` variable indeed contains the index value at position 0 (the first row) and the `row` variable contains all the data from that given row stored as a pandas `Series`. Notice, that when developing a for loop, you don't always need to go through the entire loop if you just want to test things out. Using the `break` statement in Python terminates a loop whenever it is placed inside a loop. We used it here just to test check out the values on the first row. With a large data, you might not want to print out thousands of values to the screen!\n",
    "\n",
    "Let's now create an empty column `TEMP_C` for the Celsius temperatures and update the values in that column using the `fahr_to_celsius()` function that we defined earlier. For updating the value, we can use `at` which we already used earlier in this chapter. This time, we will use the `itertuples()` method which works in a similar manner, except it only return the row values without the `index`. When using `itertuples()` accessing the row values needs to be done a bit differently, because the row is not a Series, but a `named tuple` (hence the name). A tuple is like a list (but immutable, i.e. you cannot change it) and \"named tuple\" is a special kind of tuple object that adds the ability to access the values by name instead of position index. Hence, we will access the `TEMP_F` value by using `row.TEMP_F` (compare to how we accessed the value in the prevous code block):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an empty column for the output values\n",
    "data[\"TEMP_C\"] = 0.0\n",
    "\n",
    "# Iterate over the rows\n",
    "for row in data.itertuples():\n",
    "    # Convert the Fahrenheit to Celsius\n",
    "    # Notice how we access the row value\n",
    "    celsius = fahr_to_celsius(row.TEMP_F)\n",
    "\n",
    "    # Update the value for 'Celsius' column with the converted value\n",
    "    # Notice how we can access the Index value\n",
    "    data.at[row.Index, \"TEMP_C\"] = celsius"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>STATION_NUMBER</th>\n",
       "      <th>TIME</th>\n",
       "      <th>DIR</th>\n",
       "      <th>SPEED</th>\n",
       "      <th>GUST</th>\n",
       "      <th>TEMP_F</th>\n",
       "      <th>MAX</th>\n",
       "      <th>MIN</th>\n",
       "      <th>TEMP_C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601010600</td>\n",
       "      <td>180.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>34.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.111111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601011300</td>\n",
       "      <td>20.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>32.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601012000</td>\n",
       "      <td>320.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>30.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.111111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601020600</td>\n",
       "      <td>320.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>33.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.555556</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601021300</td>\n",
       "      <td>230.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.666667</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   STATION_NUMBER          TIME    DIR  SPEED  GUST  TEMP_F  MAX  MIN  \\\n",
       "0           29820  190601010600  180.0    2.0   NaN    34.0  NaN  NaN   \n",
       "1           29820  190601011300   20.0    6.0   NaN    32.0  NaN  NaN   \n",
       "2           29820  190601012000  320.0    6.0   NaN    30.0  NaN  NaN   \n",
       "3           29820  190601020600  320.0   10.0   NaN    33.0  NaN  NaN   \n",
       "4           29820  190601021300  230.0   15.0   NaN    35.0  NaN  NaN   \n",
       "\n",
       "     TEMP_C  \n",
       "0  1.111111  \n",
       "1  0.000000  \n",
       "2 -1.111111  \n",
       "3  0.555556  \n",
       "4  1.666667  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check the result\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pandas(Index=198333, STATION_NUMBER=29820, TIME=201910012300, DIR=287.0, SPEED=33.0, GUST=41.0, TEMP_F=46.0, MAX=nan, MIN=nan, TEMP_C=0.0)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# How does our row look like?\n",
    "row"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okay, now we have iterated over our data and updated the temperatures in Celsius to `TEMP_C` column by using our `fahr_to_celsius()` function. The values look correct as 32 Fahrenheits indeed is 0 Celsius degrees, as can be seen on the second row. We also have here the last row of our DataFrame which is a named tuple. As you can see, it is a bit like a weird looking dictionary with values assigned to the names of our columns. Basically, it is an object with attributes that we can access in a similar manner as we have used to access some of the pandas DataFrame attributes, such as `data.shape`. \n",
    "\n",
    "A couple of notes about our appoaches. We used `itertuples()` method for looping over the values because it is significantly faster compared to `iterrows()` (can be ~100x faster). We used `.at` to assign the value to the DataFrame because it is designed to access single values more efficiently compared to `.loc`, which can access also groups of rows and columns. That said, you could have used `data.loc[idx, new_column] = celsius` to achieve the same result (it is just slower). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using a function with apply\n",
    "\n",
    "Although using for loop with `itertuples()` can be fairly efficient, pandas DataFrames and Series have a dedicated method called `apply()` for applying functions on columns (or rows). `apply()` is typically faster than `itertuples()`, especially if you have large number of rows, such as in our case. When using `apply()`, we pass the function that we want to use as an argument. Let's start by applying the function to the `TEMP_F` column that contains the temperature values in Fahrenheit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0         1.111111\n",
       "1         0.000000\n",
       "2        -1.111111\n",
       "3         0.555556\n",
       "4         1.666667\n",
       "            ...   \n",
       "198329    8.333333\n",
       "198330    8.333333\n",
       "198331    8.333333\n",
       "198332    8.333333\n",
       "198333    7.777778\n",
       "Name: TEMP_F, Length: 198334, dtype: float64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[\"TEMP_F\"].apply(fahr_to_celsius)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results look logical. Notice how we passed the `fahr_to_celsius()` function without using the parentheses `()` after the name of the function. When using `apply`, you should always leave out the parentheses from the function that you use. Meaning that you should use `apply(fahr_to_celsius)` instead of `apply(fahr_to_celsius())`. Why? Because the `apply()` method will execute and use the function itself in the background when it operates with the data. If we would pass our function with the parentheses, the `fahr_to_celsius()` function would actually be executed once before the loop with `apply()` starts (hence becoming unusable), and that is not what we want. Our previous command only returned the Series of temperatures to the screen, but naturally we can also store them permanently into a new column (overwriting the old values):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[\"TEMP_C\"] = data[\"TEMP_F\"].apply(fahr_to_celsius)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A nice thing with `apply()` is that we can also apply the function on several columns at once. Below, we also sort the values in descending order based on values in `MIN` column to see that applying our function really works:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEMP_F</th>\n",
       "      <th>MIN</th>\n",
       "      <th>MAX</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>196775</th>\n",
       "      <td>25.000000</td>\n",
       "      <td>25.000000</td>\n",
       "      <td>29.444444</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>154531</th>\n",
       "      <td>25.555556</td>\n",
       "      <td>24.444444</td>\n",
       "      <td>27.777778</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>188167</th>\n",
       "      <td>25.000000</td>\n",
       "      <td>24.444444</td>\n",
       "      <td>27.777778</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>188407</th>\n",
       "      <td>23.888889</td>\n",
       "      <td>23.888889</td>\n",
       "      <td>27.777778</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>188143</th>\n",
       "      <td>24.444444</td>\n",
       "      <td>23.888889</td>\n",
       "      <td>28.888889</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           TEMP_F        MIN        MAX\n",
       "196775  25.000000  25.000000  29.444444\n",
       "154531  25.555556  24.444444  27.777778\n",
       "188167  25.000000  24.444444  27.777778\n",
       "188407  23.888889  23.888889  27.777778\n",
       "188143  24.444444  23.888889  28.888889"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cols = [\"TEMP_F\", \"MIN\", \"MAX\"]\n",
    "result = data[cols].apply(fahr_to_celsius)\n",
    "result.sort_values(by=\"MIN\", ascending=False).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also directly store the outputs to new columns `'TEMP_C'`, `'MIN_C'`, `'MAX_C'`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>STATION_NUMBER</th>\n",
       "      <th>TIME</th>\n",
       "      <th>DIR</th>\n",
       "      <th>SPEED</th>\n",
       "      <th>GUST</th>\n",
       "      <th>TEMP_F</th>\n",
       "      <th>MAX</th>\n",
       "      <th>MIN</th>\n",
       "      <th>TEMP_C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601010600</td>\n",
       "      <td>180.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.111111</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.111111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601011300</td>\n",
       "      <td>20.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601012000</td>\n",
       "      <td>320.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.111111</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.111111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601020600</td>\n",
       "      <td>320.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.555556</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.555556</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601021300</td>\n",
       "      <td>230.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.666667</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.666667</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   STATION_NUMBER          TIME    DIR  SPEED  GUST    TEMP_F  MAX  MIN  \\\n",
       "0           29820  190601010600  180.0    2.0   NaN  1.111111  NaN  NaN   \n",
       "1           29820  190601011300   20.0    6.0   NaN  0.000000  NaN  NaN   \n",
       "2           29820  190601012000  320.0    6.0   NaN -1.111111  NaN  NaN   \n",
       "3           29820  190601020600  320.0   10.0   NaN  0.555556  NaN  NaN   \n",
       "4           29820  190601021300  230.0   15.0   NaN  1.666667  NaN  NaN   \n",
       "\n",
       "     TEMP_C  \n",
       "0  1.111111  \n",
       "1  0.000000  \n",
       "2 -1.111111  \n",
       "3  0.555556  \n",
       "4  1.666667  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cols = [\"TEMP_F\", \"MIN\", \"MAX\"]\n",
    "data[cols] = data[cols].apply(fahr_to_celsius)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, we showed you a few different ways to iterate over rows in pandas and apply functions. The most important thing is that you understand the logic of how loops work and how you can use your own functions to modify the values in a pandas DataFrame. Whenever you need to loop over your data, we recommend using `.apply()` as it is typically the most efficient one in terms of execution time. However, remember that in most cases you do not actually need to use loops, but you can do calculations in a \"vectorized manner\" (which is the fastest way) as we learned previously when doing basic calculations in pandas. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## String slicing\n",
    "\n",
    "We will eventually want to group our data based on month in order to see if the January temperatures in 2020 were higher than on average (which is the goal in our analysis as you might recall). Currently, the date and time information is stored in the column `TIME` that has a structure `yyyyMMddhhmm`. This is a typical timestamp format in which `yyyy` equals to year in four digit format, `MM` to month (two digits), `dd` days, `hh` hours and `mm` minutes. Let's have a closer look at the date and time information we have by checking the values in that column, and their data type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    190601010600\n",
       "1    190601011300\n",
       "2    190601012000\n",
       "3    190601020600\n",
       "4    190601021300\n",
       "Name: TIME, dtype: int64"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[\"TIME\"].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "198329    201910011900\n",
       "198330    201910012000\n",
       "198331    201910012100\n",
       "198332    201910012200\n",
       "198333    201910012300\n",
       "Name: TIME, dtype: int64"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[\"TIME\"].tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `TIME` column contains several observations per day (and even several observations per hour). The timestamp for the first observation is `190601010600`, i.e. from 1st of January 1906 (way back!), and the timestamp for the latest observation is `201910012350`. As we can see, the data type (`dtype`) of our column seems to be `int64`, i.e. the information is stored as integer values. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to aggregate this data on a monthly level. In order to do so, we need to \"label\" each row of data based on the month when the record was observed. Hence, we need to somehow separate information about the year and month for each row. In practice, we can create a new column (or an index) containing information about the month (including the year, but excluding days, hours and minutes). There are different ways of achieving this, but here we will take advantage of `string slicing` which means that we convert the date and time information into character strings and \"cut\" the needed information from the string objects. The other option would be to convert the timestamp values into something called `datetime` objects, but we will learn about those a bit later. Before further processing, we first want to convert the `TIME` column as character strings for convenience, stored into a new column `TIME_STR`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "data[\"TIME_STR\"] = data[\"TIME\"].astype(str)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we look at the latest time stamp in the data (`201910012350`), you can see that there is a systematic pattern `YEAR-MONTH-DAY-HOUR-MINUTE`. Four first characters represent the year, and the following two characters represent month. Because we are interested in understanding monthly averages for different years, we want to slice the year and month values from the timestamp (the first 6 characters), like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'201910'"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "date = \"201910012350\"\n",
    "date[0:6]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on this information, we can slice the correct range of characters from the `TIME_STR` column using a specific pandas function designed for Series, called `.str.slice()`. As parameters, the function has `start` and `stop` which you can use to specify the positions where the slicing should start and end:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>STATION_NUMBER</th>\n",
       "      <th>TIME</th>\n",
       "      <th>DIR</th>\n",
       "      <th>SPEED</th>\n",
       "      <th>GUST</th>\n",
       "      <th>TEMP_F</th>\n",
       "      <th>MAX</th>\n",
       "      <th>MIN</th>\n",
       "      <th>TEMP_C</th>\n",
       "      <th>TIME_STR</th>\n",
       "      <th>YEAR_MONTH</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601010600</td>\n",
       "      <td>180.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.111111</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.111111</td>\n",
       "      <td>190601010600</td>\n",
       "      <td>190601</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601011300</td>\n",
       "      <td>20.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>190601011300</td>\n",
       "      <td>190601</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601012000</td>\n",
       "      <td>320.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.111111</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.111111</td>\n",
       "      <td>190601012000</td>\n",
       "      <td>190601</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601020600</td>\n",
       "      <td>320.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.555556</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.555556</td>\n",
       "      <td>190601020600</td>\n",
       "      <td>190601</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601021300</td>\n",
       "      <td>230.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.666667</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.666667</td>\n",
       "      <td>190601021300</td>\n",
       "      <td>190601</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   STATION_NUMBER          TIME    DIR  SPEED  GUST    TEMP_F  MAX  MIN  \\\n",
       "0           29820  190601010600  180.0    2.0   NaN  1.111111  NaN  NaN   \n",
       "1           29820  190601011300   20.0    6.0   NaN  0.000000  NaN  NaN   \n",
       "2           29820  190601012000  320.0    6.0   NaN -1.111111  NaN  NaN   \n",
       "3           29820  190601020600  320.0   10.0   NaN  0.555556  NaN  NaN   \n",
       "4           29820  190601021300  230.0   15.0   NaN  1.666667  NaN  NaN   \n",
       "\n",
       "     TEMP_C      TIME_STR YEAR_MONTH  \n",
       "0  1.111111  190601010600     190601  \n",
       "1  0.000000  190601011300     190601  \n",
       "2 -1.111111  190601012000     190601  \n",
       "3  0.555556  190601020600     190601  \n",
       "4  1.666667  190601021300     190601  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[\"YEAR_MONTH\"] = data[\"TIME_STR\"].str.slice(start=0, stop=6)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nice! Now we have \"labeled\" the rows based on information about day of the year and hour of the day.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 3.5\n",
    "\n",
    "Create a new column `'MONTH'` with information about the month without the year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "tags": [
     "remove_cell"
    ]
   },
   "outputs": [],
   "source": [
    "# Use this cell to enter your solution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "tags": [
     "hide-cell",
     "remove_book_cell"
    ]
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "\n",
    "data[\"MONTH\"] = data[\"TIME_STR\"].str.slice(start=4, stop=6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grouping and aggregating data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basic logic of grouping a DataFrame using `.groupby()`\n",
    "\n",
    "In the following sections, we want to calculate the average temperature for each month in our dataset. Here, we will learn how to use a `.groupby()` method which is a handy tool for compressing large amounts of data and computing statistics for subgroups. We will use the groupby method to calculate the average temperatures for each month trough these three main steps:\n",
    "\n",
    "  1. group the data based on year and month using `groupby()`\n",
    "  2. calculate the average temperature for each month (i.e. each group) \n",
    "  3. store the resulting rows into a DataFrame called `monthly_data`\n",
    "  \n",
    "We have quite a few rows of weather data (N=198334), and several observations per day. Our goal is to create an aggreated DataFrame that would have only one row per month. The `.groupby()` takes as a parameter the name of the column (or a list of columns) that you want to use as basis for doing the grouping.  Let's start by grouping our data based on unique year and month combination:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped = data.groupby(\"YEAR_MONTH\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice, thas it would also be possible to create combinations of years and months \"on-the-fly\" if you have them in separate columns. In such case, grouping the data could be done as `grouped = data.groupby(['YEAR', 'MONTH'])`. Let's explore the new variable `grouped`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.groupby.generic.DataFrameGroupBy'>\n",
      "826\n"
     ]
    }
   ],
   "source": [
    "print(type(grouped))\n",
    "print(len(grouped))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have a new object with type `DataFrameGroupBy` with 826 groups. In order to understand what just happened, let's also check the number of unique year and month combinations in our data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "826"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[\"YEAR_MONTH\"].nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Length of the grouped object should be the same as the number of unique values in the column we used for grouping (`YEAR_MONTH`). For each unique value, there is a group of data. Let's explore our grouped data further by check the \"names\" of the groups (five first ones). Here, we access the `keys` of the groups and convert them to a `list` so that we can slice and print only a few of those to the sceen:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['190601', '190602', '190603', '190604', '190605']"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(grouped.groups.keys())[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check the contents for a group representing January 1906. We can get the values for that month from the grouped object using the `get_group()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>STATION_NUMBER</th>\n",
       "      <th>TIME</th>\n",
       "      <th>DIR</th>\n",
       "      <th>SPEED</th>\n",
       "      <th>GUST</th>\n",
       "      <th>TEMP_F</th>\n",
       "      <th>MAX</th>\n",
       "      <th>MIN</th>\n",
       "      <th>TEMP_C</th>\n",
       "      <th>TIME_STR</th>\n",
       "      <th>YEAR_MONTH</th>\n",
       "      <th>MONTH</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601010600</td>\n",
       "      <td>180.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.111111</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.111111</td>\n",
       "      <td>190601010600</td>\n",
       "      <td>190601</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601011300</td>\n",
       "      <td>20.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>190601011300</td>\n",
       "      <td>190601</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601012000</td>\n",
       "      <td>320.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.111111</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.111111</td>\n",
       "      <td>190601012000</td>\n",
       "      <td>190601</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601020600</td>\n",
       "      <td>320.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.555556</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.555556</td>\n",
       "      <td>190601020600</td>\n",
       "      <td>190601</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601021300</td>\n",
       "      <td>230.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.666667</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.666667</td>\n",
       "      <td>190601021300</td>\n",
       "      <td>190601</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>88</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601301300</td>\n",
       "      <td>360.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-2.222222</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-2.222222</td>\n",
       "      <td>190601301300</td>\n",
       "      <td>190601</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>89</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601302000</td>\n",
       "      <td>360.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-6.111111</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-6.111111</td>\n",
       "      <td>190601302000</td>\n",
       "      <td>190601</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>90</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601310600</td>\n",
       "      <td>340.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-7.777778</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-7.777778</td>\n",
       "      <td>190601310600</td>\n",
       "      <td>190601</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>91</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601311300</td>\n",
       "      <td>340.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.111111</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.111111</td>\n",
       "      <td>190601311300</td>\n",
       "      <td>190601</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92</th>\n",
       "      <td>29820</td>\n",
       "      <td>190601312000</td>\n",
       "      <td>360.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.555556</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.555556</td>\n",
       "      <td>190601312000</td>\n",
       "      <td>190601</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>93 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    STATION_NUMBER          TIME    DIR  SPEED  GUST    TEMP_F  MAX  MIN  \\\n",
       "0            29820  190601010600  180.0    2.0   NaN  1.111111  NaN  NaN   \n",
       "1            29820  190601011300   20.0    6.0   NaN  0.000000  NaN  NaN   \n",
       "2            29820  190601012000  320.0    6.0   NaN -1.111111  NaN  NaN   \n",
       "3            29820  190601020600  320.0   10.0   NaN  0.555556  NaN  NaN   \n",
       "4            29820  190601021300  230.0   15.0   NaN  1.666667  NaN  NaN   \n",
       "..             ...           ...    ...    ...   ...       ...  ...  ...   \n",
       "88           29820  190601301300  360.0    6.0   NaN -2.222222  NaN  NaN   \n",
       "89           29820  190601302000  360.0    2.0   NaN -6.111111  NaN  NaN   \n",
       "90           29820  190601310600  340.0    6.0   NaN -7.777778  NaN  NaN   \n",
       "91           29820  190601311300  340.0    6.0   NaN -1.111111  NaN  NaN   \n",
       "92           29820  190601312000  360.0    6.0   NaN -0.555556  NaN  NaN   \n",
       "\n",
       "      TEMP_C      TIME_STR YEAR_MONTH MONTH  \n",
       "0   1.111111  190601010600     190601    01  \n",
       "1   0.000000  190601011300     190601    01  \n",
       "2  -1.111111  190601012000     190601    01  \n",
       "3   0.555556  190601020600     190601    01  \n",
       "4   1.666667  190601021300     190601    01  \n",
       "..       ...           ...        ...   ...  \n",
       "88 -2.222222  190601301300     190601    01  \n",
       "89 -6.111111  190601302000     190601    01  \n",
       "90 -7.777778  190601310600     190601    01  \n",
       "91 -1.111111  190601311300     190601    01  \n",
       "92 -0.555556  190601312000     190601    01  \n",
       "\n",
       "[93 rows x 12 columns]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Specify a month (as character string)\n",
    "month = \"190601\"\n",
    "\n",
    "# Select the group\n",
    "group1 = grouped.get_group(month)\n",
    "group1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, a single group contains a DataFrame with values only for that specific month. Let's check the DataType of this group:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(group1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, one group is a pandas DataFrame which is really useful, because it allows us to use all the familiar DataFrame methods for calculating statistics etc. for this specific group which we will see shortly. It is also possible to iterate over the groups in our `DataFrameGroupBy` object which can be useful if you need to conduct and apply some more complicated subtasks for each group. When doing so, it is important to understand that a single group in our `DataFrameGroupBy` actually contains not only the actual values, but also information about the `key` that was used to do the grouping. Hence, when iterating we need to assign the `key` and the values (i.e. the group) into separate variables. Let's see how we can iterate over the groups and print the key and the data from a single group (again using `break` to only see what is happening):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Key:\n",
      " 190601\n",
      "\n",
      "First rows of data in this group:\n",
      "    STATION_NUMBER          TIME    DIR  SPEED  GUST    TEMP_F  MAX  MIN  \\\n",
      "0           29820  190601010600  180.0    2.0   NaN  1.111111  NaN  NaN   \n",
      "1           29820  190601011300   20.0    6.0   NaN  0.000000  NaN  NaN   \n",
      "2           29820  190601012000  320.0    6.0   NaN -1.111111  NaN  NaN   \n",
      "3           29820  190601020600  320.0   10.0   NaN  0.555556  NaN  NaN   \n",
      "4           29820  190601021300  230.0   15.0   NaN  1.666667  NaN  NaN   \n",
      "\n",
      "     TEMP_C      TIME_STR YEAR_MONTH MONTH  \n",
      "0  1.111111  190601010600     190601    01  \n",
      "1  0.000000  190601011300     190601    01  \n",
      "2 -1.111111  190601012000     190601    01  \n",
      "3  0.555556  190601020600     190601    01  \n",
      "4  1.666667  190601021300     190601    01  \n"
     ]
    }
   ],
   "source": [
    "# Iterate over groups\n",
    "for key, group in grouped:\n",
    "    # Print key and group\n",
    "    print(\"Key:\\n\", key)\n",
    "    print(\"\\nFirst rows of data in this group:\\n\", group.head())\n",
    "\n",
    "    # Stop iteration with break command\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we can see that the `key` contains the name of the group (i.e. the unique value from `YEAR_MONTH`). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Aggregating data with `groupby()`\n",
    "\n",
    "We can, for example, calculate the average values for all variables using the statistical functions that we have seen already (e.g. mean, std, min, max, median). To calculate the average temperature for each month, we can use the `mean()` function. Let's calculate the mean for all the weather related data attributes in our group at once:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DIR       230.322581\n",
       "SPEED      13.473118\n",
       "GUST             NaN\n",
       "TEMP_F     -0.537634\n",
       "TEMP_C     -0.537634\n",
       "dtype: float64"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Specify the columns that will be part of the calculation\n",
    "mean_cols = [\"DIR\", \"SPEED\", \"GUST\", \"TEMP_F\", \"TEMP_C\"]\n",
    "\n",
    "# Calculate the mean values all at one go\n",
    "mean_values = group1[mean_cols].mean()\n",
    "mean_values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a result, we get a pandas Series with mean values calculated for all columns in the group. Notice that if you want to convert this Series back into a DataFrame (which can be useful if you e.g. want to merge multiple groups), you can use command `.to_frame().T` which first converts the Series into a DataFrame and then transposes the order of the axes (the label names becomes the column names):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>DIR</th>\n",
       "      <th>SPEED</th>\n",
       "      <th>GUST</th>\n",
       "      <th>TEMP_F</th>\n",
       "      <th>TEMP_C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>230.322581</td>\n",
       "      <td>13.473118</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.537634</td>\n",
       "      <td>-0.537634</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          DIR      SPEED  GUST    TEMP_F    TEMP_C\n",
       "0  230.322581  13.473118   NaN -0.537634 -0.537634"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Convert to DataFrame\n",
    "mean_values.to_frame().T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To do a similar aggregation with all the groups in our data, we can actually combine the `groupby()` function with the aggregation step (such as taking the mean, median etc. of given columns), and finally restructure the resulting DataFrame a bit. This can be at first a bit harder to understand, but this is how you would do the grouping and aggregating the values as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>YEAR_MONTH</th>\n",
       "      <th>DIR</th>\n",
       "      <th>SPEED</th>\n",
       "      <th>GUST</th>\n",
       "      <th>TEMP_F</th>\n",
       "      <th>TEMP_C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>190601</td>\n",
       "      <td>230.322581</td>\n",
       "      <td>13.473118</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.537634</td>\n",
       "      <td>-0.537634</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>190602</td>\n",
       "      <td>186.582278</td>\n",
       "      <td>12.369048</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.044974</td>\n",
       "      <td>-1.044974</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>190603</td>\n",
       "      <td>244.835165</td>\n",
       "      <td>10.645161</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-2.485066</td>\n",
       "      <td>-2.485066</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>190604</td>\n",
       "      <td>222.954545</td>\n",
       "      <td>6.077778</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.740741</td>\n",
       "      <td>2.740741</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>190605</td>\n",
       "      <td>167.500000</td>\n",
       "      <td>5.847826</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.722820</td>\n",
       "      <td>10.722820</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>821</th>\n",
       "      <td>201906</td>\n",
       "      <td>209.000000</td>\n",
       "      <td>17.232172</td>\n",
       "      <td>22.211720</td>\n",
       "      <td>14.990715</td>\n",
       "      <td>14.990715</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>822</th>\n",
       "      <td>201907</td>\n",
       "      <td>220.430642</td>\n",
       "      <td>15.227743</td>\n",
       "      <td>19.846535</td>\n",
       "      <td>17.288769</td>\n",
       "      <td>17.288769</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>823</th>\n",
       "      <td>201908</td>\n",
       "      <td>211.044474</td>\n",
       "      <td>15.683288</td>\n",
       "      <td>20.131746</td>\n",
       "      <td>17.747080</td>\n",
       "      <td>17.747080</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>824</th>\n",
       "      <td>201909</td>\n",
       "      <td>197.193007</td>\n",
       "      <td>17.569231</td>\n",
       "      <td>23.123779</td>\n",
       "      <td>13.132371</td>\n",
       "      <td>13.132371</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>825</th>\n",
       "      <td>201910</td>\n",
       "      <td>278.583333</td>\n",
       "      <td>33.250000</td>\n",
       "      <td>39.791667</td>\n",
       "      <td>8.750000</td>\n",
       "      <td>8.750000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>826 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    YEAR_MONTH         DIR      SPEED       GUST     TEMP_F     TEMP_C\n",
       "0       190601  230.322581  13.473118        NaN  -0.537634  -0.537634\n",
       "1       190602  186.582278  12.369048        NaN  -1.044974  -1.044974\n",
       "2       190603  244.835165  10.645161        NaN  -2.485066  -2.485066\n",
       "3       190604  222.954545   6.077778        NaN   2.740741   2.740741\n",
       "4       190605  167.500000   5.847826        NaN  10.722820  10.722820\n",
       "..         ...         ...        ...        ...        ...        ...\n",
       "821     201906  209.000000  17.232172  22.211720  14.990715  14.990715\n",
       "822     201907  220.430642  15.227743  19.846535  17.288769  17.288769\n",
       "823     201908  211.044474  15.683288  20.131746  17.747080  17.747080\n",
       "824     201909  197.193007  17.569231  23.123779  13.132371  13.132371\n",
       "825     201910  278.583333  33.250000  39.791667   8.750000   8.750000\n",
       "\n",
       "[826 rows x 6 columns]"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The columns that we want to aggregate\n",
    "mean_cols = [\"DIR\", \"SPEED\", \"GUST\", \"TEMP_F\", \"TEMP_C\"]\n",
    "\n",
    "# Group and aggregate the data with one line\n",
    "monthly_data = data.groupby(\"YEAR_MONTH\")[mean_cols].mean().reset_index()\n",
    "monthly_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, aggregating the data in this way is fairly straightforward and fast process requiring merely a single command. So what did we actually do here? We i) grouped the data, ii) selected specific columns from the result (`mean_cols`), iii) calculated the mean for all of the selected columns of the groups, and finally 4) reset the index. Resetting the index at the end is not necessary, but by doing it, we turn the `YEAR_MONTH` values into a dedicated column in our data (which would be otherwise store as `index`) .\n",
    "\n",
    "What might not be obvious from this example is the fact that hidden in the background, each group is actually iterated over and the aggregation step is repeated for each group. For you to better understand what happens, we will next repeat the same process by iterating over groups and eventually creating a DataFrame that will contain the mean values for all those weather attributes that we were interested in. In this approach, we will first iterate over the groups, then calculate the mean values,  store the result into a list, and finally merge the aggregated data into a DataFrame called `monthly_data`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>DIR</th>\n",
       "      <th>SPEED</th>\n",
       "      <th>GUST</th>\n",
       "      <th>TEMP_F</th>\n",
       "      <th>TEMP_C</th>\n",
       "      <th>YEAR_MONTH</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>230.322581</td>\n",
       "      <td>13.473118</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.537634</td>\n",
       "      <td>-0.537634</td>\n",
       "      <td>190601</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>186.582278</td>\n",
       "      <td>12.369048</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.044974</td>\n",
       "      <td>-1.044974</td>\n",
       "      <td>190602</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>244.835165</td>\n",
       "      <td>10.645161</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-2.485066</td>\n",
       "      <td>-2.485066</td>\n",
       "      <td>190603</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>222.954545</td>\n",
       "      <td>6.077778</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.740741</td>\n",
       "      <td>2.740741</td>\n",
       "      <td>190604</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>167.5</td>\n",
       "      <td>5.847826</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.72282</td>\n",
       "      <td>10.72282</td>\n",
       "      <td>190605</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>209.0</td>\n",
       "      <td>17.232172</td>\n",
       "      <td>22.21172</td>\n",
       "      <td>14.990715</td>\n",
       "      <td>14.990715</td>\n",
       "      <td>201906</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>220.430642</td>\n",
       "      <td>15.227743</td>\n",
       "      <td>19.846535</td>\n",
       "      <td>17.288769</td>\n",
       "      <td>17.288769</td>\n",
       "      <td>201907</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>211.044474</td>\n",
       "      <td>15.683288</td>\n",
       "      <td>20.131746</td>\n",
       "      <td>17.74708</td>\n",
       "      <td>17.74708</td>\n",
       "      <td>201908</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>197.193007</td>\n",
       "      <td>17.569231</td>\n",
       "      <td>23.123779</td>\n",
       "      <td>13.132371</td>\n",
       "      <td>13.132371</td>\n",
       "      <td>201909</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>278.583333</td>\n",
       "      <td>33.25</td>\n",
       "      <td>39.791667</td>\n",
       "      <td>8.75</td>\n",
       "      <td>8.75</td>\n",
       "      <td>201910</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>826 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           DIR      SPEED       GUST     TEMP_F     TEMP_C YEAR_MONTH\n",
       "0   230.322581  13.473118        NaN  -0.537634  -0.537634     190601\n",
       "0   186.582278  12.369048        NaN  -1.044974  -1.044974     190602\n",
       "0   244.835165  10.645161        NaN  -2.485066  -2.485066     190603\n",
       "0   222.954545   6.077778        NaN   2.740741   2.740741     190604\n",
       "0        167.5   5.847826        NaN   10.72282   10.72282     190605\n",
       "..         ...        ...        ...        ...        ...        ...\n",
       "0        209.0  17.232172   22.21172  14.990715  14.990715     201906\n",
       "0   220.430642  15.227743  19.846535  17.288769  17.288769     201907\n",
       "0   211.044474  15.683288  20.131746   17.74708   17.74708     201908\n",
       "0   197.193007  17.569231  23.123779  13.132371  13.132371     201909\n",
       "0   278.583333      33.25  39.791667       8.75       8.75     201910\n",
       "\n",
       "[826 rows x 6 columns]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create an empty list for storing the aggregated rows/DataFrames\n",
    "data_container = []\n",
    "\n",
    "# The columns that we want to aggregate\n",
    "mean_cols = [\"DIR\", \"SPEED\", \"GUST\", \"TEMP_F\", \"TEMP_C\"]\n",
    "\n",
    "# Iterate over the groups\n",
    "for key, group in grouped:\n",
    "    # Calculate mean\n",
    "    mean_values = group[mean_cols].mean()\n",
    "\n",
    "    # Add the ´key´ (i.e. the date+time information) into the Series\n",
    "    mean_values[\"YEAR_MONTH\"] = key\n",
    "\n",
    "    # Convert the pd.Series into DataFrame and\n",
    "    # append the aggregated values into a list as a DataFrame\n",
    "    data_container.append(mean_values.to_frame().T)\n",
    "\n",
    "# After iterating all groups, merge the list of DataFrames\n",
    "monthly_data = pd.concat(data_container)\n",
    "monthly_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a result, we get identical results as with the earlier approach that was done with a single line of code (except for the position of the `YEAR_MONTH` column).\n",
    "\n",
    "So which approach should you use? From the performance point of view, we recommend using the first approach (i.e. chaining) which does not require you to create a separate for loop, and is highly performant. However, this approach might be a bit more difficult to read and comprehend (the loop might be easier). Also sometimes you want to include additional processing steps inside the loop which can be hard accomplish by chaining everything into a single command. Hence, it is useful to know both of these approaches for doing aggregations with the data.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Case study: Detecting warm months\n",
    "\n",
    "Now, we have aggregated our data on monthly level and all we need to do is to check which years had the warmest January temperatures. A simple approach is to select all January values from the data and check which group(s) have the highest mean value. Before doing this, let's separate the month information from our timestamp following the same approach as previously we did when slicing the year-month combination:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>DIR</th>\n",
       "      <th>SPEED</th>\n",
       "      <th>GUST</th>\n",
       "      <th>TEMP_F</th>\n",
       "      <th>TEMP_C</th>\n",
       "      <th>YEAR_MONTH</th>\n",
       "      <th>MONTH</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>230.322581</td>\n",
       "      <td>13.473118</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.537634</td>\n",
       "      <td>-0.537634</td>\n",
       "      <td>190601</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>186.582278</td>\n",
       "      <td>12.369048</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.044974</td>\n",
       "      <td>-1.044974</td>\n",
       "      <td>190602</td>\n",
       "      <td>02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>244.835165</td>\n",
       "      <td>10.645161</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-2.485066</td>\n",
       "      <td>-2.485066</td>\n",
       "      <td>190603</td>\n",
       "      <td>03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>222.954545</td>\n",
       "      <td>6.077778</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.740741</td>\n",
       "      <td>2.740741</td>\n",
       "      <td>190604</td>\n",
       "      <td>04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>167.5</td>\n",
       "      <td>5.847826</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.72282</td>\n",
       "      <td>10.72282</td>\n",
       "      <td>190605</td>\n",
       "      <td>05</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          DIR      SPEED GUST    TEMP_F    TEMP_C YEAR_MONTH MONTH\n",
       "0  230.322581  13.473118  NaN -0.537634 -0.537634     190601    01\n",
       "0  186.582278  12.369048  NaN -1.044974 -1.044974     190602    02\n",
       "0  244.835165  10.645161  NaN -2.485066 -2.485066     190603    03\n",
       "0  222.954545   6.077778  NaN  2.740741  2.740741     190604    04\n",
       "0       167.5   5.847826  NaN  10.72282  10.72282     190605    05"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "monthly_data[\"MONTH\"] = monthly_data[\"YEAR_MONTH\"].str.slice(start=4, stop=6)\n",
    "monthly_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can select the values for January from our data and store it into a new variable `january_data`. We will also check the highest temperature values by sorting the DataFrame in a descending order:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>DIR</th>\n",
       "      <th>SPEED</th>\n",
       "      <th>GUST</th>\n",
       "      <th>TEMP_F</th>\n",
       "      <th>TEMP_C</th>\n",
       "      <th>YEAR_MONTH</th>\n",
       "      <th>MONTH</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>213.765182</td>\n",
       "      <td>27.497976</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.302294</td>\n",
       "      <td>1.302294</td>\n",
       "      <td>200801</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>213.76569</td>\n",
       "      <td>21.138075</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.293952</td>\n",
       "      <td>1.293952</td>\n",
       "      <td>197501</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>245.454545</td>\n",
       "      <td>18.259259</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.273434</td>\n",
       "      <td>1.273434</td>\n",
       "      <td>198301</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>266.516393</td>\n",
       "      <td>16.827869</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.15491</td>\n",
       "      <td>1.15491</td>\n",
       "      <td>199201</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>255.752212</td>\n",
       "      <td>16.307359</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.069869</td>\n",
       "      <td>1.069869</td>\n",
       "      <td>198901</td>\n",
       "      <td>01</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          DIR      SPEED GUST    TEMP_F    TEMP_C YEAR_MONTH MONTH\n",
       "0  213.765182  27.497976  NaN  1.302294  1.302294     200801    01\n",
       "0   213.76569  21.138075  NaN  1.293952  1.293952     197501    01\n",
       "0  245.454545  18.259259  NaN  1.273434  1.273434     198301    01\n",
       "0  266.516393  16.827869  NaN   1.15491   1.15491     199201    01\n",
       "0  255.752212  16.307359  NaN  1.069869  1.069869     198901    01"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "january_data = monthly_data.loc[monthly_data[\"MONTH\"] == \"01\"]\n",
    "january_data.sort_values(by=\"TEMP_C\", ascending=False).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By looking at the order of `YEAR_MONTH` column, we can see that January 2020 indeed was on average the warmest month on record based on weather observations from Finland."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Automating the analysis\n",
    "\n",
    "Now we have learned how to aggregate data using pandas. average temperatures for each month based on hourly weather observations. One of the most useful aspects of programming, is the ability to automate processes and repeat analyses such as these for any number of weather stations (assuming the data structure is the same). \n",
    "\n",
    "Hence, let's now see how we can repeat the previous data analysis steps for 15 weather stations located in different parts of Finland containing data for five years (2015-2019). The idea is that we will repeat the process for each input file using a (rather long) for loop. We will use the most efficient alternatives of the previously represented approaches, and finally will store the results in a single DataFrame for all stations. We will learn how to manipulate filepaths in Python using the `pathlib` module and see how we can list our input files in the data directory `data/finnish_stations`. We will store those paths to a variable `file_list`, so that we can use the file paths easily in the later steps."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Managing and listing filesystem paths\n",
    "\n",
    "In Python there are two commonly used approaches to manage and manipulate filepaths, namely `os.path` sub-module and a newer `pathlib` module (available since Python 3.4) which we will demonstrate here.  The built-in module `pathlib` provides many useful functions for interacting and manipulating filepaths on your operating system. On the following, we have data in different sub-folders and we will learn how to use the `Path` class from the `pathlib` library to construct filepaths. Next, we will import and use the `Path` class and see how we can construct a filepath by joining a folder path and file name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PosixPath('data/finnish_stations/028360.txt')"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "# Initialize the Path\n",
    "input_folder = Path(\"data/finnish_stations\")\n",
    "\n",
    "# Join folder path and filename\n",
    "fp = input_folder / \"028360.txt\"\n",
    "fp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we first initialized the `Path` object and stored it in variable `input_folder` by passing a relative path as a string to directory where all our files are located. Then we created a full filepath to file `028360.txt` by adding a forward slash (`/`) character between the folder and the filename which joins them together (easy!). In this case, our end result is something called a *`PosixPath`* which is a filesystem path to a given file on Linux or Mac operating systems. If you would run the same commands on Windows machine, the end result would be a *`WindowsPath`*. Hence, the output depends on which operating system you are using. However, you do not need to worry about this, because both types of Paths work exactly the same, no matter which operating system you use.\n",
    "\n",
    "Both the `Path` object that we stored in `input_folder` variable and the `PosixPath` object that we stored in variable `fp` are actually quite versatile creatures, and we can do many useful things with them. For instance, we can find the parent folder where the file is located, extract the filename from the full path, test whether the file or directory actually exists, find various properties of the file (such as size of the file or creation time), and so on:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PosixPath('data/finnish_stations')"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fp.parent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'028360.txt'"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fp.name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fp.exists()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Size (bytes): 1347907\n",
      "Created (seconds since Epoch): 1649535015.1847472\n",
      "Modified (seconds since Epoch): 1649535015.1847472\n"
     ]
    }
   ],
   "source": [
    "# File properties\n",
    "size_in_bytes = fp.stat().st_size\n",
    "creation_time = fp.stat().st_ctime\n",
    "modified_time = fp.stat().st_mtime\n",
    "print(\n",
    "    f\"Size (bytes): {size_in_bytes}\\nCreated (seconds since Epoch): {creation_time}\\nModified (seconds since Epoch): {modified_time}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are also various other methods that you can do with `pathlib`, such as rename the files (`.rename()`) or create folders (`.mkdir()`). You can see all available methods from `pathlib` documentation [^pathlib]. One of the most useful tools in `pathlib` is the ability to list all files within a given folder by using the method `.glob()` which also allows you to add specific search criteria for listing only specific files from the directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "file_list = list(input_folder.glob(\"0*txt\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, the result is stored into variable `file_list` as a list. By default, the `.glob()` produces something called a `generator` which is a \"lazy iterator\", i.e. a special kind of function that allows you to iterate over items like a list, but without actually storing the data in memory. By enclosing the `.glob()` search functionality with `list()` we convert this generator into a normal Python list. Note that we're using the \\* character as a wildcard, so any filename that starts with `0` and ends with `txt` will be added to the list of files. We specifically use the number `0` as the starting part for the search criteria to avoid having metadata files included in the list. Let's take a look what we got as a result:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of files in the list: 15\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[PosixPath('data/finnish_stations/029170.txt'),\n",
       " PosixPath('data/finnish_stations/028690.txt'),\n",
       " PosixPath('data/finnish_stations/029820.txt'),\n",
       " PosixPath('data/finnish_stations/029700.txt'),\n",
       " PosixPath('data/finnish_stations/028970.txt'),\n",
       " PosixPath('data/finnish_stations/029070.txt'),\n",
       " PosixPath('data/finnish_stations/029500.txt'),\n",
       " PosixPath('data/finnish_stations/029110.txt'),\n",
       " PosixPath('data/finnish_stations/028750.txt'),\n",
       " PosixPath('data/finnish_stations/029720.txt'),\n",
       " PosixPath('data/finnish_stations/029440.txt'),\n",
       " PosixPath('data/finnish_stations/028360.txt'),\n",
       " PosixPath('data/finnish_stations/029810.txt'),\n",
       " PosixPath('data/finnish_stations/029740.txt'),\n",
       " PosixPath('data/finnish_stations/029350.txt')]"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(\"Number of files in the list:\", len(file_list))\n",
    "file_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Iterate over input files and repeat the analysis\n",
    "\n",
    "Now, we should have all the relevant file paths in the `file_list`, and we can loop over the list using a for loop (again we break the loop after first iteration):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "data/finnish_stations/029170.txt\n"
     ]
    }
   ],
   "source": [
    "for fp in file_list:\n",
    "    print(fp)\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data that we have sampled is in regular CSV format which we can read easily with `pd.read_csv()` function: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>USAF</th>\n",
       "      <th>YR--MODAHRMN</th>\n",
       "      <th>DIR</th>\n",
       "      <th>SPD</th>\n",
       "      <th>GUS</th>\n",
       "      <th>TEMP</th>\n",
       "      <th>MAX</th>\n",
       "      <th>MIN</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>29170</td>\n",
       "      <td>201501010050</td>\n",
       "      <td>240.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>34.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>29170</td>\n",
       "      <td>201501010120</td>\n",
       "      <td>260.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>36.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>29170</td>\n",
       "      <td>201501010150</td>\n",
       "      <td>250.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>34.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>29170</td>\n",
       "      <td>201501010220</td>\n",
       "      <td>250.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>36.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>29170</td>\n",
       "      <td>201501010250</td>\n",
       "      <td>240.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>36.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    USAF  YR--MODAHRMN    DIR  SPD  GUS  TEMP  MAX  MIN\n",
       "0  29170  201501010050  240.0  7.0  NaN  34.0  NaN  NaN\n",
       "1  29170  201501010120  260.0  5.0  NaN  36.0  NaN  NaN\n",
       "2  29170  201501010150  250.0  8.0  NaN  34.0  NaN  NaN\n",
       "3  29170  201501010220  250.0  8.0  NaN  36.0  NaN  NaN\n",
       "4  29170  201501010250  240.0  8.0  NaN  36.0  NaN  NaN"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.read_csv(fp)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have all the file paths to our weather observation datasets in a list, and we can start iterating over them and repeat the analysis steps for each file separately. We keep all the analytical steps inside a loop so that all of them are repeated to different stations. Finally, we will store the warmest January for each station in a list called `results` using a regular Python's `append()` method and merge the list of DataFrames into one by using `pd.concat()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "STATION NUMBER: 29170\tNUMBER OF OBSERVATIONS: 120211\n",
      "STATION NUMBER: 28690\tNUMBER OF OBSERVATIONS: 119674\n",
      "STATION NUMBER: 29820\tNUMBER OF OBSERVATIONS: 40264\n",
      "STATION NUMBER: 29700\tNUMBER OF OBSERVATIONS: 120618\n",
      "STATION NUMBER: 28970\tNUMBER OF OBSERVATIONS: 120891\n",
      "STATION NUMBER: 29070\tNUMBER OF OBSERVATIONS: 40473\n",
      "STATION NUMBER: 29500\tNUMBER OF OBSERVATIONS: 40405\n",
      "STATION NUMBER: 29110\tNUMBER OF OBSERVATIONS: 81164\n",
      "STATION NUMBER: 28750\tNUMBER OF OBSERVATIONS: 81127\n",
      "STATION NUMBER: 29720\tNUMBER OF OBSERVATIONS: 81127\n",
      "STATION NUMBER: 29440\tNUMBER OF OBSERVATIONS: 120947\n",
      "STATION NUMBER: 28360\tNUMBER OF OBSERVATIONS: 35442\n",
      "STATION NUMBER: 29810\tNUMBER OF OBSERVATIONS: 35377\n",
      "STATION NUMBER: 29740\tNUMBER OF OBSERVATIONS: 121654\n",
      "STATION NUMBER: 29350\tNUMBER OF OBSERVATIONS: 116220\n"
     ]
    }
   ],
   "source": [
    "# A list for storing the result\n",
    "results = []\n",
    "\n",
    "# Repeat the analysis steps for each input file:\n",
    "for fp in file_list:\n",
    "    # Read the data from CSV file\n",
    "    data = pd.read_csv(fp)\n",
    "\n",
    "    # Rename the columns\n",
    "    new_names = {\n",
    "        \"USAF\": \"STATION_NUMBER\",\n",
    "        \"YR--MODAHRMN\": \"TIME\",\n",
    "        \"SPD\": \"SPEED\",\n",
    "        \"GUS\": \"GUST\",\n",
    "        \"TEMP\": \"TEMP_F\",\n",
    "    }\n",
    "    data = data.rename(columns=new_names)\n",
    "\n",
    "    # Print info about the current input file (useful to understand how the process advances):\n",
    "    print(\n",
    "        f\"STATION NUMBER: {data.at[0,'STATION_NUMBER']}\\tNUMBER OF OBSERVATIONS: {len(data)}\"\n",
    "    )\n",
    "\n",
    "    # Create column\n",
    "    col_name = \"TEMP_C\"\n",
    "    data[col_name] = None\n",
    "\n",
    "    # Convert tempetarues from Fahrenheits to Celsius\n",
    "    data[\"TEMP_C\"] = data[\"TEMP_F\"].apply(fahr_to_celsius)\n",
    "\n",
    "    # Convert TIME to string\n",
    "    data[\"TIME_STR\"] = data[\"TIME\"].astype(str)\n",
    "\n",
    "    # Parse year and month and convert them to numbers\n",
    "    data[\"MONTH\"] = data[\"TIME_STR\"].str.slice(start=5, stop=6).astype(int)\n",
    "    data[\"YEAR\"] = data[\"TIME_STR\"].str.slice(start=0, stop=4).astype(int)\n",
    "\n",
    "    # Extract observations for the months of January\n",
    "    january = data[data[\"MONTH\"] == 1]\n",
    "\n",
    "    # Aggregate the data and get mean values\n",
    "    columns = [\"TEMP_F\", \"TEMP_C\", \"STATION_NUMBER\"]\n",
    "    monthly_mean = january.groupby(by=[\"YEAR\", \"MONTH\"])[columns].mean().reset_index()\n",
    "\n",
    "    # Sort the values and take the warmest January\n",
    "    warmest = monthly_mean.sort_values(by=\"TEMP_C\", ascending=False).head(1)\n",
    "\n",
    "    # Add to results\n",
    "    results.append(warmest)\n",
    "\n",
    "# Merge all the results into a single DataFrame\n",
    "results = pd.concat(results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Awesome! Now we have conducted the same analysis for 15 weather stations in Finland and it did not took too many lines of code! We were able to follow how the process advances with the printed lines of information, i.e. we did some simple `logging` of the operations. Let's finally investigate our results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>YEAR</th>\n",
       "      <th>MONTH</th>\n",
       "      <th>TEMP_F</th>\n",
       "      <th>TEMP_C</th>\n",
       "      <th>STATION_NUMBER</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2018</td>\n",
       "      <td>1</td>\n",
       "      <td>27.705512</td>\n",
       "      <td>-2.385827</td>\n",
       "      <td>29170.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2018</td>\n",
       "      <td>1</td>\n",
       "      <td>20.855503</td>\n",
       "      <td>-6.191387</td>\n",
       "      <td>28690.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2015</td>\n",
       "      <td>1</td>\n",
       "      <td>38.708724</td>\n",
       "      <td>3.727069</td>\n",
       "      <td>29820.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2015</td>\n",
       "      <td>1</td>\n",
       "      <td>38.052815</td>\n",
       "      <td>3.362675</td>\n",
       "      <td>29700.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2018</td>\n",
       "      <td>1</td>\n",
       "      <td>25.359090</td>\n",
       "      <td>-3.689395</td>\n",
       "      <td>28970.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2017</td>\n",
       "      <td>1</td>\n",
       "      <td>33.014374</td>\n",
       "      <td>0.563541</td>\n",
       "      <td>29070.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2015</td>\n",
       "      <td>1</td>\n",
       "      <td>39.166185</td>\n",
       "      <td>3.981214</td>\n",
       "      <td>29500.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2015</td>\n",
       "      <td>1</td>\n",
       "      <td>32.306713</td>\n",
       "      <td>0.170396</td>\n",
       "      <td>29110.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2015</td>\n",
       "      <td>1</td>\n",
       "      <td>27.514236</td>\n",
       "      <td>-2.492091</td>\n",
       "      <td>28750.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2015</td>\n",
       "      <td>1</td>\n",
       "      <td>34.983254</td>\n",
       "      <td>1.657363</td>\n",
       "      <td>29720.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2015</td>\n",
       "      <td>1</td>\n",
       "      <td>32.062323</td>\n",
       "      <td>0.034624</td>\n",
       "      <td>29440.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2018</td>\n",
       "      <td>1</td>\n",
       "      <td>19.644055</td>\n",
       "      <td>-6.864414</td>\n",
       "      <td>28360.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2015</td>\n",
       "      <td>1</td>\n",
       "      <td>39.864476</td>\n",
       "      <td>4.369154</td>\n",
       "      <td>29810.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2015</td>\n",
       "      <td>1</td>\n",
       "      <td>34.405518</td>\n",
       "      <td>1.336399</td>\n",
       "      <td>29740.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2015</td>\n",
       "      <td>1</td>\n",
       "      <td>28.617084</td>\n",
       "      <td>-1.879398</td>\n",
       "      <td>29350.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   YEAR  MONTH     TEMP_F    TEMP_C  STATION_NUMBER\n",
       "3  2018      1  27.705512 -2.385827         29170.0\n",
       "3  2018      1  20.855503 -6.191387         28690.0\n",
       "0  2015      1  38.708724  3.727069         29820.0\n",
       "0  2015      1  38.052815  3.362675         29700.0\n",
       "3  2018      1  25.359090 -3.689395         28970.0\n",
       "2  2017      1  33.014374  0.563541         29070.0\n",
       "0  2015      1  39.166185  3.981214         29500.0\n",
       "0  2015      1  32.306713  0.170396         29110.0\n",
       "0  2015      1  27.514236 -2.492091         28750.0\n",
       "0  2015      1  34.983254  1.657363         29720.0\n",
       "0  2015      1  32.062323  0.034624         29440.0\n",
       "3  2018      1  19.644055 -6.864414         28360.0\n",
       "0  2015      1  39.864476  4.369154         29810.0\n",
       "0  2015      1  34.405518  1.336399         29740.0\n",
       "0  2015      1  28.617084 -1.879398         29350.0"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each row in the results represents the warmest January at given `STATION_NUMBER` between the years 2015 and 2019. Based on the `YEAR` column, the warmest January in most of Finland's weather stations during this five-year period was in 2015. We can confirm this by checking the value counts of the `YEAR` column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2015    10\n",
       "2018     4\n",
       "2017     1\n",
       "Name: YEAR, dtype: int64"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results[\"YEAR\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Footnotes\n",
    "\n",
    "[^noaanews]: <https://www.noaa.gov/news/january-2020-was-earth-s-hottest-january-on-record>\n",
    "[^pathlib]: <https://docs.python.org/3/library/pathlib.html>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}