{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting started with data analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is pandas?\n", "\n", "The [pandas Python library](https://pandas.pydata.org/) [^pandas] is a easy-to-use but powerful data analysis toolkit. It is a mature data analytics framework that is widely used among different fields of science. The development of pandas started in 2008 and it is now maintained by an active developer community ({cite}`McKinney2017`). \n", "\n", "Pandas is a \"high-level\" package, which means that it makes use of several other packages in the background. It combines the performance of powerful Python libraries such as [NumPy](https://numpy.org/) [^numpy], [matplotlib](https://matplotlib.org/) [^matplotlib] and [SciPy](https://scipy.org/) [^scipy].\n", "\n", "One of the most useful features of pandas is its ability to interact with numerous data formats. It supports reading and writing data e.g. from/to:\n", "\n", "- CSV\n", "- JSON\n", "- HTML\n", "- MS Excel\n", "- HDF5\n", "- Stata\n", "- SAS\n", "- Python Pickle format\n", "- SQL (Postgresql, MySQL, Oracle, MariaDB, etc.)\n", "\n", "For a full list of supported file formats and other features, see the official [pandas documentation and reference guide](https://pandas.pydata.org/pandas-docs/stable/) [^pandasdocs]." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Pandas data structures\n", "\n", "In pandas, table-like data are stored in two-dimensional `DataFrame` objects with labeled rows and columns. The pandas DataFrame was originally inspired by data frames that are in-built in the R programming language. You can think of the pandas DataFrame as a programmable spreadsheet. One-dimensional sequences of values are stored in pandas `Series`. One row or one column in a pandas DataFrame is actually a pandas Series. You can think of a pandas Series as a clever list. These pandas structures incorporate a number of things we've already encountered earlier in this book, such as indices, data stored in a collection, and data types.\n", "\n", "![_**Figure 3.1**. Illustration of pandas DaraFrame and pandas Series data structures. Pandas DataFrame is a 2-dimensional data structure used for storing and mainpulating table-like data (data with rows and columns). Pandas Series is a 1-dimensional data structure used for storing and manipulating an sequence of values._](./../img/pandas-structures-annotated.png)\n", "\n", "_**Figure 3.1**. Illustration of pandas DaraFrame and pandas Series data structures. Pandas DataFrame is a 2-dimensional data structure used for storing and mainpulating table-like data (data with rows and columns). Pandas Series is a 1-dimensional data structure used for storing and manipulating an sequence of values._\n", "\n", "As you can see from Figure 3.1, both the DataFrame and Series in pandas have an index that can be used to select values, but they also have column labels to identify columns in DataFrames. In the next sections, we will use many of these features to explore real-world data and learn some useful data analysis procedures. For a comprehensive overview of pandas data structures, we recommend you to have a look at pandas online documentation about data structures [^pandas_ds] as well as Chapter 5 in {cite}`McKinney2017`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading tabular data \n", "\n", "In the following sections, we will learn how to read data from a text file (\"Kumpula-June-2016-w-metadata.txt\") which contains weather observations from Kumpula, Helsinki (Finland). The data were retrieved from [NOAA climate database](https://www.ncdc.noaa.gov/cdo-web/) [^noaa] and it contains observed daily mean, minimum, and maximum temperatures from June 2016, recorded by a weather observation station in Kumpula. The file includes altogether 30 rows of observations (one per day). The first fifteen rows in the file look like following:\n", "\n", "```\n", "# Data file contents: Daily temperatures (mean, min, max) for Kumpula, Helsinki\n", "# for June 1-30, 2016\n", "# Data source: https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND\n", "# Data processing: Extracted temperatures from raw data file, converted to\n", "# comma-separated format\n", "#\n", "# David Whipp - 02.10.2017\n", "\n", "YEARMODA,TEMP,MAX,MIN\n", "20160601,65.5,73.6,54.7\n", "20160602,65.8,80.8,55.0\n", "20160603,68.4,,55.6\n", "20160604,57.5,70.9,47.3\n", "20160605,51.4,58.3,43.2\n", "20160606,52.2,59.7,42.8\n", "```\n", "\n", "\n", "Now as we have familiarized ourselves with the data, we can continue and see how to read the data using pandas. Let's start by importing the ``pandas`` module. It is customary to import pandas as `pd`:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "deletable": true, "editable": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Next, we will read the input data file and store the contents of that file into a variable called `data` Using the `read_csv()` function:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false }, "lines_to_next_cell": 2 }, "outputs": [], "source": [ "data = pd.read_csv(\"data/Kumpula-June-2016-w-metadata.txt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our input file is a comma-delimited file, meaning that the columns in the data are separted by commas (`,`) on each row. The `read_csv()` function uses comma as the default delimiter, hence we did not need to specify it separately in the command above. If all went as planned, you should now have a new variable `data` in memory that contains the input data. \n", "\n", "It is quite common to have some other character instead of comma separating the columns from each other. The `read_csv()` in pandas is a generic function for reading text-based data files, supporting files separated by commas, spaces, or other common separators. The `sep` parameter can be used to specify which character is used as a delimiter. For instance `sep=';'`, would assume that the columns in the data file are delimited with semicolon (`;`). For a full list of available parameters, please refer to the [pandas read_csv() documention](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) [^readcsv], or run `help(pd.read_csv)`. \n", "\n", "Pandas has also several functions for parsing input data from different formats. For example, reading Excel files can be easily done by using a function `read_excel()`. Another useful function is `read_pickle()` that reads data stored in the Python pickle format. Check out the [pandas documentation about input and output tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#) [^pandas_io] for more details about reading data.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check the the contents of our data. You can print the contents of a DataFrame (or a snippet of it) simply by calling `data`. However, quite often you want to check only n-number of first (or last) rows in your data. For doing that, we can use the `head()` method of the pandas DataFrame object that will, by default, return the first 5 rows of the DataFrame. You can return any number of rows by adding a number inside the parentheses, such as 10 which will return the first ten rows of data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
# Data file contents: Daily temperatures (meanminmax) for KumpulaHelsinki
0# for June 1-302016NaNNaN
1# Data source: https://www.ncdc.noaa.gov/cdo-w...NaNNaNNaN
2# Data processing: Extracted temperatures from...converted toNaNNaN
3# comma-separated formatNaNNaNNaN
4#NaNNaNNaN
5# David Whipp - 02.10.2017NaNNaNNaN
6YEARMODATEMPMAXMIN
72016060165.573.654.7
82016060265.880.855.0
92016060368.4NaN55.6
\n", "
" ], "text/plain": [ " # Data file contents: Daily temperatures (mean min \\\n", "0 # for June 1-30 2016 \n", "1 # Data source: https://www.ncdc.noaa.gov/cdo-w... NaN \n", "2 # Data processing: Extracted temperatures from... converted to \n", "3 # comma-separated format NaN \n", "4 # NaN \n", "5 # David Whipp - 02.10.2017 NaN \n", "6 YEARMODA TEMP \n", "7 20160601 65.5 \n", "8 20160602 65.8 \n", "9 20160603 68.4 \n", "\n", " max) for Kumpula Helsinki \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "5 NaN NaN \n", "6 MAX MIN \n", "7 73.6 54.7 \n", "8 80.8 55.0 \n", "9 NaN 55.6 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, `tail()` will return the last 5 rows of the DataFrame." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
# Data file contents: Daily temperatures (meanminmax) for KumpulaHelsinki
322016062669.677.760.3
332016062760.770.0NaN
342016062865.473.055.8
352016062965.873.2NaN
362016063065.772.759.2
\n", "
" ], "text/plain": [ " # Data file contents: Daily temperatures (mean min max) for Kumpula \\\n", "32 20160626 69.6 77.7 \n", "33 20160627 60.7 70.0 \n", "34 20160628 65.4 73.0 \n", "35 20160629 65.8 73.2 \n", "36 20160630 65.7 72.7 \n", "\n", " Helsinki \n", "32 60.3 \n", "33 NaN \n", "34 55.8 \n", "35 NaN \n", "36 59.2 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.tail()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The first lines of the dataframe look a bit weird. `NaN` stands for \"not a number\", and might indicate some problem with reading in the contents of the file. Plus, we expected about 30 lines of data, but the index values go up to 36 when we print the last rows of the `data` variable. Looks like we need to investigate this further." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "As we can observe, there are some metadata at the top of the file giving basic information about its contents and source. This isn't data we want to process, so we need to skip over that part of the file when we load it. Here are the 8 first rows of data in the text file (note that the 8th row is blank):" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "```\n", "# Data file contents: Daily temperatures (mean, min, max) for Kumpula, Helsinki\n", "# for June 1-30, 2016\n", "# Data source: https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND\n", "# Data processing: Extracted temperatures from raw data file, converted to\n", "# comma-separated format\n", "#\n", "# David Whipp - 02.10.2017\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Fortunately, skipping over rows is easy to do when reading data with pandas. We just need to add the `skiprows` parameter when we read the file, listing the number of rows to skip (8 in this case). Let's try reading the datafile again, and this time defining the `skiprows` parameter." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "data = pd.read_csv(\"data/Kumpula-June-2016-w-metadata.txt\", skiprows=8)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Let's check how the data looks like now:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YEARMODATEMPMAXMIN
02016060165.573.654.7
12016060265.880.855.0
22016060368.4NaN55.6
32016060457.570.947.3
42016060551.458.343.2
\n", "
" ], "text/plain": [ " YEARMODA TEMP MAX MIN\n", "0 20160601 65.5 73.6 54.7\n", "1 20160602 65.8 80.8 55.0\n", "2 20160603 68.4 NaN 55.6\n", "3 20160604 57.5 70.9 47.3\n", "4 20160605 51.4 58.3 43.2" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YEARMODATEMPMAXMIN
252016062669.677.760.3
262016062760.770.0NaN
272016062865.473.055.8
282016062965.873.2NaN
292016063065.772.759.2
\n", "
" ], "text/plain": [ " YEARMODA TEMP MAX MIN\n", "25 20160626 69.6 77.7 60.3\n", "26 20160627 60.7 70.0 NaN\n", "27 20160628 65.4 73.0 55.8\n", "28 20160629 65.8 73.2 NaN\n", "29 20160630 65.7 72.7 59.2" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.tail()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Looks good! We seem to have 30 rows of data (index running from 0 to 30), and meanigful column names and values.\n", "\n", "Note that pandas DataFrames have labelled axes (rows and columns). In our sample data, the rows labeled with an index value (`0` to `29`), and columns labelled `YEARMODA`, `TEMP`, `MAX`, and `MIN`. Later on, we will learn how to use these labels for selecting and updating subsets of the data. " ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "It is also possible to read only specific columns from the data when using the `read_csv()` function. You can achieve this using the `usecols` parameter when reading the file which accepts a list of column names that will be included in the resulting DataFrame. Also positional indices (e.g. `[0, 1]`) corresponding the position of the column in the file can be used to specify which columns should be read. For additional details, see the [pandas read_csv() documention](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). Next, we will read the file `Kumpula-June-2016-w-metadata.txt` again and store its contents into a new variable called `temp_data`. In this case, we will only read the columns `YEARMODA` and `TEMP`, meaning that the new variable `temp_data` should have 30 rows and 2 columns:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YEARMODATEMP
02016060165.5
12016060265.8
22016060368.4
32016060457.5
42016060551.4
\n", "
" ], "text/plain": [ " YEARMODA TEMP\n", "0 20160601 65.5\n", "1 20160602 65.8\n", "2 20160603 68.4\n", "3 20160604 57.5\n", "4 20160605 51.4" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temp_data = pd.read_csv(\n", " \"data/Kumpula-June-2016-w-metadata.txt\", skiprows=8, usecols=[\"YEARMODA\", \"TEMP\"]\n", ")\n", "\n", "temp_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a result, we now have only two columns instead of the original four. Using the `usecols` function to limit the number of columns can be useful when having data files with possibly tens or even hundreds of columns. Typically you are not interested in all of them, but you want focus on only a few important ones which you can select already when reading the data. " ] }, { "cell_type": "markdown", "metadata": { "editable": true }, "source": [ "## Writing data to a file\n", "\n", "Naturally, it is also possible to write the data from pandas to a file. Pandas supports many common output formats such as CSV files, MS Excel, xml and others. For full details about supported file formats and writer functions, see the [pandas i/o documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-tools-text-csv-hdf5) [^pandas_io]. We will store our data back into a csv-file called `Kumpula_temp_results_June_2016.csv`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "# define output filename\n", "output_fp = \"Kumpula_temps_June_2016.csv\"\n", "\n", "# Save dataframe to csv\n", "data.to_csv(output_fp, sep=\",\")" ] }, { "cell_type": "markdown", "metadata": { "editable": true }, "source": [ "![_**Figure 3.2**. The output CSV file opened in JupyterLab._](../img/pandas-save-file-1.png)\n", "\n", "_**Figure 3.2**. The output CSV file opened in JupyterLab._\n", "\n", "\n", "Figure 3.2 shows the output file. As we can see, the first column contains the index value without any column name. Also the temperature values are represented as floating point number with a presision of 1 decimal. \n", "\n", "Let's see how we can modify these aspects of the output file using pandas and save the data again without the index, and with zero decimals. Omitting the index can be achieved using the `index=False` parameter. Decimal precision can be specified using the `float_format` parameter. Format text `\"%.0f\"` will set decimal precision to zero (while `\"%.2f\"` would set it to 2 decimals and so on)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "# define output filename\n", "output_fp2 = \"Kumpula_temperatures_integers_June_2016.csv\"\n", "\n", "# Save dataframe to csv\n", "data.to_csv(output_fp2, sep=\",\", index=False, float_format=\"%.0f\")" ] }, { "cell_type": "markdown", "metadata": { "editable": true }, "source": [ "\n", "\n", "![_**Figure 3.3**. The formatted output CSV file opened in JupyterLab._](../img/pandas-save-file-2.png)\n", "\n", "_**Figure 3.3**. The formatted output CSV file opened in JupyterLab._\n", "\n", "Figure 3.3 shows the \"cleaner\" output file that lacks the index column and temperature values are rounded to integers." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## DataFrame properties\n", "\n", "Now we will continue exploring the full dataset that we have stored in the variable `data`. A normal first step when you load new data is to explore the dataset a bit to understand how the data is structured, and what kind of values are stored in there. Let's start by checking the size of our DataFrame. We can use the `len()` function similar to the use with lists to check how many rows we have:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "30" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(data)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We can also get a quick sense of the size of the dataset using the `shape`. It returns a `tuple` with the number of rows as the first element and the number of columns as the second element:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "(30, 4)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.shape" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Here we see that our dataset has 30 rows and 4 columns, just as we saw above when printing out the entire DataFrame. Pay attention that we do not use parentheses after the word `shape` when accessing attributes. The `shape` in here is one of the several ``attributes`` related to a pandas DataFrame object. Attributes are typically used to store useful information (or *metadata*) about the object at hand. We will see many more examples of these throughout the book. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's also check the column names we have in our DataFrame (yes, it is an attribute as well!). We already saw the column names when we checked the 5 first rows using `data.head()`, but often it is useful to access the column names directly. You can check the column names by calling `data.columns` (returns an index object that contains the column labels) or `data.columns.values`:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "array(['YEARMODA', 'TEMP', 'MAX', 'MIN'], dtype=object)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.columns.values" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We can also find information about the row identifiers using the `index` attribute:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=30, step=1)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.index" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Here we see how the data is indexed, starting at 0, ending at 30, and with an increment of 1 between each value. This is basically the same way in which Python lists are indexed, however, pandas allows also other ways of identifying the rows. DataFrame indices could, for example, be character strings, or date objects. We will learn more about resetting the index later. What about the data types of each column in our dataFrame? We can check the data type of all columns at once using `pandas.DataFrame.dtypes`:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "YEARMODA int64\n", "TEMP float64\n", "MAX float64\n", "MIN float64\n", "dtype: object" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print data types\n", "data.dtypes" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true, "lines_to_next_cell": 2 }, "source": [ "Here we see that `YEARMODA` is an integer value (with 64-bit precision; ``int64``), while the other values are all decimal values with 64-bit precision (float64)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question 3.1\n", "\n", "How would you print out the number of columns in our DataFrame?" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "# Use this cell to enter your solution." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "tags": [ "hide-cell", "remove_book_cell" ] }, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution\n", "\n", "len(data.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When conducting data analysis, it is very common that you want to choose certain columns from the data for further processing. We can select specific columns based on the column values. The basic syntax is `dataframe[value]`, where value can be a single column name, or a list of column names. Let's start by selecting two columns, `'YEARMODA'` and `'TEMP'`:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YEARMODATEMP
02016060165.5
12016060265.8
22016060368.4
32016060457.5
42016060551.4
\n", "
" ], "text/plain": [ " YEARMODA TEMP\n", "0 20160601 65.5\n", "1 20160602 65.8\n", "2 20160603 68.4\n", "3 20160604 57.5\n", "4 20160605 51.4" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "selection = data[[\"YEARMODA\", \"TEMP\"]]\n", "selection.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's also check the data type of this selection:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(selection)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the subset is still a pandas DataFrame, and we are able to use all the methods and attributes related to a pandas DataFrame also with this subset. For example, we can check the shape:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(30, 2)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "selection.shape" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We can also access a single column of the data based on the column name:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "0 65.5\n", "1 65.8\n", "2 68.4\n", "3 57.5\n", "4 51.4\n", "Name: TEMP, dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[\"TEMP\"].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about the type of the column itself?" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(data[\"TEMP\"])" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Each column (and each row) in a pandas data frame is indeed a pandas Series. Notice that you can also retreive a column using a different syntax `data.TEMP`. This syntax works only if the column name is a valid name for a Python variable (e.g. the column name should not contain whitespace). The syntax `data[\"column\"]` works for all kinds of column names, so we recommend using this approach." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Unique values\n", "\n", "Sometimes it is useful to extract the unique values that you have in your column. We can do that by using `unique()` method:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([65.5, 65.8, 68.4, 57.5, 51.4, 52.2, 56.9, 54.2, 49.4, 49.5, 54. ,\n", " 55.4, 58.3, 59.7, 63.4, 57.8, 60.4, 57.3, 56.3, 59.3, 62.6, 61.7,\n", " 60.9, 61.1, 65.7, 69.6, 60.7, 65.4])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[\"TEMP\"].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a result we get an array of unique values in that column. We can also directly access the number of unique values using the `nunique()` method:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There were 28 days with unique mean temperatures in June 2016.\n" ] } ], "source": [ "print(\n", " \"There were\",\n", " data[\"TEMP\"].nunique(),\n", " \"days with unique mean temperatures in June 2016.\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Descriptive statistics" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "pandas DataFrame and Series contain useful methods for getting summary statistics. Available methods include `mean()`, `median()`, `min()`, `max()`, and `std()` (the standard deviation). The statistics can be calculated on a Series level (a single column) or getting the statistics for all columns at once for the whole DataFrame. To check e.g. the mean temperature in out input data, i.e. focusing on a single column (*Series*), we can do following: " ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "59.730000000000004" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[\"TEMP\"].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output in this case will be a single floating point number presenting the mean temperature, 59.73 Fahrenheits. To get the mean statistics for all columns in the DataFrame, we can call the `mean()` in a similar manner, but without specifying the column name:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "YEARMODA 2.016062e+07\n", "TEMP 5.973000e+01\n", "MAX 6.804074e+01\n", "MIN 5.125714e+01\n", "dtype: float64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.mean()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Notice that in this case, the result is Series showing the mean values for each column. For an overview of the basic statistics for all attributes in the data, we can use the `describe()` method:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YEARMODATEMPMAXMIN
count3.000000e+0130.00000027.00000028.000000
mean2.016062e+0759.73000068.04074151.257143
std8.803408e+005.4754726.5055755.498985
min2.016060e+0749.40000054.10000041.700000
25%2.016061e+0756.45000064.65000046.975000
50%2.016062e+0760.05000069.10000053.100000
75%2.016062e+0764.90000072.05000055.600000
max2.016063e+0769.60000080.80000060.300000
\n", "
" ], "text/plain": [ " YEARMODA TEMP MAX MIN\n", "count 3.000000e+01 30.000000 27.000000 28.000000\n", "mean 2.016062e+07 59.730000 68.040741 51.257143\n", "std 8.803408e+00 5.475472 6.505575 5.498985\n", "min 2.016060e+07 49.400000 54.100000 41.700000\n", "25% 2.016061e+07 56.450000 64.650000 46.975000\n", "50% 2.016062e+07 60.050000 69.100000 53.100000\n", "75% 2.016062e+07 64.900000 72.050000 55.600000\n", "max 2.016063e+07 69.600000 80.800000 60.300000" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a result, we get the number of values that are not None for each column (count) as well as the basic statistics and quartiles (min, 25%, 50%, 75% and max). It is also possible to get other DataFrame specific information, such as the index dtype and columns, non-null values and memory usage by calling `info()`: " ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 30 entries, 0 to 29\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 YEARMODA 30 non-null int64 \n", " 1 TEMP 30 non-null float64\n", " 2 MAX 27 non-null float64\n", " 3 MIN 28 non-null float64\n", "dtypes: float64(3), int64(1)\n", "memory usage: 1.1 KB\n" ] } ], "source": [ "data.info()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Different ways of constructing pandas objects\n", "\n", "Most often we create pandas objects by reading in data from an external source, such as a text file. Here, we will briefly show different ways how you can create pandas objects from Python objects (lists and dictionaries). If you have a very long lists of numbers, creating a pandas Series (or numpy.array) will allow you to interact with these values more efficiently in terms of computing time that using and working with a pure Python list. This is how you can create a pandas Series from a list:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 4.0\n", "1 5.0\n", "2 6.0\n", "3 7.0\n", "dtype: float64\n" ] } ], "source": [ "number_series = pd.Series([4, 5, 6, 7.0])\n", "print(number_series)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Note that Pandas is smart about the conversion, detecting a single floating point value (`7.0`) and assigning all values in the Series the data type float64. If needed, you can also set a custom index when creating the object:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a 4.0\n", "b 5.0\n", "c 6.0\n", "d 7.0\n", "dtype: float64\n" ] } ], "source": [ "number_series = pd.Series([4, 5, 6, 7.0], index=[\"a\", \"b\", \"c\", \"d\"])\n", "print(number_series)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(number_series)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How about combining several lists and construct a DataFrame from them? Certainly. Let's see how we can convert lists into a pandas DataFrame. Let's first create a few lists having the same number of items each:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "stations = [\"Hanko\", \"Heinola\", \"Kaisaniemi\", \"Malmi\"]\n", "latitudes = [59.77, 61.2, 60.18, 60.25]\n", "longitudes = [22.95, 26.05, 24.94, 25.05]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can create a pandas DataFrames by using the `pandas.DataFrame` constructor and passing a *Python dictionary* `{\"column_1\": list_1, \"column_2\": list_2, ...}` to it, indicating the structure of our data:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationlatlon
0Hanko59.7722.95
1Heinola61.2026.05
2Kaisaniemi60.1824.94
3Malmi60.2525.05
\n", "
" ], "text/plain": [ " station lat lon\n", "0 Hanko 59.77 22.95\n", "1 Heinola 61.20 26.05\n", "2 Kaisaniemi 60.18 24.94\n", "3 Malmi 60.25 25.05" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_data = pd.DataFrame(data={\"station\": stations, \"lat\": latitudes, \"lon\": longitudes})\n", "new_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to create pandas DataFrame from a list of dictionaries. Hence, this approach is quite similar as the previous example, but we organize the data a bit differently. Having a list of dictionaries is common if you happen to work with ``JSON`` files, which are commonly used when reading data from the web (e.g. via ``Application Programming Interface``):" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationlatlon
0Hanko59.7722.95
1Heinola61.2026.05
2Kaisaniemi60.1824.94
3Malmi60.2525.05
\n", "
" ], "text/plain": [ " station lat lon\n", "0 Hanko 59.77 22.95\n", "1 Heinola 61.20 26.05\n", "2 Kaisaniemi 60.18 24.94\n", "3 Malmi 60.25 25.05" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dictionaries = [\n", " {\"station\": \"Hanko\", \"lat\": 59.77, \"lon\": 22.95},\n", " {\"station\": \"Heinola\", \"lat\": 61.2, \"lon\": 26.05},\n", " {\"station\": \"Kaisaniemi\", \"lat\": 60.18, \"lon\": 24.94},\n", " {\"station\": \"Malmi\", \"lat\": 60.25, \"lon\": 25.05},\n", "]\n", "\n", "# Pass the list into the DataFrame constructor\n", "new_data_2 = pd.DataFrame(dictionaries)\n", "new_data_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a result, we got an identical DataFrame as in our first example above. Notice, that sometimes you might start working with an empty DataFrame and only later \"populate\" it with new columns:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Empty DataFrame\n", "Columns: []\n", "Index: []\n" ] } ], "source": [ "df = pd.DataFrame()\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have an empty DataFrame, and we can add new columns to it based on the lists we created earlier. Notice that the following approach requires that the length of all the lists that you want to add are equal: " ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lonlat
022.9559.77
126.0561.20
224.9460.18
325.0560.25
\n", "
" ], "text/plain": [ " lon lat\n", "0 22.95 59.77\n", "1 26.05 61.20\n", "2 24.94 60.18\n", "3 25.05 60.25" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"lon\"] = longitudes\n", "df[\"lat\"] = latitudes\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Footnotes\n", "\n", "[^pandas]: \n", "[^numpy]: \n", "[^matplotlib]: \n", "[^scipy]: \n", "[^pandasdocs]: \n", "[^pandas_ds]: \n", "[^noaa]: \n", "[^readcsv]: \n", "[^pandas_io]: \n" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" } }, "nbformat": 4, "nbformat_minor": 4 }