{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Plotting with pandas and matplotlib\n", "\n", "At this point we are familiar with some of the features of pandas and explored some very basic data visualizations at the [end of Chapter 3](../../chapter-03/nb/03-temporal-data.ipynb). Now, we will wade into visualizing our data in more detail, starting by using the built-in plotting options available directly in pandas. Much like the case of pandas being built upon numpy, plotting in pandas takes advantage of plotting features from the `matplotlib` [^matplotlib] plotting library. Plotting in pandas provides a basic framework for quickly visualizing our data, but as you'll see we will need to also use features from matplotlib for more advanced formatting and to enhance our plots. In particular, we will use features from the the `pyplot` [^pyplot] module in matplotlib, which provides MATLAB-like [^matlab] plotting. We will also briefly explore creating interactive plots using the `hvplot` [^hvplot] plotting library, which allows us to produce plots similar to those available in the `bokeh` plotting library [^bokeh] using plotting syntax very similar to that in pandas." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Creating a basic x-y plot\n", "\n", "The first step for creating a basic x-y plot is to import pandas and read in the data we want to plot from a file. We will be using a datetime index for our weather observation data as we [learned in Chapter 3](../../chapter-03/nb/03-temporal-data.ipynb). In this case, however, we'll include a few additional parameters in order to *read the data* with a datetime index. We will read in the data first, and then discuss what happened.\n", "\n", "Let's start by importing the libraries we will need (pandas and Matplotlib), and then read in the data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "fp = \"data/029740.txt\"\n", "\n", "data = pd.read_csv(\n", " fp,\n", " delim_whitespace=True,\n", " na_values=[\"*\", \"**\", \"***\", \"****\", \"*****\", \"******\"],\n", " usecols=[\"YR--MODAHRMN\", \"TEMP\", \"MAX\", \"MIN\"],\n", " parse_dates=[\"YR--MODAHRMN\"],\n", " index_col=\"YR--MODAHRMN\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, let us now examing what is different here compared to files read in Chapter 3. There are two significant changes in the form of two new parameters: `parse_dates` and `index_col`.\n", "\n", "- `parse_dates` takes a Python list of column name(s) for data file columns that contain date data and pandas will parse and convert data in these column(s) to the *datetime* data type. For many common date formats pandas will automatically recognize and convert the date data.\n", "- `index_col` is used to state a column that should be used to index the data in the DataFrame. In this case, we end up with our date data as the DataFrame index. This is a very useful feature in pandas as we will see below.\n", "\n", "Having read in the data file, we can now have a quick look at what we have using `data.head()`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | TEMP | \n", "MAX | \n", "MIN | \n", "
---|---|---|---|
YR--MODAHRMN | \n", "\n", " | \n", " | \n", " |
1952-01-01 00:00:00 | \n", "36.0 | \n", "NaN | \n", "NaN | \n", "
1952-01-01 06:00:00 | \n", "37.0 | \n", "NaN | \n", "34.0 | \n", "
1952-01-01 12:00:00 | \n", "39.0 | \n", "NaN | \n", "NaN | \n", "
1952-01-01 18:00:00 | \n", "36.0 | \n", "39.0 | \n", "NaN | \n", "
1952-01-02 00:00:00 | \n", "36.0 | \n", "NaN | \n", "NaN | \n", "