{ "cells": [ { "cell_type": "markdown", "id": "8ef7115f", "metadata": {}, "source": [ "# Preparing GeoDataFrames from geographic data\n", "\n", "Reading data into Python is usually the first step of an analysis workflow. There are various different GIS data formats available such as [Shapefile](https://en.wikipedia.org/wiki/Shapefile) [^shp], [GeoJSON](https://en.wikipedia.org/wiki/GeoJSON) [^GeoJson], [KML](https://en.wikipedia.org/wiki/Keyhole_Markup_Language) [^KML], and [GeoPackage](https://en.wikipedia.org/wiki/GeoPackage) [^GPKG]. Geopandas is capable of reading data from all of these formats (plus many more). \n", "\n", "This tutorial will show some typical examples how to read (and write) data from different sources. The main point in this section is to demonstrate the basic syntax for reading and writing data using short code snippets. You can find the example data sets in the data-folder. However, most of the example databases do not exists, but you can use and modify the example syntax according to your own setup." ] }, { "cell_type": "markdown", "id": "9481338d", "metadata": {}, "source": [ "## Reading vector data\n", "\n", "In `geopandas`, we can use a generic function `.from_file()` for reading in various vector data formats. When reading files with `geopandas`, the data are passed on to the `fiona` library under the hood for reading the data. This means that you can read and write all data formats supported by `fiona` with `geopandas`. " ] }, { "cell_type": "code", "execution_count": 1, "id": "386d2ddd-6f51-4ac6-bd03-4916f0bf51f7", "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "import os\n", "\n", "os.environ[\"USE_PYGEOS\"] = \"0\"\n", "import geopandas" ] }, { "cell_type": "code", "execution_count": 2, "id": "b3d20a76", "metadata": {}, "outputs": [], "source": [ "import geopandas as gpd\n", "import fiona" ] }, { "cell_type": "markdown", "id": "dd6d38a6", "metadata": {}, "source": [ "Let's check which drivers are supported by `fiona`." ] }, { "cell_type": "code", "execution_count": 3, "id": "02103a1a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'DXF': 'rw',\n", " 'CSV': 'raw',\n", " 'OpenFileGDB': 'raw',\n", " 'ESRIJSON': 'r',\n", " 'ESRI Shapefile': 'raw',\n", " 'FlatGeobuf': 'raw',\n", " 'GeoJSON': 'raw',\n", " 'GeoJSONSeq': 'raw',\n", " 'GPKG': 'raw',\n", " 'GML': 'rw',\n", " 'OGR_GMT': 'rw',\n", " 'GPX': 'rw',\n", " 'Idrisi': 'r',\n", " 'MapInfo File': 'raw',\n", " 'DGN': 'raw',\n", " 'PCIDSK': 'raw',\n", " 'OGR_PDS': 'r',\n", " 'S57': 'r',\n", " 'SQLite': 'raw',\n", " 'TopoJSON': 'r'}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fiona.supported_drivers" ] }, { "cell_type": "markdown", "id": "3a237aa1-cca8-4e0c-ac6e-86c28c45c846", "metadata": {}, "source": [ "In the list of supported drivers, `r` is for file formats `fiona` can read, and `w` is for file formats it can write. Letter `a` marks formats for which `fiona` can append new data to existing files." ] }, { "cell_type": "markdown", "id": "223208cc", "metadata": {}, "source": [ "Let's read in some sample data to see the basic syntax." ] }, { "cell_type": "code", "execution_count": 4, "id": "2ad209e1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fidpop2019tractgeometry
01.06070.0002422POLYGON ((615643.487 3338728.496, 615645.477 3...
12.02203.0001751POLYGON ((618576.586 3359381.053, 618614.330 3...
23.07419.0002411POLYGON ((619200.163 3341784.654, 619270.849 3...
34.04229.0000401POLYGON ((621623.757 3350508.165, 621656.294 3...
45.04589.0002313POLYGON ((621630.247 3345130.744, 621717.926 3...
\n", "
" ], "text/plain": [ " fid pop2019 tract geometry\n", "0 1.0 6070.0 002422 POLYGON ((615643.487 3338728.496, 615645.477 3...\n", "1 2.0 2203.0 001751 POLYGON ((618576.586 3359381.053, 618614.330 3...\n", "2 3.0 7419.0 002411 POLYGON ((619200.163 3341784.654, 619270.849 3...\n", "3 4.0 4229.0 000401 POLYGON ((621623.757 3350508.165, 621656.294 3...\n", "4 5.0 4589.0 002313 POLYGON ((621630.247 3345130.744, 621717.926 3..." ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read Esri Shapefile\n", "fp = \"data/Austin/austin_pop_2019.shp\"\n", "data = gpd.read_file(fp)\n", "data.head()" ] }, { "cell_type": "markdown", "id": "b1bb73dd-e272-4c59-b589-4cf7242f7b3e", "metadata": {}, "source": [ "The same syntax works for other commong vector data formats. " ] }, { "cell_type": "code", "execution_count": 5, "id": "af1ac72f-c900-4fca-9cd8-7f78e813b700", "metadata": {}, "outputs": [], "source": [ "# Read file from Geopackage\n", "fp = \"data/Austin/austin_pop_2019.gpkg\"\n", "data = gpd.read_file(fp)\n", "\n", "# Read file from GeoJSON\n", "fp = \"data/Austin/austin_pop_2019.geojson\"\n", "data = gpd.read_file(fp)\n", "\n", "# Read file from MapInfo Tab\n", "fp = \"data/Austin/austin_pop_2019.tab\"\n", "data = gpd.read_file(fp)" ] }, { "cell_type": "markdown", "id": "0bf98fe7-1ce6-4e45-86fe-a23dbed0cb1e", "metadata": {}, "source": [ "Some file formats such as GeoPackage and File Geodatabase files may contain multiple layers with different names wihich can be speficied using the `layer` -parameter. Our example geopackage file has only one layer with the same name as the file, so we don't actually need to specify it to read in the data." ] }, { "cell_type": "code", "execution_count": 6, "id": "a4daa219-0b52-4667-bfc7-33b6a34e4996", "metadata": {}, "outputs": [], "source": [ "# Read spesific layer from Geopackage\n", "fp = \"data/Austin/austin_pop_2019.gpkg\"\n", "data = gpd.read_file(fp, layer=\"austin_pop_2019\")" ] }, { "cell_type": "code", "execution_count": 7, "id": "b90aa4fb-5f0c-4bf8-a995-2d769d1510f9", "metadata": {}, "outputs": [], "source": [ "# Read file from File Geodatabase\n", "# fp = \"data/Finland/finland.gdb\"\n", "# data = gpd.read_file(fp, driver=\"OpenFileGDB\", layer=\"municipalities\")" ] }, { "cell_type": "markdown", "id": "cec9d506", "metadata": {}, "source": [ "(write intro about enabling additional drivers and reading in the KML file)" ] }, { "cell_type": "code", "execution_count": 8, "id": "bf159391", "metadata": {}, "outputs": [], "source": [ "# Enable KML driver\n", "gpd.io.file.fiona.drvsupport.supported_drivers[\"KML\"] = \"rw\"\n", "\n", "# Read file from KML\n", "fp = \"data/Austin/austin_pop_2019.kml\"\n", "# data = gpd.read_file(fp)" ] }, { "cell_type": "markdown", "id": "dcccac68-ba0e-4c2f-83fb-159f477146e8", "metadata": {}, "source": [ "## Writing vector data\n", "\n", "We can save spatial data to various vector data formats using the `.to_file()` function in `geopandas` which also relies on `fiona`. It is possible to specify the output file format using the `driver` parameter, however, for most file formats it is not needed as the tool is able to infer the driver from the file extension. " ] }, { "cell_type": "code", "execution_count": 9, "id": "a03816fd-1c7c-44de-842a-4025527ecf01", "metadata": {}, "outputs": [], "source": [ "# Write to Shapefile (just make a copy)\n", "outfp = \"data/temp/austin_pop_2019.shp\"\n", "data.to_file(outfp)\n", "\n", "# Write to Geopackage (just make a copy)\n", "outfp = \"data/Temp/austin_pop_2019.gpkg\"\n", "data.to_file(outfp, driver=\"GPKG\")\n", "\n", "# Write to GeoJSON (just make a copy)\n", "outfp = \"data/Temp/austin_pop_2019.geojson\"\n", "data.to_file(outfp, driver=\"GeoJSON\")\n", "\n", "# Write to MapInfo Tab (just make a copy)\n", "outfp = \"data/Temp/austin_pop_2019.tab\"\n", "data.to_file(outfp)\n", "\n", "# Write to same FileGDB (just add a new layer) - requires additional package installations(?)\n", "# outfp = \"data/finland.gdb\"\n", "# data.to_file(outfp, driver=\"FileGDB\", layer=\"municipalities_copy\")\n", "\n", "# Write to KML (just make a copy)\n", "outfp = \"data/Temp/austin_pop_2019.kml\"\n", "# data.to_file(outfp, driver=\"KML\")" ] }, { "cell_type": "code", "execution_count": null, "id": "8c230915", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "0fb88cd4", "metadata": {}, "source": [ "## Creating a GeoDataFrame from scratch\n", "\n", "It is possible to create spatial data from scratch by using `shapely`'s geometric objects and `geopandas`. This is useful as it makes it easy to convert, for example, a text file that contains coordinates into spatial data layers. Let's first try creating a simple `GeoDataFrame` based on coordinate information that represents the outlines of the [Senate square in Helsinki, Finland](https://fi.wikipedia.org/wiki/Senaatintori). Here are the coordinates based on which we can create a `Polygon` object using `shapely." ] }, { "cell_type": "code", "execution_count": 10, "id": "aec1fbab-b4e0-4641-816a-cdc37ab56432", "metadata": {}, "outputs": [], "source": [ "from shapely.geometry import Polygon\n", "\n", "# Coordinates of the Helsinki Senate square in decimal degrees\n", "coordinates = [\n", " (24.950899, 60.169158),\n", " (24.953492, 60.169158),\n", " (24.953510, 60.170104),\n", " (24.950958, 60.169990),\n", "]\n", "\n", "# Create a Shapely polygon from the coordinate-tuple list\n", "poly = Polygon(coordinates)" ] }, { "cell_type": "markdown", "id": "727717a5-e176-4d55-bd30-a444bfc6758c", "metadata": {}, "source": [ "Now we can use this polygon and `geopandas` to create a `GeoDataFrame` from scratch. The data can be passed in as a list-like object. In our case we will only have one row and one column of data. We can pass the polygon inside a list, and name the column as `geometry` so that `geopandas` will use the contents of that column the geometry column. Additionally, we could define the coordinate reference system for the data, but we will skip this step for now. For details of the syntax, see documentation for the `DataFrame` constructor and `GeoDataFrame` constructor online." ] }, { "cell_type": "code", "execution_count": 11, "id": "5d8178e8", "metadata": {}, "outputs": [], "source": [ "newdata = gpd.GeoDataFrame(data=[poly], columns=[\"geometry\"])" ] }, { "cell_type": "code", "execution_count": 12, "id": "a035fc6f-1149-436c-9f27-537b20ff95fe", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
geometry
0POLYGON ((24.95090 60.16916, 24.95349 60.16916...
\n", "
" ], "text/plain": [ " geometry\n", "0 POLYGON ((24.95090 60.16916, 24.95349 60.16916..." ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newdata" ] }, { "cell_type": "markdown", "id": "98400110-6519-4422-83fa-39d54d6740aa", "metadata": {}, "source": [ "We can also add additional attribute information to a new column. " ] }, { "cell_type": "code", "execution_count": 13, "id": "bff22229", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
geometryname
0POLYGON ((24.95090 60.16916, 24.95349 60.16916...Senate Square
\n", "
" ], "text/plain": [ " geometry name\n", "0 POLYGON ((24.95090 60.16916, 24.95349 60.16916... Senate Square" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add a new column and insert data\n", "newdata.at[0, \"name\"] = \"Senate Square\"\n", "\n", "# Check the contents\n", "newdata" ] }, { "cell_type": "markdown", "id": "237b04ca", "metadata": {}, "source": [ "There it is! Now we have two columns in our data; one representing the geometry and another with additional attribute information. From here, you could proceed into adding additional rows of data, or printing out the data to a file. " ] }, { "cell_type": "markdown", "id": "3f988ec7-f19f-48e0-b184-bfed31ed19dc", "metadata": {}, "source": [ "## Creating a GeoDataFrame from a text file" ] }, { "cell_type": "markdown", "id": "62336285-f0ec-4779-b4bd-57bffc21bd0f", "metadata": {}, "source": [ "A common case is to have coordinates in a delimited textfile that needs to be converted into spatial data. We can make use of `pandas`, `geopandas` and `shapely` for doing this. \n", "\n", "The example data contains point coordinates of airports derived from [openflights.org](https://openflights.org/data.html) [^openflights]. Let's read in a couple of useful columns from the data for further processing." ] }, { "cell_type": "code", "execution_count": 14, "id": "d3196735-f0f6-4f08-a1e1-9fd86406a13b", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 15, "id": "afde321f-1755-4287-a679-b35c2844987f", "metadata": {}, "outputs": [], "source": [ "airports = pd.read_csv(\n", " \"data/Airports/airports.txt\",\n", " usecols=[\"Airport ID\", \"Name\", \"City\", \"Country\", \"Latitude\", \"Longitude\"],\n", ")" ] }, { "cell_type": "code", "execution_count": 16, "id": "c0a8f3c4-0a2d-45e4-8f87-dfa0b42d3477", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Airport IDNameCityCountryLatitudeLongitude
01Goroka AirportGorokaPapua New Guinea-6.081690145.391998
12Madang AirportMadangPapua New Guinea-5.207080145.789001
23Mount Hagen Kagamuga AirportMount HagenPapua New Guinea-5.826790144.296005
34Nadzab AirportNadzabPapua New Guinea-6.569803146.725977
45Port Moresby Jacksons International AirportPort MoresbyPapua New Guinea-9.443380147.220001
\n", "
" ], "text/plain": [ " Airport ID Name City \\\n", "0 1 Goroka Airport Goroka \n", "1 2 Madang Airport Madang \n", "2 3 Mount Hagen Kagamuga Airport Mount Hagen \n", "3 4 Nadzab Airport Nadzab \n", "4 5 Port Moresby Jacksons International Airport Port Moresby \n", "\n", " Country Latitude Longitude \n", "0 Papua New Guinea -6.081690 145.391998 \n", "1 Papua New Guinea -5.207080 145.789001 \n", "2 Papua New Guinea -5.826790 144.296005 \n", "3 Papua New Guinea -6.569803 146.725977 \n", "4 Papua New Guinea -9.443380 147.220001 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airports.head()" ] }, { "cell_type": "code", "execution_count": 17, "id": "ef1181b0-7b45-4ad2-9105-851571e827c5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7698" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(airports)" ] }, { "cell_type": "markdown", "id": "78930639-62ca-4bd8-bad8-f31bae1cf6d7", "metadata": {}, "source": [ "There are over 7000 airports in the data and we can use the coordinate information available in the `Latitude` and `Longitude` columns for visualizing them on a map. The coordinates are stored as *{term}`Decimal degrees `*, meaning that the appropriate coordinate reference system for these data is WGS 84 (EPSG:4326). \n", "\n", "There is a handy tool in `geopandas` for generating an array of `Point`objects based on x and y coordinates called `.points_from_xy()`. The tool assumes that x coordinates represent longitude and that y coordinates represent latitude. " ] }, { "cell_type": "code", "execution_count": 18, "id": "bc3e9360-a8ff-4a5e-9cd5-5a6a27fd63c6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Airport IDNameCityCountryLatitudeLongitudegeometry
01Goroka AirportGorokaPapua New Guinea-6.081690145.391998POINT (145.39200 -6.08169)
12Madang AirportMadangPapua New Guinea-5.207080145.789001POINT (145.78900 -5.20708)
23Mount Hagen Kagamuga AirportMount HagenPapua New Guinea-5.826790144.296005POINT (144.29601 -5.82679)
34Nadzab AirportNadzabPapua New Guinea-6.569803146.725977POINT (146.72598 -6.56980)
45Port Moresby Jacksons International AirportPort MoresbyPapua New Guinea-9.443380147.220001POINT (147.22000 -9.44338)
\n", "
" ], "text/plain": [ " Airport ID Name City \\\n", "0 1 Goroka Airport Goroka \n", "1 2 Madang Airport Madang \n", "2 3 Mount Hagen Kagamuga Airport Mount Hagen \n", "3 4 Nadzab Airport Nadzab \n", "4 5 Port Moresby Jacksons International Airport Port Moresby \n", "\n", " Country Latitude Longitude geometry \n", "0 Papua New Guinea -6.081690 145.391998 POINT (145.39200 -6.08169) \n", "1 Papua New Guinea -5.207080 145.789001 POINT (145.78900 -5.20708) \n", "2 Papua New Guinea -5.826790 144.296005 POINT (144.29601 -5.82679) \n", "3 Papua New Guinea -6.569803 146.725977 POINT (146.72598 -6.56980) \n", "4 Papua New Guinea -9.443380 147.220001 POINT (147.22000 -9.44338) " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airports[\"geometry\"] = gpd.points_from_xy(\n", " x=airports[\"Longitude\"], y=airports[\"Latitude\"], crs=\"EPSG:4326\"\n", ")\n", "\n", "airports = gpd.GeoDataFrame(airports)\n", "airports.head()" ] }, { "cell_type": "markdown", "id": "ecb3737b-57de-436f-acaa-a903fed516a2", "metadata": {}, "source": [ "Now we have the point geometries as `shapely`objects in the geometry-column ready to be plotted on a map." ] }, { "cell_type": "code", "execution_count": 19, "id": "1de24cf8-d3a4-43dc-96f1-5576ae6780ec", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "airports.plot(markersize=0.1)" ] }, { "cell_type": "markdown", "id": "afa6079e-df84-4a95-b7d0-456d8917bb8c", "metadata": {}, "source": [ "_**Figure 6.12**. A basic plot showing the airports from openflights.org._" ] }, { "cell_type": "markdown", "id": "300856c0", "metadata": { "tags": [] }, "source": [ "## Footnotes\n", "\n", "[^GeoJson]: \n", "[^GPKG]: \n", "[^KML]: \n", "[^openflights]: \n", "[^shp]: " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" } }, "nbformat": 4, "nbformat_minor": 5 }