L10: Shapefiles, Coordinate Systems, and Basic Mapping

Bogdan G. Popescu

John Cabot University

Introduction

We previously covered the fundamentals of Geographic Information Systems (GIS)

We also learned about https://gadm.org

This is a good website to download shapefiles for countries and subdivisions

Introduction

Create a new folder for this week

Download again the shapefiles for the US and Italy and place them in a folder called data.

Mapping the US USA_0

library(sf)
#"s2" is an enhancement of the "sf" package
#It allows one to complete spatial operations with data that is in geographic coordinate system (the coordinates are in degrees)
#However, having it enables will take much longer computation time
sf_use_s2(FALSE)
library(ggplot2)
#Step1: Read country shape
usa_cntry <- st_read(dsn="./data/gadm41_USA_shp/gadm41_USA_0.shp", quiet = TRUE)

#Step2: Simplify lines
usa_cntry<-st_simplify(usa_cntry,  dTolerance = 0.05)

#Define zoom
min_lon_x_us<-(-130)
max_lon_x_us<-(-57)
min_lat_y_us<-(26)
max_lat_y_us<-(53)

#Map
fig<-ggplot()+
  geom_sf(data=usa_cntry, linewidth = 0.3, fill = NA)+
  theme_bw()+
  coord_sf(xlim = c(min_lon_x_us-3, max_lon_x_us+3), 
           ylim = c(min_lat_y_us-3, max_lat_y_us+3))

Mapping the US USA_0

fig

Mapping the US USA_0

Let us examine what the dataframe looks like

head(usa_cntry, 2)
Simple feature collection with 1 feature and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -178.9911 ymin: 18.91028 xmax: 179.7485 ymax: 71.33542
Geodetic CRS:  WGS 84
  GID_0       COUNTRY                       geometry
1   USA United States MULTIPOLYGON (((179.6461 52...

Mapping the US USA_0

This is how we turn the sf object into a simple dataframe

usa_cntry_df<-st_drop_geometry(usa_cntry)
head(usa_cntry_df, 2)
  GID_0       COUNTRY
1   USA United States

The two objects are very similar.

The difference is that usa_cntry has a geometry column. usa_cntry_df does not.

We can see the difference better if we do:

class(usa_cntry)
[1] "sf"         "data.frame"

or

class(usa_cntry_df)
[1] "data.frame"

Shapefiles for Subdivsions of the US USA_1

#Step1: Read country borders
usa_cntry1 <- st_read(dsn="./data/gadm41_USA_shp/gadm41_USA_1.shp", quiet = TRUE)

#Step2: Simplify lines
usa_cntry1<-st_simplify(usa_cntry1,  dTolerance = 0.05)

#Step3: Map
fig<-ggplot()+
  geom_sf(data=usa_cntry1, linewidth = 0.3, fill = NA)+
  theme_bw()+
  coord_sf(xlim = c(min_lon_x_us-3, max_lon_x_us+3), 
           ylim = c(min_lat_y_us-3, max_lat_y_us+3))

Shapefiles for Subdivsions of the US

fig

Shapefiles for Subdivsions of the US USA_2

#Step1: Read country borders
usa_cntry2 <- st_read(dsn="./data/gadm41_USA_shp/gadm41_USA_2.shp", quiet = TRUE)

#Step2: Simplify lines
usa_cntry2<-st_simplify(usa_cntry2,  dTolerance = 0.05)

#Step3: Map
fig<-ggplot()+
  geom_sf(data=usa_cntry2, linewidth = 0.3, fill = NA)+
  theme_bw()+
  coord_sf(xlim = c(min_lon_x_us-3, max_lon_x_us+3), 
           ylim = c(min_lat_y_us-3, max_lat_y_us+3))

Shapefiles for Subdivsions of the US USA_2

fig

Shapefiles for World

We can do the same for Italy

Shapefiles for World ITA_0

#Step1: Read country borders
it_cntry0 <- st_read(dsn="./data/gadm41_ITA_shp/gadm41_ITA_0.shp", quiet = TRUE)
it_cntry1 <- st_read(dsn="./data/gadm41_ITA_shp/gadm41_ITA_1.shp", quiet = TRUE)
it_cntry2 <- st_read(dsn="./data/gadm41_ITA_shp/gadm41_ITA_2.shp", quiet = TRUE)
it_cntry3 <- st_read(dsn="./data/gadm41_ITA_shp/gadm41_ITA_3.shp", quiet = TRUE)

#Step2: Simplify lines
it_cntry0<-st_simplify(it_cntry0,  dTolerance = 0.05)
it_cntry1<-st_simplify(it_cntry1,  dTolerance = 0.05)
it_cntry2<-st_simplify(it_cntry2,  dTolerance = 0.01)
it_cntry3<-st_simplify(it_cntry3,  dTolerance = 0.01)

#Step3: Map
fig<-ggplot()+
  geom_sf(data=it_cntry0, linewidth = 0.3, fill = NA)+
  theme_bw()

Shapefiles for World ITA_0

fig

Shapefiles for World ITA_1

#Step3: Map
fig<-ggplot()+
  geom_sf(data=it_cntry1, linewidth = 0.3, fill = NA)+
  theme_bw()
fig

Shapefiles for World ITA_1

#Step3: Map
fig

Shapefiles for World ITA_2

#Step3: Map
fig<-ggplot()+
  geom_sf(data=it_cntry2, linewidth = 0.3, fill = NA)+
  theme_bw()

Shapefiles for World ITA_2

fig

Shapefiles for World ITA_3

#Step3: Map
fig<-ggplot()+
  geom_sf(data=it_cntry3, linewidth = 0.05, fill = NA)+
  theme_bw()

Shapefiles for World ITA_3

fig

Vector File Formats

The most common vector files are:

Type Format File extension
Binary ESRI Shapefile .shp, .shx, .dbf, .prj
GeoPackage (GPKG) .gpkg
Plain Text GeoJSON .json or .geojson
GPS Exchange Format (GPX) .gpx
Keyhole Markup Language (KML) .gpx
Spatial Databases PostgreSQL / PostGIS

The sp package

  • The first R package to establish a uniform vector layer class system was sp (2005-2023).

  • Together with rgdal (2003-2023) and rgeos (2011-2023), the sp package dominated the landscape of spatial analysis in R

Spatial Data Structures in sp

Class Geometry type Attributes
SpatialPoints Points -
SpatialPointsDataFrame Points data.frame
SpatialLines Lines -
SpatialLinesDataFrame Lines data.frame
SpatialPolygons Polygons -
SpatialPolygonsDataFrame Polygons data.frame

Caveat

  • We will avoid the use of sp, rgdal, and rgeos and focus on the sf package.
  • The rgdal package is deprecated starting with 2023.
  • You should avoid using those packages (i.e. you will get errors and maybe conflicts with other libraries).
  • However, it is important to be aware of these packages, given how many forums, articles, and dependencies have been using them.

The sf package

sf is the standard package for working with vector data in R.

It relies on external software components: GDAL, GEOS, and PROJ.

It contains three classes:

  • sfg — a single geometry
  • sfc - a geometry column: a set of sfg geometries with CRS information
  • sf - a layer which is an sfc column inside a data frame with non-spatial attributes.

The sf class

Here is a polygon layer with three features and six non-spatial attributes: BIR74, SID74, NWBIR74, BIR79, SID79, NWBIR79

The sf class

Mapping the layer

The sf class

Geometry

sfg is a single geometry which can be of multiple types

  • st_point
  • st_multipoint
  • st_linestring
  • st_multilinestring
  • st_polygon
  • st_multipolygon
  • st_geometrycollection

Geometry

These are the seven most common Simple Feature geometry types

Geometry

These are the seven most common Simple Feature geometry types

Geometry

These are the seven most common Simple Feature geometry types

Geometry

These are the seven most common Simple Feature geometry types

Geometry

These are the seven most common Simple Feature geometry types

Geometry

These are the seven most common Simple Feature geometry types

Geometry

These are the seven most common Simple Feature geometry types

Geometry

The last type GEOMETRYCOLLECTION is rarely used

The Shapefile format does not support geometries of type GEOMETRYCOLLECTION

Geometry - Points

Let us create one point:

library(sf)
library(ggplot2)
pt1 = st_point(c(34.798443, 31.243288))

If we print the point, we get its WKT (Well-Known Text) representation

print(pt1)

If we examine the class definition, we get:

class(pt1)
[1] "XY"    "POINT" "sfg"  

Geometry - Points

class(pt1)
[1] "XY"    "POINT" "sfg"  
  • XY is the two-dimensional geometry of the point
  • POINT is the geometry type (it could be POINT, MULTIPOINT, etc.)
  • sfg is the general class (i.e. simple feature geometry)

Geometry - Points

This is what the point looks like:

plot(pt1)

Geometry - Polygon

This is how we create a polygon

a <- st_polygon(list(cbind(c(0,0,7.5,7.5,0),c(0,-1,-1,0,0))))

If we print the polygon, we get its WKT representation

print(a)
POLYGON ((0 0, 0 -1, 7.5 -1, 7.5 0, 0 0))

If we examine the class definition, we get:

class(a)
[1] "XY"      "POLYGON" "sfg"    

Geometry - Polygon

This is what that polygon looks like:

plot(a)

Geometry - Polygon

We can of course use ggplot

a <- st_polygon(list(cbind(c(0,0,7.5,7.5,0),c(0,-1,-1,0,0))))
ggplot()+
  geom_sf(data=a)

Geometry - Polygon 2

We can also make a more complicated polygon:

b = st_polygon(list(cbind(c(0,1,2,3,4,5,6,7,7,0),c(1,0,0.5,0,0,0.5,-0.5,-0.5,1,1))))
print(b)
POLYGON ((0 1, 1 0, 2 0.5, 3 0, 4 0, 5 0.5, 6 -0.5, 7 -0.5, 7 1, 0 1))

This is what the class looks like:

class(b)
[1] "XY"      "POLYGON" "sfg"    

Geometry - Polygon 2

This is what that polygon looks like:

plot(b)

Combining Geometries

We can combine geometries using the c function

This will result into a single MULTIPOLYGON geometry, composed of all the shapes in the input

ab <- c(a, b)
print(ab)
MULTIPOLYGON (((0 0, 0 -1, 7.5 -1, 7.5 0, 0 0)), ((0 1, 1 0, 2 0.5, 3 0, 4 0, 5 0.5, 6 -0.5, 7 -0.5, 7 1, 0 1)))

If we check the class of ab, we see that it is a MULTIPOLYGON

class(ab)
[1] "XY"           "MULTIPOLYGON" "sfg"         

Combining Geometries

This is what the combined polygon looks like:

plot(ab)

Intersecting Geometries

A new geometry can be calculated by applying various functions to existing ones.

i <- st_intersection(a, b)
print(i)
GEOMETRYCOLLECTION (POLYGON ((7 0, 7 -0.5, 6 -0.5, 5.5 0, 7 0)), LINESTRING (4 0, 3 0), POINT (1 0))

Notice the difference from the combined geometries: ab

ab <- c(a, b)
print(ab)
MULTIPOLYGON (((0 0, 0 -1, 7.5 -1, 7.5 0, 0 0)), ((0 1, 1 0, 2 0.5, 3 0, 4 0, 5 0.5, 6 -0.5, 7 -0.5, 7 1, 0 1)))

Also, notice the difference between classes:

class(i)
[1] "XY"                 "GEOMETRYCOLLECTION" "sfg"               
class(ab)
[1] "XY"           "MULTIPOLYGON" "sfg"         

Intersecting Geometries

This is what the intersected geometries look like:

i <- st_intersection(a, b)
plot(i)

Intersecting Geometries

This is how they differ from the combined geometries:

ab <- c(a, b)
plot(ab)

Geometry column - sfc

Let us create two more points: pt2 and pt3.

pt2 = st_point(c(34.812831, 31.260284))
pt3 = st_point(c(35.011635, 31.068616))

Geometry objects can be collected into a geometry column - sfc, which is done with st_sfc

A geometry column also contains a Coordinate Reference System

Geometry column - sfc

Four types of CRS definitions are acceptable

  • An EPSG code (e.g. 4326)
  • A crs object, as returned by the st_crs function
  • A PROJ4 string (e.g. '+proj=longlat +datum=WGS84 +no_defs')
  • A WKT string

If the crs argument is omitted, the default is to set an “undefined” CRS (i.e., NA).

EPSG Codes

When you need precise measurements for things like distances, you should use EPSG codes.

EPSG Geodetic Parameter Dataset is a public registry of geodetic datums, spatial reference systems, Earth ellipsoids, coordinate transformations and related units of measurement.

It came from a member of the European Petroleum Survey Group in 1985.

Different countries have different codes.

EPSG Codes

For example, the standard for the whole world is 4326, the equivalent of WGS84 (also used by Google)

The code for Italy is 6875.

This is from https://epsg.io

EPSG Geodetic Parameter Dataset is a public registry of geodetic datums, spatial reference systems.

EPSG Codes

EPSG Codes

EPSG Codes

EPSG Codes

Geometry column - sfc

We can combine pt1, pt2, and pt3.

Reminder: this is how we defined them:

pt1 = st_point(c(34.798443, 31.243288))
pt2 = st_point(c(34.812831, 31.260284))
pt3 = st_point(c(35.011635, 31.068616))

This is how we combine them:

combined <- st_sfc(pt1, pt2, pt3, crs = 4326)
combined
Geometry set for 3 features 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 34.79844 ymin: 31.06862 xmax: 35.01164 ymax: 31.26028
Geodetic CRS:  WGS 84
POINT (34.79844 31.24329)
POINT (34.81283 31.26028)
POINT (35.01163 31.06862)

Note the we added 4326 - the code for the whole world.

Geometry column - sfc

The combined points look like below

plot(combined)

Or

The combined points look like below

ggplot()+
  geom_sf(data=combined)

Layer sf

We can add attributes to the geometry in the following way:

name<-c("Point 1", "Point 2", "Point 3")
name_df<-data.frame(name)
name_df
     name
1 Point 1
2 Point 2
3 Point 3

Layer sf

We can combine the attribute table name_df and the geometry column geom (sfc) using st_sf, resulting in a layer

layer <- st_sf(name_df, combined)
layer
Simple feature collection with 3 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 34.79844 ymin: 31.06862 xmax: 35.01164 ymax: 31.26028
Geodetic CRS:  WGS 84
     name                  combined
1 Point 1 POINT (34.79844 31.24329)
2 Point 2 POINT (34.81283 31.26028)
3 Point 3 POINT (35.01163 31.06862)

Mapping the sf layer with plot

plot(layer)

Mapping the sf layer with ggplot

ggplot()+
  geom_sf(data=layer)+ theme_bw()

Mapping the sf layer with ggplot

We can also plot the polygons we made with ggplot

Here is what ab looks like:

ggplot()+
  geom_sf(data=ab)+ theme_bw()

Extracting Layer Components

Sometimes, we are interested in going the other way around and extract:

  • the geometry
  • the attribute table
  • the coordinates

Extracting Layer Components: the Geometry

The following command allows us to extract the geometry

st_geometry(layer)
Geometry set for 3 features 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 34.79844 ymin: 31.06862 xmax: 35.01164 ymax: 31.26028
Geodetic CRS:  WGS 84
POINT (34.79844 31.24329)
POINT (34.81283 31.26028)
POINT (35.01163 31.06862)

Extracting Layer Components: the Attribute Table

The following command allows us to drop the geometry

We are then left just with the attribute table:

st_drop_geometry(layer)
     name
1 Point 1
2 Point 2
3 Point 3

Extracting Layer Components: the Coordinates

The coordinates of sf, sfc or sfg objects can be obtained with the function st_coordinates.

The coordinates are a matrix:

st_coordinates(layer)
            X        Y
[1,] 34.79844 31.24329
[2,] 34.81283 31.26028
[3,] 35.01163 31.06862

We can turn the matrix into a dataframe in the following way

data.frame(st_coordinates(layer))
         X        Y
1 34.79844 31.24329
2 34.81283 31.26028
3 35.01163 31.06862

Creating a Point Layer from a table

We can create point layers from tables.

Download the Restaurants in Rome CSV file

Place it in your week’s folder

Creating a Point Layer from a table

We can create point layers from tables.

Here is an example below:

df <- read.table("./data/restaurants_rome.csv", header = TRUE, sep = ",")
head(df, n=3)
                           name                 addr.street      lat      lon
1             Pizzeria ai Marmi         Viale di Trastevere 41.88826 12.47379
2                 Sichuan Haozi Via di San Martino ai Monti 41.89580 12.49948
3 Dar filettaro a Santa Barbara           Largo dei Librari 41.89467 12.47370

The restaurants dataframe can be converted to a sf layer using st_as_sf:

df2 = st_as_sf(df, coords = c('lon', 'lat'))
st_crs(df2)<-st_crs(4326)
head(df2, n=2)
Simple feature collection with 2 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 12.47379 ymin: 41.88826 xmax: 12.49948 ymax: 41.8958
Geodetic CRS:  WGS 84
               name                 addr.street                  geometry
1 Pizzeria ai Marmi         Viale di Trastevere POINT (12.47379 41.88826)
2     Sichuan Haozi Via di San Martino ai Monti  POINT (12.49948 41.8958)

Creating a Point Layer from a table

This is what these points look like, if we map them out:

ggplot()+
  geom_sf(data=df2, size=0.3)

Creating a Point Layer from a table

Without some context, it is difficult to see what those points

You can see below an updated map with the streets in the background.

sf properties: rows and columns

Some important properties of sf objects include rows and columns.

Number of rows:

nrow(df2)
[1] 2811

Number of columns:

ncol(df2)
[1] 3

We can obtain both rows and column by typing:

dim(df2)
[1] 2811    3

sf properties: bounding box coordinates

The st_bbox function returns the bounding box coordinates

st_bbox(df2)
    xmin     ymin     xmax     ymax 
12.21167 41.70574 12.77428 42.06974 

This is helpful to find the extreme locations in our dataset: Northern-most, Southern-most, Eastern-most, Western-most

sf properties: crs

The st_crs function returns the Coordinate Reference System (CRS)

st_crs(df2)
Coordinate Reference System:
  User input: EPSG:4326 
  wkt:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        MEMBER["World Geodetic System 1984 (G2139)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]

Subsetting based on attributes

Selecting only the restaurants that have an address:

rest_noaddress<-subset(df2, !is.na(addr.street))

Subsetting based on attributes

Selecting only the restaurants that have an address:

rest_noaddress<-subset(df2, !is.na(addr.street))
library(ggpubr)
f1<-ggplot()+
  geom_sf(data=df2, size=0.3)

Subsetting based on attributes

Selecting only the restaurants that have an address:

rest_noaddress<-subset(df2, !is.na(addr.street))
library(ggpubr)
f1<-ggplot()+
  geom_sf(data=df2, size=0.3)

f2<-ggplot()+
  geom_sf(data=rest_noaddress, size=0.3)

Subsetting based on attributes

Here is how the two dataframes compare:

ggarrange(f1, f2)

Subsetting based on attributes

Subsetting columns is very similar to subsetting a dataframe.

The difference is the geometry column that stays.

We turn the spatial dataframe into a regular dataframe.

df3<-st_drop_geometry(df2)

Subsetting based on attributes

This is how the two compare:

head(df2, n=3)
Simple feature collection with 3 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 12.4737 ymin: 41.88826 xmax: 12.49948 ymax: 41.8958
Geodetic CRS:  WGS 84
                           name                 addr.street
1             Pizzeria ai Marmi         Viale di Trastevere
2                 Sichuan Haozi Via di San Martino ai Monti
3 Dar filettaro a Santa Barbara           Largo dei Librari
                   geometry
1 POINT (12.47379 41.88826)
2  POINT (12.49948 41.8958)
3  POINT (12.4737 41.89467)
head(df3, n=3)
                           name                 addr.street
1             Pizzeria ai Marmi         Viale di Trastevere
2                 Sichuan Haozi Via di San Martino ai Monti
3 Dar filettaro a Santa Barbara           Largo dei Librari

Reading vector layers

As discussed, the most common vector files are:

Type Format File extension
Binary ESRI Shapefile .shp, .shx, .dbf, .prj
GeoPackage (GPKG) .gpkg
Plain Text GeoJSON .json or .geojson
GPS Exchange Format (GPX) .gpx
Keyhole Markup Language (KML) .gpx
Spatial Databases PostgreSQL / PostGIS

Reading vector layers

Please download the following US spatial files

Place them in your week’s folder.

Reading vector layers

Here is how we would read all these different files:

Reading a shape file

usa_cntry <- st_read(dsn="./data/gadm41_USA_shp/gadm41_USA_0.shp", quiet = TRUE)

Reading vector layers

Reading a geojson file

usa_cntry = st_read('data/usa_database/geojson/gadm41_USA_0.geojson', quiet = TRUE)

Reading vector layers

Reading a from a geodatabase

usa_cntry <- st_read(dsn="./data/usa_database/usa_geodatabase.gdb",layer="gadm41_USA_0", quiet = TRUE)

Mapping the files

Conclusion

  • Today we learned about vector layers: points, lines, polygons,
  • We also learned about common vector files
  • We learned about combining and intersecting geometries
  • We also learned about EPSG codes
  • We also learned about extracting layer components: geometry, attribute table, and coordinates.