L1: Introduction

Bogdan G. Popescu

John Cabot University

Introduction

Logistics

  • Hours: 10-11:15AM
  • Room: G.K.1.4-Guarini Campus, Kushlan Wing, First Floor, Room 4
  • Office House: By appointment

Learning Outcomes

Upon successful completion of this course the students will be able to:

  • execute basic programming tasks in R (e.g. loops, conditional statements, while statements, etc.)
  • understand basic GIS terms and concepts
  • utilize GIS for conducting spatial analyses.
  • appreciate the design and structure of a geographic information system (GIS) as a decision-making tool.
  • produce maps

Jobs where these skills are valued

  • Data Scientist/Data Analyst

  • GIS Analyst/GIS Specialist

  • Environmental Scientist

  • Market Research Analyst

  • Remote Sensing Specialist

  • Transportation Planner

Grading

You will be graded on four problem sets during the semester (each 12.5% of your grade) and a final report and presentation (each 25% of your grade).

  • 4 problem sets: 50% of the final grade (12.5% each)
  • final presentation: 25% of the final grade
  • final project report: 25% of the final grade

Problem Sets

  • Initial Individual Submission
    • This component contributes 50% of the overall grade for the problem set.
    • When you complete the problem set independently and submit it to the instructor, your grade for this component will be calculated based on the quality of your independent work
  • Final Submission After Group Consultation
    • This component also contributes 50% of the overall grade for the problem sets.
    • After discussing the problem set with your group members and documenting the correct answers, you will submit this revised version.

Final Project

You will undertake a GIS project that emphasizes practical application of spatial analysis techniques using the R programming language

The project entails a few steps:

  • choose a topic that involves spatial data
  • acquire data either from the course materials or from external sources
  • employ at least five GIS procedures in R
  • craft a well-structured two-page report containing: intro to the problem, objectives, data sources, methodology, results, and conclusion
  • appendix with the R code for the GIS procedures.

Introduction

In the first part of the course, we will learn about the R programming language and its capabilities with respect to spatial data

The first lectures will be dedicated to acquire the basic knowledge to work with spatial data bit also with R

We will then move to work with spatial data in R, including how to process: vectors, rasters, and combine the two

In the final part, we will also learn how to deal with spatio-temporal data and point pattern analysis

What is R

R is a programming language originally designed for statistical computing

It is an open-source ecosystem (i.e. everyone can contribute and it’s free)

It is compatible with Windows, Mac, and Linux

A variety of libraries already exist which allow you to do easy things like:

  • clean and process data
  • visualize data
  • create interactive web-apps
  • typeset: write visually appealing articles and presentations (the presentation is made in R Quarto)

What is R

This is how R compares to other programming languages

Use of R

R is used in a variety of fields:

  • Finance
  • Academic research
  • Government
  • Retail
  • Data Journalism
  • Healthcare

Companies that use R

Examples of companies which use R include

  • Airbnb
  • Microsoft
  • Uber
  • Facebook
  • Google

Good resources for learning R

  • R for Data Science
    http://r4ds.had.co.nz/
    Introduction to data analysis using R, focused on the tidyverse packages
    Good substitute for Stata

Good resources for learning R

What is GIS?

GIS stands for Geographic Information Systems

GIS is a system that that creates, manages, analyzes, and maps all types of data

It helps us understand patterns, relationships, and geographic context

It can be used to:

  • Identify problems
  • Manage and respond to events
  • Set priorities
  • Monitor change
  • Perform forecasting
  • Understand trends

What is GIS?

Mapping focuses on the visual representation of data

Spatial analysis focuses on a variety of aspects:

  • data manipulation
  • data querying
  • statistical analysis of geographic patterns

GIS comprises both mapping (visualization) and geographic data manipulations and analysis

GIS Software

  • ArcGIS - ESRI product with a comprehensive library of GIS libraries - works on Windows (costs a few grand)
  • QGIS - Open-Source product - - works on Windows and Mac (free)

Books to Use: Data Analysis and Visualization

Books to Use: GIS

Other useful Sources

Automation and Reproducibility

  • Automation is important
  • Reduces mistakes
  • It is easy to implemet in the long run
  • If your data is reproducible, other people can see what you have done: other people can replicate your analyses.

Documentation

  • All your code should include comments
#Data Cleaning
clean_countries<-subset(life_expectancy2, !(Code %in% weird_labels))
clean_countries_urbanization<-subset(urbanization2, !(Code %in% weird_labels))

Documentation

  • All your code should include comments
#Step1: Data Cleaning
clean_countries<-subset(life_expectancy2, !(Code %in% weird_labels))
clean_countries_urbanization<-subset(urbanization2, !(Code %in% weird_labels))

#Step2: Further Data Cleaning
clean_countries<-subset(life_expectancy2, !(Code %in% weird_labels))
clean_countries_urbanization<-subset(urbanization2, !(Code %in% weird_labels))

#Step3: Left Join
new_data<-left_join(clean_countries, clean_countries_urbanization, by = c("Code"="Code"))

Documentation

  • All your code should include comments
#Step1: Data Cleaning
clean_countries<-subset(life_expectancy2, !(Code %in% weird_labels))
clean_countries_urbanization<-subset(urbanization2, !(Code %in% weird_labels))

#Step2: Further Data Cleaning
clean_countries<-subset(life_expectancy2, !(Code %in% weird_labels))
clean_countries_urbanization<-subset(urbanization2, !(Code %in% weird_labels))

#Step3: Left Join
new_data<-left_join(clean_countries, clean_countries_urbanization, by = c("Code"="Code"))
  • Comments are the what
  • Code is the how

Why R

R is good for:

  • Automation - doing unfeasible repetitive tasks
  • Reproducibility - using the same commands repeatedly and obtaining the same output
  • Visualization - making and presenting graphs and maps

GIS Inputs and Outputs in R

Reading and writing spatial data into R is done through external libraries

  • GDAL/OGR is used for reading/writing vector and raster files, with sf and stars
  • PROJ handles Coordinate Reference Systems (CRS), in both sf and stars

Processing Vector Layers sf

sf will be the main library that we will work with

It will help us deal with:

  • Numerical Operations to calculate: Areas, Length, Distances, etc.
  • GIS Logical Operations: Overlaps, Equals, Intersects, etc.
  • Geometry Operations: Centroid, Buffer, Intersection, Union, Difference, etc.

Processing Vector Layers sf: buffer

Processing Raster Layers stars

We can perform geometric operation on rasters (pictures) with the stars package

  • Accessing cell values - as a matrix or as a dataframe, extracting pixels to points
  • Performing Raster algebra: raster arirthmentic and logic
  • Changing the resolution and extent: cropping, mosaicing, resampling, and reprojecting
  • Transforming Rasters: to points and polygons

Processing Raster Layers stars

Temperature in 1901

Processing Raster Layers stars

Temperature in 2022

Processing Raster Layers stars

Temperature difference between 2022 and 1901 > 4

Data visualization

  • ggplot2 is the library that will allow to visualize data analysis results, but also to make maps
  • it has well designed and consistent syntax that supports visualization for both vectors and rasters
  • it has highly customizable publication-quality figures and maps

Interactive Data visualization

  • leaflet is a library that allows us to make interactive maps
  • mapview is a wrapper around leaflet automating the addition of: labels, popups, color scales, and common basemaps

Programming

  • A programming language is a machine-readable artificial language designed to express computations that can be performed by a computer.

  • Programming allows us to edit code and re-use it in the future and obtain the same results in the future

Object-oriented programming

  • In object-oriented programming, the interaction with the computer takes place though objects

  • Each object belongs to a class: an abstract structure that has specific properties

Example:

  • All cars in the parking lot are instances of the “car” class

  • The “car” class has specific properties: make, color, year and methods: start, drive stop

Object-oriented programming

Object-oriented programming

Object-oriented programming

Object-oriented programming in R

We will see that everything that we work with in R is an object

For example, we can load up a geojson file in R.

#Step1: Loading the geojson file
library(sf)
restaurants <- read_sf("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week7/data/restaurant.geojson")
#Step2: Selecting only the relevant variables
restaurants<-subset(restaurants, select = c(name, `addr:street`))
#Step3: Removing the restaurants without a name or without an address
restaurants2<-subset(restaurants, !is.na(restaurants$name) | !is.na(restaurants$`addr:street`))

R transforms the geojson file into an object of a class named sf data.frame

#Step4: Identifying the object class
class(restaurants2)
[1] "sf"         "tbl_df"     "tbl"        "data.frame"

Object-oriented programming in R

This type of object has numerous properties such as:

  • rows
nrow(restaurants2)
[1] 2811
  • columns
ncol(restaurants2)
[1] 3

Object-oriented programming in R

Once imported, the sf data.frame is saved in the computer memory

Printing the object will display some of its properties and specific properties

restaurants2
Simple feature collection with 2811 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 12.21167 ymin: 41.70574 xmax: 12.77428 ymax: 42.06974
Geodetic CRS:  WGS 84
# A tibble: 2,811 × 3
   name                          `addr:street`                    geometry
   <chr>                         <chr>                         <POINT [°]>
 1 Pizzeria ai Marmi             Viale di Trastevere   (12.47379 41.88826)
 2 Sichuan Haozi                 Via di San Martino a…  (12.49948 41.8958)
 3 Dar filettaro a Santa Barbara Largo dei Librari      (12.4737 41.89467)
 4 Al Peperoncino                Via Ostiense          (12.47698 41.85343)
 5 Ai Tre Scalini                Via Panisperna        (12.49044 41.89628)
 6 Trattoria Ada e Mario         Circonvallazione App… (12.51433 41.87532)
 7 Gustosando                    <NA>                  (12.42743 41.89954)
 8 Sa Posada                     Via Elvia Recina       (12.5079 41.87995)
 9 Pizzeria Formula 1            Via degli Equi        (12.51268 41.89702)
10 Da Francesco                  Piazza del Fico        (12.4704 41.89932)
# ℹ 2,801 more rows

Object-oriented programming in R

By printing the object, we can see some of its properties including:

  • dimension
  • bounding box
  • crs

Inheritance

One of the characteristics of object oriented programming is inheritance

Inheritance is what makes it possible for one class to extend to another class, by adding other properties

Example:

  • A “taxi” is an extension of a “car” class, inheriting all of its properties and methods.

  • A taxi could have new properties like taxi company name

Inheritance

In R, every complex object is a collection of smaller components such a properties

We can use str to examine the properties of the class

str(restaurants2)
sf [2,811 × 3] (S3: sf/tbl_df/tbl/data.frame)
 $ name       : chr [1:2811] "Pizzeria ai Marmi" "Sichuan Haozi" "Dar filettaro a Santa Barbara" "Al Peperoncino" ...
 $ addr:street: chr [1:2811] "Viale di Trastevere" "Via di San Martino ai Monti" "Largo dei Librari" "Via Ostiense" ...
 $ geometry   :sfc_POINT of length 2811; first list element:  'XY' num [1:2] 12.5 41.9
 - attr(*, "sf_column")= chr "geometry"
 - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA
  ..- attr(*, "names")= chr [1:2] "name" "addr:street"

For example, the names of the restaurants are stored as a string variable called name (second line of output)

The addresses of the restaurants are stored as a string variable called addr:street (second line of output)

Starting R

We will now familiarize ourselves with the R environment

We first need to install R: R-project

We will then install an R interface that allows us to interact with R in a more user-friendly manner: R-studio

R panels

Reproducible Workflows

  • R sessions and workspaces are disposable (they will disappear once you restart your computer)
  • You should save your code and not your workspace: the workspace takes a lot of space
  • Always start R with a blank state
  • Restart R once in a while to clear memory

Your workspace

  • This is what your workspace for this class should look like
  • Note every week has its associated folder

Good practices

  • Put all files related to a course or a project in their designated folders
  • Create a folder for every week: week1, week2, week3, assignment1, assignment2
  • Within each folder, you should have:
    1. data
    2. graphs
    3. tables (or output)
  • All paths will be relative to the project’s folder.
  • Always start R with a blank state
  • Always save you R scripts

Solving problems on your own

  • Try to solve a problem for 15 mins
  • If you cannot find a solution, have another try in another 15 minutes after taking a break
  • Take another break and try again
  • Finally, go and aks for help

Resources

  • Google
  • Stackoverflow
  • ChatGBT
  • Fellow students
  • Me

Collaboration

  • This class is an opportunity to learn
  • You have two attempts for every assignment
  • The final grade for each assignment will be the average of the first and second attempts
  • For the second attempt, you should meet up with your assigned team to discuss solutions