We will extract and transform sociodemographic data of local areas in Paris. At the end of the class, we will produce maps of Paris, coloured by sociodemographic variables. For example, we will able to see which areas of Paris have the highest number of qualified professionals, the highest number of immigrants, or the highest number of young people.
We aim to create two data frames, one by IRIS and the other by arrondissement, that look something like this:
We then aim to plot this data on maps of Paris.
There are five classes of objects:
logical (e.g., TRUE, FALSE)
integer (e.g., 213, -3)
numeric (real or decimal) (e.g, 2, 2.0, -4.89 pi)
complex (e.g, 1 + 0i, 1 + 4i)
character (e.g, “hello”, "AA231@_:)"
You can find the class of an object by using the class()
function and you can affect the class of an object by using the functions as.numeric()
, as.logical()
and as.character
. The numeric equivalent of FALSE
is 0 and of TRUE
is 1 (or any other number).
as.numeric(FALSE)
## [1] 0
as.logical(43)
## [1] TRUE
You can do operations on them:
addition (+)
6 + 2
## [1] 8
subtraction (-)
6 - 2
## [1] 4
division (/)
6 / 2
## [1] 3
multiplication (*)
6 * 2
## [1] 12
exponent (^)
6^2
## [1] 36
It is often useful to store the result of a computation in an object:
result <- (6^2) / 4 + 17.85
If you want to see the value of an object:
result
## [1] 26.85
They need to be surrounded by quotation marks (’ or "). They need not be letters.
MyMessage <- "Welcome to PPD! _@&"
MyMessage
## [1] "Welcome to PPD! _@&"
Without quotation marks, R will think that you refer to an object.
Hello
## Error in eval(expr, envir, enclos): objet 'Hello' introuvable
You can combine several character objects
paste("My name", "is", "Léa", " ! :D")
## [1] "My name is Léa ! :D"
or split them
substr("Bonjour", 2, 5)
## [1] "onjo"
In logical statements, e.g. “if A is equal to B, then apply function Y”, we use the following notation. Specifically, we use a double equals sign for ‘equal to’.
Statement | Meaning |
---|---|
== |
equal to |
>= , <= |
greater than or equal to, less than or equal to |
> , < |
greater than, less than |
!= |
not equal to |
& |
and |
| |
or |
Keep in mind that parentheses matter!
1 == 1 | (2 == 2 & 1 == 2)
## [1] TRUE
(1 == 1 | 2 == 2) & 1 == 2
## [1] FALSE
Basic usage of logical values:
1==1
## [1] TRUE
is.numeric(MyMessage)
## [1] FALSE
123 > pi
## [1] TRUE
R has a number of basic data structures. A data structure is either homogeneous (all elements are of the same data type) or heterogeneous (elements can be of more than one data type).
Dimension | Homogeneous | Heterogeneous |
---|---|---|
1 | Vector | List |
2 | Matrix | Data Frame |
3+ | Array | nested Lists |
In R, a vector is a sequence of objects that have the same class. To create a vector you should list its elements separated by commas inside c():
days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
Vectors are ordered: you can recover elements of a vector using their position in the sequence:
days[4]
## [1] "Thursday"
Conversely, the function match()
allows you to recover the position(s) of a specific element in a vector:
match("Friday", days)
## [1] 5
You can do basic computations with vectors:
4 + c(10, 20, 30)
## [1] 14 24 34
c(1, 2, 3) * 4
## [1] 4 8 12
4 ^ c(1, 2, 3)
## [1] 4 16 64
c(1, 2, 3) ^ 4
## [1] 1 16 81
In R, logical operators also work with vectors:
x = c(1, 3, 5, 7, 8, 9)
x > 3
## [1] FALSE FALSE TRUE TRUE TRUE TRUE
x == 3
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
There are also useful for subsetting:
x[x > 3]
## [1] 5 7 8 9
max(x)
## [1] 9
which(x == max(x))
## [1] 6
The length()
function gives you the number of elements in a vector:
length(days)
## [1] 7
The rep()
function generates vectors by repeating things:
rep(c(1, 2, 3), 3)
## [1] 1 2 3 1 2 3 1 2 3
rep(c("a", "b", "c"), each = 2)
## [1] "a" "a" "b" "b" "c" "c"
The seq()
function allows you to create vectors with sequences:
seq(0, 100, 5)
## [1] 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
## [20] 95 100
Sequences of consecutive integers can be easily produced using the “:” sign
1:8
## [1] 1 2 3 4 5 6 7 8
You can append a string to each element of a vector with the function paste()
(and the function paste0()
, which is a shortcut for paste(..., seq="")
)
paste("A", 1:4)
## [1] "A 1" "A 2" "A 3" "A 4"
paste0("B", 1:4)
## [1] "B1" "B2" "B3" "B4"
You can also merge all the elements of a vector:
days
## [1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday"
## [7] "Sunday"
paste0(days, collapse=" ")
## [1] "Monday Tuesday Wednesday Thursday Friday Saturday Sunday"
Matrices have rows and columns containing a single data type. In a matrix, the order of rows and columns is important. (This is not the case for data frames, which we will see later.)
Matrices can be created using the matrix
function.
x = 1:9
x
## [1] 1 2 3 4 5 6 7 8 9
X = matrix(x, nrow = 3, ncol = 3)
X
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
By default the matrix function fills your data into the matrix column by column. But we can also tell R to fill rows instead:
Y = matrix(x, nrow = 3, ncol = 3, byrow = TRUE)
We can also create a matrix of a specified dimension where every element is the same, in this case 0.
Z = matrix(0, 2, 5)
Z
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
Like vectors, matrices can be subsetted using square brackets, []. However, since matrices are two-dimensional, we need to specify both a row and a column when subsetting.
Y[2][3]
## [1] NA
X[1, ]
## [1] 1 4 7
Y[c(1,2), 2]
## [1] 2 5
Matrices can also be created by combining vectors as columns, using cbind
, or combining vectors as rows, using rbind
.
x = 1:9
x
## [1] 1 2 3 4 5 6 7 8 9
rev(x)
## [1] 9 8 7 6 5 4 3 2 1
rep(1, 9)
## [1] 1 1 1 1 1 1 1 1 1
rbind(x, rev(x), rep(1, 9))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## x 1 2 3 4 5 6 7 8 9
## 9 8 7 6 5 4 3 2 1
## 1 1 1 1 1 1 1 1 1
cbind(col_1 = x, col_2 = rev(x), col_3 = rep(1, 9))
## col_1 col_2 col_3
## [1,] 1 9 1
## [2,] 2 8 1
## [3,] 3 7 1
## [4,] 4 6 1
## [5,] 5 5 1
## [6,] 6 4 1
## [7,] 7 3 1
## [8,] 8 2 1
## [9,] 9 1 1
The usual computations are done element by element:
X + Y
## [,1] [,2] [,3]
## [1,] 2 6 10
## [2,] 6 10 14
## [3,] 10 14 18
X - Y
## [,1] [,2] [,3]
## [1,] 0 2 4
## [2,] -2 0 2
## [3,] -4 -2 0
X * Y
## [,1] [,2] [,3]
## [1,] 1 8 21
## [2,] 8 25 48
## [3,] 21 48 81
X / Y
## [,1] [,2] [,3]
## [1,] 1.0000000 2.00 2.333333
## [2,] 0.5000000 1.00 1.333333
## [3,] 0.4285714 0.75 1.000000
Matrix multiplication uses %*%
. Other matrix functions include t()
which gives the transpose of a matrix and solve()
which returns the inverse of a square matrix if it is invertible.
X %*% Y
## [,1] [,2] [,3]
## [1,] 66 78 90
## [2,] 78 93 108
## [3,] 90 108 126
t(X)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
A vector is a one-dimensional array. A matrix is a two-dimensional array. In R you can create arrays of arbitrary dimensionality N. Here is how:
d = 1:16
d1 = array(data = d,dim = c(4,2,2))
d2 = array(data = d,dim = c(4,2,2,3)) # will recycle 1:16
d1
## , , 1
##
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
##
## , , 2
##
## [,1] [,2]
## [1,] 9 13
## [2,] 10 14
## [3,] 11 15
## [4,] 12 16
d1 are simply two (4,2) matrices laid on top of each other, as if there were two pages. Similarly, d2 would have two pages, and another 3 registers in a fourth dimension. And so on. You can subset an array like you would a vector or a matrix, taking care to index each dimension:
d1[ ,1,1] # all elements from col 1, page 1
## [1] 1 2 3 4
d1[2:3, , ] # rows 2:3 from all pages
## , , 1
##
## [,1] [,2]
## [1,] 2 6
## [2,] 3 7
##
## , , 2
##
## [,1] [,2]
## [1,] 10 14
## [2,] 11 15
A list is a one-dimensional heterogeneous data structure. So it is indexed like a vector with a single integer value (or with a name), but each element can contain an element of any type. Lists are extremely useful and versatile objects, so make sure you understand their usage:
# creation without fieldnames
list(10, "Bonjour", FALSE)
## [[1]]
## [1] 10
##
## [[2]]
## [1] "Bonjour"
##
## [[3]]
## [1] FALSE
# creation with fieldnames
ex_list = list(
a = c(1, 2, 3, 4),
b = TRUE,
c = "PPD Master",
d = function(arg = 42) {print("Hello everyone!")},
e = diag(3)
)
Lists can be subset using two syntaxes, the $
operator, and square brackets []
. The $
operator returns a named element of a list. The []
syntax returns a list, while the [[]]
returns an element of a list.
ex_list$e
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
ex_list[1:2]
## $a
## [1] 1 2 3 4
##
## $b
## [1] TRUE
ex_list[1]
## $a
## [1] 1 2 3 4
ex_list[[1]]
## [1] 1 2 3 4
ex_list[c("e", "a")]
## $e
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
##
## $a
## [1] 1 2 3 4
ex_list["e"]
## $e
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
ex_list[["e"]]
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
ex_list$d(arg = 1)
## [1] "Hello everyone!"
Data frame is usually the most common way that we store and interact with data in economics.
data = data.frame(x = 1:10,
y = c(rep("Hello", 9), "Goodbye"),
z = rep(c(TRUE, FALSE), 5))
Unlike a matrix, a data frame is not required to have the same data type for each element. A data frame is a list of vectors, and each vector has a name. So, each vector must contain the same data type, but the different vectors can store different data types. Note, however, that all vectors must have the same length (which is the main difference from a list).
Again, we access any given column with the $ operator (as a vector):
data
## x y z
## 1 1 Hello TRUE
## 2 2 Hello FALSE
## 3 3 Hello TRUE
## 4 4 Hello FALSE
## 5 5 Hello TRUE
## 6 6 Hello FALSE
## 7 7 Hello TRUE
## 8 8 Hello FALSE
## 9 9 Hello TRUE
## 10 10 Goodbye FALSE
data$y
## [1] "Hello" "Hello" "Hello" "Hello" "Hello" "Hello" "Hello"
## [8] "Hello" "Hello" "Goodbye"
all.equal(length(data$x),
length(data$y),
length(data$z))
## [1] TRUE
nrow(data)
## [1] 10
ncol(data)
## [1] 3
names(data)
## [1] "x" "y" "z"
We can use different functions to get to know what is in a data frame:
head()
which displays the n first observations of a data framehead(data) #default
## x y z
## 1 1 Hello TRUE
## 2 2 Hello FALSE
## 3 3 Hello TRUE
## 4 4 Hello FALSE
## 5 5 Hello TRUE
## 6 6 Hello FALSE
head(data, n=2)
## x y z
## 1 1 Hello TRUE
## 2 2 Hello FALSE
str()
which displays the structure of the data framestr(data)
## 'data.frame': 10 obs. of 3 variables:
## $ x: int 1 2 3 4 5 6 7 8 9 10
## $ y: chr "Hello" "Hello" "Hello" "Hello" ...
## $ z: logi TRUE FALSE TRUE FALSE TRUE FALSE ...
You can subset data frames like matrices using square brackets [ , ]
, or you can use the function subset()
.
data[data$z == F, c("x", "y" ) ] #[row condition, col condition]
## x y
## 2 2 Hello
## 4 4 Hello
## 6 6 Hello
## 8 8 Hello
## 10 10 Goodbye
subset(data, subset = y == "Hello", select = c("x", "z"))
## x z
## 1 1 TRUE
## 2 2 FALSE
## 3 3 TRUE
## 4 4 FALSE
## 5 5 TRUE
## 6 6 FALSE
## 7 7 TRUE
## 8 8 FALSE
## 9 9 TRUE
R projects are good for managing your data and scripts in a particular folder on your computer. Using R-Studio, click File -> New project to create a new R project in a new or existing folder. A good name for a new folder is something like “Class1”, which you can save somewhere logical on your computer, such as in a folder called “Introdution_to_R”.
Within the folder “Class1”, create 3 subfolders, “Data”, “Scripts” and “Output”. We will save all R code in the folder “Scripts”. A key advantage of using R projects is that all paths leading to our input data and output files will be relative to the location of the R project (the folder “Class1”).
Create a new R script by clicking File -> New file -> R Script. This should be saved in the folder “Scripts”. You can call this script something like “cleaning_paris_data”.
It is always a good idea to comment lines of code. Use #
at the start of a line in order place a comment or in order to disactivate the line so that it does not run.
R comes with a number of built-in functions and datasets, but one of the main strengths of R as an open-source project is its package system. Packages add additional functions and data. Often, if you want to do something in R, but it is not available by default, there probably exists a package that does it. You can find all packages listed on Comprehensive R Archive Network CRAN).
To install a package, use the install.packages("package_name")
function. This requires an internet connection. Once a package is installed, it must be loaded into your current R session before being used by using the library("package_name")
function. Once you close R, all the packages are closed. The next time you open R, you do not have to install the package again, but you do have to load any packages you intend to use by invoking library()
. Thus, the first lines of our script will be:
### Installing and loading packages
# install.packages("tidyverse")
# install.packages("sf")
library("tidyverse")
library("sf")
A useful piece of code to install packages only if they are not already installed, then load them, is:
### installs if necessary and loads tidyverse and sf, another package which we will be using today
list.of.packages <- c("tidyverse", "sf")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos = "http://cran.us.r-project.org")
invisible(lapply(list.of.packages, library, character.only = TRUE))
All the data for today’s exercise can be downloaded from here. Although we provide sources, you do not need to download the data from a given source.
For this exercise, we use the French population data at the IRIS level (50k units in metropolitan France) that can be downloaded from the Insee website. The shapefiles for the IRIS can be downloaded from here. The shapefiles for the arrondissements in Paris can be downloaded from here.
This data is in the form of an .csv (comma-separated values). This can be read using the function read_delim
the tidyverse package readr
. To the left of the <-
sign is the new R object we wish to define, to the right is how we wish to define it.
df <- read_delim(file = "Data/base-ic-evol-struct-pop-2013.csv", delim = ",", col_names = TRUE, skip = 5, locale = locale(encoding = "UTF-8"))
The options for the function read_delim
can be found by typing ?? read_delim
in the console. Here, we just present a few frequently used options.
Argument | Description |
---|---|
file (required) | path to file (relative to R project) |
delim (required) | delimiter |
col_names (TRUE by default) | TRUE if first line is column names, else FALSE or a vector of column names |
skip (0 by default) | the number of lines to skip at the start |
locale | control the regional options, importantly the encoding |
We can check that our dataframe df
is how we want it to be by typing View(df)
in the console, or by clicking on the data frame in the “Environment” panel.
Encoding matters
There are other packages inside the tidyverse that can be used to read most other classic types of data, for example: read_csv
, read_xls
, read_dta
, read_sas
, read_sav
. These functions work similarly.
A tibble in R is a standard R object used to store databases. It is the more modern version of a data frame. A tibble consists of rows and columns, where the columns contain one of five basic classes of data.
When our data set was imported from .csv, R recognized character and numeric columns. We will later learn how to change column types.
Each of the columns may be accessed by their name, e.g. df$IRIS
, or by their number , e.g. df[,2]
.
The pipe function is part of the package dplyr
in the tidyverse, and is used to simply transform a tibble. A cheatsheet for the dplyr
package can by found on the homepage of this course, here
The pipe function
We want to select only the columns IRIS
, COM
, TYP_IRIS
, P13_POP
and the age variables P13_POP0014
through to P13_POP75P
. We want to select the rows that denote data from Paris only. To select columns, we use the function select
. To select rows, we use the function filter
.
iris <- df %>%
filter(DEP=="75") %>%
select(IRIS, COM, TYP_IRIS, P13_POP, P13_POP0014:P13_POP75P)
The order of these two lines matters, if we select the columns first, then we cannot use the variable DEP
to filter the variables. It is also possible to deselect variables by putting a minus sign before the variable, e.g. select(-COM)
.
Columns can be renamed by using rename(new_name=old_name)
, or by integrating the new names into the select function, e.g. select(new_name_1=old_name_1, new_name_2=old_name_2)
.
We now wish to convert all the population variables to percentages, and the TYP_IRIS
variable to a factor. To modify one, many or all columns, we use the functions mutate
, mutate_at
or mutate_all
.
iris <- df %>%
filter(DEP=="75") %>%
select(IRIS, COM, TYP_IRIS, P13_POP, P13_POP0014:P13_POP75P) %>%
mutate(TYP_IRIS = as.factor(TYP_IRIS)) %>%
mutate_at(vars(P13_POP0014:P13_POP75P), funs(pc=./P13_POP))
We notice that there are some IRIS for which the population is 0. In these cases, when we divide by 0, we obtain the result NaN
(not a number). We wish to convert these values to 0. We can use mutate_if
to only mutate columns satifying a particular condition, and we can use the ifelse
function to replace NaN
by 0
. The three arguments of the ifelse
function are:
iris <- df %>%
filter(DEP=="75") %>%
select(IRIS, COM, TYP_IRIS, P13_POP, P13_POP0014:P13_POP75P) %>%
mutate(TYP_IRIS = as.factor(TYP_IRIS)) %>%
mutate_at(vars(P13_POP0014:P13_POP75P), funs(pc=./P13_POP)) %>%
mutate_if(is.numeric, funs(ifelse(is.nan(.), 0, .)))
Here we learn two simple functions for string variables, substr
and paste0
.
Say we wish to convert the column COM
into a more readable string, e.g. instead of “75114”, we wish to write “Paris 14”. We use the function substr
to extract from the 4th to the 5th position of the string, and paste0
to concatenate strings.
iris <- df %>%
filter(DEP=="75") %>%
select(IRIS, COM, TYP_IRIS, P13_POP, P13_POP0014:P13_POP75P) %>%
mutate(TYP_IRIS = as.factor(TYP_IRIS)) %>%
mutate_at(vars(P13_POP0014:P13_POP75P), funs(pc=./P13_POP)) %>%
mutate_if(is.numeric, funs(ifelse(is.nan(.), 0, .))) %>%
mutate(name_arrd = substr(COM, 4, 5)) %>%
mutate(name_arrd = paste0("Paris ", name_arrd))
We now wish to group the IRISes by arrondissement, in order to obtain aggregated statistics of the population by arrondissement. Using the function group_by
, we can group the variables by COM
, which indicates the arrondissement. We can use the function summarise_all
, which works in the same way as mutate_all
, to aggregate our data by group. After this aggregation, we need to ungroup
our data frame.
arrd <- iris %>%
select(COM, P13_POP, P13_POP0014:P13_POP75P) %>%
group_by(COM) %>%
summarise_all(funs(sum(.))) %>%
ungroup %>%
mutate_at(vars(P13_POP0014:P13_POP75P), funs(pc=./P13_POP)) %>%
mutate_if(is.numeric, funs(ifelse(is.nan(.), 0, .)))
The final two lines are the same as before.
Our data is currently in wide format. To change it from wide to long format, we use the function gather
, and to change it from long to wide format, we use the function spread
.
long <- arrd %>%
gather(key = population_variable, value = value, -COM)
wide <- long %>%
spread(key = population_variable, value = value)
We can write data in .csv format using write_csv
. We can also use .rds format (r dataset) in order to preserve the tibble attributes, such as which variables are factor variables.
iris <- df %>%
filter(DEP=="75") %>%
select(IRIS, COM, TYP_IRIS, P13_POP, P13_POP0014:P13_POP75P) %>%
mutate(TYP_IRIS = as.factor(TYP_IRIS)) %>%
mutate_at(vars(P13_POP0014:P13_POP75P), funs(pc=./P13_POP)) %>%
mutate_if(is.numeric, funs(ifelse(is.nan(.), 0, .))) %>%
mutate(name_arrd = substr(COM, 4, 5)) %>%
mutate(name_arrd = paste0("Paris ", name_arrd)) %>%
write_csv("Output/iris.csv") %>%
write_rds("Output/iris.rds")
arrd <- iris %>%
select(COM, P13_POP, P13_POP0014:P13_POP75P) %>%
group_by(COM) %>%
summarise_all(funs(sum(.))) %>%
ungroup %>%
mutate_at(vars(P13_POP0014:P13_POP75P), funs(pc=./P13_POP)) %>%
mutate_if(is.numeric, funs(ifelse(is.nan(.), 0, .))) %>%
write_csv("Output/arrd.csv") %>%
write_rds("Output/arrd.rds")
There are four key types of joins.
Function | Meaning |
---|---|
left_join(a, b, by="x") |
Join matching rows from b to a |
right_join(a, b, by="x") |
Join matching rows from a to b |
inner_join(a, b, by="x") |
Join data retaining rows in both sets |
full_join(a, b, by="x") |
Join data retaining all rows |
We will apply a join with geographical data, in order to display our variables on a map.
Shapefiles are a common format of geographical data. We can import them using the package sf
, which is not part of the tidyverse, but follows the same syntax. We select only the variable corresponding to the IRIS code, and call this IRIS
to match our other data set. We then apply a right_join
to join our data to the geographical data to the iris tibble that we have created.
irisshp <- read_sf(dsn = "Data/iris", layer = "CONTOURS-IRIS") %>%
select(IRIS=CODE_IRIS) %>%
right_join(iris, by="IRIS")
In the next class, we will plot data in a much nicer way using ggplot2
. However, for now, we will simply use the plot
function.
We wish to plot a demography variable, such as the percentage of people over 75 years old, on a map of Paris. We select only the variable of interest then use the function plot
.
iristoplot <- irisshp %>%
# mutate(P13_POP75P_pc=ifelse(TYP_IRIS=="H", P13_POP75P_pc, NA)) %>% ### optional line to exclude IRISes with no or few inhabitants
select(P13_POP75P_pc)
plot(iristoplot)
In order to save the plots, use the following code.
### to save plot use these two lines
# dev.copy(pdf, 'Output/age.pdf')
# dev.off()
The same plot by arrondissement is given by the following code.
arrdshp <- read_sf(dsn = "Data/arrondissements", layer = "arrondissements") %>%
select(COM=c_arinsee) %>%
mutate(COM=as.character(COM)) %>%
left_join(arrd, by="COM") %>%
select(P13_POP75P_pc)
plot(arrdshp)