This lesson covers some basic exposure to R functions and data objects.
View the slides for this section here.
Two major elements in R programming: functions and objects.
Functions perform operations (calculate a mean, build a table, create a graph, etc.)
Objects hold information (a collection of numbers, dates, words, models results, etc.)
Vectors are the fundamental data type in R. Most R functions are ‘vectorized’, meaning they’re optimized to work on vectors.
The “atomic” in atomic vectors means, “of or forming a single irreducible unit or component in a larger system.”
Logical vectors are handy because when we add them together, and the total number tells us how many TRUE
values there are.
vec_logical <- c(TRUE, FALSE)
vec_logical
#> [1] TRUE FALSE
Integer vectors are created with a number and capital letter L
vec_integer <- c(1L, 10L, 100L)
vec_integer
#> [1] 1 10 100
Double vectors can be entered as decimals
vec_double <- c(0.1, 1.0, 10.01)
vec_double
#> [1] 0.10 1.00 10.01
Note that character vectors need to be in quotes.
vec_character <- c("A", "B", "C")
vec_character
#> [1] "A" "B" "C"
Store and explore - create an object, perform an operation on the object, store the results, then explore the contents with another function.
typeof()
Explore all vectors with typeof()
typeof(vec_integer)
#> [1] "integer"
is.integer()
Integers have no decimals.
is.integer(vec_integer)
#> [1] TRUE
is.numeric()
Evaluate numeric vectors with is.numeric()
is.numeric(vec_double)
#> [1] TRUE
is.logical()
is.logical()
to check vectors that are logical.
is.logical(vec_logical)
#> [1] TRUE
Recall that you can sum logical vectors.
TRUE + TRUE + FALSE + TRUE
#> [1] 3
Great for subsetting too.
vec_integer > 5
#> [1] FALSE TRUE TRUE
is.character()
Check character vectors with is.character()
is.character(vec_character)
#> [1] TRUE
R is often referred to as a “vector-oriented”, “vectorized”, or “element-wise” language because of the way it deals with vectors. We will show an example of this behavior below:
THe code below creates a sequence of ten values between 1.5
and 10.5
.
vec_seq_01 <- 1.5:10.5
length(vec_seq_01)
#> [1] 10
vec_seq_01
#> [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5
Now we create vec_seq_02
, which has a sequence of ten values between 0.2
and 2.0
.
vec_seq_02 <- c(0.2, 0.4, 0.6, 0.8, 1.0,
1.2, 1.4, 1.6, 1.8, 2.0)
vec_seq_02
#> [1] 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
Now we subtract vec_seq_02
from vec_seq_01
:
vec_seq_01 - vec_seq_02
#> [1] 1.3 2.1 2.9 3.7 4.5 5.3 6.1 6.9 7.7 8.5
The result is a vector that’s the subtraction of the first element of vec_seq_02
from the first element of vec_seq_01
, and so on…
vec_seq_01[1] - vec_seq_02[1]
#> [1] 1.3
vec_seq_01[2] - vec_seq_02[2]
#> [1] 2.1
vec_seq_01[3] - vec_seq_02[3]
#> [1] 2.9
Both vec_seq_01
and vec_seq_02
have equal lengths. What happens when we apply vectorized operations to objects of unequal length? Well, R attempts to recycle the operations. We will demonstrate this by creating vec_seq_03
, which only has three numbers in it:
vec_seq_03 <- vec_seq_02[1:3]
vec_seq_03
#> [1] 0.2 0.4 0.6
When we try to subtract vec_seq_03
from vec_seq_01
we get the following error:
vec_seq_01 - vec_seq_03
#> [1] 1.3 2.1 2.9 4.3 5.1 5.9 7.3 8.1 8.9 10.3
This is telling us R attempted to subtract each element by position, but ran out of numbers.
That’s why the first three numbers of vec_seq_01 - vec_seq_03
look identical to vec_seq_01 - vec_seq_02
:
# compare first three elements
vec_seq_01[1:3] - vec_seq_02[1:3]
#> [1] 1.3 2.1 2.9
# compare first three elements
vec_seq_01[1:3] - vec_seq_03[1:3]
#> [1] 1.3 2.1 2.9
But when R goes looking elements at position vec_seq_03[4]
, it finds nothing (NA
):
vec_seq_03[4]
#> [1] NA
So it recycles at the beginning of the vector again. Look at the code below to see how this behavior creates the values in vec_seq_01 - vec_seq_03
.
vec_seq_01 - vec_seq_03
#> [1] 1.3 2.1 2.9 4.3 5.1 5.9 7.3 8.1 8.9 10.3
vec_seq_01[1:3] - vec_seq_03[1:3]
#> [1] 1.3 2.1 2.9
vec_seq_01[4:6] - vec_seq_03[1:3]
#> [1] 4.3 5.1 5.9
vec_seq_01[7:9] - vec_seq_03[1:3]
#> [1] 7.3 8.1 8.9
Where does the 10.3
come from? That is what’s left over (and why we get the error message).
vec_seq_01[10] - vec_seq_03[1]
#> [1] 10.3
If vec_seq_01
had a length that was a multiple of vec_seq_03
, R would’ve still performed recycling (but without an error message).
vec_seq_04 <- vec_seq_02[1:5]
vec_seq_01 - vec_seq_04
#> [1] 1.3 2.1 2.9 3.7 4.5 6.3 7.1 7.9 8.7 9.5
S3 vector objects are factors, dates, date-times, durations.
Factors are categorical vectors with a given set of responses.
vec_factor <- factor(x = c("low", "medium", "high"))
vec_factor
#> [1] low medium high
#> Levels: high low medium
# Not character variables!
typeof(vec_factor)
#> [1] "integer"
We can manually assign the order of factor levels with the levels
argument in factor()
.
vec_factor <- factor(x = c("medium", "high", "low"),
levels = c("low", "medium", "high"))
# check with:
levels(vec_factor)
#> [1] "low" "medium" "high"
unclass(vec_factor)
#> [1] 2 3 1
#> attr(,"levels")
#> [1] "low" "medium" "high"
R also comes with a few functions for creating dates (Sys.Date()
and Sys.time()
).
vec_date <- c(Sys.Date(), Sys.Date() + 1, Sys.Date() + 2)
vec_date
#> [1] "2021-11-30" "2021-12-01" "2021-12-02"
Dates are stored as double
vectors with a class attribute set to "Date"
.
is.double(vec_date)
#> [1] TRUE
attributes(vec_date)
#> $class
#> [1] "Date"
The number for each date is accessible using the unclass()
function:
unclass(vec_date)
#> [1] 18961 18962 18963
The number are related to the UNIX Epoch time, which is January 1, 1970 (00:00:00 UTC).
The Unix epoch serves as a point in time in which the computer can calculate dates from. The actual date and time is arbitrary, but without a fixed point in time, there’s no way to quantify or measure a ‘date.’
Date-times contain a bit more information than dates. In R, the POSIX
date format is the time since January 1, 1970 (in the UTC time zone), measured to the nearest second.
POSIXct
represents this as a numeric vector, and POSIXlt
is list of named date/time vectors.
vec_datetime_ct <- as.POSIXct(vec_date)
vec_datetime_ct
#> [1] "2021-11-29 17:00:00 MST" "2021-11-30 17:00:00 MST"
#> [3] "2021-12-01 17:00:00 MST"
typeof(vec_datetime_ct)
#> [1] "double"
vec_datetime_lt <- as.POSIXlt(vec_date)
vec_datetime_lt
#> [1] "2021-11-30 UTC" "2021-12-01 UTC" "2021-12-02 UTC"
typeof(vec_datetime_lt)
#> [1] "list"
For vec_datetime_ct
, we can access the epoch time with unclass()
unclass(vec_datetime_ct)
#> [1] 1638230400 1638316800 1638403200
In vec_datetime_lt
, each date/time measurement is in a named vector:
unclass(vec_datetime_lt)
#> $sec
#> [1] 0 0 0
#>
#> $min
#> [1] 0 0 0
#>
#> $hour
#> [1] 0 0 0
#>
#> $mday
#> [1] 30 1 2
#>
#> $mon
#> [1] 10 11 11
#>
#> $year
#> [1] 121 121 121
#>
#> $wday
#> [1] 2 3 4
#>
#> $yday
#> [1] 333 334 335
#>
#> $isdst
#> [1] 0 0 0
#>
#> attr(,"tzone")
#> [1] "UTC"
For the "Date"
class, the number represents the number of days, and for the "POSIX
classes, this represents the number of seconds (to the nearest second) since the epoch.
The "POSIXct"
/"POSIXlt"
classes also both contain a time-zone (explore your system’s time-zone with Sys.timezone(location = TRUE)
).
We need two times to be able to calculate the difftime, but the output is fairly clear.
time_01 <- Sys.Date()
time_02 <- Sys.Date() + 10
vec_difftime <- difftime(time_01, time_02, units = "days")
vec_difftime
#> Time difference of -10 days
These have class
and units
attributes.
attributes(vec_difftime)
#> $class
#> [1] "difftime"
#>
#> $units
#> [1] "days"
In the next sections we’ll cover matrices and arrays, two of R’s multidimensional objects.
These are two-dimensional objects. We can create with matrix()
.
We can make a matrix using our existing vectors.
mat_data <- matrix(data = c(vec_double, vec_integer),
nrow = 3, ncol = 2, byrow = FALSE)
mat_data
#> [,1] [,2]
#> [1,] 0.10 1
#> [2,] 1.00 10
#> [3,] 10.01 100
We can check the dimensions of mat_data
with dim()
.
dim(mat_data)
#> [1] 3 2
We can subset using position.
mat_data[2, 2]
#> [1] 10
Arrays are like matrices, but with more dimensions.
Arrays need three dimensions in the dim
argument.
dat_array <- array(data = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18), dim = c(3, 3, 2))
dat_array
#> , , 1
#>
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
#>
#> , , 2
#>
#> [,1] [,2] [,3]
#> [1,] 10 13 16
#> [2,] 11 14 17
#> [3,] 12 15 18
Matrices are arrays, but arrays are not matrices.
class(dat_array)
#> [1] "array"
class(mat_data)
#> [1] "matrix" "array"
If you’re importing spreadsheets, most of the work you’ll do in R will be with rectangular data objects (i.e. data.frame
s and tibble
s).
Rectangular data with rows and columns.
We will create a data frame below using data.frame()
. When we create a data frame, the data are transposed (i.e. columns are written left-to-right).
DataFrame <- data.frame(character = c("A", "B", "C"),
integer = c(0.1, 1.0, 10.01),
logical = c(TRUE, FALSE, TRUE),
stringsAsFactors = FALSE)
DataFrame
NOTE: stringsAsFactors = FALSE
is not required as of R version 4.0.0.
data.frame
sCheck the structure of the data.frame with str()
str(DataFrame)
#> 'data.frame': 3 obs. of 3 variables:
#> $ character: chr "A" "B" "C"
#> $ integer : num 0.1 1 10
#> $ logical : logi TRUE FALSE TRUE
These are special kinds of data.frame
s (they print better to the console, and character vectors are never coerced into factors).
tibble
sCreating tibble
s is not transposed.
Tibble <- tibble::tribble(
~character, ~integer, ~logical,
"A", 0.1, TRUE,
"B", 1, FALSE,
"C", 10.01, TRUE)
Tibble
tibble
stibbles
are S3 objects, with types tbl_df
, tbl
, and data.frame
str(Tibble)
#> tibble [3 × 3] (S3: tbl_df/tbl/data.frame)
#> $ character: chr [1:3] "A" "B" "C"
#> $ integer : num [1:3] 0.1 1 10
#> $ logical : logi [1:3] TRUE FALSE TRUE
Lists are unique objects in R because they are recursive vectors. We’ve pointed out that atomic vectors can’t be broken down into smaller components, but a list can store objects of multiple types (character, numeric, logical, etc.), data.frame
s, tibble
s, and even other lists.
Unlike data.frame
s, tibble
s (which require each vector to be of equal length), lists can store objects of different types and sizes.
We can put all the objects we’ve created in dat_list
.
dat_list <- list("integer" = vec_integer,
"array" = dat_array,
"matrix data" = mat_data,
"data frame" = DataFrame,
"tibble" = Tibble)
dat_list
#> $integer
#> [1] 1 10 100
#>
#> $array
#> , , 1
#>
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
#>
#> , , 2
#>
#> [,1] [,2] [,3]
#> [1,] 10 13 16
#> [2,] 11 14 17
#> [3,] 12 15 18
#>
#>
#> $`matrix data`
#> [,1] [,2]
#> [1,] 0.10 1
#> [2,] 1.00 10
#> [3,] 10.01 100
#>
#> $`data frame`
#> character integer logical
#> 1 A 0.10 TRUE
#> 2 B 1.00 FALSE
#> 3 C 10.01 TRUE
#>
#> $tibble
#> # A tibble: 3 × 3
#> character integer logical
#> <chr> <dbl> <lgl>
#> 1 A 0.1 TRUE
#> 2 B 1 FALSE
#> 3 C 10.0 TRUE
And we can see all of these elements are stored (with the appropriate name).
attributes(dat_list)
#> $names
#> [1] "integer" "array" "matrix data" "data frame" "tibble"
Think of data.frame
s and tibble
s as special kinds of rectangular lists
, made with different types of vectors, with each vector being of equal length.