In R you have multiple options when repeating calculations: vectorized operations, for
loops, and apply
functions.
This lesson is an extension of Analyzing Multiple Data Sets.
In that lesson, we introduced how to run a custom function, analyze
, over multiple data files:
analyze <- function(filename) {
# Plots the average, min, and max inflammation over time.
# Input is character string of a csv file.
dat <- read.csv(file = filename, header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation)
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
}
filenames <- list.files(pattern = "csv")
A key difference between R and many other languages is a topic known as vectorization.
When you wrote the total
function, we mentioned that R already has sum
to do this; sum
is much faster than the interpreted for
loop because sum
is coded in C to work with a vector of numbers.
Many of R's functions work this way; the loop is hidden from you in C.
Learning to use vectorized operations is a key skill in R.
For example, to add pairs of numbers contained in two vectors
a <- 1:10
b <- 1:10
you could loop over the pairs adding each in turn, but that would be very inefficient in R.
res <- numeric(length = length(a))
for (i in seq_along(a)) {
res[i] <- a[i] + b[i]
}
res
[1] 2 4 6 8 10 12 14 16 18 20
Instead, +
is a vectorized function which can operate on entire vectors at once
res2 <- a + b
all.equal(res, res2)
[1] TRUE
for
or apply
?A for
loop is used to apply the same function calls to a collection of objects.
R has a family of functions, the apply
family, which can be used in much the same way.
You've already used one of the family, apply
in the first lesson.
The apply
family members include
apply
- apply over the margins of an array (e.g. the rows or columns of a matrix)lapply
- apply over an object and return listsapply
- apply over an object and return a simplified object (an array) if possiblevapply
- similar to sapply
but you specify the type of object returned by the iterationsEach of these has an argument FUN
which takes a function to apply to each element of the object.
Instead of looping over filenames
and calling analyze
, as you did earlier, you could sapply
over filenames
with FUN = analyze
:
sapply(filenames, FUN = analyze)
Deciding whether to use for
or one of the apply
family is really personal preference.
Using an apply
family function forces to you encapsulate your operations as a function rather than separate calls with for
.
for
loops are often more natural in some circumstances; for several related operations, a for
loop will avoid you having to pass in a lot of extra arguments to your function.
No, they are not! If you follow some golden rules:
c
, cbind
, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/columnAs an example, we'll create a new version of analyze
that will return the mean inflammation per day (column) of each file.
analyze2 <- function(filenames) {
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
res <- apply(fdata, 2, mean)
if (f == 1) {
out <- res
} else {
# The loop is slowed by this call to cbind that grows the object
out <- cbind(out, res)
}
}
return(out)
}
system.time(avg2 <- analyze2(filenames))
user system elapsed
0.044 0.000 0.045
Note how we add a new column to out
at each iteration?
This is a cardinal sin of writing a for
loop in R.
Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results.
Then we loop over the files but this time we fill in the f
th column of our results matrix out
.
This time there is no copying/growing for R to deal with.
analyze3 <- function(filenames) {
out <- matrix(ncol = length(filenames), nrow = 40) ## assuming 40 here from files
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
out[, f] <- apply(fdata, 2, mean)
}
return(out)
}
system.time(avg3 <- analyze3(filenames))
user system elapsed
0.056 0.004 0.057
In this simple example there is little difference in the compute time of analyze2
and analyze3
.
This is because we are only iterating over 12 files and hence we only incur 12 copy/grow operations.
If we were doing this over more files or the data objects we were growing were larger, the penalty for copying/growing would be much larger.
Note that apply
handles these memory allocation issues for you, but then you have to write the loop part as a function to pass to apply
.
At its heart, apply
is just a for
loop with extra convenience.