I’m excited to announce that peruse 0.3.0 is now available on CRAN!
Install peruse from CRAN with:
install.packages("peruse")
Alternatively, if you need the development version from GitHub install it with:
devtools::install_github("jacgoldsm/peruse")
Release 0.3.0 has significant changes to the way that Iterators are implemented, but almost all existing code will still work. In addition, the new implementation will allow for more flexibility in the code that you can write. The two main changes are:
Iterators are now built over environments, not lists. This simplifies the implementation and allows for easier debugging. It also mean that you have to be careful when copying an Iterator—see ?peruse::clone()
.
Iterators now have a formally defined search path, so you can reliably use environment variables with them.
You can now refer to the current iteration of yield_more()
, yield_while()
, move_more()
, or move_while()
with the .iter
variable.
Formulas in set comprehension now work with any integer, not explicitly booleans, so f(x)
works the same as f(x) != 0
in we_have()
.
Suppose we want to investigate the question of how many trials it takes for a random walk with drift to reach a given threshold. We know that this would follow a Negative Binomial distribution, but how could we use the Iterator to look at this empirically in a way that easily allows us to adjust the drift term and see how the result changes? We might do something like this:
p_success <- 0.5
threshold <- 100
expr <- "
set.seed(seeds[.iter])
n <- n + sample(c(1,-1), 1, prob = c(p_success, 1 - p_success))
"
iter <- Iterator(expr, list(n = 0, seeds = 1000:1e6), n)
sequence <- yield_while(iter, "n <= threshold")
plot(sequence, main = "How many iterations does it take to get to 100?")
How would we apply this same function to a grid of probabilities? We could do something like this:
probs <- seq(0.5,0.95, by = 0.01)
exprs <- rep(NA, length(probs))
num_iter <- rep(NA, length(probs))
threshold <- 20
seeds <- 1000:1e6
for (i in seq_along(probs)) {
exprs[i] <- glue::glue(
"
set.seed(seeds[.iter])
n <- n + sample(c(1,-1), 1, prob = c({probs[i]}, 1 - {probs[i]}))
"
)
iter <- Iterator(exprs[i],
list(n = 0),
yield = n)
num_iter[i] <- length(yield_while(iter, "n <= threshold"))
}
plot(x = probs,
y = log(num_iter),
main = "Probability of Success vs How long it takes to get to 20 (Log Scale)",
xlab = "Probability of Success",
ylab = "Log Number of Iterations")
This illustrates a few useful features of Iterators:
We can use environment variables in either our expression or our while
condition to represent constants. In this case, threshold
doesn’t change between iterations or between parameters. If you are creating many Iterator
s, it can be faster to use environment variables, since you don’t have to make a new object for each new Iterator
.
We can use glue::glue()
to generate a range of expressions that we can then fill in to create an Iterator
with a range of parameters.
We can refer to the current iteration number in yield_while()
, yield_more()
, or their silent variants with the environment variable .iter
.
peruse
has two main distinct capabilities, related by the idea that they ‘peruse’ a sequence:
Set or list comprehension, aimed at capturing sets that meet complex conditions
Iterator
s that make it easier to generate irregular sequences and sets that are difficult to generate with existing tools
In R, sequences are normally represented as atomic vectors. For example, here is how we might represent a weighted sequence of 50 1s and -1s, with 1 having 75% probability and -1 having 25% probability:
sample(c(-1L, 1L), size = 50L, prob = c(0.25, 0.75), replace = T)
#> [1] 1 1 1 1 -1 1 1 1 1 1 -1 1 1 -1 1 1 1 -1 -1 -1 1 1 -1 1 1
#> [26] 1 1 -1 1 -1 1 1 1 1 1 1 1 -1 1 1 -1 1 -1 1 1 -1 1 -1 1 -1
From the perspective of the R user, all these values are generated at once. This brings up two issues:
How do we generate a recursive sequence, that is, a sequence in which each value determines subsequent values?
How do we generate a sequence that only generates until a condition is met if we do not know how long that will take in advance?
The Iterator
object in peruse
is made to solve these problem. For example, suppose we want to simulate a random walk with drift that has two end conditions: success is if/when it reaches 50, and failure is if/when it reaches -50. To be efficient, we want to stop the simulation when the sequence reaches either of the end conditions.
expr <- "
set.seed(seeds[.iter])
n <- n + sample(c(-1L, 1L), size = 1L, prob = c(0.25, 0.75))
"
rwd <- Iterator(result = expr,
initial = list(n = 0, seeds = 1:1e3),
yield = n)
Value <- yield_while(rwd, "n != 50L & n != -50L")
plot(Value, main = "The Value of the Iterator after a Given Number of Iterations")
This scenario illustrates the capabilities of the Iterator
:
We defined the R expression for generating a new element in advance of the simulation
We defined initial values for all of the variables involved (in this case just n
)
We defined a variable to return each time yield_next
was called
We generated the sequence recursively, modifying the value of n
each time we computed a new one
We made the process stop whenever n
reached 50 or -50
peruse
develops a simple API for set comprehension. R already makes it easy to develop simple sets, like getting all the even numbers from 1 to 100:
(1:100)[which(1:100 %% 2 == 0)]
#> [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
#> [20] 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76
#> [39] 78 80 82 84 86 88 90 92 94 96 98 100
But more complex sets require comparing a set of elements to another set and only including an element if it matches a condition. For example, a prime number is defined as \(i \in \mathbb{N} | \forall m \in \mathbb{N} \setminus \{1,i \}, i \equiv 0 \mod m\). How do we represent that in R? The set-builder API can help!
Here, we use set comprehension to generate prime numbers 1-100:
2:100 %>%
that_for_all(range(2, .x)) %>%
we_have(~.x %% .y != 0)
#> [1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
This doesn’t, however, help us if we want to generate a certain number of prime numbers, regardless of what interval they are in. Of course, we could generate a vector and then subset it, but that would be inefficient! We want to only generate what we need.
We can bring together the set-builder and Iterator capabilities to do that, for example with the first 100 primes:
# 10,000 is just a number that we can be pretty sure is sufficiently high
primes <- 2:10000 %>%
that_for_all(range(2, .x)) %>%
we_have(~.x %% .y != 0, "Iterator")
sequence <- yield_more(primes, 100)
sequence
#> [1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61
#> [19] 67 71 73 79 83 89 97 101 103 107 109 113 127 131 137 139 149 151
#> [37] 157 163 167 173 179 181 191 193 197 199 211 223 227 229 233 239 241 251
#> [55] 257 263 269 271 277 281 283 293 307 311 313 317 331 337 347 349 353 359
#> [73] 367 373 379 383 389 397 401 409 419 421 431 433 439 443 449 457 461 463
#> [91] 467 479 487 491 499 503 509 521 523 541
This illustrates a few things:
We can generate an Iterator
from a set by including the “Iterator” argument in we_have()
We can then generate a sequence that would normally be generated all at once one-by-one
We can then only generate the numbers we want and stop once we reach the hundredth element
We can use the helper function yield_more()
to automate what would require a for
loop
peruse
is a new package and needs help! If you do run into a bug or think of a new feature that would work well in peruse
please open an issue.
Big thank you to Hadley Wickham, from whose book Advanced R I learned to do the stuff I did in the package, and whose book R Packages was invaluable in getting peruse
published on CRAN. Also, the piped set builder workflow was made possible by the magrittr pipe, so thank you to the developers: Stefan Milton Bache and Hadley Wickham.