Over the last couple of weeks I’ve been noodling on the idea of strategies: what happens if your function contains a few different approaches to the problem, and you want to expose to the user select. I believe that it’s best to expose these explicitly like the ties.method
argument to rank()
, the method
argument to p.adjust()
, or the .keep
argument to dplyr::mutate()
.
In functions that expose a strategy, it’s common to see a character vector in the function interface. For example, rank()
looks like this:
rank(
x,
na.last = TRUE,
ties.method = c("average", "first", "last", "random", "max", "min")
)
I call this character vector an enumeration and discuss it in “Enumerate possible options”. This vector enumerates (itemises) the possible options (six here) and it also tells you the default value, the first value in the vector ("average"
). This type of default value is usually paired with either match.arg()
or rlang::arg_match()
to give an informative error if the user supplies an unsupported value. The chief difference between the base and rlang versions of this function is that the base version supports partial matching (e.g. you can write rank(x, ties.method = "r")
for short), which we believe is no longer a good idea.
One of the reasons it’s useful to understand this pattern is that you might want to apply it even when there are only two options. In such a case, it’s tempting to expose the option as a Boolean argument, accepting either TRUE
or FALSE
. But this has two problems:
You might later discover that there’s a third option. Now you’re going to need to make more radical changes to your function interface to allow this.
It’s often trickier to understand a negative. For example, I recently discovered the
cancel_on_error
argument in an httr2 function. I think it’s pretty clear whatcancel_on_error = TRUE
does (it cancels if there’s an error), but what doescancel_on_error = FALSE
do? I wrote this code and now I couldn’t tell you what it actually does.
I explore this idea in more detail in “Prefer a enum, even if only two choices”, including a deeper look at the decreasing
and na.last
arguments to sort()
.
A more complicated example of the strategy pattern comes about when different strategies require different arguments. The best example of this sort of pattern is stringr, which uses the functions regex()
, fixed()
, boundary()
, and coll()
to define the pattern matching engine:
library(stringr)
x <- "The quick brown fox jumped over the lazy dog."
str_view(x, regex("[aeiou]+", ignore_case = TRUE))
#> [1] │ Th<e> q<ui>ck br<o>wn f<o>x j<u>mp<e>d <o>v<e>r th<e> l<a>zy d<o>g.
str_view(x, fixed("."))
#> [1] │ The quick brown fox jumped over the lazy dog<.>
str_view(x, boundary("word"))
#> [1] │ <The> <quick> <brown> <fox> <jumped> <over> <the> <lazy> <dog>.
I explore this idea more in “Extract strategies into objects” (which really needs a catchier name), motivated by the perl
, fixed
, and ignore.case
arguments to grepl()
and friends.
Finally, sometimes exposing multiple strategies in one function isn’t the right move, and you’re better off creating more simpler functions. I think of this problem as “three functions in a trench coat” because it can feel like three or more functions crammed into one breaking apart at the seems. forcats::fct_lump()
is a good example of this problem: it started off simple and then gained new strategies over time. Eventually it got so hard to explain that we decided to split apart in to three simpler functions. Another good example of this problem is the rep()
function: I think it’s actually two functions in trench coat and it gets easier to understand if you pull them apart. See the rep() case study for a full exploration including some of my thoughts about what you might name the functions and arguments.
Do these patterns resonate with you? Are there other functions in the tidyverse that you think do a particularly good or bad job of exposing a strategy? Please let me know in the comments!
In general, this family of patterns really speaks to me. I feel like I run into it fairly often. That said, I think a lot of times I'm really running into the "three functions in a trench coat" problem. It might be good to emphasize when to split things earlier on (although I also see there's a whole chapter for that).
For the name: Maybe emphasize the helpers (both in 13 and 11), something like "Reduce argument clutter with an options helper" and "Extract strategies into options helpers"? I'm not sure that's catchier, but "an options object" sounds like it could just be a list (which is how I read it at first), and those just feel like hidden arguments (eg, the `params` argument in `xgboost::xgboost()`, which requires you to read a separate website to find out what you can set).
From 13: "You could also implement the same strategy using `if` or S7 generic functions, depending on your needs." When would you switch from `switch` to S7 generics?
You've mentioned a downside of the enumeration strategy being that it forces a default value, and you have another chapter about not giving required arguments a default value.
If I have a required argument that I'd like to use enumeration for but I don't want the argument to choose either option by default, is it poor design to set the default to NULL and handle the NULL value by raising an error?
EDIT: I have discovered that the NULL value in c(NULL, ...) default parameter is dropped upon function call. So I rephrase my question with an alternative:
In order to use the enumeration strategy but avoid providing a default choice, would it be poor design to set the default to some value that will be immediately handled by raising an error?