New this week is a new chapter on reducing argument clutter by adding an options object. Sometimes you have a set of “second class” arguments that you don’t expect people to use very commonly, so you don’t want them cluttering up the function specification. If you want to give the user the ability to control them when needed, you can lump them all together into an “options” object.
I think the key thing here is to provide a helper function that has arguments that you can document. Then you get an error if the arguments are mispelled, and there's an obvious place to look for more details.
In my experience, you need good documentation regardless of which approach you use. I specifically noticed the issue with a giant list of arguments when I was doing meta-analysis using a function meta::metagen(). Although this is somewhat an extreme case: https://www.rdocumentation.org/packages/meta/versions/6.2-1/topics/metagen
even when everything is documented, I don't even know where to begin. Definitely would benefit from some type of hierarchical structure to be digestible.
Another function with a VERY large number of arguments is ggplot2::theme(): https://ggplot2.tidyverse.org/reference/theme.html. I think the hierarchical organisation of the argument names helps here, although we might still be better off with multiple functions.
Came here to say the same thing as Lisa... I often get frustrated by lack of documentation on what an options-type argument is and how to use it. In the help page for glm, for example, it is not clear that you should specify control = glm_control(...) and there is no useful description of what kinds of behavior you can elicit by using the control option. I like the idea of hiding some of the details that people rarely need, but I find in practice often this means just making things effectively inaccessible to anyone who isn't a creator of the package or someone who has spent a LOT of time using it and digging through all of its documentation. Including some advice in this chapter on how to document options objects effectively would be really helpful.
I think this just requires a couple of small adjustments to glm — it would be better if the default value for `control` was `glm.control()`, and the documentation could more clearly recommend that you also create the options with `glm.control()`.
This is the approach used for model fitting in package {caret}. Function train() accepts a trainControl object for fine-tuning the training process. The trainControl hides complex details by encapsulating a large number of arguments:
A nice bonus of this pattern is that you can re-use the options object/function call. In a modelling context, you often fit a series of nested models or candidate models and compare them, and this patterns helps them agree on low-level details.
Functions that wrap things written in other languages often cry out for this, in my experience. For example, `xgboost::xgboost()` has a huge list of names to use in its `params = list()` argument (which I don't THINK were documented until a year or two ago), and `yaml::yaml.load()` has you look in the Details for its `handlers` arg, where you'll find "where the names are the YAML types (i.e., 'int', 'float', 'seq', etc)." Even if the options arg is passing on to some outside thing that might change its offering, it'd be nice to have SOME guidance on the possible values!
In {tidymodel} we use this hierachi structure to our control functions. So you can pass `control_bayes()` to a function that expects `control_grid()` since the arguments in `control_bayes()` is a proper superset of the arguments in `control_grid()`.
An alternative approach, used by scikit-learn for example, is to start with a class that contains options then use class methods to act on it. In R, I implemented things such as this using {R6}. The example above could be:
```
mod <- GeneralLinearModel(family = gaussian, config = glm_config_with_defaults(trace = TRUE))
mod.fit(Postwt ~ Prewt + Treat + offset(Prewt), data = anorexia)
```
It completely declutters the actual fit() method, and using OOP we can organize the options in some meaningful way, e.g.:
```
GLMConfig <- R6::R6Class("GLMConfig",
## list(
## ## family = ...,
## ## fitting_parameters = ..., # e.g. GLMConfigFit class if there's too many (>5 ?) options
## ## parallel_execution = ... # e.g. GLMConfigParallel class
## )
)
```
There is a tradeoff obviously from listing everything in one giant argument list to creating a bunch of classes, and I am not sure what is the best middle ground? 🤔
Thanks for pointing out the broken link, fixed now.
I'll definitely get to OOP patterns later in the book, but I'm not sure using an R6 object helps here much compared to a simple list. I think the primary of advantage of R6 objects is their ability to modify-in-place, but that doesn't help here.
If you end up with a multiple options objects (possibly nested), I think you're probably better of switching to a system where you provide one function to create the object and then a bunch of functions to modify it. I've made a note to write up this pattern at https://github.com/tidyverse/design/issues/162.
not super important but we say 'Februar' (as in Germany) in Austria
Jänner instead of Januar (in Germany)
'Feber' might be said in some parts but it is not main stream
This reminded me the idea of the `control` argument in `fit_resamples()`, as illustrated in TMWR:
https://www.tmwr.org/resampling#resampling-performance
Looks like an example of a well utilized (and documented!) options argument.
If you use this structure, you need really good function documentation with examples of how the options argument needs to be structured.
Do you have advice on whether it should only be a named list or support vector input (e.g., if all possible options have the same data type)?
I think the key thing here is to provide a helper function that has arguments that you can document. Then you get an error if the arguments are mispelled, and there's an obvious place to look for more details.
In my experience, you need good documentation regardless of which approach you use. I specifically noticed the issue with a giant list of arguments when I was doing meta-analysis using a function meta::metagen(). Although this is somewhat an extreme case: https://www.rdocumentation.org/packages/meta/versions/6.2-1/topics/metagen
even when everything is documented, I don't even know where to begin. Definitely would benefit from some type of hierarchical structure to be digestible.
Another function with a VERY large number of arguments is ggplot2::theme(): https://ggplot2.tidyverse.org/reference/theme.html. I think the hierarchical organisation of the argument names helps here, although we might still be better off with multiple functions.
Came here to say the same thing as Lisa... I often get frustrated by lack of documentation on what an options-type argument is and how to use it. In the help page for glm, for example, it is not clear that you should specify control = glm_control(...) and there is no useful description of what kinds of behavior you can elicit by using the control option. I like the idea of hiding some of the details that people rarely need, but I find in practice often this means just making things effectively inaccessible to anyone who isn't a creator of the package or someone who has spent a LOT of time using it and digging through all of its documentation. Including some advice in this chapter on how to document options objects effectively would be really helpful.
I think this just requires a couple of small adjustments to glm — it would be better if the default value for `control` was `glm.control()`, and the documentation could more clearly recommend that you also create the options with `glm.control()`.
`glm.control()` could also return a specially classed list so that `glm()` could give a better error if you passed in the wrong thing. This idea is mentioned briefly in https://design.tidyverse.org/argument-clutter.html#how-do-i-use-this-pattern.
Yes, I agree!
This is the approach used for model fitting in package {caret}. Function train() accepts a trainControl object for fine-tuning the training process. The trainControl hides complex details by encapsulating a large number of arguments:
trainControl(
method = "boot",
number = ifelse(grepl("cv", method), 10, 25),
repeats = ifelse(grepl("[d_]cv$", method), 1, NA),
p = 0.75,
search = "grid",
initialWindow = NULL,
horizon = 1,
fixedWindow = TRUE,
skip = 0,
verboseIter = FALSE,
returnData = TRUE,
returnResamp = "final",
savePredictions = FALSE,
classProbs = FALSE,
summaryFunction = defaultSummary,
selectionFunction = "best",
preProcOptions = list(thresh = 0.95, ICAcomp = 3, k = 5, freqCut = 95/5, uniqueCut =
10, cutoff = 0.9),
sampling = NULL,
index = NULL,
indexOut = NULL,
indexFinal = NULL,
timingSamps = 0,
predictionBounds = rep(FALSE, 2),
seeds = NA,
adaptive = list(min = 5, alpha = 0.05, method = "gls", complete = TRUE),
trim = FALSE,
allowParallel = TRUE
)
A nice bonus of this pattern is that you can re-use the options object/function call. In a modelling context, you often fit a series of nested models or candidate models and compare them, and this patterns helps them agree on low-level details.
Functions that wrap things written in other languages often cry out for this, in my experience. For example, `xgboost::xgboost()` has a huge list of names to use in its `params = list()` argument (which I don't THINK were documented until a year or two ago), and `yaml::yaml.load()` has you look in the Details for its `handlers` arg, where you'll find "where the names are the YAML types (i.e., 'int', 'float', 'seq', etc)." Even if the options arg is passing on to some outside thing that might change its offering, it'd be nice to have SOME guidance on the possible values!
I'm definitely a fan of this pattern!
In {tidymodel} we use this hierachi structure to our control functions. So you can pass `control_bayes()` to a function that expects `control_grid()` since the arguments in `control_bayes()` is a proper superset of the arguments in `control_grid()`.
This is done in part because these control objects are passed around from function to function in some of the more complicated routines
BTW, the link in the first sentence is invalid. Seems the correct one is: https://design.tidyverse.org/argument-clutter.html
An alternative approach, used by scikit-learn for example, is to start with a class that contains options then use class methods to act on it. In R, I implemented things such as this using {R6}. The example above could be:
```
mod <- GeneralLinearModel(family = gaussian, config = glm_config_with_defaults(trace = TRUE))
mod.fit(Postwt ~ Prewt + Treat + offset(Prewt), data = anorexia)
```
It completely declutters the actual fit() method, and using OOP we can organize the options in some meaningful way, e.g.:
```
GLMConfig <- R6::R6Class("GLMConfig",
## list(
## ## family = ...,
## ## fitting_parameters = ..., # e.g. GLMConfigFit class if there's too many (>5 ?) options
## ## parallel_execution = ... # e.g. GLMConfigParallel class
## )
)
```
There is a tradeoff obviously from listing everything in one giant argument list to creating a bunch of classes, and I am not sure what is the best middle ground? 🤔
Thanks for pointing out the broken link, fixed now.
I'll definitely get to OOP patterns later in the book, but I'm not sure using an R6 object helps here much compared to a simple list. I think the primary of advantage of R6 objects is their ability to modify-in-place, but that doesn't help here.
If you end up with a multiple options objects (possibly nested), I think you're probably better of switching to a system where you provide one function to create the object and then a bunch of functions to modify it. I've made a note to write up this pattern at https://github.com/tidyverse/design/issues/162.