Tidy design principles

not super important but we say 'Februar' (as in Germany) in Austria

Jänner instead of Januar (in Germany)

'Feber' might be said in some parts but it is not main stream

Expand full comment

Matan Hakim

https://www.tmwr.org/resampling#resampling-performance

This reminded me the idea of the `control` argument in `fit_resamples()`, as illustrated in TMWR:

Looks like an example of a well utilized (and documented!) options argument.

Expand full comment

Lisa DeBruine

If you use this structure, you need really good function documentation with examples of how the options argument needs to be structured.

Do you have advice on whether it should only be a named list or support vector input (e.g., if all possible options have the same data type)?

Expand full comment

Reply (2)

I think the key thing here is to provide a helper function that has arguments that you can document. Then you get an error if the arguments are mispelled, and there's an obvious place to look for more details.

Expand full comment

Igor Segota

Aug 4, 2023Edited

In my experience, you need good documentation regardless of which approach you use. I specifically noticed the issue with a giant list of arguments when I was doing meta-analysis using a function meta::metagen(). Although this is somewhat an extreme case: https://www.rdocumentation.org/packages/meta/versions/6.2-1/topics/metagen

even when everything is documented, I don't even know where to begin. Definitely would benefit from some type of hierarchical structure to be digestible.

Expand full comment

Another function with a VERY large number of arguments is ggplot2::theme(): https://ggplot2.tidyverse.org/reference/theme.html. I think the hierarchical organisation of the argument names helps here, although we might still be better off with multiple functions.

Expand full comment

Kellie MacPhee

Came here to say the same thing as Lisa... I often get frustrated by lack of documentation on what an options-type argument is and how to use it. In the help page for glm, for example, it is not clear that you should specify control = glm_control(...) and there is no useful description of what kinds of behavior you can elicit by using the control option. I like the idea of hiding some of the details that people rarely need, but I find in practice often this means just making things effectively inaccessible to anyone who isn't a creator of the package or someone who has spent a LOT of time using it and digging through all of its documentation. Including some advice in this chapter on how to document options objects effectively would be really helpful.

Expand full comment

I think this just requires a couple of small adjustments to glm — it would be better if the default value for `control` was `glm.control()`, and the documentation could more clearly recommend that you also create the options with `glm.control()`.

`glm.control()` could also return a specially classed list so that `glm()` could give a better error if you passed in the wrong thing. This idea is mentioned briefly in https://design.tidyverse.org/argument-clutter.html#how-do-i-use-this-pattern.

Expand full comment

Kellie MacPhee

Yes, I agree!

Expand full comment

Brandon Loudermilk

This is the approach used for model fitting in package {caret}. Function train() accepts a trainControl object for fine-tuning the training process. The trainControl hides complex details by encapsulating a large number of arguments:

trainControl(

method = "boot",

number = ifelse(grepl("cv", method), 10, 25),

repeats = ifelse(grepl("[d_]cv$", method), 1, NA),

p = 0.75,

search = "grid",

initialWindow = NULL,

horizon = 1,

fixedWindow = TRUE,

skip = 0,

verboseIter = FALSE,

returnData = TRUE,

returnResamp = "final",

savePredictions = FALSE,

classProbs = FALSE,

summaryFunction = defaultSummary,

selectionFunction = "best",

preProcOptions = list(thresh = 0.95, ICAcomp = 3, k = 5, freqCut = 95/5, uniqueCut =

10, cutoff = 0.9),

sampling = NULL,

index = NULL,

indexOut = NULL,

indexFinal = NULL,

timingSamps = 0,

predictionBounds = rep(FALSE, 2),

seeds = NA,

adaptive = list(min = 5, alpha = 0.05, method = "gls", complete = TRUE),

trim = FALSE,

allowParallel = TRUE

)

Expand full comment

TJ Mahr

Aug 25, 2023

A nice bonus of this pattern is that you can re-use the options object/function call. In a modelling context, you often fit a series of nested models or candidate models and compare them, and this patterns helps them agree on low-level details.

Expand full comment

Jon Harmon

Functions that wrap things written in other languages often cry out for this, in my experience. For example, `xgboost::xgboost()` has a huge list of names to use in its `params = list()` argument (which I don't THINK were documented until a year or two ago), and `yaml::yaml.load()` has you look in the Details for its `handlers` arg, where you'll find "where the names are the YAML types (i.e., 'int', 'float', 'seq', etc)." Even if the options arg is passing on to some outside thing that might change its offering, it'd be nice to have SOME guidance on the possible values!

I'm definitely a fan of this pattern!

Expand full comment

Emil Hvitfeldt

In {tidymodel} we use this hierachi structure to our control functions. So you can pass `control_bayes()` to a function that expects `control_grid()` since the arguments in `control_bayes()` is a proper superset of the arguments in `control_grid()`.

Expand full comment

Emil Hvitfeldt

This is done in part because these control objects are passed around from function to function in some of the more complicated routines

Expand full comment

Igor Segota

Aug 4, 2023Edited

BTW, the link in the first sentence is invalid. Seems the correct one is: https://design.tidyverse.org/argument-clutter.html

An alternative approach, used by scikit-learn for example, is to start with a class that contains options then use class methods to act on it. In R, I implemented things such as this using {R6}. The example above could be:

```

mod <- GeneralLinearModel(family = gaussian, config = glm_config_with_defaults(trace = TRUE))

mod.fit(Postwt ~ Prewt + Treat + offset(Prewt), data = anorexia)

```

It completely declutters the actual fit() method, and using OOP we can organize the options in some meaningful way, e.g.:

```

GLMConfig <- R6::R6Class("GLMConfig",

## list(

## ## family = ...,

## ## fitting_parameters = ..., # e.g. GLMConfigFit class if there's too many (>5 ?) options

## ## parallel_execution = ... # e.g. GLMConfigParallel class

## )

)

```

There is a tradeoff obviously from listing everything in one giant argument list to creating a bunch of classes, and I am not sure what is the best middle ground? 🤔

Expand full comment