Figure out a way to avoid `models[[1]]` #43

joelnitta · 2024-07-11T12:03:07Z

maybe we could use a dataframe for storing modes instead?

joelnitta · 2024-07-12T01:11:38Z

This is tricky, because non-standard evaluation is used to define models for lm(). We can't just provide a character vector of model specifications like "bill_depth_mm ~ bill_length_mm" and map over those (well, we could with another custom function, but that is asking a lot of the learners).

Furthermore, the design of branching in {targets} nudges us to use dataframes (or tibbles) as targets. So when designing custom functions that will be used in branching, it helps to think of how the function will work on one row of input. We can write a custom function that looks clean in the final plan and produces clean output (a tidy dataframe), but the contents of the function are rather complicated since it has to work with a one-row dataframe as input. This will be tedious to explain to novices (and it still requires indexing with [[ anyways).

Finally, the approach of including models as a list-column in a dataframe is a rather advanced topic.

Anyways here is a sketch that builds models in a tibble, then branches over the rows of the tibble:

source("R/packages.R")
source("R/functions.R")

summarize_model <- function(model_tibble) {
  model_name <- model_tibble$model_name
  model <- model_tibble$model[[1]]
  glance(model) |>
    mutate(model_name = model_name) |>
    relocate(model_name, .before = 1)
}

tar_plan(
  # Load raw data
  tar_file_read(
    penguins_data_raw,
    path_to_file("penguins_raw.csv"),
    read_csv(!!.x, show_col_types = FALSE)
  ),
  # Clean data
  penguins_data = clean_penguin_data(penguins_data_raw),
  # Build models
  models = tibble(
    model_name = c("combined_model", "species_model", "interaction_model"),
    model = list(
      lm(bill_depth_mm ~ bill_length_mm, data = penguins_data),
      lm(bill_depth_mm ~ bill_length_mm + species, data = penguins_data),
      lm(bill_depth_mm ~ bill_length_mm * species, data = penguins_data)
    )
  ),
  # Get model summaries
  tar_target(
    model_summaries,
    summarize_model(models),
    pattern = map(models)
  )
)

I now realize a more natural introduction to branching would be to branch over different sets of input instead of different models.

@multimeric keen to hear your thoughts!

joelnitta · 2024-10-31T04:18:31Z

Update: just taught this workshop again, and this part is very difficult to teach since the details are so complicated. We should definitely use a simpler example for branching. Maybe not even use the models at all.

joelnitta · 2024-11-01T02:51:53Z

NEW IDEA: instead of branching over the list of models, split up the original data set by species using tar_group(), then build a model for each separately. It will then be much easier to reason about the subsequent steps of extracting model parameters and predictions using broom::glimpse() and broom::augment(). The downside of this approach is that it is technically not statistically sound (making a separate model for each species instead of a single model that includes species as a categorical predictor variable). But the point of the workshop is to teach how to use {targets}, not statistics, so I think that's OK.

multimeric · 2024-11-01T03:05:10Z

I think that would be better. Anything that avoids using a list is good: even if we have something that relates to branching over a single vector would be better because it avoids changing the data type.

joelnitta · 2024-11-01T03:13:07Z

Right... of course, the output of lm() is a list, so that makes it awkward to include directly in the pipeline. If we want to avoid branching over lists, we would have to build the model twice, once for broom::augment() and once for broom::glance(). Something like this (assuming penguins_data is coming in as a branch split up by species):

augment_penguins <- function(penguins_data) {
  model <- lm(bill_length_mm ~ bill_depth_mm, data = penuins_data)
  augment(model) |>
    mutate(species = unique(penguins_data$species)
}

glance_penguins <- function(penguins_data) {
  model <- lm(bill_length_mm ~ bill_depth_mm, data = penuins_data)
  glance(model) |>
    mutate(species = unique(penguins_data$species)
}

That feels a little awkward because in a "production" situation you would only build the model once. But for teaching {targets} it's probably OK? It sure is easier to reason about with dataframe in and dataframe out.

joelnitta mentioned this issue Jul 11, 2024

Workshop feedback #22

Open

joelnitta self-assigned this Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out a way to avoid `models[[1]]` #43

Figure out a way to avoid `models[[1]]` #43

joelnitta commented Jul 11, 2024 •

edited

Loading

joelnitta commented Jul 12, 2024

joelnitta commented Oct 31, 2024

joelnitta commented Nov 1, 2024

multimeric commented Nov 1, 2024

joelnitta commented Nov 1, 2024

Figure out a way to avoid models[[1]] #43

Figure out a way to avoid models[[1]] #43

Comments

joelnitta commented Jul 11, 2024 • edited Loading

joelnitta commented Jul 12, 2024

joelnitta commented Oct 31, 2024

joelnitta commented Nov 1, 2024

multimeric commented Nov 1, 2024

joelnitta commented Nov 1, 2024

Figure out a way to avoid `models[[1]]` #43

Figure out a way to avoid `models[[1]]` #43

joelnitta commented Jul 11, 2024 •

edited

Loading