Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out a way to avoid models[[1]] #43

Open
joelnitta opened this issue Jul 11, 2024 · 5 comments
Open

Figure out a way to avoid models[[1]] #43

joelnitta opened this issue Jul 11, 2024 · 5 comments
Assignees

Comments

@joelnitta
Copy link
Collaborator

joelnitta commented Jul 11, 2024

maybe we could use a dataframe for storing modes instead?

@joelnitta joelnitta self-assigned this Jul 12, 2024
@joelnitta
Copy link
Collaborator Author

This is tricky, because non-standard evaluation is used to define models for lm(). We can't just provide a character vector of model specifications like "bill_depth_mm ~ bill_length_mm" and map over those (well, we could with another custom function, but that is asking a lot of the learners).

Furthermore, the design of branching in {targets} nudges us to use dataframes (or tibbles) as targets. So when designing custom functions that will be used in branching, it helps to think of how the function will work on one row of input. We can write a custom function that looks clean in the final plan and produces clean output (a tidy dataframe), but the contents of the function are rather complicated since it has to work with a one-row dataframe as input. This will be tedious to explain to novices (and it still requires indexing with [[ anyways).

Finally, the approach of including models as a list-column in a dataframe is a rather advanced topic.

Anyways here is a sketch that builds models in a tibble, then branches over the rows of the tibble:

source("R/packages.R")
source("R/functions.R")

summarize_model <- function(model_tibble) {
  model_name <- model_tibble$model_name
  model <- model_tibble$model[[1]]
  glance(model) |>
    mutate(model_name = model_name) |>
    relocate(model_name, .before = 1)
}

tar_plan(
  # Load raw data
  tar_file_read(
    penguins_data_raw,
    path_to_file("penguins_raw.csv"),
    read_csv(!!.x, show_col_types = FALSE)
  ),
  # Clean data
  penguins_data = clean_penguin_data(penguins_data_raw),
  # Build models
  models = tibble(
    model_name = c("combined_model", "species_model", "interaction_model"),
    model = list(
      lm(bill_depth_mm ~ bill_length_mm, data = penguins_data),
      lm(bill_depth_mm ~ bill_length_mm + species, data = penguins_data),
      lm(bill_depth_mm ~ bill_length_mm * species, data = penguins_data)
    )
  ),
  # Get model summaries
  tar_target(
    model_summaries,
    summarize_model(models),
    pattern = map(models)
  )
)

I now realize a more natural introduction to branching would be to branch over different sets of input instead of different models.

@multimeric keen to hear your thoughts!

@joelnitta
Copy link
Collaborator Author

Update: just taught this workshop again, and this part is very difficult to teach since the details are so complicated. We should definitely use a simpler example for branching. Maybe not even use the models at all.

@joelnitta
Copy link
Collaborator Author

NEW IDEA: instead of branching over the list of models, split up the original data set by species using tar_group(), then build a model for each separately. It will then be much easier to reason about the subsequent steps of extracting model parameters and predictions using broom::glimpse() and broom::augment(). The downside of this approach is that it is technically not statistically sound (making a separate model for each species instead of a single model that includes species as a categorical predictor variable). But the point of the workshop is to teach how to use {targets}, not statistics, so I think that's OK.

@multimeric
Copy link
Contributor

I think that would be better. Anything that avoids using a list is good: even if we have something that relates to branching over a single vector would be better because it avoids changing the data type.

@joelnitta
Copy link
Collaborator Author

Right... of course, the output of lm() is a list, so that makes it awkward to include directly in the pipeline. If we want to avoid branching over lists, we would have to build the model twice, once for broom::augment() and once for broom::glance(). Something like this (assuming penguins_data is coming in as a branch split up by species):

augment_penguins <- function(penguins_data) {
  model <- lm(bill_length_mm ~ bill_depth_mm, data = penuins_data)
  augment(model) |>
    mutate(species = unique(penguins_data$species)
}
glance_penguins <- function(penguins_data) {
  model <- lm(bill_length_mm ~ bill_depth_mm, data = penuins_data)
  glance(model) |>
    mutate(species = unique(penguins_data$species)
}

That feels a little awkward because in a "production" situation you would only build the model once. But for teaching {targets} it's probably OK? It sure is easier to reason about with dataframe in and dataframe out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants