26 Mar 12:07

simonpcouch

63aaaa0

infer 1.0.7 Latest

Latest

The aliases p_value() and conf_int(), first deprecated 6 years ago, now
return an error (#530).
Addresses ggplot2 warnings when shading p-values for test statistics
that are outside of the range of the generated distribution (#528).
Fixed bug in shade_p_value() and shade_confidence_interval() where fill = NULL was ignored when it was documented as preventing any shading (#525).

Assets 2

31 Jan 15:51

simonpcouch

v1.0.6

fdbcece

infer 1.0.6

Updated infrastructure for errors, warnings, and messages (#513). Some of these changes will not be visible to users, though:
- Many longer error messages are now broken up into several lines.
- For references to help-files, users can now click on the error message's text to navigate to the cited documentation.
Various improvements to documentation (#501, #504, #508, #512).
Fixed bug where get_confidence_interval() would error uninformatively when the supplied distribution of estimates contained missing values. The function will now warn and return a confidence interval calculated using the non-missing estimates (#521).
Fixed bug where generate() could not be used without first specify()ing variables, even in cases where that specification would not affect resampling/simulation (#448).

Assets 2

06 Sep 12:48

simonpcouch

v1.0.5

5ad0087

infer 1.0.5

Implemented support for permutation hypothesis tests for paired data via the argument value null = "paired independence" in hypothesize() (#487). The new vignette Tidy inference for paired data outlines these changes with an applied example.
The weight_by argument to rep_slice_sample() can now be passed either as a vector of numeric weights or an unquoted column name in .data (#480).
Newly accommodates variables with spaces in names in the wrapper functions t_test() and prop_test() (#472).
Fixed bug in two-sample prop_test() where the response and explanatory variable were passed in place of each other to prop.test(). This enables using prop_test() with explanatory variables with greater than 2 levels and, in the process, addresses a bug where prop_test() collapsed levels other than the success when the response variable had more than 2 levels.

Assets 2

02 Dec 00:12

simonpcouch

v1.0.4

1da7f38

infer 1.0.4

Fixed bug in p-value shading where shaded regions no longer correctly overlaid histogram bars.
Addressed deprecation warning ahead of upcoming dplyr release.

Assets 2

22 Aug 19:01

simonpcouch

v1.0.3

3839dcb

infer 1.0.3

Fix R-devel HTML5 NOTEs.

Assets 2

12 Jun 20:07

simonpcouch

v1.0.2

06e2977

infer 1.0.2

infer v1.0.2 is a minor release containing several bug fixes and miscellaneous improvements.

Fix p-value shading when the calculated statistic falls exactly on the boundaries of a histogram bin (#424).
Fix generate() errors when columns are named x (#431).
Fix error from visualize when passed generate()d infer_dist objects that had not been passed to hypothesize() (#432).
Update visual checks for visualize output to align with the R 4.1.0+ graphics engine (#438).
specify() and wrapper functions now appropriately handle ordered factors (#439).
Clarify error when incompatible statistics and hypotheses are supplied (#441).
Updated generate() unexpected type warnings to be more permissive—the warning will be raised less often when type = "bootstrap" (#425).
Allow passing additional arguments to stats::chisq.test via ... in calculate(). Ellipses are now always passed to the applicable base R hypothesis testing function, when applicable (#414)!
The package will now set the levels of logical variables on conversion to factor so that the first level (regarded as success by default) is TRUE. Core verbs have warned without an explicit success value already, and this change makes behavior consistent with the functions being wrapped by shorthand test wrappers (#440).
Added new statistic stat = "ratio of means" (#452).
Simon Couch is now the CRAN-corresponding maintainer.

This release also ships changes from v1.0.1, a GitHub-only release, off to CRAN for the first time. Notably, the package is now released with an MIT license.

Assets 2

13 Sep 23:38

simonpcouch

v1.0.1

be84547

JOSS paper

This is a GitHub-only release—changes in this release will be reflected on CRAN in the next release. This release reflects the infer version accepted to the Journal of Open Source Software.

Re-licensed the package from CC0 to MIT (#413). See the LICENSE and LICENSE.md files.
Various improvements to documentation (#417, #418).
Contributed a paper to the Journal of Open Source Software, a draft of which is available in /figs/paper (#401).

Assets 2

13 Aug 18:42

simonpcouch

v1.0.0

e375221

First major release

infer 1.0.0

v1.0.0 is the first major release of the {infer} package! By and large, the core verbs specify(), hypothesize(), generate(), and calculate() will interface as they did before. This release makes several improvements to behavioral consistency of the package and introduces support for theory-based inference as well as randomization-based inference with multiple explanatory variables.

Behavioral consistency

A major change to the package in this release is a set of standards for behavorial consistency of calculate() (#356). Namely, the package will now

supply a consistent error when the supplied stat argument isn't well-defined
for the variables specify()d

gss %>%
  specify(response = hours) %>%
  calculate(stat = "diff in means")
#> Error: A difference in means is not well-defined for a 
#> numeric response variable (hours) and no explanatory variable.

gss %>%
  specify(college ~ partyid, success = "degree") %>%
  calculate(stat = "diff in props")
#> Error: A difference in proportions is not well-defined for a dichotomous categorical 
#> response variable (college) and a multinomial categorical explanatory variable (partyid).

supply a consistent message when the user supplies unneeded information via hypothesize() to calculate() an observed statistic

# supply mu = 40 when it's not needed
gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  calculate(stat = "mean")
#> Message: The point null hypothesis `mu = 40` does not inform calculation of 
#> the observed statistic (a mean) and will be ignored.
#> # A tibble: 1 x 1
#>    stat
#>   <dbl>
#> 1  41.4

and

supply a consistent warning and assume a reasonable null value when the user does not supply sufficient information to calculate an observed statistic

# don't hypothesize `p` when it's needed
gss %>%
    specify(response = sex, success = "female") %>%
    calculate(stat = "z")
#> # A tibble: 1 x 1
#>    stat
#>   <dbl>
#> 1 -1.16
#> Warning message:
#> A z statistic requires a null hypothesis to calculate the observed statistic. 
#> Output assumes the following null value: `p = .5`.

# don't hypothesize `p` when it's needed
gss %>%
  specify(response = partyid) %>%
  calculate(stat = "Chisq")
#> # A tibble: 1 x 1
#>    stat
#>  <dbl>
#> 1  334.
#> Warning message:
#> A chi-square statistic requires a null hypothesis to calculate the observed statistic. 
#> Output assumes the following null values: `p = c(dem = 0.2, ind = 0.2, rep = 0.2, other = 0.2, DK = 0.2)`.

To accommodate this behavior, a number of new calculate methods were added or improved. Namely:

Implemented the standardized proportion $z$ statistic for one categorical variable
Extended calculate() with stat = "t" by passing mu to the calculate() method for stat = "t" to allow for calculation of t statistics for one numeric variable with hypothesized mean
Extended calculate() to allow lowercase aliases for stat arguments (#373).
Fixed bugs in calculate() for to allow for programmatic calculation of statistics

This behavorial consistency also allowed for the implementation of observe(), a wrapper function around specify(), hypothesize(), and calculate(), to calculate observed statistics. The function provides a shorthand alternative to calculating observed statistics from data:

# calculating the observed mean number of hours worked per week
gss %>%
  observe(hours ~ NULL, stat = "mean")
#> # A tibble: 1 x 1
#>    stat
#>   <dbl>
#> 1  41.4

# equivalently, calculating the same statistic with the core verbs
gss %>%
  specify(response = hours) %>%
  calculate(stat = "mean")
#> # A tibble: 1 x 1
#>    stat
#>   <dbl>
#> 1  41.4

# calculating a t statistic for hypothesized mu = 40 hours worked/week
gss %>%
  observe(hours ~ NULL, stat = "t", null = "point", mu = 40)
#> # A tibble: 1 x 1
#>    stat
#>   <dbl>
#> 1  2.09

# equivalently, calculating the same statistic with the core verbs
gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  calculate(stat = "t")
#> # A tibble: 1 x 1
#>    stat
#>   <dbl>
#> 1  2.09

We don't anticipate that these changes are "breaking" in the sense that code that previously worked will continue to, though it may now message or warn in a way that it did not used to or error with a different (and hopefully more informative) message.

A framework for theoretical inference

This release also introduces a more complete and principled interface for theoretical inference. While the package previously supplied some methods for visualization of theory-based curves, the interface did not provide any object that was explicitly a "null distribution" that could be supplied to helper functions like get_p_value() and get_confidence_interval(). The new interface is based on a new verb, assume(), that returns a null distribution that can be interfaced in the same way that simulation-based null distributions can be interfaced with.

As an example, we'll work through a full infer pipeline for inference on a mean using infer's gss dataset. Supposed that we believe the true mean number of hours worked by Americans in the past week is 40.

First, calculating the observed t-statistic:

obs_stat <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  calculate(stat = "t")

obs_stat
#> Response: hours (numeric)
#> Null Hypothesis: point
#> # A tibble: 1 x 1
#>    stat
#>   <dbl>
#> 1  2.09

The code to define the null distribution is very similar to that required to calculate a theorized observed statistic, switching out calculate() for assume() and replacing arguments as needed.

null_dist <- gss %>%
  specify(response = hours) %>%
  assume(distribution = "t")

null_dist 
#> A T distribution with 499 degrees of freedom.

This null distribution can now be interfaced with in the same way as a simulation-based null distribution elsewhere in the package. For example, calculating a p-value by juxtaposing the observed statistic and null distribution:

get_p_value(null_dist, obs_stat, direction = "both")
#> # A tibble: 1 x 1
#>   p_value
#>     <dbl>
#> 1  0.0376

…or visualizing the null distribution alone:

visualize(null_dist)

…or juxtaposing the two visually:

visualize(null_dist) + 
  shade_p_value(obs_stat, direction = "both")

Confidence intervals lie in data space rather than the standardized scale of the theoretical distributions. Calculating a mean rather than the standardized t-statistic:

obs_mean <- gss %>%
  specify(response = hours) %>%
  calculate(stat = "mean")

The null distribution here just defines the spread for the standard error calculation.

ci <- 
  get_confidence_interval(
    null_dist,
    level = .95,
    point_estimate = obs_mean
  )

ci
#> # A tibble: 1 x 2
#>   lower_ci upper_ci
#>      <dbl>    <dbl>
#> 1     40.1     42.7

Visualizing the confidence interval results in the theoretical distribution being recentered and rescaled to align with the scale of the observed data:

visualize(null_dist) + 
  shade_confidence_interval(ci)

Previous methods for interfacing with theoretical distributions are superseded—they will continue to be supported, though documentation will forefront the assume() interface.

Support for multiple regression

The 2016 "Guidelines for Assessment and Instruction in Statistics Education" [1] state that, in introductory statistics courses, "[s]tudents should gain experience with how statistical models, including multivariable models, are used." In line with this recommendation, we introduce support for randomization-based inference with multiple explanatory variables via a new fit.infer core verb.

If passed an infer object, the method will parse a formula out of the formula or response and explanatory arguments, and pass both it and data to a stats::glm call.

gss %>%
  specify(hours ~ age + college) %>%
  fit()
#> # A tibble: 3 x 2
#>   term          estimate
#>   <chr>            <dbl>
#> 1 intercept     40.6    
#> 2 age            0.00596
#> 3 collegedegree  1.53

Note that the function returns the model coefficients as estimate rather than their associated t-statistics as stat.

If passed a generate()d object, the model will be fitted to each replicate.

gss %>%
  specify(hours ~ age + college) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  fit()
#> # A tibble: 300 x 3
#> # Groups:   replicate [100]
#>    replicate term          estimate
#>        <int> <chr>            <dbl>
#>  1         1 intercept     44.4    
#>  2         1 age           -0.0767 
#>  3         1 collegedegree  0.121  
#>  4         2 intercept     41.8    
#>  5         2 age            0.00344
#>  6         2 collegedegree -1.59   
#>  7         3 intercept     38.3    
#>  8         3 age            0.0761 
#>  9         3 collegedegree  0.136  
#> 10         4 intercept     43.1    
#> # … with 290 more rows

If type = "permute", a set of unquoted column names in the data to permute (independently of each other) can be passed via the variables argument to generate. It defaults to only the response variable.

gss %>%
  specify(hours ~ age + college) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute", variables = c(age, college)) %>%
  fit()
#> # A tibble: 300 x 3
#> # Groups:   replicate [100]
#>   ...

Assets 2

13 Jan 17:00

simonpcouch

v0.5.4

cffc082

Standardized proportion z statistic, improvements to various helpers

rep_sample_n() no longer errors when supplied a prob argument (#279)
Added rep_slice_sample(), a light wrapper around rep_sample_n(), that more closely resembles dplyr::slice_sample() (the function that supersedes dplyr::sample_n()) (#325)
Added a success, correct, and z argument to prop_test() (#343, #347, #353)
Implemented observed statistic calculation for the standardized proportion $z$ statistic (#351, #353)
Various bug fixes and improvements to documentation and errors.

Assets 2

15 Jul 16:59

simonpcouch

v0.5.3

5b57562

Bias-corrected confidence intervals

get_confidence_interval() can now produce bias-corrected confidence intervals
by setting type = "bias-corrected". Thanks to @davidbaniadam for the
initial implementation (#237, #318)!
get_confidence_interval() now uses column names ('lower_ci' and 'upper_ci')
in output that are consistent with other infer functionality (#317).
Fix CRAN check failures related to long double errors.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infer 1.0.0

Behavioral consistency

A framework for theoretical inference

Support for multiple regression

Releases: tidymodels/infer

infer 1.0.7

infer 1.0.6

infer 1.0.5

infer 1.0.4

infer 1.0.3

infer 1.0.2

JOSS paper

First major release

infer 1.0.0

Behavioral consistency

A framework for theoretical inference

Support for multiple regression

Standardized proportion z statistic, improvements to various helpers

Bias-corrected confidence intervals