Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Generate var_names from the data and partial predict #98

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

thequackdaddy
Copy link
Contributor

@thequackdaddy thequackdaddy commented Dec 29, 2016

Hello,

I have a proposal that really came about because of the way I've been interacting with patsy.

My datasets are kind of long and kind of wide. I have lots of fields that I use for expoloring stuff, but naturally they just don't work out.

I've been using bcolz because it stores the data in a columnar fashion making horizontal slices really easy. Before, I'd been creating a list of variables that I wanted, defining all the transforms that I needed in patsy, and then feeding that through. I can't load the entire dataset into memory just because its too wide and long and I might only be looking at 20-30 columns for any one model.

So I propose having patsy attempt to figure out which columns it needs from the data using this new var_names method which is available on DesignInfo, EvalFactor, and Term. In a nutshell, it gets a list of all the variables used, checks if that variable is defined in the EvalEnvironment, and if not, assumes it must be data.

I've called this var_names for now, but arguably maybe non_eval_var_names might be more accurate? Open to suggestions here.

One nice thing is that when using incr_dbuilder, it can automatically slice on the columns which makes the construction much faster (for me at least).

Here's a gist demo'ing this.

https://gist.github.com/thequackdaddy/2e601afff4fbbfe42ed31a9b2925967d

Let me know what you think.

@codecov-io
Copy link

codecov-io commented Dec 29, 2016

Codecov Report

Merging #98 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #98      +/-   ##
==========================================
+ Coverage   98.96%   98.99%   +0.03%     
==========================================
  Files          30       30              
  Lines        5585     5760     +175     
  Branches      775      803      +28     
==========================================
+ Hits         5527     5702     +175     
  Misses         35       35              
  Partials       23       23
Impacted Files Coverage Δ
patsy/user_util.py 100% <100%> (ø) ⬆️
patsy/test_build.py 98.1% <100%> (+0.1%) ⬆️
patsy/desc.py 98.42% <100%> (+0.07%) ⬆️
patsy/design_info.py 99.68% <100%> (+0.06%) ⬆️
patsy/build.py 99.62% <100%> (ø) ⬆️
patsy/eval.py 99.16% <100%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c613d0...544effd. Read the comment docs.

@thequackdaddy thequackdaddy force-pushed the varnames branch 6 times, most recently from b0dc258 to 460a6f9 Compare March 4, 2017 22:58
@thequackdaddy
Copy link
Contributor Author

I went ahead and built the partial function that I had alluded to in #93. This makes it much easier to create design matrices for statsmodels that show you the marginal differences whe you only change the levels of 1 (or more) factors.

Here's a basic example:

In [1]: from patsy import dmatrix
   ...: import pandas as pd
   ...: import numpy as np
   ...:
   ...: data = pd.DataFrame({'categorical': ['a', 'b', 'c', 'b', 'a'],
   ...:                      'integer': [1, 3, 7, 2, 1],
   ...:                      'flt': [1.5, 0.0, 3.2, 4.2, 0.7]})
   ...: dm = dmatrix('categorical * np.log(integer) + bs(flt, df=3, degree=3)',
   ...:  data)
   ...: dm.design_info.partial({'categorical': ['a', 'b', 'c']})
   ...:
Out[1]:
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [2]: dm.design_info.partial({'categorical': ['a', 'b'],
   ...:                         'integer': [1, 2, 3, 4]},
   ...:                        product=True)
Out[2]:
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.69314718,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.09861229,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.38629436,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  0.69314718,  0.69314718,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  1.09861229,  1.09861229,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  1.38629436,  1.38629436,
         0.        ,  0.        ,  0.        ,  0.        ]])

@thequackdaddy
Copy link
Contributor Author

thequackdaddy commented Mar 4, 2017

@njsmith Also, it appears that travis isn't kicking off for this all of a sudden. Any ideas why this would be?

I'm fairly certain this will pass. Here is the branch in my travis.

@thequackdaddy thequackdaddy changed the title ENH: Generate var_names from the data ENH: Generate var_names from the data and partial predict Mar 4, 2017
@njsmith
Copy link
Member

njsmith commented Mar 5, 2017

It seems like it would be simpler to query a ModelDesc for all the variables it uses, period? And then it'd be your job to ignore the ones that aren't present in your data set. This would also be more accurate, because

The even simpler (from the API design / coupling perspective) would be to pass in a dict-like that loads the data lazily on demand, like:

class LazyData(dict):
    def __missing__(self, key):
        try:
            return bcolz.load(key, file)
        except BcolzKeyNotFound:
            raise KeyError(key)

Would this work for you?

Is the partial part somehow tied to the var_names part? They look like separate changes to me, so should be in separate PRs?

This is also missing lots of tests, but let's not worry about that until after the high-level discussion...

@thequackdaddy
Copy link
Contributor Author

It seems like it would be simpler to query a ModelDesc for all the variables it uses, period? And then it'd be your job to ignore the ones that aren't present in your data set. This would also be more accurate, because

Hmm... I hadn't thought of that. That should be relatively easy to add/change based on what I've done so far. The heart of this is the var_names is on the EvalFactor class that looks at all the objects needed to evalulate the factor using the ast_names function. This is in turn used by the Term class... (and is used in turn by DesignInfo class). ModelDesc has a list of terms (lhs_termlist and rhs_termlist), so adding this would be easy.

I presume you're implying that I shouldn't be worrying about the EvalEnvironment variable and just return every dependent object--function and module alike? I was trying to return only "data"-ish things. Simply removing them from the output set manually set seems easy enough...

The even simpler (from the API design / coupling perspective) would be to pass in a dict-like that loads the data lazily on demand, like:

This is really clever, thanks! I'll try it. However, I don't think it solves the partial issue below.

Is the partial part somehow tied to the var_names part? They look like separate changes to me, so should be in separate PRs?

Yes. partial looks at each Term's var_names and decides whether the Term needs the variable or not. If yes, it pulls that Term using subset to create the design matrix only for that subset of columns using the variables specified. Otherwise, it returns columns full of zeros. The end result is a design matrix of the same width and column alignment as the model's DesignMatrix, but only with as many rows as needed to evaluate the partial predictions and the rest of the columns as zeros.

This is also missing lots of tests, but let's not worry about that until after the high-level discussion...

Sound good. Writing tests is not something I've excelled at. This is somewhat tested and I (think) there is coverage for most of the new lines--likely I missed a few. I added some asserts to some of the existing tests with the new functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants