-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Generate var_names from the data and partial predict #98
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #98 +/- ##
==========================================
+ Coverage 98.96% 98.99% +0.03%
==========================================
Files 30 30
Lines 5585 5760 +175
Branches 775 803 +28
==========================================
+ Hits 5527 5702 +175
Misses 35 35
Partials 23 23
Continue to review full report at Codecov.
|
b0dc258
to
460a6f9
Compare
I went ahead and built the Here's a basic example: In [1]: from patsy import dmatrix
...: import pandas as pd
...: import numpy as np
...:
...: data = pd.DataFrame({'categorical': ['a', 'b', 'c', 'b', 'a'],
...: 'integer': [1, 3, 7, 2, 1],
...: 'flt': [1.5, 0.0, 3.2, 4.2, 0.7]})
...: dm = dmatrix('categorical * np.log(integer) + bs(flt, df=3, degree=3)',
...: data)
...: dm.design_info.partial({'categorical': ['a', 'b', 'c']})
...:
Out[1]:
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0.]])
In [2]: dm.design_info.partial({'categorical': ['a', 'b'],
...: 'integer': [1, 2, 3, 4]},
...: product=True)
Out[2]:
array([[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0.69314718, 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 1.09861229, 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 1.38629436, 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 0.69314718, 0.69314718,
0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 1.09861229, 1.09861229,
0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 1.38629436, 1.38629436,
0. , 0. , 0. , 0. ]]) |
It seems like it would be simpler to query a The even simpler (from the API design / coupling perspective) would be to pass in a dict-like that loads the data lazily on demand, like: class LazyData(dict):
def __missing__(self, key):
try:
return bcolz.load(key, file)
except BcolzKeyNotFound:
raise KeyError(key) Would this work for you? Is the This is also missing lots of tests, but let's not worry about that until after the high-level discussion... |
Hmm... I hadn't thought of that. That should be relatively easy to add/change based on what I've done so far. The heart of this is the I presume you're implying that I shouldn't be worrying about the
This is really clever, thanks! I'll try it. However, I don't think it solves the
Yes.
Sound good. Writing tests is not something I've excelled at. This is somewhat tested and I (think) there is coverage for most of the new lines--likely I missed a few. I added some |
19ad339
to
e63da78
Compare
e63da78
to
807cc93
Compare
a79c5c8
to
050c220
Compare
050c220
to
544effd
Compare
b07ba3f
to
48fd2e4
Compare
4f8a70c
to
691eb4e
Compare
Hello,
I have a proposal that really came about because of the way I've been interacting with patsy.
My datasets are kind of long and kind of wide. I have lots of fields that I use for expoloring stuff, but naturally they just don't work out.
I've been using bcolz because it stores the data in a columnar fashion making horizontal slices really easy. Before, I'd been creating a list of variables that I wanted, defining all the transforms that I needed in patsy, and then feeding that through. I can't load the entire dataset into memory just because its too wide and long and I might only be looking at 20-30 columns for any one model.
So I propose having patsy attempt to figure out which columns it needs from the data using this new
var_names
method which is available onDesignInfo
,EvalFactor
, andTerm
. In a nutshell, it gets a list of all the variables used, checks if that variable is defined in theEvalEnvironment
, and if not, assumes it must be data.I've called this
var_names
for now, but arguably maybenon_eval_var_names
might be more accurate? Open to suggestions here.One nice thing is that when using
incr_dbuilder
, it can automatically slice on the columns which makes the construction much faster (for me at least).Here's a gist demo'ing this.
https://gist.github.com/thequackdaddy/2e601afff4fbbfe42ed31a9b2925967d
Let me know what you think.