-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patsy pickling #67
base: master
Are you sure you want to change the base?
Patsy pickling #67
Conversation
Hey sorry, I'll get to this soon but I'll need to spend some time thinking hard about the current organization of |
Any news? |
Right, thanks for the ping :-). I spent a few hours looking at this last night, and it's complicated! The easy part: there are a bunch of simpler, lower-level objects that model pickling will rely on. For the most part I think we just need to add some tests for these (pickle/unpickle, and ideally having some historical pickles stored in the test suite to test that new versions of patsy continue to be able to unpickle them):
Then there's the core design stuff, mostly in
This is the hard part, because the internal representations here are a mess -- weird encoding of information, redundant encodings across different objects, and so on. This really needs some substantive refactoring. And then there's a few high-level objects that need some attention:
So I'm going to keep thinking about the best way to handle |
It might also be worth doing a quick pass through the source to find any other classes that aren't on the above list, and add |
…ignInfo Notable changes: - DesignInfo now exposes lots more metadata about how exactly different factors and terms are coded. - In fact, it exposes enough more metadata that you can now reconstruct a design matrix entirely from a DesignInfo, so DesignMatrixBuilder becomes redundant and is removed in favor of DesignInfo. - DesignInfo's constructor is very different; in particular, removed the option of specifying terms as strings, which was only useful for interoperability with competing formula libraries. Four years later, no such competitors have appeared, so I can't be bothered to keep maintaining this. Will re-add later if someone actually wants to use it. This versions works and passes tests, but a bunch more tests need to be added. This fixes #61, and sets us up to implemented pickling support (#26, #67).
Okay, I sat down and refactored big chunks of I also went through and added raise-an-error So now I think the todo list is just:
|
That's great! I don't have time to look at this right now, but I should have more time around the next weekend. |
Cool, no worries. If I get some time I might go ahead and release 0.4.0, so people can at On Sun, Jun 14, 2015 at 5:03 PM, Christian Hudon [email protected]
Nathaniel J. Smith -- http://vorpus.org |
Working on this. Quick questions. Should I do a |
I guess we need a For tests, I'd generally suggest round-trip tests, plus some tests that involve stored pickle files (or strings inside the test .py, or whatever) that we attempt to load (since this is the only way to make sure we haven't accidentally broken both the save and load pathways in parallel, which is surprisingly easy to do with pickle). |
Getting back to working on patsy. My plan is to write the pickling support and tests for a simple class (I picked So, while I'm in Also, I assume we don't care about forward compatibility (i.e. it's okay if an older version of patsy can't read a pickle generated by a newer version), correct? |
And for the tests, here's actually what I'm thinking of doing right now. You can tell me what you think, and if it sounds realistic. We want to achieve two main things with the tests for pickling: 1) testing that pickling round-trips, and 2) testing that pickles created by old versions of patsy still work on newer version. I'd like not to repeat myself too much between these two, so here's what I'm thinking. I create a bunch of interesting patsy objects, all contained in (say) a dict. Then for each of these objects, I can test that Do you think this would work as a testing strategy? Or am I missing something? |
Forward compatibility: It'd be good if trying to load a new pickle in an old version at least gave an error? But we can always deal with that later at the time we decide to change something in a breaking fashion. (I guess if one wants to get really fancy one could put a version number field in every pickle so that old versions will not just error out, but also be able to display a comprehensible error message - "This version of patsy is too old..." -- but this is definitely an extra credit kind of thing.) The testing strategy also looks good to me. I guess the one subtlety will be that if we add new stuff in a future release then we'll need to have some way to add new test objects that are only present in newer pickle dumps, while still keeping the old pickle dumps. One possible approach: represent the test objects as little bits of constructor code -- |
Hi! I just came across this PR via this issue: #26 Is this still the PR that would solve that issue? Is there a way that I can help here? |
What's happened so far has happened here, yes :-). IIRC the general plan for how to handle this is clear, but stalled out with the work mostly not done. If you and @chrish42 want to work on it together, or if @chrish42 is busy and you want to just pick up his work and continue where he left off, then either way it would be great, and I'm happy to continue providing advice and review. |
I've updated the description at the top to actually talk about the work that remains to be done, after going through all the comments in the pull request and other sources. @njsmith, let me know if that approach makes at least some sense or not. :-) After thinking about it some more, I still think that giving each "pickle test" its own name and corresponding object in a Python file is a better approach overall than saving the code as text inside a pickle. This mirrors other tests more (they all have a name too). We can discuss this approach more once I've added code to show how it would work in practice. (Will be easier to talk about it then.) |
@chrish42: top level comment looks reasonable, and if you have a better idea I am extremely happy to defer to it :-) |
This is not quite done or working yet, but it's starting to show the general approach. |
@njsmith Now would be good time to have a look and give any comments about the general approach. If things look okay, we can start making more of the patsy objects pickeable, and add more pickling testcases. Comments and questions welcome! |
|
||
from patsy import EvalFactor | ||
|
||
PICKE_TESTCASES_ROOTDIR = os.path.join(os.path.dirname(__file__), '..', 'pickle_testcases') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/PICKE/PICKLE/, I assume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, hmm, so it looks like you're storing the pickle tests outside the source tree. Right now we go to some effort to ship the test suite, so that we can test patsy-as-installed... and I think the current CI stuff tries hard to run the test suite without access to the source tree, to make sure that this works. So this is definitely a change. Maybe it's a good change if the pickle test suite is large... but we'll need to do some more work in that case to make it really work. OTOH, maybe the pickle test suite is small, in which case we might as well ship it, because testing is important.
Probably the simplest thing to do is to move it into a patsy/test_data
directory inside the package for now, and then we can always move it back out again if we decide that we have too many megabytes of test pickles. (This will probably require some annoying change to setup.py as well to include the test data -- example.)
I made a bunch of small comments, but the overall approach looks fine to me! Thanks for working on this, and sorry for being slow in reviewing. |
b07ba3f
to
48fd2e4
Compare
4f8a70c
to
691eb4e
Compare
Branch for the rest of the work required to make patsy pickles work well. At a high level, we need to implement
__getstate__
and__setstate__
for relevant classes (easy), and add a testing framework (that is easy to use) that makes sure that pickles created with one version of patsy continue to work with newer versions (harder).Pickling testsuite:
pickles/vX.y/test_name.pickle
(where "test_name" is the key in the dict for the corresponding object). This program would be run before every release (with the version number of the release as an argument), to create a new set of pickle files for that release. Said pickle files would be added to the source repository.Pickle support:
__getstate__
and__setstate__
methods forEvalFactor
__getstate__
and__setstate__
methods forLookupFactor
__getstate__
and__setstate__
methods forTerm
__getstate__
and__setstate__
methods forEvalEnvironment
__getstate__
and__setstate__
methods for the stateful transform classes__getstate__
and__setstate__
methods forContrastMatrix
__getstate__
and__setstate__
methods forDesignInfo
__getstate__
and__setstate__
methods forDesignMatrix