-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encode/decode data directly to/from binary file #2
Comments
This would require storing the codebook along with the file somehow. There are many ways to do that, so you'd have to determine how you want to strike a balance between a number of factors:
One idea: The codec currently supports dumping a pickled form to a file, so one possible way would be to store the codebook and payload to a file using a framed format like RIFF. It'd provide extensibility and uses a standard file format; downside is it'd limit you to four gigs for a compressed payload size. |
I think this is an interesting idea. What I'm wondering is if it's possible to simply append the codec as regular 8-bit to the end of the encoded bytes file in static use-cases (as with mine). I'm implementing dahuffman in another pure python module, and trying to keep my codebase as small as possible, so bear with me as I'm trying to think in very simplistic terms. In this kind of static case, I always know the length of both my bytes and codec, which means I can hardcode the length of the codec into my open() methods and afterwards grab that many byte characters from the end of the bytes file and redesignate it as my codec in the namespace A more graceful solution may be doing the same bytes-codec append but finally also appending that codec length to EOF. Given the last characters of the regular codec would never be numeric.
So now I have |
What "formatting issues" are you running into? Error messages or a stack trace would be helpful. |
Sorry, I said formatting issues, but really I just don't know how to handle hexadecimal. Specifically, in the above example, I've mixed the hexadecimal component in bytes with an encoded string. What would be best is to keep the codec in hex as it appears as a file after a codec.save()... then append the length of the codec as a hexadecimal code as well. The best way I can think to retrieve it then is using regular expressions, something like
I'm also realizing though that when I said "a more graceful solution", I really meant a more dynamic solution. So while using regex makes sense to me, I might just choose to go for the simple:
My assertions all pass, so I from now on I can import To actually address the feature idea proposed by @soxofaan , one solution is incorporating the above into the huffmancodec's static method
|
Are we guaranteed the codec will always be 1418 bytes though? If not we'd have to have a simple way for someone to figure out the codec length ahead of time. |
Not for any codec, no. In my specific case, yes. So if there are any other codecs that are guaranteed to stay the same size, hard-coding like this works. That simple way ahead of time I think is to produce the codec by using
|
hi, thank you for your interest in dahuffman, it just started as a toy project, I didn't know it is used in real projects :) I'd like to raise some points in this discussion.
I somewhat disagree here. I don't think that the codebook must be stored along with the file (as a separate sidecar file or in file header/footer). Also note that it might not be trivial to encode the codebook in a generic way to bytes. Dahuffman is designed to encode a symbol stream that is not necessarily text. The symbols just have to be hashable and could be other things like just integers, enums, tuples (e.g. 2D points in space) or user defined classes. That's why I just used pickle as it supports arbitrary Python structures, but indeed it is Python specific, which is a limitation. An alternative could be something JSON-based, which is more widely supported than pickle, and is probably flexible enough for most code book use cases. On a more general note: I'm not sure whether it's worth it to push dahuffman into the already very crowded space of file compression formats and tools (zip, tar, gzip, bzip, rar, ...). These formats are well established already and most of them are even supported by Python stdlib. So it feels there are already enough good/best tools for the job here. |
I agree. In reality, combining the codebook with the file or omitting it as a file like you've said, in my opinion is very much an ascetic choice, not one of practicality or making dahuffman a zip-equivalent since zip does extra tricks to be extra good at its job. |
We have a problem in our quest in exterminating |
TBH I don't think we should allow compression of data of non-primitive types. My main motivation is to eliminate code execution vulnerability inherently present in everything that uses |
feature idea: functionality to encode/decode directly to/from a binary file
The text was updated successfully, but these errors were encountered: