Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encode/decode data directly to/from binary file #2

Open
soxofaan opened this issue Jun 26, 2019 · 10 comments
Open

encode/decode data directly to/from binary file #2

soxofaan opened this issue Jun 26, 2019 · 10 comments

Comments

@soxofaan
Copy link
Owner

feature idea: functionality to encode/decode directly to/from a binary file

@dargueta
Copy link

dargueta commented Sep 3, 2020

This would require storing the codebook along with the file somehow. There are many ways to do that, so you'd have to determine how you want to strike a balance between a number of factors:

  1. Do you want the file format you write to to be usable by other software? If so, what restrictions would you have to impose to accomplish this? You wouldn't be able to serialize arbitrary Python values via pickling if this is the case.
  2. If you don't care about other software being able to read files this generates, what features do you want to be able to encode in the file?

One idea: The codec currently supports dumping a pickled form to a file, so one possible way would be to store the codebook and payload to a file using a framed format like RIFF. It'd provide extensibility and uses a standard file format; downside is it'd limit you to four gigs for a compressed payload size.

@GregCM
Copy link

GregCM commented Oct 9, 2020

I think this is an interesting idea. What I'm wondering is if it's possible to simply append the codec as regular 8-bit to the end of the encoded bytes file in static use-cases (as with mine). I'm implementing dahuffman in another pure python module, and trying to keep my codebase as small as possible, so bear with me as I'm trying to think in very simplistic terms.

In this kind of static case, I always know the length of both my bytes and codec, which means I can hardcode the length of the codec into my open() methods and afterwards grab that many byte characters from the end of the bytes file and redesignate it as my codec in the namespace

A more graceful solution may be doing the same bytes-codec append but finally also appending that codec length to EOF. Given the last characters of the regular codec would never be numeric.

# Producing a sample (byte + codec) bin. file
# -------------------------------
codec = HuffmanCodec.load('.codec')
ct = codec.get_code_table()

sct = str(ct)
s = '%s%s' % (sct, len(sct))
# UTF-8
b = s.encode()

# copybytes is just a copy of my original bytes file
with open('copybytes', 'ab') as f:
    f.write(b)
# -------------------------------

# Reading from the sample file
# -------------------------------
with open('copybytes', 'rb') as f:
    cb = f.read()

So now I have cb but I haven't worked with pickle before, so clearly I'm running into simple formatting issues (. But before I spend anymore time thinking about it, do you think there's potential in the basic idea I'm getting at?

@dargueta
Copy link

dargueta commented Oct 12, 2020

What "formatting issues" are you running into? Error messages or a stack trace would be helpful.

@GregCM
Copy link

GregCM commented Oct 12, 2020

Sorry, I said formatting issues, but really I just don't know how to handle hexadecimal.

Specifically, in the above example, I've mixed the hexadecimal component in bytes with an encoded string. What would be best is to keep the codec in hex as it appears as a file after a codec.save()... then append the length of the codec as a hexadecimal code as well. The best way I can think to retrieve it then is using regular expressions, something like

codecsize_match = re.match(b'[\x30-\x3A]+$', cb)

I'm also realizing though that when I said "a more graceful solution", I really meant a more dynamic solution. So while using regex makes sense to me, I might just choose to go for the simple:

# Reading from the sample bytes file
# -------------------------------
with open('copybytes', 'rb') as f:
    cb = f.read()

# Reading from the sample codec
# -------------------------------
with open('codec', 'rb') as f:
    c = f.read()

# Writing the codec to the bytes file
# -------------------------------
with open('copybytes', 'ab') as f:
    f.write(c)

# Test the results
# -------------------------------
with open('copybytes', 'rb') as f:
    # cb + c
    cbc = f.read()

assert(cbc == cb + c)
# Where my codec totals 1418 bytes
assert(cbc[0:-1418] == cb)
assert(cbc[-1418:] == c)

My assertions all pass, so I from now on I can import cbc instead of cb and grab the last 1418 bytes of it to use as my codec.

To actually address the feature idea proposed by @soxofaan , one solution is incorporating the above into the huffmancodec's static method load() as an alternative to loading from path (ie load from namespace rather than file). The size of the codec could be passed in alongside the variable name. For example:

HuffmanCodec.load(var=cbc, size=1418)

@dargueta
Copy link

dargueta commented Oct 12, 2020

Are we guaranteed the codec will always be 1418 bytes though? If not we'd have to have a simple way for someone to figure out the codec length ahead of time.

@GregCM
Copy link

GregCM commented Oct 12, 2020

Not for any codec, no. In my specific case, yes. So if there are any other codecs that are guaranteed to stay the same size, hard-coding like this works.

That simple way ahead of time I think is to produce the codec by using from_data() or from_frequencies() methods, then take the length and pass it into load(). Assuming the size of the file won't change while it's encoded, everything can be done in namespace. An example that gets the job done:

from dahuffman import HuffmanCodec
import os


def bytes_from_data(data):
    # Creating a codex from my original data
    codec = HuffmanCodec.from_data(data)

    # Temporarily save the codec
    codec.save('codec')

    # For lack of a better method, I read the codec in as bytes
    with open('codec', 'rb') as f:
        codec_bytes = f.read()

    # No longer needed
    os.remove('codec')
    return codec, codec_bytes


def encode_combined(data):
    codec, codec_bytes = bytes_from_data(data)
    edata = codec.encode(data)
    # Now I'll combine my encoded data and its codec
    output = b''.join([edata, codec_bytes])
    with open('encoded', 'wb') as f:
        f.write(output)

    codec_size = len(codec_bytes)
    # "edata" is returned only for demonstration, we don't actually need it.
    return output, codec_bytes, codec_size, edata


def decode_combined(output):
    codec_bytes = output[-codec_size:]
    edata = output[0:-codec_size]
    
    # This block would ideally be replaced by a modified load() method
    # ============================================
    with open('codec', 'wb') as f:
        f.write(codec_bytes)

    codec = HuffmanCodec.load('codec')
    # ============================================
    data = codec.decode(edata)
    return data

# 10 sets of ['0', '1', ..., '9']
data = [str(i) for i in range(10)] * 10

output, codec_bytes, codec_size, edata = encode_combined(data)

assert(output == edata + codec_bytes)
assert(output[0:-codec_size] == edata)
assert(output[-codec_size:] == codec_bytes)
print(decode_combined(output))

# Say my original data changes...
data.append('10')
# ... thats okay as long as it only changes in the python namespace
output, codec_bytes, codec_size, edata = encode_combined(data)

assert(output == edata + codec_bytes)
assert(output[0:-codec_size] == edata)
assert(output[-codec_size:] == codec_bytes)
print(decode_combined(output))

@soxofaan
Copy link
Owner Author

hi, thank you for your interest in dahuffman, it just started as a toy project, I didn't know it is used in real projects :)

I'd like to raise some points in this discussion.

This would require storing the codebook along with the file somehow

I somewhat disagree here. I don't think that the codebook must be stored along with the file (as a separate sidecar file or in file header/footer).
This is actually how I originally thought about usage of the module: the codebook is hardcoded somewhere in your application code and you just send around just the encoded bytes as "messages" (or file) without the codebook.

Also note that it might not be trivial to encode the codebook in a generic way to bytes. Dahuffman is designed to encode a symbol stream that is not necessarily text. The symbols just have to be hashable and could be other things like just integers, enums, tuples (e.g. 2D points in space) or user defined classes. That's why I just used pickle as it supports arbitrary Python structures, but indeed it is Python specific, which is a limitation. An alternative could be something JSON-based, which is more widely supported than pickle, and is probably flexible enough for most code book use cases.

On a more general note: I'm not sure whether it's worth it to push dahuffman into the already very crowded space of file compression formats and tools (zip, tar, gzip, bzip, rar, ...). These formats are well established already and most of them are even supported by Python stdlib. So it feels there are already enough good/best tools for the job here.

@GregCM
Copy link

GregCM commented Oct 24, 2020

it feels there are already enough good/best tools for the job

I agree. In reality, combining the codebook with the file or omitting it as a file like you've said, in my opinion is very much an ascetic choice, not one of practicality or making dahuffman a zip-equivalent since zip does extra tricks to be extra good at its job.

@KOLANICH
Copy link

KOLANICH commented Dec 1, 2021

We have a problem in our quest in exterminating pickle. Currently arbitrary functions are serialized into the format. "concat": self._concat, and it is a Callable. It is inacceptable and we should get rid of it.

@KOLANICH
Copy link

KOLANICH commented Dec 1, 2021

TBH I don't think we should allow compression of data of non-primitive types. IntEnums are OK because they can be losslessly converted to just ints and have the same hash.

My main motivation is to eliminate code execution vulnerability inherently present in everything that uses pickle. It was a big mistake to introduce pickle into the standard lib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants