This repository has been archived by the owner on Dec 24, 2019. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a try to fix #37.
I guess it needs some more documentation and some tests, but before I do that, I wanted to show what way I did go. (So please don't merge this yet. I also need to add a better commit comment.)
The problem was that some fix functions had
dst.write(fixed.decode())
and laterfix_file
hadfd.write(fixed.encode())
, which use the system's default encoding, which in my system seems to be ASCII. Also, this way we got possibly multiple decode steps (one for each problem), and if we have after one of the decode steps a Jalopy run or similar, it gets even more unclear what's happening.Basically the idea of my fix is that for each file-type, there is one rule which indicates the encoding. It is named after one of the encodings supported by the codecs package. (In the default configuration, this is just utf-8 (for most file types) and ascii (for .properties), there should be some test-cases for latin-1/iso-8859-1, at least, too.) (The function
get_encoding_rule(rules)
simply iterates over the rule names and passes each one to codecs.lookup, until one of them is successful. For this reason I needed to reorder the rules for the*.properties
entry in the default config.)That encoding is then (other than for the check that one actually can decode the file using this encoding) used for those fix functions which actually work on the content of the file by themselves and don't just call external programs (like Jalopy, pythontidy, etc.). For those methods (for now just
_fix_notabs
,_fix_nocr
and_fix_notrailingws
) I invented a decorator@needs_unicode
, which wraps the function call intocodecs.EncodedFile
.I needed to store the used encoding in a global dict
ENCODING_BY_FILE
, analogously to the VALIDATION_ERRORS list – please propose a better way to do this, if this is not the way to go.I noted that for the added functionality it would be useful to have some automated tests (i.e. a bundle of config file, command line arguments, input file and expected output/expected fixed file). Is there some standard python way of doing this?