Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
work around bug in lxml's incremental parser
lxml is applying basic UTF8 decoding to each chunk, which fails when a chunk ends in the middle of an UTF8 sequence ``` Traceback (most recent call last): File "legi/tar2sqlite.py", line 510, in <module> main() File "legi/tar2sqlite.py", line 486, in main process_archive(db, args.directory + '/' + archive_name) File "legi/tar2sqlite.py", line 262, in process_archive xml.feed(block) File "src/lxml/parser.pxi", line 1217, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:114563) File "src/lxml/parser.pxi", line 1339, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:114436) File "src/lxml/parser.pxi", line 586, in lxml.etree._ParserContext._handleParseResult (src/lxml/lxml.etree.c:105777) File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105896) File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:107604) File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:106458) File "<string>", line 227 lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding ! Bytes: 0xC3 EOF, line 227, column 289 ```
- Loading branch information