Skip to content

Commit

Permalink
work around bug in lxml's incremental parser
Browse files Browse the repository at this point in the history
lxml is applying basic UTF8 decoding to each chunk, which fails when a chunk ends in the middle of an UTF8 sequence

```
Traceback (most recent call last):
  File "legi/tar2sqlite.py", line 510, in <module>
    main()
  File "legi/tar2sqlite.py", line 486, in main
    process_archive(db, args.directory + '/' + archive_name)
  File "legi/tar2sqlite.py", line 262, in process_archive
    xml.feed(block)
  File "src/lxml/parser.pxi", line 1217, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:114563)
  File "src/lxml/parser.pxi", line 1339, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:114436)
  File "src/lxml/parser.pxi", line 586, in lxml.etree._ParserContext._handleParseResult (src/lxml/lxml.etree.c:105777)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105896)
  File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:107604)
  File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:106458)
  File "<string>", line 227
lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0xC3 EOF, line 227, column 289
```
  • Loading branch information
Changaco committed Jul 16, 2017
1 parent 5f798b4 commit ea79f61
Showing 1 changed file with 1 addition and 2 deletions.
3 changes: 1 addition & 2 deletions legi/tar2sqlite.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,8 +258,7 @@ def count_one(k):
skipped += 1
continue

for block in entry.get_blocks():
xml.feed(block)
xml.feed(b''.join(entry.get_blocks()))
root = xml.close()
tag = root.tag
meta = root.find('META')
Expand Down

0 comments on commit ea79f61

Please sign in to comment.