lxml.etree.ParserError: Document is empty #207

lironesamoun · 2023-07-03T09:54:02Z

I made a small script in order to try the scrapping process.

I have a case when If I use extruct as the CLI, I get lots of information about the schema extracted.
extruct [url]

However, if I use for the same url schema = extruct.extract(html_content, base_url=url) , I get the error "lxml.etree.ParserError: Document is empty"
The url is valid and the content of html_content (response.text) is valid and full.

I tried also with a fresh python environment when I've installed only extruct and I still get the error.

import requests
import sys
import extruct

def get_html(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        return None

# Check if URL is provided as a command-line argument
if len(sys.argv) < 2:
    print("Please provide a URL as a command-line argument.")
    sys.exit(1)

url = sys.argv[1]  # Get the URL from the command-line argument
html_content = get_html(url)
if html_content:
    #print(html_content)
    schema = extruct.extract(html_content, base_url=url)
    print(schema)
else:
    print("Failed to retrieve HTML.")

Any insights about why it failed by using the python code ?

The text was updated successfully, but these errors were encountered:

Vasniktel · 2023-08-09T10:56:39Z

I'm facing the same issue for cnn articles (e.g. https://edition.cnn.com/2023/08/09/politics/georgia-medicaid-eligibility-work-requirements/index.html). It seems that lxml.etree.fromstring(resp.text, parser=lxml.html.HTMLParser()) returns None for some reason. Haven't investigated it any further but it seems that this is an issue with the lxml.

Vasniktel · 2023-08-09T11:23:13Z

I did some more analysis. It seems that for the same article goose3 works well while extruct crashes. Both libraries use lxml. The only difference is in goose3.utils.encoding.smart_str function being applied (https://github.com/goose3/goose3/blob/d3c404a79e0e15b7957355083bd5a7590d4103ba/goose3/parsers.py#L59). I've checked it manually and it seems to do the trick for me.

Additionally, there is a lxml.html.soupparser module that can also be used.

To summarize, either of the two worked for me:

from goose3.utils.encoding import smart_str

html = '...'
extruct.extract(smart_str(html), syntaxes=['json-ld'])

from lxml.html import soupparser

html = '...'
extruct.extract(soupparser.fromstring(html), syntaxes=['json-ld'])

lironesamoun · 2023-08-09T13:03:48Z

Interesting ! Thanks for the info. I'll definitely try that !

lopuhin · 2023-08-11T12:17:40Z

Another option is to parse the HTML on your end and pass an already parsed tree (in lxml.html format) to the extruct library, most syntaxes support that in the last release. For example we're internally using an HTML5 parser https://github.com/kovidgoyal/html5-parser/ passing treebuilder='lxml_html' (happy to share more details), which is more compatible compared to default lxml.html parser.

trifle · 2023-10-03T18:03:41Z

Hi everyone,
I believe this is fundamentally an encoding issue, as vasniktel suggested. Try to feed extruct directly with bytes instead of (mistakenly) utf-8 decoded strings to prevent it from happening.

Example:

import requests
import extruct

u = "https://edition.cnn.com/2023/08/09/politics/georgia-medicaid-eligibility-work-requirements/index.html"
r = requests.get(u) # note that r.content is a bytes object

# crashes
extruct.extract(r.content.decode("utf-8"))
# works
extruct.extract(r.content)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lxml.etree.ParserError: Document is empty #207

lxml.etree.ParserError: Document is empty #207

lironesamoun commented Jul 3, 2023 •

edited

Loading

Vasniktel commented Aug 9, 2023

Vasniktel commented Aug 9, 2023 •

edited

Loading

lironesamoun commented Aug 9, 2023

lopuhin commented Aug 11, 2023

trifle commented Oct 3, 2023

lxml.etree.ParserError: Document is empty #207

lxml.etree.ParserError: Document is empty #207

Comments

lironesamoun commented Jul 3, 2023 • edited Loading

Vasniktel commented Aug 9, 2023

Vasniktel commented Aug 9, 2023 • edited Loading

lironesamoun commented Aug 9, 2023

lopuhin commented Aug 11, 2023

trifle commented Oct 3, 2023

lironesamoun commented Jul 3, 2023 •

edited

Loading

Vasniktel commented Aug 9, 2023 •

edited

Loading