WAT extractor: Overlong truncated HTTP request header line throws exception and loss of request record #32

sebastian-nagel · 2023-10-06T06:09:06Z

If a WARC request record contains and overlong and truncated HTTP request header line (GET /path HTTP/1.1) HttpRequestMessageParser throws an exception which causes that the request record is not transformed into a WAT record. If the exception is not handled in the calling code, even the WAT/WET extractor job (commoncrawl/ia-hadoop-tools) may fail.

The issue was observed on a couple of WARC files of CC-MAIN-2023-40:

the overlong HTTP request header lines were truncated after 8 kiB, so that the line was only GET /path-truncated which caused the HttpRequestMessageParser to fail (no HTTP version). Investigate in separate issues
a. why the truncation happened (commoncrawl/nutch: in the WARC writer or at the protocol level recording the HTTP communication between crawler and web server)?
b. how these URLs stem from and whether the URL filters need to be tightened to avoid similar errors.

fix HttpRequestMessageParser: it should correctly recognize that the header line is truncated but not fail on Response message to long

$> java -cp target/webarchive-commons-jar-with-dependencies.jar org.archive.extract.ResourceExtractor -wat CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request.warc.gz >CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request.warc.wat.gz
org.archive.resource.ResourceParseException: org.archive.format.http.HttpParseException: Response Message too long
    at org.archive.resource.http.HTTPRequestResourceFactory.getResource(HTTPRequestResourceFactory.java:34)
    at org.archive.extract.ExtractingResourceProducer.getNext(ExtractingResourceProducer.java:40)
    at org.archive.extract.ResourceExtractor.run(ResourceExtractor.java:137)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
    at org.archive.extract.ResourceExtractor.main(ResourceExtractor.java:63)
Caused by: org.archive.format.http.HttpParseException: Response Message too long
    at org.archive.format.http.HttpRequestMessageParser.parse(HttpRequestMessageParser.java:43)
    at org.archive.format.http.HttpRequestParser.parse(HttpRequestParser.java:18)
    at org.archive.resource.http.HTTPRequestResourceFactory.getResource(HTTPRequestResourceFactory.java:27)
    ... 4 more

also fix HttpResponseMessageParser (obviously, code was copy-pasted)

if a WARC file with only the single request record is parsed, the exception changes:

$> java -cp target/webarchive-commons-jar-with-dependencies.jar org.archive.extract.ResourceExtractor -wat CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request-only.warc.gz >CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request-only.warc.wat.gz
org.archive.resource.ResourceParseException: org.archive.format.http.HttpParseException: No spaces in message
    at org.archive.resource.http.HTTPRequestResourceFactory.getResource(HTTPRequestResourceFactory.java:34)
    at org.archive.extract.ExtractingResourceProducer.getNext(ExtractingResourceProducer.java:40)
    at org.archive.extract.ResourceExtractor.run(ResourceExtractor.java:137)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
    at org.archive.extract.ResourceExtractor.main(ResourceExtractor.java:63)
Caused by: org.archive.format.http.HttpParseException: No spaces in message
    at org.archive.format.http.HttpRequestMessageParser.parseLax(HttpRequestMessageParser.java:176)
    at org.archive.format.http.HttpRequestMessageParser.parse(HttpRequestMessageParser.java:49)
    at org.archive.format.http.HttpRequestMessageParser.parse(HttpRequestMessageParser.java:39)
    at org.archive.format.http.HttpRequestParser.parse(HttpRequestParser.java:18)
    at org.archive.resource.http.HTTPRequestResourceFactory.getResource(HTTPRequestResourceFactory.java:27)
    ... 4 more

CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request.warc.gz
CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request-only.warc.gz

The text was updated successfully, but these errors were encountered:

log exception and continue - work-around to address commoncrawl/ia-web-commons#32

jnioche · 2023-11-06T12:07:41Z

1.b did some detective work and found where the URL came from

zstdgrep -a "https://rosecollection.brandeis.edu/objects-1/portfolio?records=12&query" wat_seeds/wat.part-r-00383.zst | more

it was a link coming from a WAT which did not get filtered out and ended up being selected.

There is currently no mechnism in Nutch to simply filter a URL by length. @sebastian-nagel had planned to do it

<property><name>urlfilter.fast.url.path.max.length</name><value>1024</value></property>
<property><name>urlfilter.fast.url.pathquery.max.length</name><value>2048</value></property>

but it hasn't been implemented in the urlfilter.fast

During fetching, URLs are filtered based on urlfilter.fast alone so best to do it there.

I will create a new issue for this. We need this before starting the next crawl.

jnioche · 2023-11-07T10:00:39Z

The culprit can be found below.

veryLargeURL.txt

Should run a dummy crawl with it and investigate why 1.B happened (what truncated it)

jnioche · 2023-11-07T17:10:45Z

@wumpus check that something needs fixing in some Python library

sebastian-nagel added a commit to commoncrawl/ia-hadoop-tools that referenced this issue Oct 6, 2023

WEATGenerator: catch exceptions thrown by ExtractingResourceProducer,

599eac6

log exception and continue - work-around to address commoncrawl/ia-web-commons#32

sebastian-nagel mentioned this issue Oct 6, 2023

WEATGenerator: catch exceptions thrown by ExtractingResourceProducer commoncrawl/ia-hadoop-tools#6

Merged

jnioche self-assigned this Nov 6, 2023

jnioche mentioned this issue Nov 6, 2023

Add param to fast.urlfilter to filter based on length of the URL commoncrawl/nutch#27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WAT extractor: Overlong truncated HTTP request header line throws exception and loss of request record #32

WAT extractor: Overlong truncated HTTP request header line throws exception and loss of request record #32

sebastian-nagel commented Oct 6, 2023

jnioche commented Nov 6, 2023

jnioche commented Nov 7, 2023

jnioche commented Nov 7, 2023

WAT extractor: Overlong truncated HTTP request header line throws exception and loss of request record #32

WAT extractor: Overlong truncated HTTP request header line throws exception and loss of request record #32

Comments

sebastian-nagel commented Oct 6, 2023

jnioche commented Nov 6, 2023

jnioche commented Nov 7, 2023

jnioche commented Nov 7, 2023