issue with extraction, get_text_fromlayout_json function #356

red-sky17 · 2024-04-15T12:13:52Z

attached the part of the pdf, which I am trying to extract.

I am doing extraction using:
textract_json = call_textract(input_document="s3:url",
features=[Textract_Features.LAYOUT,
Textract_Features.TABLES])
layout = get_text_from_layout_json(textract_json=data)

the output I am getting is:

I analysed this in textract console, there it was able to detect two tables and everything clearly analyzed over there.

and I was able to extract this, when I am loading the json to ( textractor.entities.document import Document ) the Document and get the results using document.text but the extracted tables are not bordered when I am using this function.

I will try to resolve this from my end, but if I am missing anything or anyone already working on this, I request and appreciate all the help.

Thankyou.

Belval · 2024-05-06T14:28:13Z

For markdown bordering you can use the MarkdownLinearizationConfig by calling .to_markdown() on your document object.

https://aws-samples.github.io/amazon-textract-textractor/notebooks/document_linearization_to_markdown_or_html.html#All-entities-can-be-linearized

Belval added the question Further information is requested label May 6, 2024

red-sky17 mentioned this issue Jun 24, 2024

issue regarding .to_markdown() method #380

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue with extraction, get_text_fromlayout_json function #356

issue with extraction, get_text_fromlayout_json function #356

red-sky17 commented Apr 15, 2024

Belval commented May 6, 2024

issue with extraction, get_text_fromlayout_json function #356

issue with extraction, get_text_fromlayout_json function #356

Comments

red-sky17 commented Apr 15, 2024

Belval commented May 6, 2024