Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with extraction, get_text_fromlayout_json function #356

Open
red-sky17 opened this issue Apr 15, 2024 · 1 comment
Open

issue with extraction, get_text_fromlayout_json function #356

red-sky17 opened this issue Apr 15, 2024 · 1 comment
Labels
question Further information is requested

Comments

@red-sky17
Copy link

attached the part of the pdf, which I am trying to extract.

input_textext

I am doing extraction using:
textract_json = call_textract(input_document="s3:url",
features=[Textract_Features.LAYOUT,
Textract_Features.TABLES])
layout = get_text_from_layout_json(textract_json=data)

the output I am getting is:

image

I analysed this in textract console, there it was able to detect two tables and everything clearly analyzed over there.

and I was able to extract this, when I am loading the json to ( textractor.entities.document import Document ) the Document and get the results using document.text but the extracted tables are not bordered when I am using this function.

I will try to resolve this from my end, but if I am missing anything or anyone already working on this, I request and appreciate all the help.

Thankyou.

@Belval
Copy link
Contributor

Belval commented May 6, 2024

For markdown bordering you can use the MarkdownLinearizationConfig by calling .to_markdown() on your document object.

https://aws-samples.github.io/amazon-textract-textractor/notebooks/document_linearization_to_markdown_or_html.html#All-entities-can-be-linearized

@Belval Belval added the question Further information is requested label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants