Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporting text+tables while maintaining layout #347

Open
austinmw opened this issue Apr 3, 2024 · 1 comment
Open

Exporting text+tables while maintaining layout #347

austinmw opened this issue Apr 3, 2024 · 1 comment

Comments

@austinmw
Copy link

austinmw commented Apr 3, 2024

Supposed I have a document like this:

<text>

<table>

<text>

Where a table is located between two chunks of text, and I'd like to parse the document and save the parsed information, in order, to a text file.

If I use the document_analysis functionality, I can successfully extract the text and tables, and print them separately:

document = extractor.start_document_analysis(
    file_source=LOCAL_DOCUMENT_PATH,
    s3_upload_path=S3_UPLOAD_PATH,
    features=[TextractFeatures.LAYOUT, TextractFeatures.SIGNATURES, TextractFeatures.FORMS],
    save_image=True
)

print(document.text)
print(document.tables)

However, this loses information about the layout (i.e., that in my example, the table is in between two pieces of text).

So how can I print the parsed text+tables in order? As in something like:

print(document.text_and_tables)

Is there any convenience functionality in this library to do this?

@ucegbe
Copy link

ucegbe commented Apr 10, 2024

print(document.get_text()) gets you the text and tables in plain text in the order appeared in the doc
If you want the tables in csv within the, you would have to tag the tables using the linearization config and rplace them with their csv counterpart gotten from the document.tables
This notebook helps: https://github.com/aws-samples/layout-aware-document-processing-and-retrieval-augmented-generation/blob/main/textract-api.ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants