Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with ordering in extractions, markdown and gettext methods #388

Open
red-sky17 opened this issue Aug 17, 2024 · 8 comments
Open

issue with ordering in extractions, markdown and gettext methods #388

red-sky17 opened this issue Aug 17, 2024 · 8 comments

Comments

@red-sky17
Copy link

the attached input document contains text then a table followed by some text, we want the text file to be the same as the input pdf file.

input_page

I tried extraction using different methods:

for 1.) and 2.) this is the code I am using:
textract_json = extractor.start_document_analysis( file_source="s3://s3sagemakerbucket/textract_analysis/12382593_bnp_credit_facility_20m.pdf", features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES], save_image=False, )
response_textract_async = extractor.get_result(job_id=textract_json.job_id, api=Textract_API.ANALYZE)
markdown_text = response_textract_async.to_markdown()
1.) .to_markdown() method
using_markdown_method
the issue here is the two table are at the bottom.

2.) .get_text() method
using_gettext_method
in this case as well we can see the two tables are at the bottom and like we know without config parameter we wont get markdown output.

now the third is interesting
the code used for this is:
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import get_text_from_layout_json

textract_json = call_textract(input_document="s3://s3sagemakerbucket/textract_analysis/12382593_bnp_credit_facility_20m.pdf", features=[Textract_Features.LAYOUT,Textract_Features.TABLES],)
3.) get_text_from_layout_json(textract_json=textract_json)
also tried with get_text_from_layout_json(textract_json=textract_json, generate_markdown = True) in both of these cases getting the same output.
using_gettextfromlayout_1
using_gettextfromlayout_2

the issue in using this method is like you can see, the data is getting repeated twice, also there is no markdown format present.

@Belval or anyone can you please suggest if there is anything we can do to prevent this and get the text in correct like we have in the pdf file.

Thanks.

@red-sky17 red-sky17 changed the title issue with ordering after extraction, in the final text file. issue with ordering in extractions, markdown and gettext methods Aug 17, 2024
@red-sky17
Copy link
Author

also do look into this output for the attached pdf as well, same issue is being observed here as well for the 1st page the tables are being printed down and as for the second page
Egypt_EG01_Credit Agricole.pdf

this is for 2nd page:
second_pdf_usingmarkdown

complete text file:
Egypt_EG01_Credit Agricole_using_markdown.txt

@red-sky17
Copy link
Author

where as the ordering is present in this text file when extracted using get_text_from_layout_json(textract_json=textract_json)
the issue is same like the one discussed in the first thread (3.).

text file for reference:

Egypt_EG01_Credit Agricole_using_gettextfromlayout_json.txt

I am thinking is this a bug for .to_markdown() and get_text() methods because for gettextfromlayoutjson() we are getting the output in correct order.

ultimately the final goal is to get the extraction like we did in gettextfromlayoutjson but with markdown bordering and no duplication.

so, I believe it would be better if we could get the extraction properly by using .to_markdown method only, because in this method we have markdown bordering and the only issue is ordering which can debugged I guess by comparing the gettextfromlayoutjson and to_markdown functions code of traversing the json dict.

@Belval
Copy link
Contributor

Belval commented Aug 20, 2024

I will test it first but this looks like a known issue that happens when the LAYOUT predictions do not match the TABLE predictions, causing the reading order to be wrong.

@Belval
Copy link
Contributor

Belval commented Aug 20, 2024

What version of amazon-textract-textractor are you using? With 1.8.2 I get:

Page 2 of 10


Schneider Electric South East Asia (HQ) Pte. Ltd. Schneider Electric Overseas Asia Pte Ltd Schneider Electric Singapore Pte. Ltd. Schneider Electric IT Singapore Pte. Ltd. (formerly known as MGE Asia Pte Ltd) Schneider Electric IT Logistics Asia Pacific Pte. Ltd. Schneider Electric Logistics Asia Pte Ltd Schneider Electric Systems Singapore Pte. Ltd. (formerly known as Invensys Process Systems (S) Pte. Ltd.) 1 March 2017 

Previous Facility Letters. In the event that this Facility Letter is not accepted or lapses and is not extended by the Bank, the terms and conditions in the Previous Facility Letters shall continue to apply, save for any revision or amendments to the Interest Rate and any reduction in the amount of the Lines of Credit as stated herein. 

## A. LINE(S) OF CREDIT 



| AMOUNT          | TYPE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|-----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| SGD20,000,000/- | Multi-currency Banker's Guarantee [including but not limited to Performance Guarantee or Payment Guarantee (for up to 60 months or such other tenor as may be agreed by the Bank from time to time) or to finance any other transactions acceptable to the Bank on a case-by-case subject to such conditions as may be determined by the Bank in its sole and absolute discretion] and/or Sight & Usance Letters of Credit (for up to 12 months) (with/ without control of goods) and/or Shipping Guarantee & Acceptance Under Usance Letters of Credit. |



## 1. PURPOSE 

The Facilities shall be used solely to finance the Borrower's working capital requirements. However, without prejudice to the Borrower's obligations, the Bank shall not be obliged to check that the Borrower does so or that the Facilities or any part thereof is utilized in such a manner. 

## 2. INTEREST RATE/COMMISSION/FEE 

(a) Commission on Banker's Guarantee shall be calculated on the face amount of the Banker's Guarantee for the period from the date of issuance upto the expiry date of the Banker's Guarantee, payable upfront as follows :- 
 
(b) Non-refundable Commission / Interest on the Trade Facilities shall be payable at the following rates and in the following manner:- 
(i) Letters of Credit 0.125% per month, minimum 2 months 



| Tenor                    | Commission    |
|--------------------------|---------------|
| Less than 3 years,       | 0.2%pa        |
| 3 years and upto 5 years | 0.25%pa       |

Which does not match what you are reporting.

@red-sky17
Copy link
Author

red-sky17 commented Aug 21, 2024

@Belval , I am attaching the input pdf, when tested on the single page like I attached( in the first thread, which you tested) its giving the same output like you got, but when tested as a whole(pdf) that is when I am facing issue.

I am using amazon-textract-textractor version 1.8.2

this_pdf.pdf

@Belval
Copy link
Contributor

Belval commented Aug 21, 2024

Thank you for clarifying and sharing the file, I will attempt to reproduce the issue.

@red-sky17
Copy link
Author

Hello @Belval, were you able to reproduce this issue.

@Chuukwudi
Copy link
Contributor

I have noticed this a few times myself.

If order is important, I would usually get the bbox of the entity and sort by x or y axis.

Combining page ordering, together with entity bboxes guarantees that order is maintain in the output.

Of course, you will need to know the format of you input pdf beforehand to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants