Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page headers should be identified as PageHeaderElement #65

Open
Elijas opened this issue Dec 27, 2023 · 4 comments
Open

Page headers should be identified as PageHeaderElement #65

Elijas opened this issue Dec 27, 2023 · 4 comments
Assignees
Labels
contributions-welcome Intended for completion by you, the contributor feature:elements Parsing all the other elements correctly status:in-progress Work underway. Reach out if you're interested in helping!

Comments

@Elijas
Copy link
Member

Elijas commented Dec 27, 2023

Context

MSFT accuracy-test (permalink at the time of posting)

Problem

The header "PART I" is identified as top section title element, when it should be identified as page header element. And because of this, the actual top section title element is incorrectly identified as title element

image

Ideas about a possible solution

One possible solution: Identify page header elements first, and then the top section title element will start working correctly, as it will not be confused by the header elements

@Elijas Elijas added contributions-welcome Intended for completion by you, the contributor feature:elements Parsing all the other elements correctly labels Dec 27, 2023
@deenaawny-github-account
Copy link
Contributor

deenaawny-github-account commented Dec 31, 2023

Hey @Elijas,

I started working on this issue.

This question came up:
Was the element identified incorrectly because there is a bug in the top-level section classifier?
The same pattern occurs on other pages and the page header elements are identified correctly.

I am still analyzing. And want to be assigned this issue.

Thank you.

@Elijas Elijas added the status:in-progress Work underway. Reach out if you're interested in helping! label Jan 1, 2024
@Elijas
Copy link
Member Author

Elijas commented Jan 1, 2024

Hey @Elijas,

I started working on this issue.

This question came up: Was the element identified incorrectly because there is a bug in the top-level section classifier? The same pattern occurs on other pages and the page header elements are identified correctly.

I am still analyzing. And want to be assigned this issue.

Thank you.

Thanks for taking a look!

image

So in this picture let's take a look at the elements:

  • A. PART I (small letters)
  • B. PART I. FINANCIAL INFORMATION (big letters)

Element A should be identified as PageHeaderElement (because it's irrelevant) and Element B should be identified as TopLevelSectionTitle

Does it make sense?

@deenaawny-github-account
Copy link
Contributor

deenaawny-github-account commented Jan 2, 2024

Hey @Elijas,

yes, it makes sense.

I looked more into the code on Sunday.

I first asked ChatGPT to refactor the _process_element function (of the top section title classifier) the into smaller functions (keeping the same logic). I ran unit tests, snapshot verify and accuracy tests and they were running with no errors, after code refractor. In addition, the unit tests cover for top_section_manager_for_10q increased from 63% to 76% (please see attachment
refactor-tests.txt).

I think when functions are small, they are easier to read (at least for me). Then, I fixed some redundancy with the functions. I then asked ChatGPT to give the functions specifications and then I re-wrote these specifications based on my understanding and ChatGPT's specification. So, these function specifications are both ChatGPT's specification and mine.

I then looked at the select_element function and extended it to:

  1. check if the length of the given list of semantic elements is more than or equal to 2: and if yes:
    a. checks whether the first element in the list is a potential page header element and if yes, then it chooses the second element in the given list of semantic elements (as the top section title element).

With this small method change, it ensures that top section title elements are all classified correctly for the report MSFT 0000950170-23-014423.
After this change, I looked at some of the elements that changed. These are listed below:

html_hash html_preview text_content section_type before cls_name before change section_type after cls_name after change
b5a04115 style='background-co...[162]...fit-content;'>PART I PART I part 1 TopSectionTitle - IntroductorySectionElement
9f043413 style='background-co...[136]...fit-content;'>Item 1 Item 1 part1 item 1 TopSectionTitle - IntroductorySectionElement
0f6f4ac1 style='background-co...[144]...nt;'>PART I. FINANCI PART I. FINANCI - TitleElement part 1 TopSectionTitle
7d3a9a03 style='background-co...[144]...nt;'>ITEM 1. FINANCI ITEM 1. FINANCI - TitleElement part 1 item 1 TopSectionTitle
a5c9de95 style='background-co...[136]...fit-content;'>Item 2 Item 2 part 1 item 2 TopSectionTitle - PageHeaderElement
3ce7a5eb style='background-co...[176]...SION AND ANALYSIS OF ITEM 2. MANAGEMENT’S DISCUSSION AND ANALYSIS OF - TitleElement part 1 item 2 TopSectionTitle
c1eb19de style='background-co...[139]...-content;'>Item 3, 4 Item 3, 4 part 1 item 3 TopSectionTitle - TitleElement
51287d34 style='background-co...[163]...TATIVE AND QUALITATI ITEM 3. QUANTITATIVE AND QUALITATI - TitleElement part 1 item 3 TopSection Title
ddd63bb style='background-co...[163]...it-content;'>PART II PART II part 2 TopSectionTitle - PageHeaderElement
3cd753ab style='background-co...[166]...content;'>Item 1, 1A Item 1, 1A part 2 item 1 TopSectionTitle - TitleElement
159c6586 style='background-co...[143]...ent;'>PART II. OTHER PART II. OTHER - TitleElement part 2 TopSectionTitle
820f6022 style='background-co...[142]...tent;'>ITEM 1. LEGAL ITEM 1. LEGAL - TitleElement part 2 item 1 TopSectionTitle
8d5cae59 style='background-co...[140]...ontent;'>ITEM 1A. RI ITEM 1A. RI part 2 item 1 a TopSectionTitleElement - TitleElement
a5c9de95 style='background-co...[136]...fit-content;'>Item 2 Item 2 part 2 item 2 TopSectionTitleElement - PageHeaderElement
6db46951 style='background-co...[163]...STERED SALES OF EQUI ITEM 2. UNREGISTERED SALES OF EQUI - TitleElement part 2 item 2 TopSectionTitle
cf7940c0 style='background-co...[136]...fit-content;'>Item 6" Item 6 part 2 item 6 TopSectionTitle - TitleElement
97fc951c style='background-co...[138]...-content;'>ITEM 6. E ITEM 6. E - TitleElement part 2 item 6 TopSectionTitle

The above implementation fixes the classification of top section title elements. It makes sure they are not classified as page header elements.

Scores:

Accuracy tests (before change)

Filling F1-Score Recall Precision Missing Unexpected Total
AAPL 100.00% 100.00% 100.00% 0 0 202
MSFT 95.00% 97.30% 92.80% 10 28 389
META 100.00% 100.00% 100.00% 0 0 308
GOOG-70 100.00% 100.00% 100.00% 0 0 387
GOOG-94 100.00% 100.00% 100.00% 0 0 390

Accuracy tests (after change)

Filling F1-Score Recall Precision Missing Unexpected Total
AAPL 100.00% 100.00% 100.00% 0 0 202
MSFT 95.00% 97.30% 93.77% 10 24 385
META 100.00% 100.00% 100.00% 0 0 308
GOOG-70 99.74% 99.74% 99.74% 1 1 387
GOOG-94 99.74% 99.74% 99.74% 1 1 390

Pull Request: #70

Next:

  • Incase you don't want code refactor, I have to do the change in the previous code.
  • I need to make sure the code follows code standards.

It makes more sense (as you stated above) to call page header classifier first. I can do the implementation of the second solution too.

I appreciate your feedback.

@deenaawny-github-account
Copy link
Contributor

deenaawny-github-account commented Jan 2, 2024

Hey @Elijas,

question from Elijas: for the GOOG filings, can you briefly mention what is the changed element?

GOOG-70
Element before
"cls_name": "TopSectionTitle",
"html_hash": "e6cef692",
"html_preview": "style="color:#000000...[68]...height:120%">ITEM 1.",
"level": 1,
"section_type": "part1item1",
"tag_name": "span",
"text_content": "ITEM 1."

Element after
"cls_name": "TitleElement",
"html_hash": "e6cef692",
"html_preview": "style="color:#000000...[68]...height:120%">ITEM 1.",
"level": 0,
"tag_name": "span",
"text_content": "ITEM 1."

GOOG-94
Element before
"cls_name": "TopSectionTitle",
"html_hash": "e6cef692",
"html_preview": "style="color:#000000...[68]...height:120%">ITEM 1.",
"level": 1,
"section_type": "part1item1",
"tag_name": "span",
"text_content": "ITEM 1."

Element after
"cls_name": "TitleElement",
"html_hash": "e6cef692",
"html_preview": "style="color:#000000...[68]...height:120%">ITEM 1.",
"level": 0,
"tag_name": "span",
"text_content": "ITEM 1."

If solution a is chosen, will need to open an issue for these changed elements.

For solution b - clean the irrelevant elements first
The solution implementation was simple - just moving the top section manager for 10q after the page header classifier:

            PageHeaderClassifier(
                types_to_process={TextElement, HighlightedTextElement},
            ),
            TopSectionManagerFor10Q(types_to_exclude={PageHeaderElement}),

Accuracy tests (after change)

Filling F1-Score Recall Precision Missing Unexpected Total
AAPL 58.20% 62.38% 54.55% 76 105 234
MSFT 62.80% 64.15% 61.50% 133 149 406
META 72.36% 75.65% 69.35% 75 103 340
GOOG-70 60.58% 62.53% 58.74% 145 170 416
GOOG-94 60.62% 62.56% 58.80% 146 171 419

I remember now, why I deviated from the first solution (even though it is a better one). The accuracy tests didn't look promising. I then said I won't move the order of the classifiers, it seems to break the whole parser. And I will try to find a solution while keeping the same order.
But I think they shouldn't be that bad. It requires more work to fix all these breaking points.

Please let me know if I should continue with solution b and fix all the breaking points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributions-welcome Intended for completion by you, the contributor feature:elements Parsing all the other elements correctly status:in-progress Work underway. Reach out if you're interested in helping!
Projects
None yet
Development

No branches or pull requests

2 participants