Page headers should be identified as PageHeaderElement #65

Elijas · 2023-12-27T11:47:15Z

Context

MSFT accuracy-test (permalink at the time of posting)

Problem

The header "PART I" is identified as top section title element, when it should be identified as page header element. And because of this, the actual top section title element is incorrectly identified as title element

Ideas about a possible solution

One possible solution: Identify page header elements first, and then the top section title element will start working correctly, as it will not be confused by the header elements

The text was updated successfully, but these errors were encountered:

deenaawny-github-account · 2023-12-31T14:23:29Z

Hey @Elijas,

I started working on this issue.

This question came up:
Was the element identified incorrectly because there is a bug in the top-level section classifier?
The same pattern occurs on other pages and the page header elements are identified correctly.

I am still analyzing. And want to be assigned this issue.

Thank you.

Elijas · 2024-01-01T13:20:05Z

Hey @Elijas,

I started working on this issue.

This question came up: Was the element identified incorrectly because there is a bug in the top-level section classifier? The same pattern occurs on other pages and the page header elements are identified correctly.

I am still analyzing. And want to be assigned this issue.

Thank you.

Thanks for taking a look!

So in this picture let's take a look at the elements:

A. PART I (small letters)
B. PART I. FINANCIAL INFORMATION (big letters)

Element A should be identified as PageHeaderElement (because it's irrelevant) and Element B should be identified as TopLevelSectionTitle

Does it make sense?

deenaawny-github-account · 2024-01-02T11:52:54Z

Hey @Elijas,

yes, it makes sense.

I looked more into the code on Sunday.

I first asked ChatGPT to refactor the _process_element function (of the top section title classifier) the into smaller functions (keeping the same logic). I ran unit tests, snapshot verify and accuracy tests and they were running with no errors, after code refractor. In addition, the unit tests cover for top_section_manager_for_10q increased from 63% to 76% (please see attachment
refactor-tests.txt).

I think when functions are small, they are easier to read (at least for me). Then, I fixed some redundancy with the functions. I then asked ChatGPT to give the functions specifications and then I re-wrote these specifications based on my understanding and ChatGPT's specification. So, these function specifications are both ChatGPT's specification and mine.

I then looked at the select_element function and extended it to:

check if the length of the given list of semantic elements is more than or equal to 2: and if yes:
a. checks whether the first element in the list is a potential page header element and if yes, then it chooses the second element in the given list of semantic elements (as the top section title element).

With this small method change, it ensures that top section title elements are all classified correctly for the report MSFT 0000950170-23-014423.
After this change, I looked at some of the elements that changed. These are listed below:

html_hash	html_preview	text_content	section_type before	cls_name before change	section_type after	cls_name after change
b5a04115	style='background-co...[162]...fit-content;'>PART I	PART I	part 1	TopSectionTitle	-	IntroductorySectionElement
9f043413	style='background-co...[136]...fit-content;'>Item 1	Item 1	part1 item 1	TopSectionTitle	-	IntroductorySectionElement
0f6f4ac1	style='background-co...[144]...nt;'>PART I. FINANCI	PART I. FINANCI	-	TitleElement	part 1	TopSectionTitle
7d3a9a03	style='background-co...[144]...nt;'>ITEM 1. FINANCI	ITEM 1. FINANCI	-	TitleElement	part 1 item 1	TopSectionTitle
a5c9de95	style='background-co...[136]...fit-content;'>Item 2	Item 2	part 1 item 2	TopSectionTitle	-	PageHeaderElement
3ce7a5eb	style='background-co...[176]...SION AND ANALYSIS OF	ITEM 2. MANAGEMENT’S DISCUSSION AND ANALYSIS OF	-	TitleElement	part 1 item 2	TopSectionTitle
c1eb19de	style='background-co...[139]...-content;'>Item 3, 4	Item 3, 4	part 1 item 3	TopSectionTitle	-	TitleElement
51287d34	style='background-co...[163]...TATIVE AND QUALITATI	ITEM 3. QUANTITATIVE AND QUALITATI	-	TitleElement	part 1 item 3	TopSection Title
ddd63bb	style='background-co...[163]...it-content;'>PART II	PART II	part 2	TopSectionTitle	-	PageHeaderElement
3cd753ab	style='background-co...[166]...content;'>Item 1, 1A	Item 1, 1A	part 2 item 1	TopSectionTitle	-	TitleElement
159c6586	style='background-co...[143]...ent;'>PART II. OTHER	PART II. OTHER	-	TitleElement	part 2	TopSectionTitle
820f6022	style='background-co...[142]...tent;'>ITEM 1. LEGAL	ITEM 1. LEGAL	-	TitleElement	part 2 item 1	TopSectionTitle
8d5cae59	style='background-co...[140]...ontent;'>ITEM 1A. RI	ITEM 1A. RI	part 2 item 1 a	TopSectionTitleElement	-	TitleElement
a5c9de95	style='background-co...[136]...fit-content;'>Item 2	Item 2	part 2 item 2	TopSectionTitleElement	-	PageHeaderElement
6db46951	style='background-co...[163]...STERED SALES OF EQUI	ITEM 2. UNREGISTERED SALES OF EQUI	-	TitleElement	part 2 item 2	TopSectionTitle
cf7940c0	style='background-co...[136]...fit-content;'>Item 6"	Item 6	part 2 item 6	TopSectionTitle	-	TitleElement
97fc951c	style='background-co...[138]...-content;'>ITEM 6. E	ITEM 6. E	-	TitleElement	part 2 item 6	TopSectionTitle

The above implementation fixes the classification of top section title elements. It makes sure they are not classified as page header elements.

Scores:

Accuracy tests (before change)

Filling	F1-Score	Recall	Precision	Missing	Unexpected	Total
AAPL	100.00%	100.00%	100.00%	0	0	202
MSFT	95.00%	97.30%	92.80%	10	28	389
META	100.00%	100.00%	100.00%	0	0	308
GOOG-70	100.00%	100.00%	100.00%	0	0	387
GOOG-94	100.00%	100.00%	100.00%	0	0	390

Accuracy tests (after change)

Filling	F1-Score	Recall	Precision	Missing	Unexpected	Total
AAPL	100.00%	100.00%	100.00%	0	0	202
MSFT	95.00%	97.30%	93.77%	10	24	385
META	100.00%	100.00%	100.00%	0	0	308
GOOG-70	99.74%	99.74%	99.74%	1	1	387
GOOG-94	99.74%	99.74%	99.74%	1	1	390

Pull Request: #70

Incase you don't want code refactor, I have to do the change in the previous code.
I need to make sure the code follows code standards.

It makes more sense (as you stated above) to call page header classifier first. I can do the implementation of the second solution too.

I appreciate your feedback.

deenaawny-github-account · 2024-01-02T17:00:35Z

Hey @Elijas,

question from Elijas: for the GOOG filings, can you briefly mention what is the changed element?

GOOG-70
Element before
"cls_name": "TopSectionTitle",
"html_hash": "e6cef692",
"html_preview": "style="color:#000000...[68]...height:120%">ITEM 1.",
"level": 1,
"section_type": "part1item1",
"tag_name": "span",
"text_content": "ITEM 1."

Element after
"cls_name": "TitleElement",
"html_hash": "e6cef692",
"html_preview": "style="color:#000000...[68]...height:120%">ITEM 1.",
"level": 0,
"tag_name": "span",
"text_content": "ITEM 1."

GOOG-94
Element before
"cls_name": "TopSectionTitle",
"html_hash": "e6cef692",
"html_preview": "style="color:#000000...[68]...height:120%">ITEM 1.",
"level": 1,
"section_type": "part1item1",
"tag_name": "span",
"text_content": "ITEM 1."

Element after
"cls_name": "TitleElement",
"html_hash": "e6cef692",
"html_preview": "style="color:#000000...[68]...height:120%">ITEM 1.",
"level": 0,
"tag_name": "span",
"text_content": "ITEM 1."

If solution a is chosen, will need to open an issue for these changed elements.

For solution b - clean the irrelevant elements first
The solution implementation was simple - just moving the top section manager for 10q after the page header classifier:

            PageHeaderClassifier(
                types_to_process={TextElement, HighlightedTextElement},
            ),
            TopSectionManagerFor10Q(types_to_exclude={PageHeaderElement}),

Accuracy tests (after change)

Filling	F1-Score	Recall	Precision	Missing	Unexpected	Total
AAPL	58.20%	62.38%	54.55%	76	105	234
MSFT	62.80%	64.15%	61.50%	133	149	406
META	72.36%	75.65%	69.35%	75	103	340
GOOG-70	60.58%	62.53%	58.74%	145	170	416
GOOG-94	60.62%	62.56%	58.80%	146	171	419

I remember now, why I deviated from the first solution (even though it is a better one). The accuracy tests didn't look promising. I then said I won't move the order of the classifiers, it seems to break the whole parser. And I will try to find a solution while keeping the same order.
But I think they shouldn't be that bad. It requires more work to fix all these breaking points.

Please let me know if I should continue with solution b and fix all the breaking points.

Elijas added contributions-welcome Intended for completion by you, the contributor feature:elements Parsing all the other elements correctly labels Dec 27, 2023

Elijas assigned Elijas and deenaawny-github-account and unassigned Elijas Jan 1, 2024

Elijas added the status:in-progress Work underway. Reach out if you're interested in helping! label Jan 1, 2024

deenaawny-github-account mentioned this issue Jan 2, 2024

Issue # 65 pull request #70

Closed

This was referenced Jan 7, 2024

Issue # 65 pull request #73

Closed

Issue number 65 #77

Closed

Issue # 65 updated test data alphanome-ai/sec-parser-test-data#4

Closed

test data update for issue # 65 alphanome-ai/sec-parser-test-data#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page headers should be identified as PageHeaderElement #65

Page headers should be identified as PageHeaderElement #65

Elijas commented Dec 27, 2023

deenaawny-github-account commented Dec 31, 2023 •

edited

Loading

Elijas commented Jan 1, 2024

deenaawny-github-account commented Jan 2, 2024 •

edited

Loading

deenaawny-github-account commented Jan 2, 2024 •

edited

Loading

Page headers should be identified as PageHeaderElement #65

Page headers should be identified as PageHeaderElement #65

Comments

Elijas commented Dec 27, 2023

Context

Problem

Ideas about a possible solution

deenaawny-github-account commented Dec 31, 2023 • edited Loading

Elijas commented Jan 1, 2024

deenaawny-github-account commented Jan 2, 2024 • edited Loading

deenaawny-github-account commented Jan 2, 2024 • edited Loading

deenaawny-github-account commented Dec 31, 2023 •

edited

Loading

deenaawny-github-account commented Jan 2, 2024 •

edited

Loading

deenaawny-github-account commented Jan 2, 2024 •

edited

Loading