-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Page headers should be identified as PageHeaderElement #65
Comments
Hey @Elijas, I started working on this issue. This question came up: I am still analyzing. And want to be assigned this issue. Thank you. |
Thanks for taking a look! So in this picture let's take a look at the elements:
Element A should be identified as PageHeaderElement (because it's irrelevant) and Element B should be identified as TopLevelSectionTitle Does it make sense? |
Hey @Elijas, yes, it makes sense. I looked more into the code on Sunday. I first asked ChatGPT to refactor the _process_element function (of the top section title classifier) the into smaller functions (keeping the same logic). I ran unit tests, snapshot verify and accuracy tests and they were running with no errors, after code refractor. In addition, the unit tests cover for top_section_manager_for_10q increased from 63% to 76% (please see attachment I think when functions are small, they are easier to read (at least for me). Then, I fixed some redundancy with the functions. I then asked ChatGPT to give the functions specifications and then I re-wrote these specifications based on my understanding and ChatGPT's specification. So, these function specifications are both ChatGPT's specification and mine. I then looked at the select_element function and extended it to:
With this small method change, it ensures that top section title elements are all classified correctly for the report MSFT 0000950170-23-014423.
The above implementation fixes the classification of top section title elements. It makes sure they are not classified as page header elements. Scores: Accuracy tests (before change)
Accuracy tests (after change)
Pull Request: #70 Next:
It makes more sense (as you stated above) to call page header classifier first. I can do the implementation of the second solution too. I appreciate your feedback. |
Hey @Elijas, question from Elijas: for the GOOG filings, can you briefly mention what is the changed element? GOOG-70 Element after GOOG-94 Element after If solution a is chosen, will need to open an issue for these changed elements. For solution b - clean the irrelevant elements first
Accuracy tests (after change)
I remember now, why I deviated from the first solution (even though it is a better one). The accuracy tests didn't look promising. I then said I won't move the order of the classifiers, it seems to break the whole parser. And I will try to find a solution while keeping the same order. Please let me know if I should continue with solution b and fix all the breaking points. |
Context
MSFT accuracy-test (permalink at the time of posting)
Problem
The header "PART I" is identified as
top section title element
, when it should be identified aspage header element
. And because of this, the actualtop section title element
is incorrectly identified astitle element
Ideas about a possible solution
One possible solution: Identify page header elements first, and then the
top section title element
will start working correctly, as it will not be confused by the header elementsThe text was updated successfully, but these errors were encountered: