New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Integrate second pass ex 21 model improvements and move to notebook based training #95

Open

zschira wants to merge 140 commits into main from prep_paragraph_classifier

Member

zschira commented Oct 11, 2024

Overview

Closes #78.

This PR adds improvements to the exhibit 21 extractor and moves the model implementation and training to a notebook which can be managed by dagster.

What did you change in this PR?

Move exhibit 21 model to a notebook
Pull model improvements from @katie-lamb's branch second-pass-ex21-improvements
Add Exhibit 21 layout classifier model
Integrate layout classifier to production pipeline to enable filtering paragraph docs from record linkage

zschira added 30 commits

August 13, 2024 15:02


          Initial dagster integration


          Update validate integration test to dagster infra

9d9fbfd


          Merge branch 'main' into dagster_integration

3da9659


          Generalize mltools

ee77e7a


          Reorg repo to move towards generalized modelling repo

53d3354


          Change library module structure

014bcb1


          Create turn experiment_tracking into sub-package


          Remove unused function

886614f


          Gracefully handle mlflow run on failure

dec80b8


          Fix variable name

e725f3d


          Change experiment tracker resource names

df44ed5


          Add mlflow artifact io-manager

93da052


          Simplify pudl_models decorator

07713e9


          Split extraction logging into two funcs

5d89ec6


          Add mlflow metrics io-manager

c57818a


          Change pudl_model to pudl_pipeline

625783b


          Add validation pipeline

4f50a7b


          Streamline construction of dagster jobs for running/testing pudl models

f6ab22c


          Remove old comment

f20fb7d


          Add ex21 to dagster jobs

92e2e00


          Prep for multiple code locations

520e6d1


          Add top-level worksapce file

e99ee1a


          Restructure docs

559c0e6


          Add train model job

93d02f3


          Log mlflow artifacts as parquet until csv is fixed

5190bf9


          Fix ex21 extraction

ca9599e


          Add development section to docs

7e7a503


          Fix integration tests

61f48c3


          Don't run ruff on notebooks

0fd8ffc


          xfail ex21 integration test

97d5587

review-notebook-app bot commented Oct 11, 2024

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

zschira added 8 commits

October 11, 2024 12:30


          Fix config warnings

93bffcb


          Test pin sphinx

d717caa


          Catch errors while normalizing bounding boxes

15be127


          Fix call to pandera example

4117d0a


          Fix handle failures in converting to pdf

8c8dd60


          Actually fix handle failures in converting to pdf

ff821b5


          Add model documentation to sec10k readme

a8eb359


          Fix ex 21 validation integration test

dc160ac

zschira commented

View reviewed changes

src/mozilla_sec_eia/library/generic_io_managers.py

                       """Read parquet."""
                       with path.open("rb") as file:
                           return pd.read_parquet(file)
+              class PickleUPathIOManager(UPathIOManager):

Member Author

zschira Oct 16, 2024

I added this to save pickled asset outputs to GCS. This is needed because I separated the ex21 inference dataset creation from actually running the model, but the datasets take up too much space if they're saved locally.

zschira commented

View reviewed changes

src/mozilla_sec_eia/library/mlflow/__init__.py

		@@ -12,6 +16,22 @@
		)


		def pyfunc_model_asset_factory(name: str, mlflow_run_uri: str):

Member Author

zschira Oct 16, 2024

This function will create an asset to load a model from mlflow. Using create_model is a little bit of a weird way to provide configuration to the asset, but this ensures that the default value for mlflow_run_uri will show up in the dagster UI.

zschira commented

View reviewed changes

src/mozilla_sec_eia/models/sec10k/__init__.py

-              ex21_validation_job = model_jobs.create_validation_model_job(
-                  "ex21_extraction_validation",
-                  ex_21.validation_assets,

Member Author

zschira Oct 16, 2024

At some point I'd like to clean up all this asset/job creation stuff, but I don't think this is the best time for that.

zschira and others added 5 commits

October 16, 2024 18:33


          Improve classifier error handling

10b24a9


          Fully broaden classifier errors

ad54979


          add more docs on running the notebooks

672e123


          clean up feature creation in paragraph classifier

2dbdcaa


          fix feature creation function

cda3225

katie-lamb reviewed

View reviewed changes

src/mozilla_sec_eia/models/sec10k/notebooks/exhibit21_extractor.ipynb Outdated

		@@ -0,0 +1,1144 @@
		{

Member

katie-lamb Oct 22, 2024 •

edited

Loading

At a high level, I think I'm confused about whether this notebook should be used to run only validation, or also inference on the whole dataset. If it's only used for validation, then I think the job name and/or the name of the notebook should be changed to include "validation" in the name. It seems like this notebook is used for the ex21_training job and not ex21_extraction , but it's called ex21_extraction.ipynb

Also, maybe should be "This notebook implements a model built on top of layoutlmv3 to extract tables from Exhibit 21 attachments to SEC-10k filings." I think the current wording doesn't make sense to me.

Reply via ReviewNB

src/mozilla_sec_eia/models/sec10k/notebooks/exhibit21_extractor.ipynb Outdated

		@@ -0,0 +1,1144 @@
		{

Member

katie-lamb Oct 22, 2024 •

edited

Loading

I think these upstream assets are hard to differentiate. Took a stab at making it a little more verbose:

ex21_training_data: Dataset of labeled Ex. 21 documents produced in Label Studio and used to train layoutlm . Each word in a document has an IOB format entity tag indicating one of these classes: subsidiary, location of incorporation, ownership percentage, or other.
ex21_validation_set: Transcribed tables from Ex. 21 documents describing the expected inference output on a validation set of filings.
ex21_failed_parsing_metadata: Metadata for any validation filings that couldn't be parsed and included in the inference dataset (usually empty)
ex21_inference_dataset : Parsed filings prepped for the inference model. If running validation, this should be the validation set of filings.

If running inference on all the docs, do you still need the validation set in ex21_validation_set to be materialized? If you are running validation, should ex21_inference_dataset just contain the validation set of filings?

Reply via ReviewNB

src/mozilla_sec_eia/models/sec10k/notebooks/exhibit21_extractor.ipynb Outdated

		@@ -0,0 +1,1144 @@
		{

Member

katie-lamb Oct 22, 2024 •

edited

Loading

Line #6.        "layoutlm_training_run": "layoutlm-labeledv0.2",

Does setting a different value in the Dagster config overrule what's set in this cell?

What actually does this string represent? Is it the model path from MLflow, something from Dagster, or something local? I see exhibit21_extractor and layoutlm_extractor on the MLFlow registered models, but neither seems right.

Should this be layoutlm_training_run or layoutlm_uri? The documentation cell above says layoutlm_uri but maybe just needs to be updated?

Reply via ReviewNB

src/mozilla_sec_eia/models/sec10k/notebooks/exhibit21_extractor.ipynb Outdated

		@@ -0,0 +1,1144 @@
		{

Member

katie-lamb Oct 22, 2024 •

edited

Loading

Maybe would add this: "Model finetuning will only be run if configured to do so (layoutlm_training_run = None ) , otherwise a pretrained version will be used from the mlflow tracking server.

Also I think in step 2 we should say Named Entity Recognition (NER) since it's the first place we use NER.

Reply via ReviewNB

src/mozilla_sec_eia/models/sec10k/notebooks/exhibit21_extractor.ipynb Outdated

		@@ -0,0 +1,1144 @@
		{

Member

katie-lamb Oct 22, 2024 •

edited

Loading

Maybe "labeled validation data" should be "manually transcribed validation tables"? Just to differentiate between this and the Label Studio training data?

Also, if this inference section is used in the ex21_extraction job, then maybe mention that is what's run to actually extract tables, and not just for validation.

Reply via ReviewNB

src/mozilla_sec_eia/models/sec10k/notebooks/exhibit21_layout_classifier.ipynb

		@@ -0,0 +1,337 @@
		{

Member

katie-lamb Oct 22, 2024 •

edited

Loading

Line #11.        def predict(self, context, model_input: pd.DataFrame):

I assume we use predict when we pull down the classifier during an inference run and actually classify documents?

Reply via ReviewNB

src/mozilla_sec_eia/models/sec10k/notebooks/exhibit21_layout_classifier.ipynb

		@@ -0,0 +1,337 @@
		{

Member

katie-lamb Oct 22, 2024 •

edited

Loading

Line #25.    for classifier, model in classifiers.items():

Do you think we should run cross validation of the models every time? Or should we just choose one (SVM performed the best). I guess it's fine that we just log all of them in MLFlow and then just choose one but could potentially use the wrong one in production.

Reply via ReviewNB

Member

katie-lamb commented Oct 22, 2024 •

edited

Loading

Cool that this works! Seems like a great structure for splink notebooks and record linkage validation.

I changed the feature creation of the paragraph layout classifier slightly and left some comments on the Ex. 21 extraction notebook.

I think what I'm most unclear about in the structure is what the difference is between the ex21_training and ex21_extraction job is. The ex21_training job has the option of fine-tuning LayoutLM and then running inference on the validation set, whereas the ex21_extraction job just loads a model from a run and performs inference on the whole dataset? I felt confused where to set the config in Dagster and where to set the config in the notebook. Also, I wasn't sure where model paths were coming from, since they no longer seem to be coming from the MLFlow model registry.

Maybe we need to include a summary of the different jobs in the README?

pre-commit-ci bot and others added 3 commits

October 22, 2024 22:28


          [pre-commit.ci] auto fixes from pre-commit.com hooks

509b7a0

For more information, see https://pre-commit.ci


          small fixes to read in comments in tracking dataframe

8855e5e


          Merge branch 'prep_paragraph_classifier' of https://github.com/cataly…

f806ad8

…st-cooperative/mozilla-sec-eia into prep_paragraph_classifier

katie-lamb requested changes

View reviewed changes

Member

katie-lamb left a comment •

edited

Loading

I think it might be worth getting on the phone to talk about how the Dagster config stuff is now set up. I know we already talked through it but I think I forgot something key because the config was not laid out exactly how I remembered. I integrated a few more changes to the model pipeline from my second pass improvements branch, but otherwise I think all the changes from there are integrated in this PR.

src/mozilla_sec_eia/models/sec10k/ex_21/data/common.py Outdated

+              ) -> pd.DataFrame:
+                  """Format Label Studio output JSONs into dataframe."""
+                  labeled_df = pd.DataFrame()
+                  tracking_df = validation_helpers.load_training_data("ex21_labels.csv")

Member

katie-lamb Oct 22, 2024

This is a nit, but I think this CSV shouldn't be labeled ex21_labels.csv, because that's the same name as the manually transcribed tables. Maybe this one should be ex21_labeled_filings.csv and the manually transcribed tables should be ex21_transcriptions.csv?

Member

katie-lamb Oct 22, 2024 •

edited

Loading

Also, I see that the version of the ex21_labels.csv in this branch has paragraph layout docs included. I'm a little worried that means that we trained on paragraph layout docs in the latest run, or did you not retrain LayoutLM in that run?

The validation data in this branch doesn't include paragraph layout docs which is good.

Member

katie-lamb Oct 23, 2024

I renamed this file to ex21_labeled_filings.csv but feel free to change. Also, I pulled over the version from my other branch which has paragraph layout filings commented out.

katie-lamb added 4 commits

October 22, 2024 22:18


          updates to model pipeline

590ba60


          take out logging messages

3db47d4


          update to exclude paragraph layout docs in labeled data tracking

5f23e1c


          ad remove paragraph filenames from validation data

e166c3d

katie-lamb reviewed

View reviewed changes

src/mozilla_sec_eia/models/sec10k/ex_21/ex21_validation_helpers.py

               def clean_ex21_validation_set(validation_df: pd.DataFrame):
                   """Clean Ex. 21 validation data to match extracted format."""
+                  validation_df = remove_paragraph_tables_from_validation_data(validation_df)

Member

katie-lamb Oct 23, 2024

I added this, to automatically remove paragraph layout docs from the validation data (even though the current version of the validation data already has them removed). I figured it couldn't hurt, but you can take it out if you want.

zschira added 2 commits

October 25, 2024 12:36


          Split layoutlm training from inference model/validation

f529ca9


          Merge branch 'prep_paragraph_classifier' of github.com:catalyst-coope…

7dcf5ac

…rative/mozilla-sec-eia into prep_paragraph_classifier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet