Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate second pass ex 21 model improvements and move to notebook based training #95

Open
wants to merge 140 commits into
base: main
Choose a base branch
from

Conversation

zschira
Copy link
Member

@zschira zschira commented Oct 11, 2024

Overview

Closes #78.

This PR adds improvements to the exhibit 21 extractor and moves the model implementation and training to a notebook which can be managed by dagster.

What did you change in this PR?

  • Move exhibit 21 model to a notebook
  • Pull model improvements from @katie-lamb's branch second-pass-ex21-improvements
  • Add Exhibit 21 layout classifier model
  • Integrate layout classifier to production pipeline to enable filtering paragraph docs from record linkage

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@@ -19,3 +21,19 @@ def load_from_path(self, context: InputContext, path: UPath) -> pd.DataFrame:
"""Read parquet."""
with path.open("rb") as file:
return pd.read_parquet(file)


class PickleUPathIOManager(UPathIOManager):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to save pickled asset outputs to GCS. This is needed because I separated the ex21 inference dataset creation from actually running the model, but the datasets take up too much space if they're saved locally.

@@ -12,6 +16,22 @@
)


def pyfunc_model_asset_factory(name: str, mlflow_run_uri: str):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function will create an asset to load a model from mlflow. Using create_model is a little bit of a weird way to provide configuration to the asset, but this ensures that the default value for mlflow_run_uri will show up in the dagster UI.

ex21_validation_job = model_jobs.create_validation_model_job(
"ex21_extraction_validation",
ex_21.validation_assets,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point I'd like to clean up all this asset/job creation stuff, but I don't think this is the best time for that.

@@ -0,0 +1,1144 @@
{
Copy link
Member

@katie-lamb katie-lamb Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level, I think I'm confused about whether this notebook should be used to run only validation, or also inference on the whole dataset. If it's only used for validation, then I think the job name and/or the name of the notebook should be changed to include "validation" in the name. It seems like this notebook is used for the ex21_training job and not ex21_extraction , but it's called ex21_extraction.ipynb

Also, maybe should be "This notebook implements a model built on top of layoutlmv3 to extract tables from Exhibit 21 attachments to SEC-10k filings." I think the current wording doesn't make sense to me.


Reply via ReviewNB

@@ -0,0 +1,1144 @@
{
Copy link
Member

@katie-lamb katie-lamb Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these upstream assets are hard to differentiate. Took a stab at making it a little more verbose:

  • ex21_training_data: Dataset of labeled Ex. 21 documents produced in Label Studio and used to train layoutlm . Each word in a document has an IOB format entity tag indicating one of these classes: subsidiary, location of incorporation, ownership percentage, or other.
  • ex21_validation_set: Transcribed tables from Ex. 21 documents describing the expected inference output on a validation set of filings.
  • ex21_failed_parsing_metadata: Metadata for any validation filings that couldn't be parsed and included in the inference dataset (usually empty)
  • ex21_inference_dataset : Parsed filings prepped for the inference model. If running validation, this should be the validation set of filings.

If running inference on all the docs, do you still need the validation set in ex21_validation_set to be materialized? If you are running validation, should ex21_inference_dataset just contain the validation set of filings?


Reply via ReviewNB

@@ -0,0 +1,1144 @@
{
Copy link
Member

@katie-lamb katie-lamb Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #6.        "layoutlm_training_run": "layoutlm-labeledv0.2",

Does setting a different value in the Dagster config overrule what's set in this cell?

What actually does this string represent? Is it the model path from MLflow, something from Dagster, or something local? I see exhibit21_extractor and layoutlm_extractor on the MLFlow registered models, but neither seems right.

Should this be layoutlm_training_run or layoutlm_uri? The documentation cell above says layoutlm_uri but maybe just needs to be updated?


Reply via ReviewNB

@@ -0,0 +1,1144 @@
{
Copy link
Member

@katie-lamb katie-lamb Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe would add this: "Model finetuning will only be run if configured to do so (layoutlm_training_run = None ) , otherwise a pretrained version will be used from the mlflow tracking server.

Also I think in step 2 we should say Named Entity Recognition (NER) since it's the first place we use NER.


Reply via ReviewNB

@@ -0,0 +1,1144 @@
{
Copy link
Member

@katie-lamb katie-lamb Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "labeled validation data" should be "manually transcribed validation tables"? Just to differentiate between this and the Label Studio training data?

Also, if this inference section is used in the ex21_extraction job, then maybe mention that is what's run to actually extract tables, and not just for validation.


Reply via ReviewNB

@@ -0,0 +1,337 @@
{
Copy link
Member

@katie-lamb katie-lamb Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #11.        def predict(self, context, model_input: pd.DataFrame):

I assume we use predict when we pull down the classifier during an inference run and actually classify documents?


Reply via ReviewNB

@@ -0,0 +1,337 @@
{
Copy link
Member

@katie-lamb katie-lamb Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #25.    for classifier, model in classifiers.items():

Do you think we should run cross validation of the models every time? Or should we just choose one (SVM performed the best). I guess it's fine that we just log all of them in MLFlow and then just choose one but could potentially use the wrong one in production.


Reply via ReviewNB

@katie-lamb
Copy link
Member

katie-lamb commented Oct 22, 2024

Cool that this works! Seems like a great structure for splink notebooks and record linkage validation.

I changed the feature creation of the paragraph layout classifier slightly and left some comments on the Ex. 21 extraction notebook.

I think what I'm most unclear about in the structure is what the difference is between the ex21_training and ex21_extraction job is. The ex21_training job has the option of fine-tuning LayoutLM and then running inference on the validation set, whereas the ex21_extraction job just loads a model from a run and performs inference on the whole dataset? I felt confused where to set the config in Dagster and where to set the config in the notebook. Also, I wasn't sure where model paths were coming from, since they no longer seem to be coming from the MLFlow model registry.

Maybe we need to include a summary of the different jobs in the README?

Copy link
Member

@katie-lamb katie-lamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worth getting on the phone to talk about how the Dagster config stuff is now set up. I know we already talked through it but I think I forgot something key because the config was not laid out exactly how I remembered. I integrated a few more changes to the model pipeline from my second pass improvements branch, but otherwise I think all the changes from there are integrated in this PR.

) -> pd.DataFrame:
"""Format Label Studio output JSONs into dataframe."""
labeled_df = pd.DataFrame()
tracking_df = validation_helpers.load_training_data("ex21_labels.csv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nit, but I think this CSV shouldn't be labeled ex21_labels.csv, because that's the same name as the manually transcribed tables. Maybe this one should be ex21_labeled_filings.csv and the manually transcribed tables should be ex21_transcriptions.csv?

Copy link
Member

@katie-lamb katie-lamb Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I see that the version of the ex21_labels.csv in this branch has paragraph layout docs included. I'm a little worried that means that we trained on paragraph layout docs in the latest run, or did you not retrain LayoutLM in that run?

The validation data in this branch doesn't include paragraph layout docs which is good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this file to ex21_labeled_filings.csv but feel free to change. Also, I pulled over the version from my other branch which has paragraph layout filings commented out.

def clean_ex21_validation_set(validation_df: pd.DataFrame):
"""Clean Ex. 21 validation data to match extracted format."""
validation_df = remove_paragraph_tables_from_validation_data(validation_df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this, to automatically remove paragraph layout docs from the validation data (even though the current version of the validation data already has them removed). I figured it couldn't hurt, but you can take it out if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

Second Pass Ex. 21 Model Improvements
2 participants