-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate second pass ex 21 model improvements and move to notebook based training #95
base: main
Are you sure you want to change the base?
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@@ -19,3 +21,19 @@ def load_from_path(self, context: InputContext, path: UPath) -> pd.DataFrame: | |||
"""Read parquet.""" | |||
with path.open("rb") as file: | |||
return pd.read_parquet(file) | |||
|
|||
|
|||
class PickleUPathIOManager(UPathIOManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to save pickled asset outputs to GCS. This is needed because I separated the ex21 inference dataset creation from actually running the model, but the datasets take up too much space if they're saved locally.
@@ -12,6 +16,22 @@ | |||
) | |||
|
|||
|
|||
def pyfunc_model_asset_factory(name: str, mlflow_run_uri: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function will create an asset to load a model from mlflow. Using create_model
is a little bit of a weird way to provide configuration to the asset, but this ensures that the default value for mlflow_run_uri
will show up in the dagster UI.
ex21_validation_job = model_jobs.create_validation_model_job( | ||
"ex21_extraction_validation", | ||
ex_21.validation_assets, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point I'd like to clean up all this asset/job creation stuff, but I don't think this is the best time for that.
@@ -0,0 +1,1144 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level, I think I'm confused about whether this notebook should be used to run only validation, or also inference on the whole dataset. If it's only used for validation, then I think the job name and/or the name of the notebook should be changed to include "validation" in the name. It seems like this notebook is used for the ex21_training
job and not ex21_extraction
, but it's called ex21_extraction.ipynb
Also, maybe should be "This notebook implements a model built on top of layoutlmv3 to extract tables from Exhibit 21 attachments to SEC-10k filings." I think the current wording doesn't make sense to me.
Reply via ReviewNB
@@ -0,0 +1,1144 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these upstream assets are hard to differentiate. Took a stab at making it a little more verbose:
ex21_training_data
: Dataset of labeled Ex. 21 documents produced in Label Studio and used to trainlayoutlm
. Each word in a document has an IOB format entity tag indicating one of these classes: subsidiary, location of incorporation, ownership percentage, or other.ex21_validation_set
: Transcribed tables from Ex. 21 documents describing the expected inference output on a validation set of filings.ex21_failed_parsing_metadata
: Metadata for any validation filings that couldn't be parsed and included in the inference dataset (usually empty)ex21_inference_dataset
: Parsed filings prepped for the inference model. If running validation, this should be the validation set of filings.
If running inference on all the docs, do you still need the validation set in ex21_validation_set
to be materialized? If you are running validation, should ex21_inference_dataset
just contain the validation set of filings?
Reply via ReviewNB
@@ -0,0 +1,1144 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #6. "layoutlm_training_run": "layoutlm-labeledv0.2",
Does setting a different value in the Dagster config overrule what's set in this cell?
What actually does this string represent? Is it the model path from MLflow, something from Dagster, or something local? I see exhibit21_extractor
and layoutlm_extractor
on the MLFlow registered models, but neither seems right.
Should this be layoutlm_training_run
or layoutlm_uri
? The documentation cell above says layoutlm_uri
but maybe just needs to be updated?
Reply via ReviewNB
@@ -0,0 +1,1144 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe would add this: "Model finetuning will only be run if configured to do so (layoutlm_training_run = None
) , otherwise a pretrained version will be used from the mlflow
tracking server.
Also I think in step 2 we should say Named Entity Recognition (NER) since it's the first place we use NER.
Reply via ReviewNB
@@ -0,0 +1,1144 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "labeled validation data" should be "manually transcribed validation tables"? Just to differentiate between this and the Label Studio training data?
Also, if this inference section is used in the ex21_extraction job, then maybe mention that is what's run to actually extract tables, and not just for validation.
Reply via ReviewNB
@@ -0,0 +1,337 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #11. def predict(self, context, model_input: pd.DataFrame):
I assume we use predict when we pull down the classifier during an inference run and actually classify documents?
Reply via ReviewNB
@@ -0,0 +1,337 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #25. for classifier, model in classifiers.items():
Do you think we should run cross validation of the models every time? Or should we just choose one (SVM performed the best). I guess it's fine that we just log all of them in MLFlow and then just choose one but could potentially use the wrong one in production.
Reply via ReviewNB
Cool that this works! Seems like a great structure for splink notebooks and record linkage validation. I changed the feature creation of the paragraph layout classifier slightly and left some comments on the Ex. 21 extraction notebook. I think what I'm most unclear about in the structure is what the difference is between the Maybe we need to include a summary of the different jobs in the README? |
For more information, see https://pre-commit.ci
…st-cooperative/mozilla-sec-eia into prep_paragraph_classifier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be worth getting on the phone to talk about how the Dagster config stuff is now set up. I know we already talked through it but I think I forgot something key because the config was not laid out exactly how I remembered. I integrated a few more changes to the model pipeline from my second pass improvements branch, but otherwise I think all the changes from there are integrated in this PR.
) -> pd.DataFrame: | ||
"""Format Label Studio output JSONs into dataframe.""" | ||
labeled_df = pd.DataFrame() | ||
tracking_df = validation_helpers.load_training_data("ex21_labels.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nit, but I think this CSV shouldn't be labeled ex21_labels.csv
, because that's the same name as the manually transcribed tables. Maybe this one should be ex21_labeled_filings.csv
and the manually transcribed tables should be ex21_transcriptions.csv
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I see that the version of the ex21_labels.csv
in this branch has paragraph layout docs included. I'm a little worried that means that we trained on paragraph layout docs in the latest run, or did you not retrain LayoutLM in that run?
The validation data in this branch doesn't include paragraph layout docs which is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this file to ex21_labeled_filings.csv
but feel free to change. Also, I pulled over the version from my other branch which has paragraph layout filings commented out.
def clean_ex21_validation_set(validation_df: pd.DataFrame): | ||
"""Clean Ex. 21 validation data to match extracted format.""" | ||
validation_df = remove_paragraph_tables_from_validation_data(validation_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this, to automatically remove paragraph layout docs from the validation data (even though the current version of the validation data already has them removed). I figured it couldn't hurt, but you can take it out if you want.
…rative/mozilla-sec-eia into prep_paragraph_classifier
Overview
Closes #78.
This PR adds improvements to the exhibit 21 extractor and moves the model implementation and training to a notebook which can be managed by dagster.
What did you change in this PR?