Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate second pass ex 21 model improvements and move to notebook based training #95

Open
wants to merge 140 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 132 commits
Commits
Show all changes
140 commits
Select commit Hold shift + click to select a range
5767035
Initial dagster integration
zschira Aug 13, 2024
9d9fbfd
Update validate integration test to dagster infra
zschira Aug 14, 2024
3da9659
Merge branch 'main' into dagster_integration
zschira Aug 20, 2024
ee77e7a
Generalize mltools
zschira Aug 27, 2024
53d3354
Reorg repo to move towards generalized modelling repo
zschira Aug 28, 2024
014bcb1
Change library module structure
zschira Aug 28, 2024
5404148
Create turn experiment_tracking into sub-package
zschira Aug 28, 2024
886614f
Remove unused function
zschira Aug 28, 2024
dec80b8
Gracefully handle mlflow run on failure
zschira Aug 28, 2024
e725f3d
Fix variable name
zschira Aug 28, 2024
df44ed5
Change experiment tracker resource names
zschira Aug 28, 2024
93da052
Add mlflow artifact io-manager
zschira Aug 28, 2024
07713e9
Simplify pudl_models decorator
zschira Aug 29, 2024
5d89ec6
Split extraction logging into two funcs
zschira Aug 29, 2024
c57818a
Add mlflow metrics io-manager
zschira Aug 29, 2024
625783b
Change pudl_model to pudl_pipeline
zschira Aug 29, 2024
4f50a7b
Add validation pipeline
zschira Aug 30, 2024
f6ab22c
Streamline construction of dagster jobs for running/testing pudl models
zschira Sep 2, 2024
f20fb7d
Remove old comment
zschira Sep 2, 2024
92e2e00
Add ex21 to dagster jobs
zschira Sep 3, 2024
520e6d1
Prep for multiple code locations
zschira Sep 3, 2024
e99ee1a
Add top-level worksapce file
zschira Sep 3, 2024
559c0e6
Restructure docs
zschira Sep 3, 2024
93d02f3
Add train model job
zschira Sep 3, 2024
5190bf9
Log mlflow artifacts as parquet until csv is fixed
zschira Sep 3, 2024
ca9599e
Fix ex21 extraction
zschira Sep 4, 2024
7e7a503
Add development section to docs
zschira Sep 4, 2024
61f48c3
Fix integration tests
zschira Sep 4, 2024
0fd8ffc
Don't run ruff on notebooks
zschira Sep 4, 2024
97d5587
xfail ex21 integration test
zschira Sep 4, 2024
ace268b
Add parquet upath io-manager
zschira Sep 5, 2024
fb1feeb
Remove nb-output clear
zschira Sep 5, 2024
294ec72
Test docker deployment
zschira Sep 5, 2024
4de51b3
Chunk ex 21 extraction
zschira Sep 5, 2024
214e28f
Fix asign copy
zschira Sep 6, 2024
c5736e0
Add job for testing ex21 resource usage
zschira Sep 6, 2024
4a81e88
Merge branch 'test_parquet_logging' into dagster_integration
zschira Sep 6, 2024
ec39633
Remove test docker files
zschira Sep 6, 2024
101ccf1
Remove complex asset factory
zschira Sep 6, 2024
7e0c5a5
Parallelize ex21 extraction
zschira Sep 6, 2024
080d790
Don't chunk in inference module
zschira Sep 6, 2024
44dfc52
Handle failures in converting to pdf
zschira Sep 6, 2024
6e24157
Delete cached pdfs early
zschira Sep 6, 2024
cd06d07
Add metadata to chunk_filings
zschira Sep 9, 2024
e3e8c45
Catch oom errors while extracting ex21
zschira Sep 9, 2024
350defb
Fix ex21 gcs io-manager
zschira Sep 9, 2024
3c80b72
Fix partitions for basic 10k extraction.
zschira Sep 9, 2024
31971b7
Cache layoutlm locally
zschira Sep 9, 2024
634a050
Fix caching model
zschira Sep 9, 2024
69ee4c0
Remove bad call
zschira Sep 9, 2024
63d6600
Test own_per conversion
zschira Sep 10, 2024
c8490d4
Add pandera types for output tables
zschira Sep 10, 2024
fa4f57d
Add missing entities module
zschira Sep 10, 2024
35e917d
Don't cache model, load with io manager
zschira Sep 10, 2024
a7b1c7f
Remove float conversion
zschira Sep 10, 2024
f019117
Add hypothesis to deps
zschira Sep 10, 2024
d7d13d8
Make own_per str
zschira Sep 10, 2024
70f5293
Remove astype
zschira Sep 10, 2024
e406092
Validate ex21 return types
zschira Sep 10, 2024
f3835d9
Clean model download temp dir
zschira Sep 11, 2024
3c995cd
Fix model return type
zschira Sep 11, 2024
ef55e4b
Catch errors in creating ex 21 dataset
zschira Sep 11, 2024
b37450a
Fix column name
zschira Sep 11, 2024
06b18ed
Try to catch empty pdf errors
zschira Sep 12, 2024
abfc006
Print traceback in caught exception
zschira Sep 12, 2024
ff92a55
Fix empty pdf check
zschira Sep 12, 2024
8aa8c95
Actually fix empty pdf check?
zschira Sep 12, 2024
43600bc
Use UPath in GCSArchive
zschira Sep 18, 2024
05ad82c
Make _configure_mlflow a standalone function
zschira Sep 18, 2024
fddc3b2
Merge branch 'main' into error_handling_improvements
zschira Sep 18, 2024
99fc7ed
Try to skip notebooks in ruff check
zschira Sep 18, 2024
b135500
Pull integration test fixes from main
zschira Sep 19, 2024
6e868f2
Fix typos in README.rst
zschira Sep 19, 2024
df4fd09
Cache downloaded layoutlm in dagster home
zschira Sep 19, 2024
74d237d
Merge branch 'error_handling_improvements' of github.com:catalyst-coo…
zschira Sep 19, 2024
3642765
Fix broken test
zschira Sep 19, 2024
830bd74
fix rename filings
katie-lamb Sep 20, 2024
2cd1fe6
fix paths to cache training data
katie-lamb Sep 20, 2024
64dc8c5
update root dir path
katie-lamb Sep 20, 2024
226d91c
Fix UPath initialization
zschira Sep 20, 2024
3c17d33
Fix path in test
zschira Sep 20, 2024
df69f42
Create huggingface dataset outside model execution
zschira Sep 20, 2024
2d3345c
small fixes to path handling
katie-lamb Sep 20, 2024
46e7b40
Merge branch 'error_handling_improvements' into second-pass-ex21-impr…
katie-lamb Sep 20, 2024
6f9d34a
Minor fixes
zschira Sep 23, 2024
07d500a
Start migrating model training to notebook
zschira Sep 23, 2024
81813a7
Create dataset as dataframe for logging
zschira Sep 24, 2024
5174ed7
Modify dataset return type
zschira Sep 24, 2024
7a572c0
Fix dataset types for model signature
zschira Sep 24, 2024
5728026
Migrate ex 21 model training to a notebook
zschira Sep 25, 2024
5fbbfff
Merge initial notebook migration (broken)
zschira Oct 3, 2024
37edd50
Split dataset loading into separate assets
zschira Oct 4, 2024
d6889e3
Minor notebook fixes
zschira Oct 4, 2024
d5e013a
Fix import in notebook
zschira Oct 4, 2024
f9810db
add device to pipeline
zschira Oct 4, 2024
2760881
Fix signature inference
zschira Oct 4, 2024
1dcacfa
Fix notebook dagster config
zschira Oct 4, 2024
39bb45b
Fix config param name
zschira Oct 4, 2024
cb83862
Partition training data
zschira Oct 5, 2024
c71593c
Add partitions to notebook asset
zschira Oct 5, 2024
4efa515
Update ex21 labels
zschira Oct 6, 2024
581b2e3
Use run name for specifying training runs
zschira Oct 6, 2024
c67a1be
Rework how notebook is configured
zschira Oct 6, 2024
b8a5b24
Finetune configuration
zschira Oct 6, 2024
45d5cf8
separate inference dataset creation from model prediction
zschira Oct 7, 2024
3e15b1f
Remove deprecated inference module
zschira Oct 7, 2024
60a1260
Add notebook for training ex21 classifier
zschira Oct 8, 2024
4105110
Pull in model updates
zschira Oct 8, 2024
4d29037
Update classifier model
zschira Oct 8, 2024
85c44ff
Fix set on copy pandas issue
zschira Oct 9, 2024
52e3580
Fix model uri's
zschira Oct 9, 2024
b709053
Fix indices in extraction model
zschira Oct 9, 2024
b8dad3c
Fix typo
zschira Oct 9, 2024
e6b29ff
Add asset factory for loading models
zschira Oct 10, 2024
3d11777
Catch layout classification NaN exception
zschira Oct 10, 2024
df5fe0d
Use GCS pickle io-manager
zschira Oct 10, 2024
d6c41a2
Switch gcs pickle io manager to upath based
zschira Oct 11, 2024
ddd2263
Remove duplicate logger
zschira Oct 11, 2024
93bffcb
Fix config warnings
zschira Oct 11, 2024
d717caa
Test pin sphinx
zschira Oct 11, 2024
15be127
Catch errors while normalizing bounding boxes
zschira Oct 14, 2024
4117d0a
Fix call to pandera example
zschira Oct 14, 2024
8c8dd60
Fix handle failures in converting to pdf
zschira Oct 14, 2024
ff821b5
Actually fix handle failures in converting to pdf
zschira Oct 14, 2024
a8eb359
Add model documentation to sec10k readme
zschira Oct 16, 2024
dc160ac
Fix ex 21 validation integration test
zschira Oct 16, 2024
10b24a9
Improve classifier error handling
zschira Oct 16, 2024
ad54979
Fully broaden classifier errors
zschira Oct 16, 2024
672e123
add more docs on running the notebooks
zschira Oct 16, 2024
2dbdcaa
clean up feature creation in paragraph classifier
katie-lamb Oct 22, 2024
cda3225
fix feature creation function
katie-lamb Oct 22, 2024
509b7a0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 22, 2024
8855e5e
small fixes to read in comments in tracking dataframe
katie-lamb Oct 22, 2024
f806ad8
Merge branch 'prep_paragraph_classifier' of https://github.com/cataly…
katie-lamb Oct 22, 2024
590ba60
updates to model pipeline
katie-lamb Oct 23, 2024
3db47d4
take out logging messages
katie-lamb Oct 23, 2024
5f23e1c
update to exclude paragraph layout docs in labeled data tracking
katie-lamb Oct 23, 2024
e166c3d
ad remove paragraph filenames from validation data
katie-lamb Oct 23, 2024
f529ca9
Split layoutlm training from inference model/validation
zschira Oct 25, 2024
7dcf5ac
Merge branch 'prep_paragraph_classifier' of github.com:catalyst-coope…
zschira Oct 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ dev = [
docs = [
"doc8>=1,<2", # Ensures clean documentation formatting
"furo>=2022.4.7",
"sphinx>=6,<9", # The default Python documentation engine
"sphinx>=6,<8.1", # The default Python documentation engine
"sphinx-autoapi>=2,<4", # Generates documentation from docstrings
"sphinx-issues>=1.2,<5", # Allows references to GitHub issues

Expand Down Expand Up @@ -157,7 +157,7 @@ doctest_optionflags = [

[tool.ruff]
exclude = ["notebooks/*"]
select = [
lint.select = [
"A", # flake8-builtins
# "ARG", # unused arguments
# "B", # flake8-bugbear
Expand Down Expand Up @@ -185,7 +185,7 @@ select = [
"UP", # pyupgrade (use modern python syntax)
"W", # pycodestyle warnings
]
ignore = [
lint.ignore = [
"D401", # Require imperative mood in docstrings.
"D417",
"E501", # Overlong lines.
Expand All @@ -205,26 +205,26 @@ target-version = "py311"
line-length = 88

# Don't automatically concatenate strings -- sometimes we forget a comma!
unfixable = ["ISC"]
lint.unfixable = ["ISC"]

[tool.ruff.per-file-ignores]
[tool.ruff.lint.per-file-ignores]
"__init__.py" = ["F401"] # Ignore unused imports
"tests/*" = ["D"]

[tool.ruff.pep8-naming]
[tool.ruff.lint.pep8-naming]
# Allow Pydantic's `@validator` decorator to trigger class method treatment.
classmethod-decorators = ["pydantic.validator", "pydantic.root_validator"]

[tool.ruff.isort]
[tool.ruff.lint.isort]
known-first-party = ["pudl"]

[tool.ruff.pydocstyle]
[tool.ruff.lint.pydocstyle]
convention = "google"

[tool.ruff.mccabe]
[tool.ruff.lint.mccabe]
max-complexity = 10

[tool.ruff.flake8-quotes]
[tool.ruff.lint.flake8-quotes]
docstring-quotes = "double"
inline-quotes = "double"
multiline-quotes = "double"
Expand Down
18 changes: 18 additions & 0 deletions src/mozilla_sec_eia/library/generic_io_managers.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
"""Implement useful generic io-managers."""

import pickle

import pandas as pd
from dagster import InputContext, OutputContext, UPathIOManager
from upath import UPath
Expand All @@ -19,3 +21,19 @@ def load_from_path(self, context: InputContext, path: UPath) -> pd.DataFrame:
"""Read parquet."""
with path.open("rb") as file:
return pd.read_parquet(file)


class PickleUPathIOManager(UPathIOManager):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to save pickled asset outputs to GCS. This is needed because I separated the ex21 inference dataset creation from actually running the model, but the datasets take up too much space if they're saved locally.

"""Read and write pandas dataframes as parquet files on local or remote filesystem."""

extension: str = ".pickle"

def dump_to_path(self, context: OutputContext, obj: pd.DataFrame, path: UPath):
"""Write parquet."""
with path.open("wb") as file:
pickle.dump(obj, file)

def load_from_path(self, context: InputContext, path: UPath) -> pd.DataFrame:
"""Read parquet."""
with path.open("rb") as file:
return pickle.load(file) # noqa: S301
20 changes: 20 additions & 0 deletions src/mozilla_sec_eia/library/mlflow/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
"""Implement tooling to interface with mlflow experiment tracking."""

from dagster import Config, asset
from pydantic import create_model

from .mlflow_io_managers import (
MlflowBaseIOManager,
MlflowMetricsIOManager,
MlflowPandasArtifactIOManager,
MlflowPyfuncModelIOManager,
)
from .mlflow_resource import (
MlflowInterface,
Expand All @@ -12,6 +16,22 @@
)


def pyfunc_model_asset_factory(name: str, mlflow_run_uri: str):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function will create an asset to load a model from mlflow. Using create_model is a little bit of a weird way to provide configuration to the asset, but this ensures that the default value for mlflow_run_uri will show up in the dagster UI.

"""Create asset for loading a model logged to mlflow."""
PyfuncConfig = create_model( # NOQA: N806
f"PyfuncConfig{name}", mlflow_run_uri=(str, mlflow_run_uri), __base__=Config
)

@asset(
name=name,
io_manager_key="pyfunc_model_io_manager",
)
def _model_asset(config: PyfuncConfig):
return config.mlflow_run_uri

return _model_asset


def get_mlflow_io_manager(
key: str,
mlflow_interface: MlflowInterface | None = None,
Expand Down
26 changes: 26 additions & 0 deletions src/mozilla_sec_eia/library/mlflow/mlflow_io_managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,32 @@ def _get_run_info(self) -> Run:
return mlflow.get_run(self.mlflow_interface.mlflow_run_id)


class MlflowPyfuncModelIOManager(MlflowBaseIOManager):
"""IO Manager to load pyfunc models from tracking server."""

uri: str | None = None

def handle_output(self, context: OutputContext, model_uri: str):
"""Takes model uri as a string and caches the model locally for future use."""
cache_path = self.mlflow_interface.dagster_home_path / "model_cache"
cache_path.mkdir(exist_ok=True, parents=True)

logger.info(f"Caching {context.name} model at {cache_path}")
mlflow.pyfunc.load_model(
model_uri,
dst_path=cache_path,
)

def load_input(self, context: InputContext):
"""Load pyfunc model with mlflow server."""
cache_path = (
self.mlflow_interface.dagster_home_path / "model_cache" / context.name
)
logger.info(f"Loading {context.name} model from {cache_path}")

return mlflow.pyfunc.load_model(cache_path)


class MlflowPandasArtifactIOManager(MlflowBaseIOManager):
"""Implement IO manager for logging/loading dataframes as mlflow artifacts."""

Expand Down
15 changes: 11 additions & 4 deletions src/mozilla_sec_eia/library/model_jobs.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ def create_production_model_job(
job_name: str,
assets: list[AssetsDefinition],
concurrency_limit: int | None = None,
tag_concurrency_limits: list[dict] | None = None,
**kwargs,
) -> JobDefinition:
"""Construct a dagster job and supply Definitions with assets and resources."""
Expand All @@ -39,10 +40,16 @@ def create_production_model_job(
}
},
}
if concurrency_limit is not None:
config["execution"] = {
"config": {"multiprocess": {"max_concurrent": concurrency_limit}}
}
if (concurrency_limit is not None) or (tag_concurrency_limits is not None):
config["execution"] = {"config": {"multiprocess": {}}}
if concurrency_limit is not None:
config["execution"]["config"]["multiprocess"][
"max_concurrent"
] = concurrency_limit
else:
config["execution"]["config"]["multiprocess"][
"tag_concurrency_limits"
] = tag_concurrency_limits

return define_asset_job(
job_name,
Expand Down
80 changes: 80 additions & 0 deletions src/mozilla_sec_eia/models/sec10k/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,86 @@ sec10k: Extracting company ownership data from sec10k documents

This repo contains exploratory development for an SEC-EIA linkage.

Models
------
Basic 10k
^^^^^^^^^
The extraction model for basic 10k company information is very simple and requires no
training. This model is implemented as a simple rules based parser that finds key-value
pairs containing company information, which is embedded in a header for all 10k filings.

Exhibit 21
^^^^^^^^^^
Exhibit 21 extraction is much more complicated and requires pretrained models that are
cached with our mlflow tracking server. Currently, there are 2 models which are
implemented in the ``notebooks/`` directory. These notebooks use
`Dagstermill <https://docs.dagster.io/integrations/dagstermill/using-notebooks-with-dagster>`_
so they can be run interactively like any normal Jupyter Notebook, or run in a Dagster
job.

Extraction
""""""""""
The primary extraction model is implemented in the ``notebooks/exhibit21_extractor.ipynb``.
This model is based on
`layoutlm <https://huggingface.co/microsoft/layoutlmv3-base>`_ with custom inference logic
to construct a table of ownership information from an exhibit 21 document. Both the
layoutlm model and the inference model are logged separately with mlflow. This
separation between the models allows for testing minor modifications to the inference
portion with the same pretrained layoutlm model.

There are currently two configuration parameters that used by the extraction model
notebook:

* ``layoutlm_training_run``: This should be an existing mlflow run name, which was used
to train layoutlm, and has a logged model associated with it. If ``None`` layoutlm
will be trained when the notebook is run, and the new training run will be used for
inference and validation.
* ``training_data_version``: This should point to a GCS folder containing training
data to use with layoutlm. If ``layoutlm_training_run`` is set, then this parameter
doesn't matter, as layoutlm will not be re-trained when the notebook is executed.

The notebook also depends on several upstream dagster assets, which produce training and
validation datasets. Using upstream assets allows these datasets, which are relatively
expensive to produce, to be easily cached and reused while interating on the model.
These upstream assets need to be produced before the notebook can be run. They should
also be re-materialized if you want to modify the training or validation data, otherwise
the notebook can be re-run as many times as desired with existing data.

Layout Classification
"""""""""""""""""""""
The second model is a classifier, which labels filings as either having a 'paragraph'
layout or not. This is done because the extraction model performs poorly on documents
formatted as paragraphs rather than tables. For now we will likely just filter out these
results, but we could also develop a separate extraction model which handles these
documents better.

This model is located in ``notebooks/exhibit21_layout_classifier.ipynb``, and it also
depends on upstream assets to produce training data, which will need
to be produced before running the notebook.

Training the Models
"""""""""""""""""""
The models are trained by running the notebooks. This can be done either interactively
like a normal notebook or through dagster directly.

Whether running interactively or with dagster, you will first need to produce the
upstream data assets:

1. Launch dagster from the repo root with the ``dagster dev`` command
2. Locate the training Job in question using the webui
3. Select the upstream assets by holding down the shift key and clicking on each
asset excluding the notebook asset
4. Click ``Materialize all`` in the UI

Once this is complete, you can simply launch ``Jupyter`` and run the notebooks
interactively as you would any other notebook. The first cell loads the upstream
assets and sets configuration. You can modify the configuration directly in the
notebook as normal.

To run the notebook in dagster, you simply execute it like any other normal asset.
You can first set configuration in the dagster launchpad if desired, and when it
completes executing, you can click on the asset to view the fully rendered notebook.

Usage
-----

Expand Down
Loading