Improve forecast etl performance #28

amehta-scottlogic · 2024-05-20T16:11:31Z

Description

Performance was particularly slow because xarray lazily loads data from a dataset. We then try to access each dataset for each city which without eagerly loading is quite slow.

By explicitly calling load, we use more memory but performance improves. Introducing threads also significantly improves the runtime.

We now process all 153 in around 2 minutes instead of 10.

Default

15 cities per minutes

With eager load

25 cities per minute

With eager load and thread pool

76 cities per minute

Output

2024-05-20 17:07:46,450 - INFO - Finding data for 153 cities
2024-05-20 17:07:46,450 - INFO - Extracting pollutant forecast data
2024-05-20 17:07:46,462 - INFO - Loading data from CAMS to file single_level_2024-05-20_00.grib
2024-05-20 17:07:47,403 - INFO - Loading data from CAMS to file multi_level_2024-05-20_00.grib
2024-05-20 17:07:49,678 - INFO - Transforming forecast data
2024-05-20 17:09:00,256 - INFO - Persisting forecast data
2024-05-20 17:09:12,220 - INFO - 5049 documents upserted, 0 modified

github-actions · 2024-05-20T16:15:05Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
266	200	75%	0%	🟢

New Files

File	Coverage	Status
air-quality-backend/src/database/location.py	100%	🟢
TOTAL	100%	🟢

Modified Files

File	Coverage	Status
air-quality-backend/src/database/air_quality_dashboard_dao.py	0%	🟢
air-quality-backend/src/etl/forecast/forecast_adapter.py	100%	🟢
air-quality-backend/src/etl/forecast/forecast_dao.py	100%	🟢
air-quality-backend/src/etl/forecast/forecast_data.py	93%	🟢
TOTAL	73%	🟢

updated for commit: 69e77c2 by action🐍

mwalker-scottlogic

looks good to me, probably worth a dev review too

mwalker-scottlogic · 2024-05-21T08:01:48Z

air-quality-backend/src/etl/forecast/forecast_dao.py

+            f"single_level_{model_base_date_str}_{model_base_time}.grib",
+        ),
+        (
+            get_multi_level_request_body(model_base_date_str, model_base_time),


Possibility that creating new files each time this is run could cause issues for user of application?

amehta-scottlogic added 5 commits May 20, 2024 12:37

Use valid time instead of step, refactor

79bce7a

Fetch CAMS data concurrently

4d507ec

Add thread pool for data transform and load dataset eagerly

fb4ed60

Rename fields to be non city specific

20dfe60

Fix lint

526ad25

Remove redundant method

69e77c2

mwalker-scottlogic approved these changes May 21, 2024

View reviewed changes

amehta-scottlogic merged commit 2ba3a37 into main May 21, 2024
1 check passed

amehta-scottlogic deleted the feature/forecast-performance branch May 21, 2024 08:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve forecast etl performance #28

Improve forecast etl performance #28

amehta-scottlogic commented May 20, 2024 •

edited

Loading

github-actions bot commented May 20, 2024 •

edited

Loading

mwalker-scottlogic left a comment

mwalker-scottlogic May 21, 2024

Improve forecast etl performance #28

Improve forecast etl performance #28

Conversation

amehta-scottlogic commented May 20, 2024 • edited Loading

Description

Default

With eager load

With eager load and thread pool

Output

github-actions bot commented May 20, 2024 • edited Loading

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

mwalker-scottlogic left a comment

Choose a reason for hiding this comment

mwalker-scottlogic May 21, 2024

Choose a reason for hiding this comment

amehta-scottlogic commented May 20, 2024 •

edited

Loading

github-actions bot commented May 20, 2024 •

edited

Loading