Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crusher Pipeline Status #33

Open
wants to merge 14 commits into
base: develop
Choose a base branch
from
Open

Conversation

cameronrutherford
Copy link
Contributor

@cameronrutherford cameronrutherford commented Oct 10, 2023

Closes #9

@cameronrutherford
Copy link
Contributor Author

PR is in good shape, but for some reason we aren't able to trigger Crusher pipelines through a pipeline trigger at the moment https://code.olcf.ornl.gov/ci/csc359/dev/exago/-/issues/5

@cameronrutherford
Copy link
Contributor Author

Seems like we will only be able to run on develop on a schedule, and then for PRs you can manually trigger in the web UI.

This also means that spack modules will have to be maintained by hand by @nkoukpaizan and others for the meantime.

@cameronrutherford
Copy link
Contributor Author

Fun bug here was pre-commit can't auto-fix GitHub workflow pipelines without more permissions. Leaving this as a feature, not a bug for now unless someone objects.

@cameronrutherford
Copy link
Contributor Author

cameronrutherford commented Oct 10, 2023

This is almost ready for review. Thanks @nkoukpaizan for getting the updated modules in place to make this relatively trivial :)

@cameronrutherford
Copy link
Contributor Author

Sbatch account needs updating, and for some reason the success pipeline is triggering as well as the failure one.

The needs relationship between pipelines should also be addressed to run in parallel / without blocking.

@cameronrutherford
Copy link
Contributor Author

cameronrutherford commented Oct 11, 2023

Getting some test failures on Crusher. TODO:

  • Check for identical failures on other CI platforms
  • Move relevant GitHub issues over from OPFLOW/SOPFLOW/SCOPFLOW
  • Try updating HiOp
  • Add CMake label to let tests fail
  • Document in GitHub issues

@cameronrutherford
Copy link
Contributor Author

cameronrutherford commented Oct 11, 2023

The following tests FAILED:
	 20 - FUNCTIONALITY_TEST_OPFLOW_IPOPT_POLAR_TOML_TESTSUITE (Failed)
	 33 - FUNCTIONALITY_TEST_SCOPFLOW_HIOP_MPI_TESTSUITE (Failed)
	 34 - FUNCTIONALITY_TEST_SCOPFLOW_HIOP_SERIAL_TESTSUITE (Failed)
	 40 - FUNCTIONALITY_TEST_SOPFLOW_SCENARIO_TOML (Failed)

20 and 40 are new failures to me. 33 and 34 are already skipped on other testing platforms. This will be re-visited after P6 and JIRA have been closed out, and will likely come in next release unless someone has some time to take a look.

cc @jaelynlitz since you were dealing with the latest relevant GitLab issue https://gitlab.pnnl.gov/exasgd/frameworks/exago/-/issues/459

Since we likely need to update HiOp, lets wait on things like #20 and #30 to be merged before re-visiting this IMO

@cameronrutherford
Copy link
Contributor Author

@abhyshr Since the pipelines are functional, another option that I did not initially consider is just merging and having failing pipelines on Crusher for the moment. The issues may be resolved in the end through hiop updates and other bugfixes, so we can always re-visit and see if there are platform specific bugs...

@@ -102,6 +102,7 @@
- rm -rf "$WORKDIR"

Ascent Build:
needs: []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am struggling to understand the difference between dependencies and needs in gitlab terminology but other than my lack of knowledge these changes look reasonable. Great work.

PS:
Dependencies just inherit artifacts form previous jobs.
Needs lets you run out of order of staging .
At least that my understanding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct. I had to google to understand this. needs: [] just lets a job run ASAP when the workflow starts. We probably only needs dependencies for reporting jobs on PNNL where partition matters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might just cherry-pick this fix into #20

@jaelynlitz
Copy link
Contributor

The following tests FAILED:
	 20 - FUNCTIONALITY_TEST_OPFLOW_IPOPT_POLAR_TOML_TESTSUITE (Failed)
	 33 - FUNCTIONALITY_TEST_SCOPFLOW_HIOP_MPI_TESTSUITE (Failed)
	 34 - FUNCTIONALITY_TEST_SCOPFLOW_HIOP_SERIAL_TESTSUITE (Failed)
	 40 - FUNCTIONALITY_TEST_SOPFLOW_SCENARIO_TOML (Failed)

20 and 40 are new failures to me. 33 and 34 are already skipped on other testing platforms. This will be re-visited after P6 and JIRA have been closed out, and will likely come in next release unless someone has some time to take a look.

cc @jaelynlitz since you were dealing with the latest relevant GitLab issue https://gitlab.pnnl.gov/exasgd/frameworks/exago/-/issues/459

Since we likely need to update HiOp, lets wait on things like #20 and #30 to be merged before re-visiting this IMO

The 4 test failures are either num iters or objective value differences which is to be expected with changing the version of hiop right? I suppose a question for @abhyshr if the differences are within a tolerance we feel okay with allowing under a warning

For ref:

  • Test 20 - "expected 454 num iters, got 292", "expected 462 num iters, got 484"
  • Test 33 - "expected objective value=25957.56699172251 actual objective value=27557.524961865744 tol=0.001 err=0.061635065242754646", "expected 1 num iters, got 2"
  • Test 34 - "expected objective value=25957.56759147379 actual objective value=27557.524963990003 tol=0.001 err=0.06163504079638527", "expected objective value=4072.4623362595667 actual objective value=4144.460442911344 tol=0.001 err=0.01767491649815754"
  • Test 40 - "expected 129 num iters, got 157"

@abhyshr
Copy link
Collaborator

abhyshr commented Oct 11, 2023

The following tests FAILED:
	 20 - FUNCTIONALITY_TEST_OPFLOW_IPOPT_POLAR_TOML_TESTSUITE (Failed)
	 33 - FUNCTIONALITY_TEST_SCOPFLOW_HIOP_MPI_TESTSUITE (Failed)
	 34 - FUNCTIONALITY_TEST_SCOPFLOW_HIOP_SERIAL_TESTSUITE (Failed)
	 40 - FUNCTIONALITY_TEST_SOPFLOW_SCENARIO_TOML (Failed)

20 and 40 are new failures to me. 33 and 34 are already skipped on other testing platforms. This will be re-visited after P6 and JIRA have been closed out, and will likely come in next release unless someone has some time to take a look.
cc @jaelynlitz since you were dealing with the latest relevant GitLab issue https://gitlab.pnnl.gov/exasgd/frameworks/exago/-/issues/459
Since we likely need to update HiOp, lets wait on things like #20 and #30 to be merged before re-visiting this IMO

The 4 test failures are either num iters or objective value differences which is to be expected with changing the version of hiop right? I suppose a question for @abhyshr if the differences are within a tolerance we feel okay with allowing under a warning

For ref:

  • Test 20 - "expected 454 num iters, got 292", "expected 462 num iters, got 484"
  • Test 33 - "expected objective value=25957.56699172251 actual objective value=27557.524961865744 tol=0.001 err=0.061635065242754646", "expected 1 num iters, got 2"
  • Test 34 - "expected objective value=25957.56759147379 actual objective value=27557.524963990003 tol=0.001 err=0.06163504079638527", "expected objective value=4072.4623362595667 actual objective value=4144.460442911344 tol=0.001 err=0.01767491649815754"
  • Test 40 - "expected 129 num iters, got 157"

It is very much possible that the solution changes a bit (objective value and the number of iterations) when there is an update to HiOp (or any of its dependencies). I would check the solution on other platforms and see if you are getting the same test failures as well. In that case, we will need to update the reference/expected solution for the failing tests.

@cameronrutherford
Copy link
Contributor Author

Moving to 1.6.1 for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Crusher CI pipeline status
4 participants