Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate CI failures #56

Closed
wants to merge 4 commits into from
Closed

Investigate CI failures #56

wants to merge 4 commits into from

Conversation

inducer
Copy link
Owner

@inducer inducer commented Feb 4, 2021

@inducer
Copy link
Owner Author

inducer commented Feb 4, 2021

Huh. Works like a charm. I'll rerun to see if it's a fluke.

@alexfikl
Copy link
Collaborator

alexfikl commented Feb 4, 2021

Seems to have failed again. Hm, so it's not numpy and not related to anything in #55.

@inducer inducer changed the title Test numpy < 1.20 from conda-forge Investigate CI failures Feb 4, 2021
@inducer
Copy link
Owner Author

inducer commented Feb 4, 2021

Thanks @isuruf for checking for sumpy influence!

I just added a run without

cc @kaushikcfd

@kaushikcfd
Copy link

Hmm. CI/examples passed when an older loopy is pinned. I don't have a fix yet, but:

  1. locally laplce-dirichlet-3d.py passes for me. (for both loopy versions)
  2. iname tags of kernels generated by the 2 commits with discrepancy are identical.

@inducer
Copy link
Owner Author

inducer commented Feb 4, 2021

So the first go passed. Now running a second round to see if it's repeatable.

@inducer
Copy link
Owner Author

inducer commented Feb 5, 2021

For perspective, the examples failure is distinct. Full story at #57. This did not exhibit the failure I am concerned about here, which is the main Linux pytest failure. I'll rerun again, to see if it holds up.

@alexfikl
Copy link
Collaborator

alexfikl commented Feb 6, 2021

Did the Linux tests fail after numpy was bumped down? I just remember the examples failing in a while now.

Those failures really looked like some sort of out of memory issue. Did https://gitlab.tiker.net/inducer/pytential/-/issues/131 ever improve?

@inducer
Copy link
Owner Author

inducer commented Feb 8, 2021

The only way I know to curb this misery going forward is running downstream CI along with upstream projects, in this case loopy, as I propose here: inducer/loopy#220. If that works out, I'll probably apply the same idea to meshmode (inducer/meshmode#113).

@inducer
Copy link
Owner Author

inducer commented Feb 8, 2021

Did the Linux tests fail after numpy was bumped down?

I don't recall such an instance. I'll bump numpy back up here, to check. But I don't expect it to fail.

Those failures really looked like some sort of out of memory issue.

I agree, though I don't see (yet?) how the loopy PRs would inflate memory usage in a substantial fashion.

Did https://gitlab.tiker.net/inducer/pytential/-/issues/131 ever improve?

No, didn't. It just wasn't bad enough to be a problem. In addition, there's a similar-looking mystery (illinois-ceesd/mirgecom#212) being chased down in mirgecom.

@alexfikl
Copy link
Collaborator

alexfikl commented Feb 8, 2021

I don't recall such an instance. I'll bump numpy back up here, to check. But I don't expect it to fail.

It seems to have failed, rerun?

No, didn't. It just wasn't bad enough to be a problem. In addition, there's a similar-looking mystery (illinois-ceesd/mirgecom#212) being chased down in mirgecom.

Maybe worth adding a memory pool already in pytest_generate_tests_for_pyopencl_array_context? Although yeah, that would just hide the issue.

@inducer
Copy link
Owner Author

inducer commented Feb 8, 2021

It seems to have failed, rerun?

Wha? 🤯

Sure, I'll rerun, but now I don't know what to believe. Is this something that's brought about by numpy or the loopy or both?

@inducer
Copy link
Owner Author

inducer commented Feb 8, 2021

Maybe worth adding a memory pool already in pytest_generate_tests_for_pyopencl_array_context? Although yeah, that would just hide the issue.

Ugh, no. Not a fan of sweeping stuff under the rug.

@alexfikl
Copy link
Collaborator

alexfikl commented Feb 8, 2021

Sure, I'll rerun, but now I don't know what to believe. Is this something that's brought about by numpy and the loopy change together?

Just to add another variable: looking at the CI history, last scheduled run on Ubuntu 18.04 passed just fine, but then the next ones on Ubuntu 20.04 started failing. Can we pin it to Ubuntu 18.04 to see if that passes reliably?

Besides that, no idea, since it seems to pass intermittently..

@inducer
Copy link
Owner Author

inducer commented Feb 8, 2021

Hmm, so possibly the common theme among all these changes (newer ubuntu, loopy PRs, numpy 1.20) is just that they each ever so slightly increase memory usage...

@inducer
Copy link
Owner Author

inducer commented Feb 8, 2021

Passed this time around, FWIW.

@inducer
Copy link
Owner Author

inducer commented Feb 9, 2021

Alright, I'm now super confused. Reverting the Loopy PRs that we suspected caused problems actually did exactly nothing to help inducer/loopy#220 pass. So that theory is pretty dead in the water to me.

@inducer
Copy link
Owner Author

inducer commented Feb 9, 2021

My next best plan is to go hunt this stupid memory leak. Grrr.

@inducer
Copy link
Owner Author

inducer commented Feb 9, 2021

illinois-ceesd/mirgecom#212 if you'd like to follow the saga.

@inducer
Copy link
Owner Author

inducer commented Feb 10, 2021

Using jemalloc for the CI (#58) seems to help. See illinois-ceesd/mirgecom#212 for more details. Closing here.

@inducer inducer closed this Feb 10, 2021
@inducer inducer deleted the is-numpy-1.20-busted branch February 10, 2021 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants