Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Tests timing out occasionally #4209

Closed
IAlibay opened this issue Jul 25, 2023 · 33 comments · Fixed by #4707 · May be fixed by #4584
Closed

[CI] Tests timing out occasionally #4209

IAlibay opened this issue Jul 25, 2023 · 33 comments · Fixed by #4707 · May be fixed by #4584

Comments

@IAlibay
Copy link
Member

IAlibay commented Jul 25, 2023

For some reason some of the tests are every so often timing out. It's been mostly the ubuntu full deps py3.10 gh actions tests, but we've seen it in the macos and azure runners too.

It's rather hard to debug and I'm currenttly cycling CI with pytest-timeout in #4197

@IAlibay
Copy link
Member Author

IAlibay commented Jul 25, 2023

From a look at the top 50 slowest, there seems to be some kind of systematic increase in test run times in the failing runners.

Note: test_streamplot_2D is > 20s even on a good day!

@IAlibay
Copy link
Member Author

IAlibay commented Jul 28, 2023

@MDAnalysis/coredevs this is actually a pretty big issue, if anyone has any ideas here that'd be appreciated.

@IAlibay
Copy link
Member Author

IAlibay commented Jul 28, 2023

My initial suspicion is that this might be happening: pytest-dev/pytest#11174

Need to gather a bit more evidence though (which annoyingly means re-running the tests until we can gather more random failures).

A potential short term solution is to just stick on pytest-timeout, but it really only shifts the problem from a job not finishing to one failing (so really it counts as a failure both ways and CI remains red).

@IAlibay
Copy link
Member Author

IAlibay commented Jul 30, 2023

Unfortunately I've been unable to narrow down where this is happening. From runs on #4197 my current conclusions are:

  1. The frequency at which these stalling runs occus is somewhat low. It affects nearly every PR at some point, but we're looking at maybe a ~ 1 / 20 chance?
  2. There are no specific tests which time out (once it was a multiprocessing one, another was nsgrid..)
  3. It is not isolated to github actions (we saw it happen on azure pipelines).

My current completely random guess is that it's something to do with the latest 7.4.0 release and maybe coverage? (I haven't seen it happen on runner not trying to use pytest-cov functionality).

I've ran out of ideas for now and I don't think I have the time to debug any furtther. So I'm going to propose the following:

  1. We add pytest-timeout to our CI runs. At least runners will complete even though they will fail. That will save a few trees and allow us to just ignore timeout issues if we don't think they are relevant.
  2. We keep this issue ope and re-asses over time.

@hmacdope
Copy link
Member

Sounds good to me @IAlibay, thanks for putting in a great effort trying to figure it out.

@RMeli
Copy link
Member

RMeli commented Jul 30, 2023

Sounds very sensible. Thanks for all the work!

tylerjereddy added a commit to tylerjereddy/mdanalysis that referenced this issue Aug 1, 2023
* speed up `test_streamplot_2D()` because it has been
reported to take ~20 seconds on a regular basis
in CI in MDAnalysisgh-4209

* we don't really need to plot the output, which
was taking most of the time, and instead we can
just check the data structures that MDAnalysis
returns (this may be a better test by some definitions
anyway...); I suppose we could also spot check
a few values in the arrays if we wanted as well

* locally, that single test seems to run in 0.39 s
on this branch vs. 4.7 s on `develop`
@drew-parsons
Copy link

drew-parsons commented Aug 3, 2023

I'm seeing a consistent timeout (after 2.5 hours) in mdanalysis 2.5.0 on i386 after updating gsd to v3, see https://ci.debian.net/packages/m/mdanalysis/testing/i386/

The timeout occurs in testsuite/MDAnalysisTests/parallelism/test_multiprocessing.py
So I suspect it might be a race condition in multiprocessing, hence might be a separate issue from the pytest-timeout issue discussed here.

@IAlibay
Copy link
Member Author

IAlibay commented Aug 3, 2023

Interesting, I completely forgot that the latest version of GSD had a very strange memory buffer thing going on.

We did see tests time out in other places though, so I'm not fully convinced it's the sole issue here, but it is definitely worth investigating.

@IAlibay
Copy link
Member Author

IAlibay commented Aug 3, 2023

Although @drew-parsons, MDAnalysis 2.5.0 doesn't support GSD 3.0+. The GSD release was after the 2.5.0 release so we didn't get to retrospectively pin this in our setup.py but it is reflected in our conda recipe.

We're pretty swamped with just the two package indexes (pypi and conda-forge) so this kind of stuff tends to completely fall off our radar. Is there any way we can better synch up / report on this?

@drew-parsons
Copy link

True, I patched MDAnalysis 2.5.0 with PR #4153 and PR #4174 .

@IAlibay
Copy link
Member Author

IAlibay commented Aug 4, 2023

@drew-parsons, I would very much appreciate if you didn't have to do that. Ideally the source code which we provide for releases here should match what folks encounter in the wild.

If you need us to do a bugfix release then I am more than happy to do that if necessary.

Would you mabe be interested in having a quick call to discuss your needs here? I don't fully understand the debian packaging ecosystem and I would like to make sure we are providing the right things where possible.

If this is of interest could maybe ping us an email at [email protected] or maybe we can discuss this more on our discord? (https://discord.gg/fXTSfDJyxE)

@drew-parsons
Copy link

In a sense it's not a huge issue. Only i386 is failing here, other arches are happy As far as the Debian distributions goes, the issue is in Debian unstable, it's "supposed" to break from time to time. We'll fix it. The reason Debian needed gsd 3 is because 2.7 was failing to build with latest python tools, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1042136. It was simplest to upgrade gsd to v3 where this was fixed. gsd is being relegated to optional anyway from the perspective of mdanalysis, so temporary disruption on i386 is not a big problem (we can just skip the test if necessary). But a bugfix release would also be welcome. Good to not skip tests if we can run them successfully.

@drew-parsons
Copy link

Worth noting, the i386 test passes now.
https://ci.debian.net/data/autopkgtest/testing/i386/m/mdanalysis/36427124/log.gz

IAlibay pushed a commit that referenced this issue Aug 10, 2023
* TST: test_streamplot_2D faster

* speed up `test_streamplot_2D()` because it has been
reported to take ~20 seconds on a regular basis
in CI in gh-4209

* we don't really need to plot the output, which
was taking most of the time, and instead we can
just check the data structures that MDAnalysis
returns (this may be a better test by some definitions
anyway...); I suppose we could also spot check
a few values in the arrays if we wanted as well

* locally, that single test seems to run in 0.39 s
on this branch vs. 4.7 s on `develop`

* TST: PR 4221 revisions

* use `pytest.approx` for single value comparisons

* `u1` and `v1` are now checked more thoroughly for
their actual floating point values
@IAlibay
Copy link
Member Author

IAlibay commented Sep 2, 2023

This is still an issue, we are still getting a ton of failures with test_creating_multiple_universe_without_offset. Can one of the @MDAnalysis/coredevs please step up and actively try to fix this?

@IAlibay IAlibay removed the downstream label Sep 2, 2023
@orbeckst
Copy link
Member

orbeckst commented Sep 2, 2023

test_creating_multiple_universe_without_offset

def test_creating_multiple_universe_without_offset(temp_xtc, ncopies=3):
is a bit of an odd test: it was supposed to test that we fixed a race condition but the comments say that it doesn't actually test the situation:

    #  test if they can be created without generating
    #  the offset simultaneously.
    #  The tested XTC file is way too short to induce a race scenario but the
    #  test is included as documentation for the scenario that used to create
    #  a problem (see PR #3375 and issues #3230, #1988)

@yuxuanzhuang you wrote the test. Do you have some more insights what might be happening here? Are we using some sort of file-locking that might block everything?

I have seen this test come up (always?) when I restarted a timed-out CI runner so it makes sense to figure this one out and then see if anything else is odd.

@yuxuanzhuang
Copy link
Contributor

yuxuanzhuang commented Sep 7, 2023

I have been trying to battle with it over the past couple of days, but none of my attempts seem to be working. The branch of testing I am currently working on is the one with #4162 merged---so much higher failure rate :) (yuxuanzhuang#6).

Here are the attempts that have failed:

  • Running pytest without multiple workers.
  • Removing fastener code.
  • Disabling the test_creating_multiple_universe_without_offset (the test can fail elsewhere when a multiprocessing pool is initiated). As a side note, I have no idea why the test passed last time we disabled it in @marinegor's PR; apparently, it's a scholastic thing :)
  • Adding mp.set_start_method("spawn", force=True) as suggested in the comments above.
  • Using --dist loadgroup to restrain all parallel code from starting in the same worker (since running pytest with only one worker also failed, obviously it won't work).
  • add p.join() to terminate all child processes

The only solution seems to be disabling all tests that involve multiprocessing (dask, on the other hand, seems fine).

I believe the test timeout is related to the multiprocessing pool not being able to terminate when a lot of jobs are present. Given that it's almost a 100% failure rate if parallel analysis codes are merged, compared to a 1/20 chance (?), I would assume that the more multiprocessing pools there are, the higher the chance of failure.

The other problem is that it never failed on my local workstation or on the compute cluster nodes—so I need a 30-minute testing cycle to rule out any possible cases on GitHub, which is really frustrating. Moreover, sometimes the timeout appears as a test failure, and sometimes it simply doesn't show up except in the duration (e.g. here).

@IAlibay
Copy link
Member Author

IAlibay commented Sep 7, 2023

@yuxuanzhuang that last test result you point to is a codecov failure, that's a different problem?

@yuxuanzhuang
Copy link
Contributor

that last test result you point to is a codecov failure, that's a different problem?

I think the codecov failure is just because I hacked the repository to run the test on my own repo?

@IAlibay
Copy link
Member Author

IAlibay commented Sep 7, 2023

Are you talking about this failed runner? https://github.com/yuxuanzhuang/mdanalysis/actions/runs/6074912996/job/16479999506

@yuxuanzhuang
Copy link
Contributor

yes

[2023-09-04T15:14:35.507Z] ['error'] There was an error running the uploader: Error uploading to https://codecov.io: Error: There was an error fetching the storage URL during POST: 400 - [ErrorDetail(string='This repository has been deactivated. To resume uploading to it, please activate the repository in the codecov UI: https://app.codecov.io/github/yuxuanzhuang/mdanalysis/settings', code='invalid')]

@IAlibay
Copy link
Member Author

IAlibay commented Mar 10, 2024

@hmacdope I'm assigning this to you since you had done so in #4475

@orbeckst
Copy link
Member

Primarily I now see test_creating_multiple_universe_without_offset() failing across multiple Python versions and architectures. This is getting annoying. I prefer to merge PRs when everything is green but the last three merges that I did all had to override failures due to timing out on test_creating_multiple_universe_without_offset().

@hmacdope
Copy link
Member

hmacdope commented Apr 1, 2024

@orbeckst I will try and dig into this one ASAP.

@yuxuanzhuang
Copy link
Contributor

I believe I’ve finally identified the culprit behind the timeout: the SIGTERM handler introduced by GSD: https://github.com/glotzerlab/gsd/blob/trunk-patch/gsd/__init__.py.

I’m not very familiar with Python's signal handling, I welcome any insights or suggestions from others on whether we can address it downstream or need to contact GSD people.

Below is a minimal example that reproduces the hanging issue on my laptop (Mac M3 Pro):

for i in {1..100}; do python test_sig.py; done
...
Exception ignored in: <function _get_module_lock.<locals>.cb at 0x1016ae160>
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 451, in cb
  File "/Users/scottzhuang/jupyter_ground/sig.py", line 5, in <lambda>
    signal.signal(signal.SIGTERM, lambda n, f: sys.exit(1))
                                               ^^^^^^^^^^^
SystemExit: 1

test_sig.py

#import logging
import sig

#logger = logging.getLogger("mp_test")

if __name__ == "__main__":
    from multiprocessing import Pool, log_to_stderr

    # Uncomment for logging
    # logger = log_to_stderr()
    # logger.setLevel(logging.DEBUG)

    with Pool(processes=12) as pool:
        results = pool.map(sig.get_number, range(12))
    print(results)

sig.py

import signal
import sys

# Introduced in GSD (https://github.com/glotzerlab/gsd/blob/trunk-patch/gsd/__init__.py)
try:
    signal.signal(signal.SIGTERM, lambda n, f: sys.exit(1))
except ValueError:
    pass

def get_number(x):
    return x + 1

@orbeckst
Copy link
Member

orbeckst commented Sep 8, 2024

FYI: There was some concern about this code raised by @IAlibay in glotzerlab/gsd#255 .

@IAlibay
Copy link
Member Author

IAlibay commented Sep 8, 2024

Ahhhh this makes a lot of sense. I had completely forgotten about that one - indeed the way GSD is doing this seems like it would be prone to issues.

Thanks so much for all your work identifying this @yuxuanzhuang !

My take is that we should just raise this upstream and remove GSD from relevant multiprocessing tests, it's not something that can be fixed without a patch upstream.

@hmacdope
Copy link
Member

hmacdope commented Sep 8, 2024

Amazing work @yuxuanzhuang! Thanks so much for diving deep on this, much appreciated.

@yuxuanzhuang
Copy link
Contributor

A few additional thoughts:

  • Since gsd is already imported in our initial import, simply removing GSD file multiprocessing testing won’t help, as the job will still hang when the worker is justing running import MDAnalysis, even though it’s not doing anything GSD-related.
  • One approach could be to lazily import gsd only when it is actually needed.
  • Another option is to reset the signal handler using signal.signal(signal.SIGTERM, signal.SIG_DFL) right before starting any multiprocessing tasks on our end.

Since the second approach could potentially cause unforeseen interference with other packages, I would prefer the first approach if it’s feasible.

@IAlibay
Copy link
Member Author

IAlibay commented Sep 8, 2024

One approach could be to lazily import gsd only when it is actually needed.

GSD is an optional dependency, this is something we should be doing as much as possible.

@IAlibay
Copy link
Member Author

IAlibay commented Sep 8, 2024

My suggested approach:

  1. Report the issue upstream and see what they say
  2. If it doesn't look like it will be fixed any time soon, then we look into lazy importing where necessary
  3. If that looks like it would take more than a day of work, then we remove GSD from our optional test dependencies for the standard CI - we can have GSD on our cron jobs, that's less disruptive / easier to just restart until it passes + we can have an optional xfail if GSD is installed.
  4. If a fix doesn't look like it would come by our next release +1, we deprecate GSD.

@IAlibay
Copy link
Member Author

IAlibay commented Sep 8, 2024

Actually I would even recommend we move towards #3 either way, we can always revert CI inclusion later.

@orbeckst
Copy link
Member

orbeckst commented Sep 9, 2024

Really anything to get the CI to a state where we can get just a green pass for normal PRs... Option 3 sounds sensible.

If we xfail GSD tests in standard CI, should we reduce the timeout duration from 200s to, say, 20s, so that these failures happen more quickly and not hold up runners?

@orbeckst orbeckst assigned yuxuanzhuang and unassigned hmacdope Sep 9, 2024
@orbeckst
Copy link
Member

orbeckst commented Sep 9, 2024

I assigned @yuxuanzhuang , primarily to indicate who's been doing the expert code sleuthing here ;-) 🔎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment