Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Timeout issues #2852

Open
Arqu opened this issue Oct 28, 2024 · 4 comments
Open

CI Timeout issues #2852

Arqu opened this issue Oct 28, 2024 · 4 comments
Labels
bug Something isn't working ci continous integration

Comments

@Arqu
Copy link
Collaborator

Arqu commented Oct 28, 2024

Last ~20 days or so we have started seeing CI jobs fail due to stuff seemingly timing out.
The ones affected I've seen so far are: netsim, cross tests, docker.
It's concerning because netsim and docker tests run a plain binary and exhibit this behavior. It's become more prominent in the last 2 weeks and presumably is also affecting users.

Sample links:
https://github.com/n0-computer/iroh/actions/runs/11342921218/job/31544406620?pr=2804
https://github.com/n0-computer/iroh/actions/runs/11354934395/job/31583241218?pr=2807
https://github.com/n0-computer/iroh/actions/runs/11370155916/job/31629269957?pr=2808

Some logs from netsim:

cmd: ./bins/iroh start --add data/1G.bin

2024-10-15T09:52:54.736754Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:52:56.737453Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:53:24.303831Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:53:26.304470Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:53:54.739896Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:53:56.740046Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:54:20.265772Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:54:22.265803Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:54:50.253689Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:54:52.253677Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:55:16.653801Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:55:18.654037Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:55:46.172940Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:55:48.172981Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:56:14.136309Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:56:16.136141Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:56:41.597054Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:56:43.597030Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:57:06.907840Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:57:08.907963Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Some context from discord:

Those are the logs from the failing node running in netsim
Ie the chain of events is:
- flaky start with ./bins/iroh start --add data/1G.bin
- gets in an infinite loop of tick: probes timed out
- netsim waits for 10 sec and keeps checking output to see if it has generated a ticket
- ticket never comes, netsim nukes the process and exits with 1
- netsim job fails


in the past 3 days this has happened:
- many times in netsim
- couple times in docker tests
- now in cross
- unsure but think I've seen one or two elsewhere but at the time didn't document the links

netsim is the most likely to hang as it runs many nodes & sims with the full binary
docker less likely but also happened enough times I caught wind of it
cross is the first time I've seen it but just tells me it might be worse than expected / deeper than just cli hangs
@dignifiedquire dignifiedquire added bug Something isn't working ci continous integration labels Oct 29, 2024
@dignifiedquire dignifiedquire added this to the v0.28.0 milestone Oct 29, 2024
@Arqu
Copy link
Collaborator Author

Arqu commented Oct 29, 2024

Seems like this is also one of those shrodinger bugs, if you look at it, you cant reproduce. Running like 15 tests with debug logging and suddenly it just works.

Seems like its a timing/race somewhere. Or the setup just got significantly slower somehow.

@flub
Copy link
Contributor

flub commented Oct 29, 2024

@Arqu where do i find those logs that you print extracts from? Following the links just gives me opaque failure messages without a debug log of iroh.

@Arqu
Copy link
Collaborator Author

Arqu commented Oct 29, 2024

I've been trying to get a fresh set of these, when you run the netsim job with debug logs you get that output but I had trouble getting it to fail with debug logs enabled again. The original ones I got by sshing into the box directly.

@dignifiedquire dignifiedquire removed this from the v0.28.0 milestone Nov 4, 2024
@dignifiedquire
Copy link
Contributor

One contributing bug was #2876

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci continous integration
Projects
Status: No status
Development

No branches or pull requests

3 participants