Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 BUG]: execTTL timeout elapsed #1992

Open
1 task done
cv65kr opened this issue Aug 21, 2024 · 4 comments
Open
1 task done

[🐛 BUG]: execTTL timeout elapsed #1992

cv65kr opened this issue Aug 21, 2024 · 4 comments
Assignees
Labels
B-bug Bug: bug, exception F-need-verification
Milestone

Comments

@cv65kr
Copy link
Member

cv65kr commented Aug 21, 2024

No duplicates 🥲.

  • I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

image
image

On the metrics screen you can see that average of requests is 5ms, randomly requests start going in the queue and in logs you can see messages:

{"level":"error","ts":1724229813047911732,"logger":"grpc","msg":"method call was finished with error","error":"rpc error: code = Internal desc = worker_exec_with_timeout: ExecTTL: context deadline exceeded","method":"/api.test","start":1724229804042745317,"elapsed":9005}

My config:

  pool:
    num_workers: 4
    destroy_timeout: 30s
    supervisor:
      exec_ttl: 9s # Client timeout
      max_worker_memory: 70

It's pretty interesting because elapsing time for request is 9 sec. I checked the endpoint logic and this is simple query which not appears in any slow log of database.

Version (rr --version)

v2024.1.5

How to reproduce the issue?

Didn't found a way, it happened randomly

Relevant log output

No response

@cv65kr cv65kr added B-bug Bug: bug, exception F-need-verification labels Aug 21, 2024
@rustatian rustatian added this to the v2024.3 milestone Aug 21, 2024
@rustatian rustatian moved this to 📋 Backlog in Jira 😄 Aug 21, 2024
@rustatian rustatian removed the status in Jira 😄 Aug 21, 2024
@rustatian rustatian moved this to 🔖 Ready in Jira 😄 Aug 21, 2024
@rustatian
Copy link
Member

Hey @cv65kr 👋 Thanks for the report 👍

Note for myself: load test with exec_ttl set. Build RR with the -race flag.

@genhoi
Copy link

genhoi commented Aug 23, 2024

I am experiencing a similar problem on GRPC with exec_ttl

rr version 2024.1.5 (build time: 2024-06-20T19:10:34+0000, go1.22.4), OS: linux, arch: amd64

@route33
Copy link

route33 commented Aug 29, 2024

@rustatian
I experience the same problem on latest RR
[INFO] RoadRunner server started; version: 2024.2.0, buildtime: 2024-07-25T13:25:10+0000
[INFO] sdnotify: not notified

{"level":"info","ts":1724864808550491636,"logger":"grpc","msg":"grpc server was started","address":"tcp://0.0.0.0:5002"}

{"level":"warn","ts":1724869469478675017,"logger":"server","msg":"worker stopped, and will be restarted","reason":"execTTL timeout elapsed","pid":148,"internal_event_name":"EventExecTTL","error":"worker_exec_with_timeout: ExecTTL: context canceled"}

{"level":"error","ts":1724869469478755068,"logger":"grpc","msg":"method call was finished with error","error":"rpc error: code = Internal desc = worker_exec_with_timeout: ExecTTL: context canceled","method":"/my.Service/HealthCheck","start":1724869468475914858,"elapsed":1002}

{"level":"warn","ts":1724890028555570695,"logger":"server","msg":"worker doesn't respond on stop command, killing process","PID":742}

{"level":"error","ts":1724890028555645505,"logger":"server","msg":"no free workers in the pool, wait timeout exceed","reason":"no free workers","internal_event_name":"EventNoFreeWorkers","error":"worker_watcher_get_free_worker: NoFreeWorkers:\n\tcontext canceled"}

{"level":"error","ts":1724890028555717208,"logger":"grpc","msg":"method call was finished with error","error":"rpc error: code = Internal desc = static_pool_exec: NoFreeWorkers:\n\tstatic_pool_exec:\n\tworker_watcher_get_free_worker:\n\tcontext canceled","method":"/my.Service/HealthCheck","start":1724890018477624974,"elapsed":10078}

That's the log on a pod with no traffic, just health checks every 5 seconds
Configuration:

grpc:
  listen: tcp://0.0.0.0:5002
  max_concurrent_streams: 10
  max_connection_age: 0s
  max_connection_age_grace: 0s8h
  max_connection_idle: 0s
  max_recv_msg_size: 50
  max_send_msg_size: 50
  ping_time: 2h
  pool:
    allocate_timeout: 30s
    destroy_timeout: 60s
    max_jobs: 10000
    num_workers: 8
    supervisor:
      exec_ttl: 1m
      idle_ttl: 2m
      max_worker_memory: 100
      ttl: 1h
      watch_tick: 1s
  proto: ../my.service/service.proto
  timeout: 1h

@rustatian
Copy link
Member

Thank you, guys, for the comments. I hope next week I'd be able to dig into this problem. Will keep you all posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-bug Bug: bug, exception F-need-verification
Projects
Status: 🔖 Ready
Development

No branches or pull requests

4 participants