Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Callback duration metric #439

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Conversation

benbierens
Copy link

@benbierens benbierens commented Aug 8, 2023

All right... this doesn't currently work and I need some help.

Here's the goal:
As a chronos user, I want to be able to measure callbacks durations (when using debug builds), so that I can be informed about long callbacks that will cause my app to block/not respond.

I've added a compile-flag and some code to asyncloop.nim to try and get this going, only to realize that chronos doesn't use nim-metrics. (And can't because nim-metrics uses chronos?) So I'm writing this message to ask help from you, experienced chronos developers! Do you see a use for a feature of this sort? How can this best be accomplished?

@arnetheduck
Copy link
Member

importing metrics here would create a circular dep that would need to be broken - but other than that, metrics aren't really appropriate for the task: async calls generally get broken up into multiple callbacks while gauges in metrics only record the latest value, and only the time it takes - if you have one request taking 1s and a 1ms request just after, the gauge will say 1ms and you will find out nothing - or it will say 1s but you won't know which callback it was because chronos typically processes thousands of them, specially when building up combined future flows.

the way to do this would be to introduce a pre/post callback which gets enabled by a config parameter a bit like future tracking - in that pre/post, you could take whatever measurements you want (including capturing a call stack of the request and for example saving all callbacks that take >N ms or whatever other policy you might want for the specific problem you're looking at. You'd also need to more extensively track the origin of the callback, ie was it a user-added callback, a timer, some internal chronos mechanism or a future being woken up to continue its work.

@dryajov
Copy link
Member

dryajov commented Aug 9, 2023

Agree, metrics aren't the right way to measure this. I like @arnetheduck idea about pre/post callbacks, makes sense.

@elcritch
Copy link

the way to do this would be to introduce a pre/post callback which gets enabled by a config parameter a bit like future tracking - in that pre/post, you could take whatever measurements you want (including capturing a call stack of the request and for example saving all callbacks that take >N ms or whatever other policy you might want for the specific problem you're looking at. You'd also need to more extensively track the origin of the callback, ie was it a user-added callback, a timer, some internal chronos mechanism or a future being woken up to continue its work.

I was thinking about that. It seems like it'd be possible to allow generic pre/post calls. It's not quite clear if it'd be more useful to track all callbacks or Future's or closure-iterators.

I took a stab at it using a more concrete timing for each closure iterator for each async proc: #442

That approach doesn't allow for getting stacktraces, or only recording long running ones.

@cheatfate
Copy link
Collaborator

the way to do this would be to introduce a pre/post callback which gets enabled by a config parameter a bit like future tracking - in that pre/post, you could take whatever measurements you want (including capturing a call stack of the request and for example saving all callbacks that take >N ms or whatever other policy you might want for the specific problem you're looking at. You'd also need to more extensively track the origin of the callback, ie was it a user-added callback, a timer, some internal chronos mechanism or a future being woken up to continue its work.

If there no re-entrant calls to poll callstack will be always the same and looks like this:

NimMain
...
poll()
callback()

And ... is some starting code your binary executes before it starts executing poll() in loop.

Calculating callbacks duration is pretty uninformative and intrusive for chronos scheduler. If you {.async.} procedure is slow - it is slow, if you need to get time of execution of this procedure - establish time measurement around this procedure. Metrics you trying to meter will help you measure parts of your procedures. For example:

proc someFoo() {.async.} =
  # Callback 1 measurement start
  ...
  # Callback 1 measurement finish
  await someOtherFooCall()
  # Callback 2 measurement start
  ...
  # Callback 2 measurement finish
  await someNewFooCall()
  # Callback 3 measurement start
  ...
  # Callback 3 measurement finish

So you got 3 measurement of code between calls to await. Is it what you need? I dont see any reasons why chronos should have facility to make time measurements of this places, and this become even more confusing if you have something like that:

proc someFoo() {.async.} =
  for item in iter():
    # Callback time measurement ends
    await someNewFooCall()
    # Callback time measurement starts

@arnetheduck
Copy link
Member

arnetheduck commented Aug 10, 2023

I took a stab at it using a more concrete timing for each closure iterator for each async proc: #442

I think this is being approached from the wrong end: ie I'd recommend first building a useful end-to-end solution for something concrete - ie something that provides value - if, in the end, it so happens that there exists a feature of sufficient genericity that it indeed would be easier to create a generic / shared solution in chronos, we can look at it from the perspective of a specific useful thing - like this, we're just throwing mud at a wall to see if something will stick and what shape it'll have.

Notably, the above can trivially be developed in a branch of chronos and showcased using that branch.

@dryajov
Copy link
Member

dryajov commented Aug 10, 2023

Notably, the above can trivially be developed in a branch of chronos and showcased using that branch.

I think your original proposal with pre/post hooks makes more sense. It allows building this types of features on top of chronos without bundling them with chronos unless they prove useful...

Of course, even the simple callback approach has to be carefully thought out - do we want it as a generic future that or only something that sits behind a flag and only allowed in debug builds?

@elcritch
Copy link

So you got 3 measurement of code between calls to await. Is it what you need? I dont see any reasons why chronos should have facility to make time measurements of this places, and this become even more confusing if you have something like that:

That's a useful question, and one I'm trying to understand.

Measuring the non-async sections can be useful since these sections are often the parts that disrupt overall tail latencies in a project. It's often useful to tracking down incorrect usage of blocking calls (disk IO, compute, blocking network calls, unbroken for-loops, etc).

Meanwhile, the time of the async portions can be somewhat gauged by proxy using total request time.

If I understand it, the callback approach described above wouldn't provide the granularity to measure the actual time it takes to execute the async proc time, only the total duration.

@elcritch
Copy link

I think this is being approached from the wrong end: ie I'd recommend first building a useful end-to-end solution for something concrete - ie something that provides value -

I disagree; this is a useful concrete solution to measure timings which would otherwise not be readily feasible to us. Also, the feedback from @cheatfate was helpful.

if, in the end, it so happens that there exists a feature of sufficient genericity that it indeed would be easier to create a generic / shared solution in chronos, we can look at it from the perspective of a specific useful thing - like this, we're just throwing mud at a wall to see if something will stick and what shape it'll have.

I'm onboard with creating a more generic solution if possible, but wanted to spitball something concrete and see how well it worked. These sort of timings could provide our project value, and possibly others, since non-async portions can easily cause issues.

Notably, the above can trivially be developed in a branch of chronos and showcased using that branch.

Isn't that what a draft PR is for? ;) Still that was my first approach, but I don't have access to create a branch and things get lost on a fork. I am also new to Chronos as a project and unsure the best way to get useful feedback. A draft PR gives a chance for a broader audience to chime in.

@elcritch
Copy link

elcritch commented Aug 11, 2023

I took my PR from yesterday and used it to make a more generic solution based on callbacks in FutureBase. It's not ready for upstreaming so I left it on my fork, but if you want to see it: elcritch#2

I'm not 100% sure if it's where you were thinking of adding the callbacks @arnetheduck. Future's made the most sense for a generic API as it could cover both iterators and callbacks. It'd need more testing to see how useful it'd be.

Note, I also needed a third callback onFuturePause to enable collecting execution time and not just the total elapsed time from once the Future starts. As @cheatfate mentioned that wouldn't include "child futures". That might be possible by combining it with the FutureTracking feature. All the data should in theory be able to be collected downstream.

I'd say this PR could probably be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants