feat: opportunity to realize graceful stopping IJobWorker instances #624

dimasm61 · 2023-12-14T16:16:13Z

Problem:

Under heavy load and a long time of the worker (4..30 sec), when the service ends, the following situation occurs: the worker has already successfully completed its work and begin to report zeebe about it, but zeebeClient has already been disposed.
System.ObjectDisposedException: Cannot access a disposed object.\nObject name: 'ZeebeClient was already disposed.'.\n
at AsyncUnaryCall GatewayProtocol.ClosedGatewayClient.CompleteJobAsync(CompleteJobRequest request, Metadata headers, DateTime? deadline, CancellationToken cancellationToken)\n
at async Task Zeebe.Client.Impl.Commands.CompleteJobCommand.Send(TimeSpan? timeout, CancellationToken token) x 2\n

Really, the worker's task has already been done, but zeebe thinks that it is not correct and after a few time transfers it to another worker. There will be problems:

if the worker’s task is not idempotent;
time is wasted to transfers it to another worker, zeebe processes take longer to complete;
more errors in logs that need spend time to be analyzed.

When the service shuts down, the current worker handlers are terminated, zeebe will wait for a timeout before transferring the task to other pods. Waste of time.

If it is possible to give a few seconds for the worker to finish the current work and report it to zeebe, there will be fewer time delays required to transfer tasks to other pods.

To implement such a strategy with zeebe-client, it is necessary possibility to stop pushing new tasks, while it must be able to report of complete on current ones; zeebeClient should not be disposed of before reporting.

A simplified implementation example is here
https://github.com/dimasm61/zeebe-client-csharp/tree/feat2_example/Client.GracefulStopping.Example

Zelldon

Hey @dimasm61 sorry I might not fully understand your problem, and also how your PR resolves this issue.

Is this related to #245 and others?

dimasm61 · 2023-12-20T10:01:07Z

Is this related to #245 and others?

Probably something else )

I might not fully understand your problem, and also how your PR resolves this issue.

To launch a worker, the following construction is usually used

            // open job worker
            using (var signal = new EventWaitHandle(false, EventResetMode.AutoReset))
            {
                client.NewWorker()
                    .JobType(JobType)
                    .Handler(HandleJob)
                    .MaxJobsActive(120)
                    .Name(WorkerName)
                    .AutoCompletion()
                    .PollInterval(TimeSpan.FromMilliseconds(100))
                    .Timeout(TimeSpan.FromSeconds(10))
                    .PollingTimeout(TimeSpan.FromSeconds(30))
                    .Open();

                // blocks main thread, so that worker can run
                signal.WaitOne();
            }

Worker handler is something like this

       private static void HandleJob(IJobClient jobClient, IJob job)
        {
            Logger.Debug("Handle job {job}", job.Key);
            jobClient.NewCompleteJobCommand(job).Send();
        }

When I use this in ASP.NET Core service, I wrap it in Background service (IHostedService, example)

To control IHostedService the asp.net core engine use StartAsync & StopAsync methods.

Respectively, in StartAsync method I place code block "client...Open()....WaitHandler.WaitOne", in StopAsync method I reset WaitHandler.

Let's back to code

1            // open job worker
2            using (var signal = new EventWaitHandle(false, EventResetMode.AutoReset))
3            {
4                client.NewWorker()
5                    .JobType(JobType)
6                    .Handler(HandleJob)
..
10                    .Open();
11
12               // blocks main thread, so that worker can run
13                signal.WaitOne();
15            }
16

1       private static void HandleJob(IJobClient jobClient, IJob job)
2       {
3            Logger.Debug("Handle job {job}", job.Key);
4
5            // some worker job, 1..10 seconds duration
6
7            jobClient.NewCompleteJobCommand(job).Send();
8        }

Imagine a situation:

HandleJob method have done primary job, code execution is in line 6, next command will be line 7;
at this time, you (or kuber) stopping this service, asp.net core call a StopAsync method, signal.WaitOne is reset;
code execution move to line 15 (the end of using scope), after that a grpc client is disposed;
after that a HandlerJob code execution move to line 7, and try report complete job command to zeebe;
a grpc client is null, handler will throw exception: ..Cannot access a disposed object.\nObject name: 'ZeebeClient was already disposed.'. at AsyncUnaryCall GatewayProtocol.ClosedGatewayClient.CompleteJobAsync(...

It dose not allow to realize to graceful stop service.

In my case, the workload is interaction with external service. Not every operation is idempotent. The worker completed his work but did not have time to report to zeebe, zebbe redirect this job to another worker service and it call extrenal service again.

The first solution was wait when all handlers complete before worker dispose.

1            // open job worker
2            using (var signal = new EventWaitHandle(false, EventResetMode.AutoReset))
3            {
4                client.NewWorker()
5                    .JobType(JobType)
6                    .Handler(HandleJob)
..
10                    .Open();
11
12               // blocks main thread, so that worker can run
13                signal.WaitOne();
14
15                while(handlerCount > 0) Thread.Sleep(10);
16            }
17

1       private static void HandleJob(IJobClient jobClient, IJob job)
2       {
3            handlerCount++;
4            Logger.Debug("Handle job {job}", job.Key);
5
6            // some worker job, 1..10 seconds duration
7
8            jobClient.NewCompleteJobCommand(job).Send();
9            handlerCount--;
10       }

But problem in this solution that while code wait active handlers ending (line 15), worker continues pooling new task form zeebe and start new handlers.

Next step, we fork (locally) zeebe-client-csharp and add method StopPooling() to worker

        public void StopPooling() => source.Cancel();

It's stop new job pooling

        private async Task PollJobs(ITargetBlock<IJob> input, CancellationToken cancellationToken)
        {
            while (!source.IsCancellationRequested)
            {

Now I can do this

1            // open job worker
2            using (var signal = new EventWaitHandle(false, EventResetMode.AutoReset))
3            {
4                var worker = client.NewWorker()
5                    .JobType(JobType)
6                    .Handler(HandleJob)
..
10                    .Open();
11
12                // blocks main thread, so that worker can run
13                signal.WaitOne();
14                worker.StopPooling();
15                while(handlerCount > 0) Thread.Sleep(10);
16            }
17

Before wait for active handlers stopping I call StopPoling().

As result, when the service is stopping:

active handlers is comleted;
active handlers report to zeebe - OK (grpc client is not disposed);
and new handlers is no started.

It's work well.

Furthermore, I can extend service stop timeout and wait for not only report to zeebe, I can wait for worker comlete interaction with external service.

xlegalles · 2023-12-20T11:42:53Z

Sorry to intervene in this discussion but why do you use a WaitEventHandle? We also use hosted services for our workers and create them in the Start method, but do not block anything because it's not required: once opened, the worker will poll for jobs until disposed.
We keep the list of workers, so that we can monitor them and check their status (it participates in the liveness probe of the complete pod)
And finally, when we stop the hosted service, we dispose all workers (btw, it stops the polling), before disposing the client.

Now about the problem you mention, I'm afraid that the main question is about the idempotency. It's not because you try to shutdown gracefully that you have any guaranty that you will be able to report the completion of the job to Zeebe. What I mean is that if your hosted service is stopping, it may be because your node is down and you may have already lost any communication means. Just my 2 cents.

dimasm61 · 2023-12-22T11:05:47Z

Sorry to intervene in this discussion but why do you use a WaitEventHandle?

Good question :)
I can assume that it happened historically, my previous co-workes use own multi task engine from previos project and peek at zeebe-client-csharp examples and got such a mix, with WaitEventHandler :)
In general I agree, it looks like an excess layer of abstraction.

We also use hosted services for our workers and create them in the Start method, but do not block anything because it's not required: once opened, the worker will poll for jobs until disposed. We keep the list of workers, so that we can monitor them and check their status (it participates in the liveness probe of the complete pod) And finally, when we stop the hosted service, we dispose all workers (btw, it stops the polling), before disposing the client.

Hm... I have not looked at this problem from this point of view.
If I understand you correctly - example

It works well. The graceful stoping works.

    public Task StartAsync(CancellationToken cancellationToken)
    {
        workerInstance = _workerManager.StartWorker(_jobType, WorkTaskAsync, cancellationToken);
        return Task.CompletedTask;
    }

    public async Task StopAsync(CancellationToken cancellationToken)
    {
        workerInstance.JobWorker.Dispose();
        await _workerHandlerCounter.WaitForActiveHandlersAsync();
    }

    public WorkerInstance StartWorker(
        string jobType,
        Func<IJob, ICompleteJobCommandStep1, CancellationToken, Task> handleJobAsync,
        CancellationToken cancellationToken)
    {
        var result = new WorkerInstance();

        var scope = _serviceProvider.CreateScope();

        result.ZeebeClient = scope.ServiceProvider.GetRequiredService<IZeebeClient>();

        result.JobWorker = result.ZeebeClient
            .NewWorker()
            .JobType(jobType)
            .Handler((client, job) => HandleJobAsync(client, job, handleJobAsync, cancellationToken))
            .MaxJobsActive(StaticSettings.MaxJobsActive)
            .Name($"{jobType}[{Environment.MachineName}][{Environment.CurrentManagedThreadId}]")
            .AutoCompletion()
            .PollingTimeout(TimeSpan.FromSeconds(StaticSettings.PollingTimeoutSec))
            .PollInterval(TimeSpan.FromSeconds(StaticSettings.PollIntervalSec))
            .Timeout(TimeSpan.FromSeconds(StaticSettings.TimeoutSec))
            .HandlerThreads(StaticSettings.HandlerThreads)
            .Open();
        return result;
    }

The fact that it works looks like using an undocumented feature :)
Worker handler still work after Worker disposed... m... imho it is not intuitive, not obvious.

But if @Zelldon, as inspirer of the project, thinks this option is better then add JobWorker.StopPulling() method then ok.

the liveness probe

This is cool idea!
Please clarify, are you hosting one worker in one service or many workers in one service?

Now about the problem you mention, I'm afraid that the main question is about the idempotency. It's not because you try to shutdown gracefully that you have any guaranty that you will be able to report the completion of the job to Zeebe. What I mean is that if your hosted service is stopping, it may be because your node is down and you may have already lost any communication means.

Loss of connection is a rare event. But service restart for updates happen frequently. Even if this covers 20...30% of such errors - the effort will be worth it.

xlegalles · 2023-12-22T13:40:44Z

We host multiple workers in a single application using hosted services. But they all share the same Zeebe client.
All workers are registered in a single service (using dependency injection) so that we can start everyone, stop and also do the health check: we send a TopologyRequest() to validate the client communication and check IsOpen() for every worker.

In a distributed system, anything can happen and you must be ready for any kind of failure (that includes remote ITs doing crazy things as we have already seen): so I agree that your use case is right but is not less likely than a loss of communication while you were trying to ack Zeebe about the completion of the job. This is why, I personnaly think that you must be ready to restart in any kind of situation and recover. That being said, it's useless trying to shutdown gracefully, just be ready to restart gracefully.

feat: opportunity to realize graceful stopping IJobWorker instances

7d53f1b

dimasm61 requested a review from Zelldon as a code owner December 14, 2023 16:16

Zelldon reviewed Dec 15, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: opportunity to realize graceful stopping IJobWorker instances #624

feat: opportunity to realize graceful stopping IJobWorker instances #624

dimasm61 commented Dec 14, 2023

Zelldon left a comment

dimasm61 commented Dec 20, 2023 •

edited

Loading

xlegalles commented Dec 20, 2023

dimasm61 commented Dec 22, 2023

xlegalles commented Dec 22, 2023

feat: opportunity to realize graceful stopping IJobWorker instances #624

Are you sure you want to change the base?

feat: opportunity to realize graceful stopping IJobWorker instances #624

Conversation

dimasm61 commented Dec 14, 2023

Zelldon left a comment

Choose a reason for hiding this comment

dimasm61 commented Dec 20, 2023 • edited Loading

xlegalles commented Dec 20, 2023

dimasm61 commented Dec 22, 2023

xlegalles commented Dec 22, 2023

dimasm61 commented Dec 20, 2023 •

edited

Loading