Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when calling Wait and Go from different goroutines #6

Open
horgh opened this issue Apr 20, 2021 · 4 comments
Open

Deadlock when calling Wait and Go from different goroutines #6

horgh opened this issue Apr 20, 2021 · 4 comments

Comments

@horgh
Copy link

horgh commented Apr 20, 2021

Hi!

I'm using 0.1.5 with Go 1.16.3.

First, thank you for making the package! It's a great idea!

As to the issue: I've encountered what I believe is a deadlock due to my use of this package.

My program (a daemon) does this, among other things:

  • Creates an errgroup
  • Calls Go() on it
  • This goroutine (goroutine A) periodically receives some work to do and then starts worker goroutines, also with Go() on the same errgroup

Periodically my program shuts down. As part of that, it calls Wait() on the errgroup.

The situation I have observed is this:

Stack traces:

goroutine 1 [semacquire, 68 minutes]:                                                                                                                                   
sync.runtime_Semacquire(0xc00055ed30)                                                                                                                                   
        /usr/local/go/src/runtime/sema.go:56 +0x45                                                                                                                      
sync.(*WaitGroup).Wait(0xc00055ed28)                                                                                                                                    
        /usr/local/go/src/sync/waitgroup.go:130 +0x65                                                                                                                   
github.com/neilotoole/errgroup.(*Group).Wait(0xc00055ed20, 0xc000136008, 0xc000609d18)                                                                                  
        /home/wstorey_maxmind_com/go/pkg/mod/github.com/neilotoole/[email protected]/errgroup.go:99 +0x65                                                                 
github.maxmind.com/maxmind/mm_website/go/shared/pubsubworker.Run(0x129a780, 0xc000543e40, 0xc0005977e0, 0x129d0f8, 0xc000560320, 0x1, 0x12869c0, 0xc000136670, 0x0, 0x0)
        /home/wstorey_maxmind_com/mm_website/go/shared/pubsubworker/worker.go:121 +0x465                                                                                
github.maxmind.com/maxmind/mm_website/go/process-job-pubsub/worker.DaemonFromCLI(0x129a780, 0xc00011a000)                                                               
        /home/wstorey_maxmind_com/mm_website/go/process-job-pubsub/worker/worker.go:158 +0x7a5                                                                          
main.main()                                                                                                                                                             
        /home/wstorey_maxmind_com/mm_website/go/process-job-pubsub/main.go:10 +0x39                                                                                     
                                                                                                                                                                        
                                                                                                                                                                        
goroutine 41 [semacquire, 68 minutes]:                                                                                                                                  
sync.runtime_SemacquireMutex(0xc00055ed6c, 0x0, 0x1)                                                                                                                    
        /usr/local/go/src/runtime/sema.go:71 +0x47                                                                                                                      
sync.(*Mutex).lockSlow(0xc00055ed68)                                                                                                                                    
        /usr/local/go/src/sync/mutex.go:138 +0x105                                                                                                                      
sync.(*Mutex).Lock(...)                                                                                                                                                 
        /usr/local/go/src/sync/mutex.go:81                                                                                                                              
github.com/neilotoole/errgroup.(*Group).Go(0xc00055ed20, 0xc000769880)                                                                                                  
        /home/wstorey_maxmind_com/go/pkg/mod/github.com/neilotoole/[email protected]/errgroup.go:122 +0x190                                                               
github.maxmind.com/maxmind/mm_website/go/shared/pubsubworker.(*Worker).process(0xc000547b30, 0x129a748, 0xc000543e40, 0xc00055ed20)                                     
        /home/wstorey_maxmind_com/mm_website/go/shared/pubsubworker/worker.go:190 +0x155

What do you think?

@neilotoole
Copy link
Owner

@horgh Thanks for the kind words, much appreciated. It's great to see code in action.

I'll freely admit I had not considered (or tested for) this scenario. It's very interesting.

Just running through your example in my head, yes, I see the deadlock.

It seems like it should be possible to distill this situation down to a test case, and easily arrive at that deadlock (I haven't done that yet, basically due to lack of time).

I suppose the question is: if we had that test case, would sync/errgroup fail at the same deadlock as neilotoole/errgroup? I think this package should still emulate the behavior of sync/errgroup (even in the case of deadlock), as the goal is to be a "drop-in" replacement for sync/errgroup. On the other hand, if sync/errgroup does not experience deadlock, but this package does, then this package has failed as a "drop-in".

Ideally, could you provide a test case that demonstrates the deadlock? Either which way, I want to get to the bottom of this scenario. It's a real-world situation that at a minimum needs to be documented, but hopefully solved.

@horgh
Copy link
Author

horgh commented Apr 22, 2021

Thank you for looking at it!

This reproduces the issue:

package main

import (
	"context"
	"log"
	"time"

	"github.com/neilotoole/errgroup"
	// "golang.org/x/sync/errgroup"
)

func main() {
	ctx := context.Background()

	eg, _ := errgroup.WithContext(ctx)
	eg.Go(func() error { return g1(eg) })

	if err := eg.Wait(); err != nil {
		log.Fatal(err)
	}
}

func g1(eg *errgroup.Group) error {
	time.Sleep(100 * time.Millisecond)
	eg.Go(func() error { return nil })
	return nil
}

When run this shows: fatal error: all goroutines are asleep - deadlock!

It does not happen with golang.org/x/sync/errgroup. It looks like that implementation doesn't use a mutex directly so it is avoiding the situation.

@horgh
Copy link
Author

horgh commented Apr 22, 2021

I added a test here in the test package too - https://github.com/neilotoole/errgroup/compare/master...horgh:horgh/deadlock-test?expand=1

@horgh
Copy link
Author

horgh commented May 6, 2021

I'm kinda wondering now if golang.org/x/sync/errgroup is not designed for this either. It won't deadlock, but I don't know that it is valid to call Go() and Wait() on a Group from it concurrently. I think we could end up with a race where we still start a goroutine despite Wait() returning. It won't deadlock so its failure mode is better, but it's not correct either. Perhaps this should simply be documented or something. This comment made me think about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants