[BUG] Race condition in JobScheduler#schedule and JobScheduler#deschedule #275

xiaoyuan0821 · 2022-11-29T07:28:17Z

What is the bug?
We met a managed ISM index stuck at initializing status for several weeks

How can one reproduce the bug?
There's a race condition in JobScheduler#schedule and JobScheduler#deschedule, following unit test will fail（add it in JobSchedulerTests）

public void testRaceCondition() throws InterruptedException {

        String indexName = ".opendistro-ism-config";
        String docId = "test-doc-id";
        ScheduledJobParameter jobParameter = buildScheduledJobParameter(docId, "dummy job name",
                Instant.now(), Instant.now(), new IntervalSchedule(Instant.now(), 5, ChronoUnit.MINUTES), true);
        ScheduledJobRunner runner = Mockito.mock(ScheduledJobRunner.class);
        Scheduler.ScheduledCancellable cancellable = Mockito.mock(Scheduler.ScheduledCancellable.class);
        Mockito.when(this.threadPool.schedule(Mockito.any(), Mockito.any(), Mockito.anyString())).thenReturn(cancellable);
        Mockito.when(cancellable.cancel()).thenReturn(true);

        for (int i = 0; i < 10000; i++) {
            logger.info("start iteration {}", i);
            // schedule thread
            Thread scheduleThread = new Thread(() -> scheduler.schedule(indexName, docId, jobParameter, runner, dummyVersion, jitterLimit));
            // deschedule thread
            Thread descheduleThread = new Thread(() -> scheduler.deschedule(indexName, docId));
            // start them
            scheduleThread.start();
            descheduleThread.start();
            // wait for them to end
            scheduleThread.join();
            descheduleThread.join();
            // deschedule again to make sure the job is removed from scheduler#scheduledJobInfo
            scheduler.deschedule(indexName, docId);
            // after deschedule, the scheduledJobInfo should not contains the job again
            assertNull(scheduler.getScheduledJobInfo().getJobInfo(indexName, docId));
            logger.info("end iteration {}", i);
        }
    }

On my desktop, after 1200+ iterations, the tests fails

// ... omit lot of logs
[2022-11-29T07:36:23,512][INFO ][o.o.j.s.JobSchedulerTests] [testRaceCondition] start iteration 1265
[2022-11-29T07:36:23,512][INFO ][o.o.j.s.JobScheduler     ] [[Thread-2535]] Scheduling job id test-doc-id for index .opendistro-ism-config .
[2022-11-29T07:36:23,513][INFO ][o.o.j.s.JobScheduler     ] [testRaceCondition] Descheduling jobId: test-doc-id
[2022-11-29T07:36:23,513][INFO ][o.o.j.s.JobSchedulerTests] [testRaceCondition] end iteration 1265
[2022-11-29T07:36:23,513][INFO ][o.o.j.s.JobSchedulerTests] [testRaceCondition] start iteration 1266
[2022-11-29T07:36:23,513][INFO ][o.o.j.s.JobScheduler     ] [[Thread-2537]] Scheduling job id test-doc-id for index .opendistro-ism-config .
[2022-11-29T07:36:23,513][INFO ][o.o.j.s.JobScheduler     ] [[Thread-2538]] Descheduling jobId: test-doc-id
[2022-11-29T07:36:23,514][INFO ][o.o.j.s.JobScheduler     ] [[Thread-2537]] not scheduled because already removed
[2022-11-29T07:36:23,514][INFO ][o.o.j.s.JobScheduler     ] [testRaceCondition] Descheduling jobId: test-doc-id
[2022-11-29T07:36:23,546][INFO ][o.o.j.s.JobSchedulerTests] [testRaceCondition] after test
REPRODUCE WITH: gradlew ':test' --tests "org.opensearch.jobscheduler.scheduler.JobSchedulerTests.testRaceCondition" -Dtests.seed=E33377BC38A3CD99 -Dtests.security.manager=false -Dtests.locale=ar-JO -Dtests.timezone=Africa/Nouakchott -Druntime.java=12

expected null, but was:<org.opensearch.jobscheduler.scheduler.JobSchedulingInfo@2a85bf7a>
java.lang.AssertionError: expected null, but was:<org.opensearch.jobscheduler.scheduler.JobSchedulingInfo@2a85bf7a>
	at __randomizedtesting.SeedInfo.seed([E33377BC38A3CD99:4CEC8FE92BF01C19]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotNull(Assert.java:756)
	at org.junit.Assert.assertNull(Assert.java:738)
	at org.junit.Assert.assertNull(Assert.java:748)

in ISM, if user add policy to index and remove it immediately, there's a chance to trigger this bug.

What is your host/environment?

All versions include Opendistro JobScheduler and Opensearch JobScheduler has this bug.

The text was updated successfully, but these errors were encountered:

saratvemulapalli · 2023-01-20T21:26:58Z

@joshpalis @vibrantvarun tagging you along, looks like a genuine bug. Lets take care of this?

xiaoyuan0821 added bug Something isn't working untriaged labels Nov 29, 2022

saratvemulapalli removed the untriaged label Jan 20, 2023

saratvemulapalli assigned joshpalis and vibrantvarun Jan 20, 2023

peterzhuamazon added this to Engineering Effectiveness Board Jul 11, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Jul 11, 2024

getsaurabh02 moved this from 🆕 New to Backlog in Engineering Effectiveness Board Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Race condition in JobScheduler#schedule and JobScheduler#deschedule #275

[BUG] Race condition in JobScheduler#schedule and JobScheduler#deschedule #275

xiaoyuan0821 commented Nov 29, 2022 •

edited

Loading

saratvemulapalli commented Jan 20, 2023

[BUG] Race condition in JobScheduler#schedule and JobScheduler#deschedule #275

[BUG] Race condition in JobScheduler#schedule and JobScheduler#deschedule #275

Comments

xiaoyuan0821 commented Nov 29, 2022 • edited Loading

saratvemulapalli commented Jan 20, 2023

xiaoyuan0821 commented Nov 29, 2022 •

edited

Loading