[BUG] Race condition on restoring from snapshot #396

heemin32 · 2023-06-01T22:19:48Z

What is the bug?
This is just my thought process. If extension of job scheduler with short interval acquire lock, it will create an index .opendistro-job-scheduler-lock.

After taking a snapshot, and if we restore the snapshot, the index for extension of job scheduler can be restored first and it will trigger the task which will create .opendistro-job-scheduler-lock index. If restoration of .opendistro-job-scheduler-lock is happened after that, it will fail due to index name conflict.

How can one reproduce the bug?
Steps to reproduce the behavior:

Create extension of job scheduler with short run interval which also acquire lock.
Take snapshot
Restore from snapshot
Restore of .opendistro-job-scheduler-lock will fail

What is the expected behavior?
Maybe, .opendistro-job-scheduler-lock should be taken in snapshot or restoring of it should be blocked.

What is your host/environment?
N/A

Do you have any screenshots?
N/A

Do you have any additional context?
opensearch-project/OpenSearch#7778

The text was updated successfully, but these errors were encountered:

andrross · 2023-06-13T19:18:15Z

If the data in .opendistro-job-scheduler-lock is truly ephemeral state that should never survive a snapshot->restore cycle, then it might be appropriate to block it from being snapshotted. However, it does raise the question of whether an index is the right place to store such ephemeral state.

If, however, there are cases where you would want the data in .opendistro-job-scheduler-lock to survive a snapshot->restore cycle, then the current behavior seems appropriate where either the snapshotter or the restorer can choose whether to exclude the index at snapshot or restore time. There may indeed be a race, but if a new .opendistro-job-scheduler-lock has been created and might have data in it, then the operator needs to explicitly make a choice as to whether to use the new data versus the data in the snapshot.

prudhvigodithi · 2024-04-01T17:43:42Z

[Triage]
Hey, just following back on this, I assume this issue still persists adding @joshpalis @cwperks @dbwiddis to provide some insights. I agree with @andrross there has to be a mechanism to choose if .opendistro-job-scheduler-lock should be part of the snapshot->restore cycle.

We can also have user to first rename the index to a new name during restoration, delete the original index, and then reindex the data from the renamed index to the original one.

Adding @bbarani

andrross · 2024-04-02T16:45:22Z

We can also have user to first rename the index to a new name during restoration, delete the original index, and then reindex the data from the renamed index to the original one.

We should really try to avoid user interaction here, I think. I'm not super familiar with the low level details of job scheduler, but I suspect .opendistro-job-scheduler-lock is an implementation detail. It really raises the question of whether a system index is the right place for this data (versus say cluster state or some other mechanism).

heemin32 added bug Something isn't working untriaged labels Jun 1, 2023

joshpalis self-assigned this Aug 11, 2023

prudhvigodithi removed the untriaged label Apr 1, 2024

peterzhuamazon added this to Engineering Effectiveness Board Jul 11, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Jul 11, 2024

getsaurabh02 moved this from 🆕 New to Backlog in Engineering Effectiveness Board Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Race condition on restoring from snapshot #396

[BUG] Race condition on restoring from snapshot #396

heemin32 commented Jun 1, 2023

andrross commented Jun 13, 2023

prudhvigodithi commented Apr 1, 2024

andrross commented Apr 2, 2024

[BUG] Race condition on restoring from snapshot #396

[BUG] Race condition on restoring from snapshot #396

Comments

heemin32 commented Jun 1, 2023

andrross commented Jun 13, 2023

prudhvigodithi commented Apr 1, 2024

andrross commented Apr 2, 2024