-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit uniqueness of Datum and Resource. #28
Comments
I don't understand what happens to the "new" document in 2. on datum_id collision for identical documents. |
An example of duplicates at RSOXS:
|
Suppose two Runs in a row try to insert identical Datum documents, as would happen if the first Run took a dark frame and the next Run reused it. In the first case, the database won't have seen that def insert_datum(doc):
try:
mongo_datum_collection.insert(doc)
except DuplicateError:
# The database already has a Datum will this datum_id.
# It has probably been inserted before. But just to be sure that something
# hasn't gone terribly wrong (a different document but a colliding datum_id)
# let's compare the existing document to the one we tried to insert.
# If there are the same, all is well.
prior_doc = mongo_datum_collection.find({'datum_id': doc['datum_id']})
if prior_doc != doc:
# If this happens, something has gone *very* wrong with the code that
# generates the documents.
raise ValueError(
"Attempted to insert Datum with a datum_id that already exists "
"but different content!") [Edited to fix formatting of pseudocode] |
So the "new" document is not saved, since it is identical to the "prior" document which is already in mongo? |
Exactly. |
I think we can do a mongoClient.db.collection.update(key, doc, upsert=true) to insert only if the doc is unique. |
Thoughts about duplicate resources: Fill an event: Get all the documents for a run: |
👍 for (2) keeping uniqueness. |
@gwbischof I agree the reverse-lookup issue is confusing. If we're going from Event -> Datum -> Resource without any additional context, we could end up with any Resource that has the right This feel convoluted but I don't see any terrible problems with it yet. |
If we take that approach, we would never notice the situation where Datum 2 has different content than Datum 1 and thus blows away Datum 1 on insert. That shouldn't ever happen, but if it did it would be very bad. Maybe we should implement both (2) and this and provide a switch. If the overhead of validation ever becomes a serious problem, we can elect to switch to the upsert-based un-validated mode. |
My memory was to make them unique going forward (and tolerate duplicate resource and datum bodies), but to tolerate the current non-uniqueness to avoid having to re-write events en-mass (but probably should eventually). Allowing the datum id to be non-unique leaves us open to eventually emitting non-consistent datums. In the case of mongo we have a story of how we can check that, but in a more streaming / decoupled deployment we may not be able to do that check or the exception will happen someplace far down stream that can not easily reach back up and tell the RE that something is very wrong. We want to insert the run id into the resource document [1] which now do in the RE as the documents go out. As we are doing this, we are now in the situation where there will be resources with the same uid but different content. While you can say the "important" content is all the same, spreading through the entire code base which keys are allowed to not be the same, but still have two documents be considered "the same". If we are changing the resource uids then we also need to change the datum ids (as we want them to be This is going to require a bunch more complexity be pushed down into AD / the darkframe code, but it will save us a whole lot of complexity on the consumer side. [1] which we did not initially do because the resource / datum went out entirely side-band and we did not want to inject the run id into ohpyd objects (could do it in stage, but then you could not stage out side of a run nor leave a device staged across runs)). We could add another required method to the blueskyinterface that is |
That's very compelling. I think we did have a conversation about this path forward, and I had just forgotten. |
The way mongo embedded works now is that start, stop, resource, and descriptors, go in the header document. This may be an issue if we don't want to duplicate resource documents. |
It might be nice to know which stream a resource belongs to so that if you want to do |
I am recording---belatedly---some phone and in-person conversations with @tacaswell and @gwbischof.
It is possible for the same exposure to be useful across multiple Runs. One prominent example is a dark frame, which might be taken during one Run and then referenced (reused) by another Run. In these situations, we re-emit the same Datum and Resource documents. This breaks assumptions that had leaked in:
datum_id
is no longer uniqueuid
is no longer unique, though the pair(run_start, uid)
is ifrun_start
is defined. (It was a late addition to the schema and is still optional.)Rejected Alternatives
We could issue new Datum and Resource documents with fresh unique IDs but the same underlying contents. I'm not sure I recall all the reasons that this was rejected, but it seems it could lead to bloat with many identical copies of Datum documents being stored.
Implementation
Currently, an index on the Datum collection expects
datum_id
to be unique. Should we:datum_id
collision occurs, catch the error, verify that that the full contents (not just thedatum_id
) of the of the existing document and the would-be new document are identical. If they are, pass. If they are not, raise an error ("Two Datum with the same datum_id must have identical contents!").(2) would be slower on insert but seems better to me because it does the validation up front.
The text was updated successfully, but these errors were encountered: