Revisit uniqueness of Datum and Resource. #28

danielballan · 2020-01-08T00:25:17Z

I am recording---belatedly---some phone and in-person conversations with @tacaswell and @gwbischof.

It is possible for the same exposure to be useful across multiple Runs. One prominent example is a dark frame, which might be taken during one Run and then referenced (reused) by another Run. In these situations, we re-emit the same Datum and Resource documents. This breaks assumptions that had leaked in:

The Datum field datum_id is no longer unique
The Resource field uid is no longer unique, though the pair (run_start, uid) is if run_start is defined. (It was a late addition to the schema and is still optional.)

Rejected Alternatives

We could issue new Datum and Resource documents with fresh unique IDs but the same underlying contents. I'm not sure I recall all the reasons that this was rejected, but it seems it could lead to bloat with many identical copies of Datum documents being stored.

Implementation

Currently, an index on the Datum collection expects datum_id to be unique. Should we:

Remove the uniqueness constraint and go ahead and accept duplicates, potentially adding a pruning tool in the future to remove the duplicates.
Continue to require uniqueness. When a datum_id collision occurs, catch the error, verify that that the full contents (not just the datum_id) of the of the existing document and the would-be new document are identical. If they are, pass. If they are not, raise an error ("Two Datum with the same datum_id must have identical contents!").

(2) would be slower on insert but seems better to me because it does the validation up front.

The text was updated successfully, but these errors were encountered:

jklynch · 2020-01-08T16:07:40Z

I don't understand what happens to the "new" document in 2. on datum_id collision for identical documents.

gwbischof · 2020-01-08T16:18:25Z

An example of duplicates at RSOXS:


In [2]:     oldclient = pymongo.MongoClient("mongodb://xf07id1-ws18:27017/")             
   ...:     old_assets_db = oldclient["rsoxs-assets-store"]                              
   ...:     old_meta_db = oldclient["rsoxs-metadata-store"]                                                                                   

In [3]: list(old_meta_db.resource.find({'uid': '1c43af30-27db-437e-83c0-38b1cc528f06'}))                                                      
Out[3]: []

In [4]: list(old_assets_db.resource.find({'uid': '1c43af30-27db-437e-83c0-38b1cc528f06'}))                                                    
Out[4]: 
[{'_id': ObjectId('5d540d4cfb414042494a3a0e'),
  'spec': 'AD_TIFF',
  'resource_path': 'data/2019/08/14',
  'root': '/DATA/images',
  'resource_kwargs': {'template': '%s%s_%6.6d.tiff',
   'filename': 'e1f402f1-10bd-4685-a4d4',
   'frame_per_point': 1},
  'path_semantics': 'posix',
  'uid': '1c43af30-27db-437e-83c0-38b1cc528f06',
  'run_start': '4fa42813-0b56-4eee-a577-2125cd9f6c24'},
 {'_id': ObjectId('5d540d7bfb414042494a3a5e'),
  'spec': 'AD_TIFF',
  'resource_path': 'data/2019/08/14',
  'root': '/DATA/images',
  'resource_kwargs': {'template': '%s%s_%6.6d.tiff',
   'filename': 'e1f402f1-10bd-4685-a4d4',
   'frame_per_point': 1},
  'path_semantics': 'posix',
  'uid': '1c43af30-27db-437e-83c0-38b1cc528f06',
  'run_start': '4fa42813-0b56-4eee-a577-2125cd9f6c24'},
 {'_id': ObjectId('5d540db9fb414042494a3aa2'),
  'spec': 'AD_TIFF',
  'resource_path': 'data/2019/08/14',
  'root': '/DATA/images',
  'resource_kwargs': {'template': '%s%s_%6.6d.tiff',
   'filename': 'e1f402f1-10bd-4685-a4d4',
   'frame_per_point': 1},
  'path_semantics': 'posix',
  'uid': '1c43af30-27db-437e-83c0-38b1cc528f06',
  'run_start': 'd1fc0b67-0fcc-4239-b1c5-cdbc15a27e2b'}]```

danielballan · 2020-01-08T16:18:34Z

Suppose two Runs in a row try to insert identical Datum documents, as would happen if the first Run took a dark frame and the next Run reused it.

In the first case, the database won't have seen that datum_id before, so it will go in. In the second c ase, we'll try to insert it and pymongo will raise an error complaining about the duplicate because datum_id is supposed to be unique. We can do something like this:

def insert_datum(doc):
    try:
        mongo_datum_collection.insert(doc)
    except DuplicateError:
        # The database already has a Datum will this datum_id.
        # It has probably been inserted before. But just to be sure that something
        # hasn't gone terribly wrong (a different document but a colliding datum_id)
        # let's compare the existing document to the one we tried to insert.
        # If there are the same, all is well.
        prior_doc = mongo_datum_collection.find({'datum_id': doc['datum_id']})
        if prior_doc != doc:
            # If this happens, something has gone *very* wrong with the code that
            # generates the documents.
            raise ValueError(
                "Attempted to insert Datum with a datum_id that already exists "
                "but different content!")

[Edited to fix formatting of pseudocode]

jklynch · 2020-01-08T16:21:55Z

So the "new" document is not saved, since it is identical to the "prior" document which is already in mongo?

danielballan · 2020-01-08T16:27:49Z

Exactly.

gwbischof · 2020-01-08T16:38:27Z

I think we can do a mongoClient.db.collection.update(key, doc, upsert=true) to insert only if the doc is unique.

gwbischof · 2020-01-08T17:00:28Z

Thoughts about duplicate resources:
For reading data I think we want to do two things: fill an event, and get all of the documents for a run.

Fill an event:
If we have multiple resources with the same uid, when we fill an event we wont be sure which resource document to use, because we lookup the resource by uid. unique resource uids would be best here.

Get all the documents for a run:
I think that we want to lookup resource documents by the run_start, so that we don't have to iterate through all of the events to find of the resources for the run. For this situation I think we want duplicate resource documents that differ only by run_start.

mrakitin · 2020-01-08T17:23:07Z

👍 for (2) keeping uniqueness.

danielballan · 2020-01-08T19:32:50Z

@gwbischof I agree the reverse-lookup issue is confusing. If we're going from Event -> Datum -> Resource without any additional context, we could end up with any Resource that has the right uid (and the right content) but any run_start, perhaps not the one belonging to the Event. Not ideal. But, if we're starting from an Event, that means we also have (or can easily get) the EventDescriptor and from there the run_start UID.

This feel convoluted but I don't see any terrible problems with it yet.

danielballan · 2020-01-08T19:35:58Z

I think we can do a mongoClient.db.collection.update(key, doc, upsert=true) to insert only if the doc is unique.

If we take that approach, we would never notice the situation where Datum 2 has different content than Datum 1 and thus blows away Datum 1 on insert. That shouldn't ever happen, but if it did it would be very bad.

Maybe we should implement both (2) and this and provide a switch. If the overhead of validation ever becomes a serious problem, we can elect to switch to the upsert-based un-validated mode.

tacaswell · 2020-01-09T15:31:08Z

My memory was to make them unique going forward (and tolerate duplicate resource and datum bodies), but to tolerate the current non-uniqueness to avoid having to re-write events en-mass (but probably should eventually).

Allowing the datum id to be non-unique leaves us open to eventually emitting non-consistent datums. In the case of mongo we have a story of how we can check that, but in a more streaming / decoupled deployment we may not be able to do that check or the exception will happen someplace far down stream that can not easily reach back up and tell the RE that something is very wrong.

We want to insert the run id into the resource document [1] which now do in the RE as the documents go out. As we are doing this, we are now in the situation where there will be resources with the same uid but different content. While you can say the "important" content is all the same, spreading through the entire code base which keys are allowed to not be the same, but still have two documents be considered "the same". If we are changing the resource uids then we also need to change the datum ids (as we want them to be 'resource_id/N' to save us round of database lookup and to lay the path towards being able to do reasonable bulk datum processing).

This is going to require a bunch more complexity be pushed down into AD / the darkframe code, but it will save us a whole lot of complexity on the consumer side.

[1] which we did not initially do because the resource / datum went out entirely side-band and we did not want to inject the run id into ohpyd objects (could do it in stage, but then you could not stage out side of a run nor leave a device staged across runs)). We could add another required method to the blueskyinterface that is inform_device_of_current_run(uid: str), but that would fail now that we allow there to be more than one open run. I think we would have to add an (optional) kwarg to every method to allow the RE to pass in the uid of the run it is currently operating on the device under the context of.

danielballan · 2020-01-09T17:35:37Z

in a more streaming / decoupled deployment we may not be able to do that check or the exception will happen someplace far down stream that can not easily reach back up and tell the RE that something is very wrong.

That's very compelling. I think we did have a conversation about this path forward, and I had just forgotten.

gwbischof · 2020-01-09T18:57:36Z

The way mongo embedded works now is that start, stop, resource, and descriptors, go in the header document. This may be an issue if we don't want to duplicate resource documents.

gwbischof · 2020-01-09T21:43:48Z

It might be nice to know which stream a resource belongs to so that if you want to do documents_to _xarray then you can put only the datum that are associated with the stream into the filler upfront. now we are putting all of the datum for the run into the filler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit uniqueness of Datum and Resource. #28

Revisit uniqueness of Datum and Resource. #28

danielballan commented Jan 8, 2020 •

edited

Loading

jklynch commented Jan 8, 2020

gwbischof commented Jan 8, 2020

danielballan commented Jan 8, 2020 •

edited

Loading

jklynch commented Jan 8, 2020

danielballan commented Jan 8, 2020

gwbischof commented Jan 8, 2020 •

edited

Loading

gwbischof commented Jan 8, 2020 •

edited

Loading

mrakitin commented Jan 8, 2020

danielballan commented Jan 8, 2020

danielballan commented Jan 8, 2020

tacaswell commented Jan 9, 2020

danielballan commented Jan 9, 2020

gwbischof commented Jan 9, 2020

gwbischof commented Jan 9, 2020

Revisit uniqueness of Datum and Resource. #28

Revisit uniqueness of Datum and Resource. #28

Comments

danielballan commented Jan 8, 2020 • edited Loading

Rejected Alternatives

Implementation

jklynch commented Jan 8, 2020

gwbischof commented Jan 8, 2020

danielballan commented Jan 8, 2020 • edited Loading

jklynch commented Jan 8, 2020

danielballan commented Jan 8, 2020

gwbischof commented Jan 8, 2020 • edited Loading

gwbischof commented Jan 8, 2020 • edited Loading

mrakitin commented Jan 8, 2020

danielballan commented Jan 8, 2020

danielballan commented Jan 8, 2020

tacaswell commented Jan 9, 2020

danielballan commented Jan 9, 2020

gwbischof commented Jan 9, 2020

gwbischof commented Jan 9, 2020

danielballan commented Jan 8, 2020 •

edited

Loading

danielballan commented Jan 8, 2020 •

edited

Loading

gwbischof commented Jan 8, 2020 •

edited

Loading

gwbischof commented Jan 8, 2020 •

edited

Loading