Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit uniqueness of Datum and Resource. #28

Open
danielballan opened this issue Jan 8, 2020 · 14 comments
Open

Revisit uniqueness of Datum and Resource. #28

danielballan opened this issue Jan 8, 2020 · 14 comments

Comments

@danielballan
Copy link
Member

danielballan commented Jan 8, 2020

I am recording---belatedly---some phone and in-person conversations with @tacaswell and @gwbischof.

It is possible for the same exposure to be useful across multiple Runs. One prominent example is a dark frame, which might be taken during one Run and then referenced (reused) by another Run. In these situations, we re-emit the same Datum and Resource documents. This breaks assumptions that had leaked in:

  • The Datum field datum_id is no longer unique
  • The Resource field uid is no longer unique, though the pair (run_start, uid) is if run_start is defined. (It was a late addition to the schema and is still optional.)

Rejected Alternatives

We could issue new Datum and Resource documents with fresh unique IDs but the same underlying contents. I'm not sure I recall all the reasons that this was rejected, but it seems it could lead to bloat with many identical copies of Datum documents being stored.

Implementation

Currently, an index on the Datum collection expects datum_id to be unique. Should we:

  1. Remove the uniqueness constraint and go ahead and accept duplicates, potentially adding a pruning tool in the future to remove the duplicates.
  2. Continue to require uniqueness. When a datum_id collision occurs, catch the error, verify that that the full contents (not just the datum_id) of the of the existing document and the would-be new document are identical. If they are, pass. If they are not, raise an error ("Two Datum with the same datum_id must have identical contents!").

(2) would be slower on insert but seems better to me because it does the validation up front.

@jklynch
Copy link
Contributor

jklynch commented Jan 8, 2020

I don't understand what happens to the "new" document in 2. on datum_id collision for identical documents.

@gwbischof
Copy link
Contributor

An example of duplicates at RSOXS:


In [2]:     oldclient = pymongo.MongoClient("mongodb://xf07id1-ws18:27017/")             
   ...:     old_assets_db = oldclient["rsoxs-assets-store"]                              
   ...:     old_meta_db = oldclient["rsoxs-metadata-store"]                                                                                   

In [3]: list(old_meta_db.resource.find({'uid': '1c43af30-27db-437e-83c0-38b1cc528f06'}))                                                      
Out[3]: []

In [4]: list(old_assets_db.resource.find({'uid': '1c43af30-27db-437e-83c0-38b1cc528f06'}))                                                    
Out[4]: 
[{'_id': ObjectId('5d540d4cfb414042494a3a0e'),
  'spec': 'AD_TIFF',
  'resource_path': 'data/2019/08/14',
  'root': '/DATA/images',
  'resource_kwargs': {'template': '%s%s_%6.6d.tiff',
   'filename': 'e1f402f1-10bd-4685-a4d4',
   'frame_per_point': 1},
  'path_semantics': 'posix',
  'uid': '1c43af30-27db-437e-83c0-38b1cc528f06',
  'run_start': '4fa42813-0b56-4eee-a577-2125cd9f6c24'},
 {'_id': ObjectId('5d540d7bfb414042494a3a5e'),
  'spec': 'AD_TIFF',
  'resource_path': 'data/2019/08/14',
  'root': '/DATA/images',
  'resource_kwargs': {'template': '%s%s_%6.6d.tiff',
   'filename': 'e1f402f1-10bd-4685-a4d4',
   'frame_per_point': 1},
  'path_semantics': 'posix',
  'uid': '1c43af30-27db-437e-83c0-38b1cc528f06',
  'run_start': '4fa42813-0b56-4eee-a577-2125cd9f6c24'},
 {'_id': ObjectId('5d540db9fb414042494a3aa2'),
  'spec': 'AD_TIFF',
  'resource_path': 'data/2019/08/14',
  'root': '/DATA/images',
  'resource_kwargs': {'template': '%s%s_%6.6d.tiff',
   'filename': 'e1f402f1-10bd-4685-a4d4',
   'frame_per_point': 1},
  'path_semantics': 'posix',
  'uid': '1c43af30-27db-437e-83c0-38b1cc528f06',
  'run_start': 'd1fc0b67-0fcc-4239-b1c5-cdbc15a27e2b'}]```

@danielballan
Copy link
Member Author

danielballan commented Jan 8, 2020

Suppose two Runs in a row try to insert identical Datum documents, as would happen if the first Run took a dark frame and the next Run reused it.

In the first case, the database won't have seen that datum_id before, so it will go in. In the second c ase, we'll try to insert it and pymongo will raise an error complaining about the duplicate because datum_id is supposed to be unique. We can do something like this:

def insert_datum(doc):
    try:
        mongo_datum_collection.insert(doc)
    except DuplicateError:
        # The database already has a Datum will this datum_id.
        # It has probably been inserted before. But just to be sure that something
        # hasn't gone terribly wrong (a different document but a colliding datum_id)
        # let's compare the existing document to the one we tried to insert.
        # If there are the same, all is well.
        prior_doc = mongo_datum_collection.find({'datum_id': doc['datum_id']})
        if prior_doc != doc:
            # If this happens, something has gone *very* wrong with the code that
            # generates the documents.
            raise ValueError(
                "Attempted to insert Datum with a datum_id that already exists "
                "but different content!")

[Edited to fix formatting of pseudocode]

@jklynch
Copy link
Contributor

jklynch commented Jan 8, 2020

So the "new" document is not saved, since it is identical to the "prior" document which is already in mongo?

@danielballan
Copy link
Member Author

Exactly.

@gwbischof
Copy link
Contributor

gwbischof commented Jan 8, 2020

I think we can do a mongoClient.db.collection.update(key, doc, upsert=true) to insert only if the doc is unique.

@gwbischof
Copy link
Contributor

gwbischof commented Jan 8, 2020

Thoughts about duplicate resources:
For reading data I think we want to do two things: fill an event, and get all of the documents for a run.

Fill an event:
If we have multiple resources with the same uid, when we fill an event we wont be sure which resource document to use, because we lookup the resource by uid. unique resource uids would be best here.

Get all the documents for a run:
I think that we want to lookup resource documents by the run_start, so that we don't have to iterate through all of the events to find of the resources for the run. For this situation I think we want duplicate resource documents that differ only by run_start.

@mrakitin
Copy link
Member

mrakitin commented Jan 8, 2020

👍 for (2) keeping uniqueness.

@danielballan
Copy link
Member Author

@gwbischof I agree the reverse-lookup issue is confusing. If we're going from Event -> Datum -> Resource without any additional context, we could end up with any Resource that has the right uid (and the right content) but any run_start, perhaps not the one belonging to the Event. Not ideal. But, if we're starting from an Event, that means we also have (or can easily get) the EventDescriptor and from there the run_start UID.

This feel convoluted but I don't see any terrible problems with it yet.

@danielballan
Copy link
Member Author

I think we can do a mongoClient.db.collection.update(key, doc, upsert=true) to insert only if the doc is unique.

If we take that approach, we would never notice the situation where Datum 2 has different content than Datum 1 and thus blows away Datum 1 on insert. That shouldn't ever happen, but if it did it would be very bad.

Maybe we should implement both (2) and this and provide a switch. If the overhead of validation ever becomes a serious problem, we can elect to switch to the upsert-based un-validated mode.

@tacaswell
Copy link
Contributor

My memory was to make them unique going forward (and tolerate duplicate resource and datum bodies), but to tolerate the current non-uniqueness to avoid having to re-write events en-mass (but probably should eventually).

Allowing the datum id to be non-unique leaves us open to eventually emitting non-consistent datums. In the case of mongo we have a story of how we can check that, but in a more streaming / decoupled deployment we may not be able to do that check or the exception will happen someplace far down stream that can not easily reach back up and tell the RE that something is very wrong.

We want to insert the run id into the resource document [1] which now do in the RE as the documents go out. As we are doing this, we are now in the situation where there will be resources with the same uid but different content. While you can say the "important" content is all the same, spreading through the entire code base which keys are allowed to not be the same, but still have two documents be considered "the same". If we are changing the resource uids then we also need to change the datum ids (as we want them to be 'resource_id/N' to save us round of database lookup and to lay the path towards being able to do reasonable bulk datum processing).

This is going to require a bunch more complexity be pushed down into AD / the darkframe code, but it will save us a whole lot of complexity on the consumer side.

[1] which we did not initially do because the resource / datum went out entirely side-band and we did not want to inject the run id into ohpyd objects (could do it in stage, but then you could not stage out side of a run nor leave a device staged across runs)). We could add another required method to the blueskyinterface that is inform_device_of_current_run(uid: str), but that would fail now that we allow there to be more than one open run. I think we would have to add an (optional) kwarg to every method to allow the RE to pass in the uid of the run it is currently operating on the device under the context of.

@danielballan
Copy link
Member Author

in a more streaming / decoupled deployment we may not be able to do that check or the exception will happen someplace far down stream that can not easily reach back up and tell the RE that something is very wrong.

That's very compelling. I think we did have a conversation about this path forward, and I had just forgotten.

@gwbischof
Copy link
Contributor

The way mongo embedded works now is that start, stop, resource, and descriptors, go in the header document. This may be an issue if we don't want to duplicate resource documents.

@gwbischof
Copy link
Contributor

It might be nice to know which stream a resource belongs to so that if you want to do documents_to _xarray then you can put only the datum that are associated with the stream into the filler upfront. now we are putting all of the datum for the run into the filler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants