-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RF "files.py" validation/metadata-loading to support BIDS #1044
Comments
@yarikoptic I'm thinking something like this:
Note that this assumes that, if a Dandiset contains one or more BIDS datasets that aren't at the root, then we want any non-NWB, non-Zarr files outside of the BIDS datasets to not be uploaded as assets by default. |
Sounds good for a start! I dislike a little implicit My main concern is the combinatoric NB I would prefer consistent capitalization of |
@yarikoptic I'd prefer multiple inheritance to delegating to an attribute. For one thing, given an instance of an NWB in a BIDS dataset, I would expect
Are you suggesting changing the name of the file that denotes a BIDS dataset? I don't think that's up to us. |
no, I am just suggesting adding BIDS to our class/option names as e.g. in
hm... ok... but then let's have still have the logic of behavior centralized, may be it should be a mixin |
@jwodder will work on RFing code per above, @TheChymera please point him to the interfaces for BIDS validation and metadata extraction he should use. |
@yarikoptic @TheChymera Question: What happens if a directory contains a |
treat it as a "start of a new BIDS dataset". It will be for validator to decide if that is a permitted location to have such nested BIDS dataset (some, like |
@jwodder from the point of view of the pilot DANDI support we have just merged, both would be detected as candidates, and the validation would be run for both. This is a repercussion of top level matches being necessarily unique, i.e. unlike entity files there should only ever be one file matching one pattern... Ultimately this is a shortcoming of the validator path selection and not of our DANDI wrapper. |
Fix incoming: bids-standard/bids-specification#1145 |
@TheChymera So, if I have a
|
@jwodder if the chymera@decohost ~/src/bids-examples $ dandi validate-bids asl003/
2022-07-19 16:53:10,234 [ WARNING] A newer version (0.45.1) of dandi/dandi-cli is available. You are using 0.44.1+45.g2ed8929.dirty
2022-07-19 16:53:10,384 [ INFO] Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-07-19 16:53:10,384 [ INFO] NumExpr defaulting to 8 threads.
2022-07-19 16:53:11,608 [ WARNING] BIDSVersion 1.5.0 is less than the minimal working 1.7.0+369. Falling back to 1.7.0+369. To force the usage of earlier versions specify them explicitly when calling the validator.
All filenames are BIDS-valid and no mandatory files are missing.
2022-07-19 16:53:12,824 [ INFO] Logs saved in /home/chymera/.cache/dandi-cli/log/20220719205309Z-9230.log If you do not have the directory at hand and just a list of paths which do not exist on your system this will only be possible after we have merge
Currently it differs in that the keys are not fully homogenized with the DANDI nomenclature. I did a bit of standardizing here, but there are a few more keys coming in a new PR soon. |
|
currently that can't be done on :master because we need the actual paths, however, as soon as the Lines 116 to 131 in 1ae0a3b
The validator returns a list of dictionaries for all matches found, where the keys are BIDS-standard (but not DANDI-standard) metadata fields, that is the standardization I previously mentioned: Line 134 in 1ae0a3b
|
AFAIK @jwodder would have access to actual paths for any purpose of validation or upload |
Then it's as simple as: chymera@decohost ~ $ cat lala.py
from dandi.validate import validate_bids
result = validate_bids("~/src/bids-examples/asl003")
for i in result["match_listing"]:
print(i["path"])
try:
print(i["subject"])
print(i["session"])
except:
pass
chymera@decohost ~ $ python lala.py
2022-07-19 17:48:55,647 [ WARNING] BIDSVersion 1.5.0 is less than the minimal working 1.7.0+369. Falling back to 1.7.0+369. To force the usage of earlier versions specify them explicitly when calling the validator.
/home/chymera/src/bids-examples/asl003/CHANGES
/home/chymera/src/bids-examples/asl003/dataset_description.json
/home/chymera/src/bids-examples/asl003/README
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.nii.gz
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_aslcontext.tsv
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.json
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.json
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asllabeling.jpg
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.json
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.nii.gz
Sub1
None |
@yarikoptic Should the validation errors for a BIDS dataset be attached to the |
@yarikoptic @satra @TheChymera Also, the |
@jwodder - there isn't, since those are two distinct schemas. just like NWB has a version, i would keep the bids version as described in the dataset_description.json. and if you are referring to the bids schema as implemented in dandi, one could add validation information, including dandi-specific bids schema version, in the provenance section of the metadata ( |
Short overall answer before I outline a possible design I see: I think errors associated with paths should be available (assigned or query-able) for those specific paths and it should be possible to get all errors across all paths of the dataset (#943 relates as might need tune up depending on what we come up with here or there ;)). ATM, Extracted as a separate https://github.com//issues/1082 nowI thought I had described it here or elsewhere but I failed to find it, so coming up with my thinking here. I think we overall have to work with a following structure. I was not thinking yet on how and whether it should relate to existing Dandiset interfaces, may be consider them all below
So, I am thinking, that for upload/validation, given the optional paths (limiting what we can find) we first figure out the Dandiset (location of Relating to WiP in #1076, Something like that. WDYT? |
don't worry about BIDS version here! And looking at how we "use" |
@yarikoptic I'm thinking that such a restructuring of the classes should happen in a separate issue/PR. For right now, if we want individual BIDS assets to have their own validation errors (with BIDS dataset-wide errors just associated with the |
done - #1082
For consistency with how nwbs are handled and since ATM it is allowed -- easiest is to do at the point of getting to that file during upload. It would cause some duplicate avoidable "find the root of BIDS dataset" but it could (if not already) be cached so shouldn't add too much penalty hopefully. In the course of #1082 refactoring, as I have mentioned, for BIDS such validation might need to happen at the level of the BIDS dataset first before any file from it gets considered for upload. We might even later introduce policy/behavior to disallow upload of DANDI or BIDS datasets altogether by default if any file fails validation (it is not current behavior). |
@TheChymera Does the structure returned by |
ping @TheChymera . After all current BIDS work is merged into BIDS we really need to work out to establish more "accessible" interface to BIDS validation (proper "error records", #943) -- currently returned @jwodder - consider following keys in the returned dict, and @TheChymera please expand/fix if I am incomplete/wrong in any of that
That deep meaning over those keys could be gathered by looking at the reporting code at https://github.com/dandi/dandi-cli/blob/HEAD/dandi/bids_validator_xs.py#L513 |
@TheChymera @yarikoptic Should files listed in |
Yes AFAIK, treat as error |
the description is correct. Sorry if it seemed convoluted, what it's doing is that the “tracking” list is eroded with each match. |
@jwodder you can pass a list of files to dandi-cli/dandi/support/bids/validator.py Line 730 in 468e39c
The get file logic will be run, but it will bypass most the checks: dandi-cli/dandi/support/bids/validator.py Lines 61 to 63 in 468e39c
If you want to not run the get file logic at all (in which case you can also use paths which don't exist locally) you can use the |
I believe I've properly integrated validation in b83ba3f. @TheChymera @satra @yarikoptic Could someone please explain exactly how to acquire BIDS metadata and integrate it into Dandi metadata? |
@jwodder I had some basic support for that here: Lines 124 to 128 in 1c94736
Basically the keys in the |
@TheChymera That involves running |
@jwodder to do that you need a set of keys, the values of which unambiguously identify the file, chymera@decohost ~ $ cat /tmp/lala.py
from dandi.validate import validate_bids
result = validate_bids("~/src/bids-examples/asl003")
print("These are the valid paths:")
for i in result["match_listing"]:
print(i["path"])
selection = "/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz"
print(f"Now let's take one and get the info: {selection}")
print([i for i in result["match_listing"] if i["path"] == selection])
chymera@decohost ~ $ python /tmp/lala.py
2022-07-27 16:15:57,799 [ WARNING] BIDSVersion 1.5.0 is less than the minimal working 1.7.0+369. Falling back to 1.7.0+369. To force the usage of earlier versions specify them explicitly when calling the validator.
These are the valid paths:
/home/chymera/src/bids-examples/asl003/CHANGES
/home/chymera/src/bids-examples/asl003/dataset_description.json
/home/chymera/src/bids-examples/asl003/README
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.nii.gz
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_aslcontext.tsv
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.json
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.json
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asllabeling.jpg
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.json
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.nii.gz
Now let's take one and get the info: /home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz
[{'subject': 'Sub1', 'session': None, 'acquisition': None, 'reconstruction': None, 'direction': None, 'run': None, 'path': '/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz'}] |
@TheChymera @yarikoptic @satra The code linked by Horea above only applies BIDS metadata to NWB files, and it replaces the normal NWB metadata. Are either of those things that we want to do/keep doing? |
the normal metadata for NWB should not be replaced nor should the bids metadata be applied to NWB files. if it's bids, then there is the possibility of an NWB file inside it, so it should also do the related NWB processing for those files. however, setting that aside, any bids path should result in metadata that augments all the standard metadata we have for an asset. so similar to the original use of the metadata dictionary was to support organize, and that would be irrelevant for a bids dandiset. |
From my local testing with
Yes, this is true, It doesn't replace it, though, it just doesn't go on to source the nwb data at all if it's a BIDS dataset. Perhaps this isn't such a bad idea? i.e. if it's a BIDS dataset we source the information via the BIDS standard. |
@satra So:
|
|
nwb file have had two notions of metadata, one that is extracted from the file and then the Asset structure (which is what i would call metadata). are you talking about the former or both? for current flow: nwb --> metadata --> Asset metadata for bids it would be: perhaps the refactoring can simply create one metadata structure, the Asset structure, which can then be used for other things (ls, organize, validate, etc.,.). |
I'm talking about the final
Describe these steps in detail. |
i would leave that to @TheChymera and can review it, but describing those steps would be almost equivalent to writing the code. essentially one has to follow the details of the bids-specification here and map each field or piece of information into the bareasset structure. |
This would be for the @TheChymera to clarify.
Given the current (absent yet overall) agreement in bids-standard/bids-specification#761 (comment) about overloading of metadata in side-car files (and thus in filepath), I would say: for extraction (not validation), take metadata contained in .nwb file, and only complement (add fields which have no non-None value) with the metadata from the BIDS filename (we aren't even getting anything from sidecar yet, right @TheChymera ?) |
@yarikoptic @TheChymera What about things like BIDS' "subject" field, which the code currently maps to our "subject_id" field? Should the metadata for a BIDS NWB asset use the subject_id from the NWB, from BIDS, or what? |
per my comment above let's do from .nwb for now for consistency (unless empty there). It would be for (our) validator to complain if there is inconsistency, and it will be for .nwb overloading (WiP) to be used to "fix it" if needed. |
@yarikoptic @TheChymera Based on what Horea's posted so far, the BIDS metadata contains the following fields:
Just where in a |
yes, no files are being opened. I don't have any strong opinions either way, really, but the reason why I suggested going BIDS-first (or rather path-first) is for what I thought was consistency/usability. We don't get any metadata from the NIfTI headers either, so in addition to privileging NWB over BIDS we'd be privileging NWB over NIfTI. Not least of all, path metadata can be easily sourced without downloading any files... But if you disagree, don't mind me :) just explaining why I suggested BIDS first. |
Those are the fields for a file with that specific suffix. In total BIDS files can have any of the following entities, though they might be chymera@darkhost ~/src/bids-specification/src/schema $ ag "^[a-z]*?:" objects/entities.yaml
6:acquisition:
28:atlas:
38:ceagent:
49:chunk:
58:density:
69:description:
79:direction:
88:echo:
101:flip:
113:hemisphere:
125:inversion:
137:label:
148:modality:
157:mtransfer:
171:part:
195:processing:
209:reconstruction:
218:recording:
228:resolution:
239:run:
256:sample:
268:session:
289:space:
307:split:
329:stain:
341:subject:
348:task:
358:tracer: |
@jwodder I do not think we map much from BIDS to our metadata ATM. Just use what we have at https://github.com/dandi/dandi-cli/blob/HEAD/dandi/metadata.py#L124 _meta = validate_bids(path)
meta = _meta["match_listing"][0]
meta["bids_schema_version"] = _meta["bids_schema_version"]
meta = _rename_bids_keys(meta) may be RF it into a function |
@yarikoptic - for mapping to bareasset the function should map subject, session, modality info. i would suggest a bids2asset function similar to the nwb2asset function. |
@yarikoptic Using that code will result in only subject and session being set, as those are the only keys that are mapped to something that
Where/how is modality represented in the Dandi schema?
Assuming this function returns a |
Let's start with that. We will improve upon this later. I will leave to @satra to clarify about modality |
@TheChymera @satra So now I know how to determine metadata for BIDS assets, but what should the Dandi metadata be for the |
@jwodder I actually don't think it would have any, it's a metadata file, so perhaps it should be handled in the same way as |
i would say this should apply: bareasset_dict.update(**nwb_asset_dict) for now treat dataset_description.json as just another asset. we could help populate it from dandiset metadata, but that's a separate feature in my opinion. we are also not doing anything for all other json and tsv files at the moment. modality, sessions/samples/participants.tsv can come in a separate PR. |
#1011 is adding basic metadata extraction for BIDS datasets. Introduced in #1011 is more of a "hack" than proper addition of support for BIDS datasets. It is 'ad-hoc' in part due to the clear separation of "asset types" in https://github.com/dandi/dandi-cli/blob/HEAD/dandi/files.py e.g. to
NWBAsset
(with custom metadata extraction and validation) vsVideoAsset
(nothing special ATM) toGenericAsset
(really nothing special ;) ). With introduction of support of BIDS datasets it gets tricky:dataset_description.json
withBIDSVersion
in itso a dandiset can contain multiple BIDS (sb)datasets
sub-1_slice-1.nwb
it would likely to come fromsub-1_slice-1.overwrite.json
(if present).sub-1_slice-1.nwb
it could come fromsub-1_slice-1.json
validator
to complain whenever there is incongruence between different sources of metadata@jwodder -- how
files.py
and anything else needed should be refactored so we support such multiple sources of metadata: file format based + BIDSThe text was updated successfully, but these errors were encountered: