Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF "files.py" validation/metadata-loading to support BIDS #1044

Closed
yarikoptic opened this issue Jul 5, 2022 · 55 comments · Fixed by #1076
Closed

RF "files.py" validation/metadata-loading to support BIDS #1044

yarikoptic opened this issue Jul 5, 2022 · 55 comments · Fixed by #1076
Assignees

Comments

@yarikoptic
Copy link
Member

#1011 is adding basic metadata extraction for BIDS datasets. Introduced in #1011 is more of a "hack" than proper addition of support for BIDS datasets. It is 'ad-hoc' in part due to the clear separation of "asset types" in https://github.com/dandi/dandi-cli/blob/HEAD/dandi/files.py e.g. to NWBAsset (with custom metadata extraction and validation) vs VideoAsset (nothing special ATM) to GenericAsset (really nothing special ;) ). With introduction of support of BIDS datasets it gets tricky:

  • we need to upload pretty much every file (not just .nwb) if it is found to be a BIDS dataset
    • we decide if it is a BIDS dataset if there is dataset_description.json with BIDSVersion in it
  • we might have "super-BIDS datasets" like https://dandiarchive.org/dandiset/000026/draft/files?location= where we have following hierarchy within a dandiset
derivatives/<some subdatasets some of which are BIDS>/
rawdata/ - BIDS dataset

so a dandiset can contain multiple BIDS (sb)datasets

  • There is multiple "files" from which metadata could be loaded from. Below I outline 3 possible ways, but most likely we would offload both to 1 - file format specific (nwb and its .overwrite.json) + 2 - BIDS specific (using BIDS library), with BIDS overloading what prior one provided. But here are the details
    • metadata-precedence-1: metadata will/can come from filename in addition to being extracted from the data file. .nwb files are legit within BIDS datasets, so NWBAsset by itself is not describing entirety of the case. And for NWBAsset belonging to BIDS dataset we would want filename based metadata overload what is in the file.
    • metadata-precedence-2 NWB folks are working on introducing overlays support (WiP, not yet finalized). So for sub-1_slice-1.nwb it would likely to come from sub-1_slice-1.overwrite.json (if present).
    • metadata-precedence-3: metadata can come from BIDS sidecar file, e.g for sub-1_slice-1.nwb it could come from sub-1_slice-1.json
  • note: it will be up for validator to complain whenever there is incongruence between different sources of metadata

@jwodder -- how files.py and anything else needed should be refactored so we support such multiple sources of metadata: file format based + BIDS

@jwodder
Copy link
Member

jwodder commented Jul 7, 2022

@yarikoptic I'm thinking something like this:

  • Add a DatasetDescriptionAsset(LocalFileAsset) class for representing dataset_description.json files
    • Any BIDS dataset-wide operations like metadata generation and validation are performed as methods of this class
  • Add a BIDSFileAsset(LocalFileAsset) class for representing any other file underneath a directory with a dataset_description.json; instances will have an attribute storing the corresponding DatasetDescriptionAsset
  • Add a NWBBidsAsset(NWBAsset, BIDSFileAsset) class for representing NWBs in BIDS datasets, with any necessary custom behavior
  • Modify find_dandi_files() to check whether each directory contains a dataset_description.json and, if it does, produce the appropriate asset classes for the files within
  • Give dandi_file a dataset_description: Optional[DatasetDescriptionAsset] = None argument for use when the file is known to reside within a BIDS dataset; if this argument is None, it is assumed to not be in one

Note that this assumes that, if a Dandiset contains one or more BIDS datasets that aren't at the root, then we want any non-NWB, non-Zarr files outside of the BIDS datasets to not be uploaded as assets by default.

@yarikoptic
Copy link
Member Author

Sounds good for a start! I dislike a little implicit DatasetDescriptionAsset implying having a BIDS dataset. I guess we can start with that since I don't have a better idea ATM. Let's just make it more explicit BIDSDatasetDescriptionAsset and bids_dataset_description.

My main concern is the combinatoric NWBBidsAsset(NWBAsset, BIDSFileAsset) since we might then need to also do it for any other asset type such as VideoAsset and get VideoBidsAsset. I guess we can start doing this until we run into some real explosion would require instead establishing some more flexible construct. But what would you say about having CompositeBIDSFileAsset(BIDSFileAsset) which would be just parametrized at instantiation with an instance of underlying NWBAsset (or VideoAsset) which could be asked for its metadata/validation. This way it would just delegate to some "underlying" data type consistently across all such combinations without requiring multiple inheritance. WDYT?

NB I would prefer consistent capitalization of BIDS so to have NWBBIDSAsset even though parsing between two acronyms becomes impossible.

@jwodder
Copy link
Member

jwodder commented Jul 12, 2022

@yarikoptic I'd prefer multiple inheritance to delegating to an attribute. For one thing, given an instance of an NWB in a BIDS dataset, I would expect isinstance(nwb, NWBAsset) and isinstance(nwb, BIDSFileAsset) to both be true.

Let's just make it more explicit ... bids_dataset_description.

Are you suggesting changing the name of the file that denotes a BIDS dataset? I don't think that's up to us.

@yarikoptic
Copy link
Member Author

Let's just make it more explicit ... bids_dataset_description.

Are you suggesting changing the name of the file that denotes a BIDS dataset? I don't think that's up to us.

no, I am just suggesting adding BIDS to our class/option names as e.g. in dataset_description: Optional[DatasetDescriptionAsset] = None argument you suggested.

@yarikoptic I'd prefer multiple inheritance to delegating to an attribute. For one thing, given an instance of an NWB in a BIDS dataset, I would expect isinstance(nwb, NWBAsset) and isinstance(nwb, BIDSFileAsset) to both be true.

hm... ok... but then let's have still have the logic of behavior centralized, may be it should be a mixin NWBBidsAsset(NWBAsset, BIDSFileAsset, CompositeBIDSFileAssetMixIn) where it would be the CompositeBIDSFileAssetMixIn centralizing that logic of preferring BIDSFileAsset extracted metadata and doing the validation for consistency?

@yarikoptic
Copy link
Member Author

@jwodder will work on RFing code per above, @TheChymera please point him to the interfaces for BIDS validation and metadata extraction he should use.

@jwodder
Copy link
Member

jwodder commented Jul 13, 2022

@yarikoptic @TheChymera Question: What happens if a directory contains a dataset_description.json file and one of its (possibly nested) subdirectories also contains a dataset_description.json file? Does the inner file start a new BIDS dataset, or is it treated as just another file of the outer dataset?

@yarikoptic
Copy link
Member Author

treat it as a "start of a new BIDS dataset". It will be for validator to decide if that is a permitted location to have such nested BIDS dataset (some, like rawdata/ or any directory under derivatives/ folder are ok to be nested BIDS datasets, and in turn have other nested BIDS datasets)

@TheChymera
Copy link
Contributor

TheChymera commented Jul 14, 2022

@jwodder from the point of view of the pilot DANDI support we have just merged, both would be detected as candidates, and the validation would be run for both.
However, running the validator on the parent directory will also index the child directory and will flag any duplicate “top level” files (such as README) as invalid.

This is a repercussion of top level matches being necessarily unique, i.e. unlike entity files there should only ever be one file matching one pattern...

Ultimately this is a shortcoming of the validator path selection and not of our DANDI wrapper.
I'll work on fixing it in the validator code.

@TheChymera
Copy link
Contributor

Fix incoming: bids-standard/bids-specification#1145

@jwodder
Copy link
Member

jwodder commented Jul 19, 2022

@TheChymera So, if I have a dataset_description.json file and a list of all files in a BIDS dataset:

  • How do I validate the dataset/assets?
  • How does the DANDI metadata for the assets differ from non-BIDS assets?

@TheChymera
Copy link
Contributor

@jwodder if the dataset_description.json file is inside the BIDS directory, you would just just point the validator to the BIDS directory, e.g.:

chymera@decohost ~/src/bids-examples $ dandi validate-bids asl003/
2022-07-19 16:53:10,234 [ WARNING] A newer version (0.45.1) of dandi/dandi-cli is available. You are using 0.44.1+45.g2ed8929.dirty
2022-07-19 16:53:10,384 [    INFO] Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-07-19 16:53:10,384 [    INFO] NumExpr defaulting to 8 threads.
2022-07-19 16:53:11,608 [ WARNING] BIDSVersion 1.5.0 is less than the minimal working 1.7.0+369. Falling back to 1.7.0+369. To force the usage of earlier versions specify them explicitly when calling the validator.
All filenames are BIDS-valid and no mandatory files are missing.
2022-07-19 16:53:12,824 [    INFO] Logs saved in /home/chymera/.cache/dandi-cli/log/20220719205309Z-9230.log

If you do not have the directory at hand and just a list of paths which do not exist on your system this will only be possible after we have merge dummy_path support:

How does the DANDI metadata for the assets differ from non-BIDS assets?

Currently it differs in that the keys are not fully homogenized with the DANDI nomenclature. I did a bit of standardizing here, but there are a few more keys coming in a new PR soon.

@jwodder
Copy link
Member

jwodder commented Jul 19, 2022

@TheChymera

  • I meant, how would I do the validation using the Python API, not the CLI?
  • How should I generate the metadata for BIDS assets?

@TheChymera
Copy link
Contributor

TheChymera commented Jul 19, 2022

@jwodder

I meant, how would I do the validation using the Python API, not the CLI?

currently that can't be done on :master because we need the actual paths, however, as soon as the dummy_paths parameter is in, you could do it as I have in this PR (it's the same PR which implements dummy path support ahead of upstream):

with DandiAPIClient.for_dandi_instance("dandi") as client:
dandiset = client.get_dandiset(dandi_id)
path_list = []
for asset in dandiset.get_assets():
i = f"dummy/{asset.path}"
if "sub-MITU01h3" in i and "sub-MITU01" in i:
lgr.warning("Fixing subject field inconsistencies:")
lgr.warning(" - Pre-repair: %s", i)
i = i.replace("sub-MITU01h3", "sub-MITU01")
lgr.warning(" + Post-repair: %s", i)
# ome.zarr support pending:
# https://github.com/dandi/dandi-cli/pull/1050
if "ome.zarr" not in i:
path_list.append(i)
result = validate_bids(path_list, dummy_paths=True)

How should I generate the metadata for BIDS assets?

The validator returns a list of dictionaries for all matches found, where the keys are BIDS-standard (but not DANDI-standard) metadata fields, that is the standardization I previously mentioned:

match_listing = result["match_listing"]

@yarikoptic
Copy link
Member Author

I meant, how would I do the validation using the Python API, not the CLI?

currently that can't be done on :master because we need the actual paths

AFAIK @jwodder would have access to actual paths for any purpose of validation or upload

@TheChymera
Copy link
Contributor

TheChymera commented Jul 19, 2022

AFAIK @jwodder would have access to actual paths for any purpose of validation or upload

Then it's as simple as:

chymera@decohost ~ $ cat lala.py
from dandi.validate import validate_bids
result = validate_bids("~/src/bids-examples/asl003")
for i in result["match_listing"]:
    print(i["path"])
    try:
        print(i["subject"])
        print(i["session"])
    except:
        pass
chymera@decohost ~ $ python lala.py
2022-07-19 17:48:55,647 [ WARNING] BIDSVersion 1.5.0 is less than the minimal working 1.7.0+369. Falling back to 1.7.0+369. To force the usage of earlier versions specify them explicitly when calling the validator.
/home/chymera/src/bids-examples/asl003/CHANGES
/home/chymera/src/bids-examples/asl003/dataset_description.json
/home/chymera/src/bids-examples/asl003/README
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.nii.gz
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_aslcontext.tsv
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.json
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.json
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asllabeling.jpg
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.json
Sub1
None
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.nii.gz
Sub1
None

@jwodder
Copy link
Member

jwodder commented Jul 21, 2022

@yarikoptic Should the validation errors for a BIDS dataset be attached to the dataset_description.json object or to the relevant assets (for those that are asset-specific)? Note that the latter will be much harder to do when processing upload assets in parallel.

@jwodder
Copy link
Member

jwodder commented Jul 21, 2022

@yarikoptic @satra @TheChymera Also, the get_validation_errors() method of asset objects takes a schema_version argument denoting the version of the Dandi metadata schema to use. Is there a way to map this version to a BIDS schema version?

@satra
Copy link
Member

satra commented Jul 21, 2022

Is there a way to map this version to a BIDS schema version?

@jwodder - there isn't, since those are two distinct schemas. just like NWB has a version, i would keep the bids version as described in the dataset_description.json. and if you are referring to the bids schema as implemented in dandi, one could add validation information, including dandi-specific bids schema version, in the provenance section of the metadata (wasGeneratedBy) since that is a list of activities.

@yarikoptic
Copy link
Member Author

yarikoptic commented Jul 22, 2022

@yarikoptic Should the validation errors for a BIDS dataset be attached to the dataset_description.json object or to the relevant assets (for those that are asset-specific)? Note that the latter will be much harder to do when processing upload assets in parallel.

Short overall answer before I outline a possible design I see: I think errors associated with paths should be available (assigned or query-able) for those specific paths and it should be possible to get all errors across all paths of the dataset (#943 relates as might need tune up depending on what we come up with here or there ;)).

ATM, upload allows to skip "invalid" files while uploading "valid" files. Originally it was done because validation was done "in-line" with upload -- it was taking time to validate, so we did not pre-validate all files to say that entire dataset is bad, but discovered/reported bad assets in the course of the upload. Moreover, validation of dandiset.yaml was done independently from all other files, although it is more of a "Dandiset level validation". For BIDS we still can do the same since current validation can take individual path at a time I believe. But overall we might want to step away from that practice.
With that in mind, even though in short term we could associate with some superficial DatasetDescriptionAsset to signal that it is a BIDS dataset, we might better provide more explicit structure.

Extracted as a separate https://github.com//issues/1082 now

I thought I had described it here or elsewhere but I failed to find it, so coming up with my thinking here. I think we overall have to work with a following structure. I was not thinking yet on how and whether it should relate to existing Dandiset interfaces, may be consider them all below Local as in files.py:

  • Dandiset
    • Dandiset must have dandiset.yaml (DandisetMetadataFile) which we know how to validate, it represents metadata about entire Dandiset
    • Dandiset contains either DANDIDataset or BIDSDataset (AFAIK cannot be both, @satra please confirm if our thinking aligned). I might even say we must not have any "abstract" Dataset - they all should be DANDIDataset or BIDSDataset(s). Default (if not BIDS) is DANDIDataset.
      • I decided for "contains" instead of just subclassing from Dandiset because nested BIDSDataset might not to "contain" dandiset.yaml. But may be those specific ones could be subclasses (thuse DANDIDandiset and BIDSDandiset), and there would be PlainBIDSDataset implementing Dandiset independent logic (such as validation), and then BIDSDandiset(Dandiset, PlainBIDSDataset) to mix in. For now I will continue with Dataset delegated flavors below:
      • DANDIDataset is the one dandi organize renames into, and the layout (directory/filenames) of which we do not yet validate AFAIK although should (I noted that in my reply here investigate --validation skip and --allow-any-path decision tree #1074 (comment)). Errors will be associated with specific paths.
      • BIDSDataset (like 000026) allows for nested BIDSDatasets underneath (e.g. under derivatives/{name} or rawdata/ or somewhere under sourcedata/). Currently code finds them anywhere (not necessarily in those locations) and validates them independently. That it is a BIDSDatasets (and not DANDIDatasets) decided by having dataset_description.json.
    • Each asset should be associated with some specific (not just abstract Dataset) BIDSDataset or DANDIDataset, even if that asset is not "valid".
      • BIDSDataset might be a part of another BIDSDataset. BIDSDataset.subdatasets: dict could provide mapping.
      • each Dataset could have some validation errors associated purely with it, which cannot relate to specific asset(s). An example -- error about missing required asset (e.g. README with some extension, required in BIDS, is missing)
      • some validation errors might relate to groups of assets (e.g. some inconsistency across files). I am not sure yet how/where we should keep them, probably within Dataset level, just record containing all those paths.
      • we should be able to tell what errors are for an asset.
      • given all the above, I feel that we should collect errors at the level of Dataset and allow for querying for errors associated with a specific sub-path (might be within subdataset in BIDS).

So, I am thinking, that for upload/validation, given the optional paths (limiting what we can find) we first figure out the Dandiset (location of dandiset.yaml) and its Dataset (BIDS or DANDI), and subdatasets (if paths provided, for those paths). If there is no Dataset -- nothing to upload. If no paths given, it will be for the type of the dataset to figure out which paths to pick up (like we get .nwb and video files for DANDIDataset).
Then before upload Dandiset.validate would do its self.DandisetMetadataFile-instance.validate() followed by self.Dataset-instance.validate() (which for DANDIDataset would do nothing ATM; for BIDSDataset -- might do something in the future, like may be full dataset validation). Some method like .discover_local_assets(path: list[str]=None) should be available to associate/discover optionally provided paths with corresponding Datasets, might be BIDS vs DANDI dependent, and thus building up the tree. Dandiset.validate_asset(path: str) could lazily (unless already full validation ran, and that path was marked "ok" or have already results associated) initiate corresponding Dataset (if needed, e.g. if nested BIDSDataset), trigger corresponding (depending on Dataset type and type of the asset, e.g. NWBAsset) LocalAsset.validate(), store result of the validation within corresponding Dataset, and return validity status for upload to use.

Relating to WiP in #1076, find_dandi_files and dandi_file should become method of that Dandiset (already takes, albeit optional dandiset_path . If we figure out dandiset, which must exist, first -- can become methods). There it should figure out corresponding to the path Dataset and associate it with that Asset. To contain all validation results in corresponding Dataset, may be LocalAsset.validate should request to update its record within its Dataset.

Something like that. WDYT?

@yarikoptic
Copy link
Member Author

@yarikoptic @satra @TheChymera Also, the get_validation_errors() method of asset objects takes a schema_version argument denoting the version of the Dandi metadata schema to use. Is there a way to map this version to a BIDS schema version?

don't worry about BIDS version here! And looking at how we "use" schema_version in get_validation_errors it seems that ability to provide version is "superficial" since we simply fail to validate if specified version is not current... And in general relying on dandischema to do a possible upgrade for current dandischema version if version in the record is outdated. So I feel that may be ability to specify version is "legacy" (@satra -- do you remember anything on that?).
For validate-bids I believe we provided ability to specify BIDS schema version, since we keep working on BIDS spec to fit our needs so some times "cheat" in that we provide another (our) version of BIDS schema to use.

@jwodder
Copy link
Member

jwodder commented Jul 25, 2022

@yarikoptic I'm thinking that such a restructuring of the classes should happen in a separate issue/PR. For right now, if we want individual BIDS assets to have their own validation errors (with BIDS dataset-wide errors just associated with the dataset_description.json asset for now), at what point in the upload process should this validation occur? Should all BIDS datasets be validated before processing any individual file uploads, with asset-specific errors attached to the respective assets for reporting once the upload code gets around to them, or should the individual BIDS assets be validated in the course of the upload by calling some method (that is both cached and thread-locked) of the dataset_description.json asset?

@yarikoptic
Copy link
Member Author

@yarikoptic I'm thinking that such a restructuring of the classes should happen in a separate issue/PR.

done - #1082

For right now, if we want individual BIDS assets to have their own validation errors (with BIDS dataset-wide errors just associated with the dataset_description.json asset for now), at what point in the upload process should this validation occur? Should all BIDS datasets be validated before processing any individual file uploads, with asset-specific errors attached to the respective assets for reporting once the upload code gets around to them, or should the individual BIDS assets be validated in the course of the upload by calling some method (that is both cached and thread-locked) of the dataset_description.json asset?

For consistency with how nwbs are handled and since ATM it is allowed -- easiest is to do at the point of getting to that file during upload. It would cause some duplicate avoidable "find the root of BIDS dataset" but it could (if not already) be cached so shouldn't add too much penalty hopefully. In the course of #1082 refactoring, as I have mentioned, for BIDS such validation might need to happen at the level of the BIDS dataset first before any file from it gets considered for upload. We might even later introduce policy/behavior to disallow upload of DANDI or BIDS datasets altogether by default if any file fails validation (it is not current behavior).

@jwodder
Copy link
Member

jwodder commented Jul 25, 2022

@TheChymera Does the structure returned by validate_bids() currently contain any information on asset-specific validation errors, or are all the errors dataset-wide things like required files being missing? If the former, how do I extract the relevant information from the return value?

@yarikoptic
Copy link
Member Author

yarikoptic commented Jul 26, 2022

ping @TheChymera . After all current BIDS work is merged into BIDS we really need to work out to establish more "accessible" interface to BIDS validation (proper "error records", #943) -- currently returned dict is hard to "understand".

@jwodder - consider following keys in the returned dict, and @TheChymera please expand/fix if I am incomplete/wrong in any of that

  • path_listing -- should have paths which were "tested". So if you gave a folder for validation, this would list all paths which were tested
  • path_tracking -- if path is listed in that list -- it is invalid. So it is per-asset error
  • schema_tracking -- dataset-level errors for missing required assets among those which were given to be validated.

That deep meaning over those keys could be gathered by looking at the reporting code at https://github.com/dandi/dandi-cli/blob/HEAD/dandi/bids_validator_xs.py#L513

@jwodder
Copy link
Member

jwodder commented Jul 26, 2022

@TheChymera @yarikoptic Should files listed in path_tracking be treated as failing validation? The code here implies that such a state is just a warning rather than an error.

@yarikoptic
Copy link
Member Author

Yes AFAIK, treat as error

@TheChymera
Copy link
Contributor

@TheChymera please expand/fix if I am incomplete/wrong in any of that

the description is correct. Sorry if it seemed convoluted, what it's doing is that the “tracking” list is eroded with each match.

@TheChymera
Copy link
Contributor

@jwodder you can pass a list of files to validate_bids() under the bids_paths parameter:

The get file logic will be run, but it will bypass most the checks:

if os.path.isfile(bids_path):
path_list.append(Path(bids_path).as_posix())
continue

If you want to not run the get file logic at all (in which case you can also use paths which don't exist locally) you can use the dummy_path mechanic I described earlier.

@jwodder
Copy link
Member

jwodder commented Jul 27, 2022

I believe I've properly integrated validation in b83ba3f.

@TheChymera @satra @yarikoptic Could someone please explain exactly how to acquire BIDS metadata and integrate it into Dandi metadata?

@TheChymera
Copy link
Contributor

TheChymera commented Jul 27, 2022

@jwodder I had some basic support for that here:

dandi-cli/dandi/metadata.py

Lines 124 to 128 in 1c94736

_meta = validate_bids(path)
meta = _meta["match_listing"][0]
meta["bids_schema_version"] = _meta["bids_schema_version"]
meta = _rename_bids_keys(meta)
return meta

Basically the keys in the result[match_listing] dictionaries returned by the bundled module are just changed to the DANDI equivalent.

@jwodder
Copy link
Member

jwodder commented Jul 27, 2022

@TheChymera That involves running validate_bids() on just the individual path. If I've already run it on all the assets in the BIDS dataset at once, how do I get an individual asset's metadata out of the result?

@TheChymera
Copy link
Contributor

TheChymera commented Jul 27, 2022

@jwodder to do that you need a set of keys, the values of which unambiguously identify the file, path is of course the obvious one, though it can be larger sets as well.

chymera@decohost ~ $ cat /tmp/lala.py 
from dandi.validate import validate_bids
result = validate_bids("~/src/bids-examples/asl003")

print("These are the valid paths:")
for i in result["match_listing"]:
    print(i["path"])

selection = "/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz"
print(f"Now let's take one and get the info: {selection}")
print([i for i in result["match_listing"] if i["path"] == selection])
chymera@decohost ~ $ python /tmp/lala.py 
2022-07-27 16:15:57,799 [ WARNING] BIDSVersion 1.5.0 is less than the minimal working 1.7.0+369. Falling back to 1.7.0+369. To force the usage of earlier versions specify them explicitly when calling the validator.
These are the valid paths:
/home/chymera/src/bids-examples/asl003/CHANGES
/home/chymera/src/bids-examples/asl003/dataset_description.json
/home/chymera/src/bids-examples/asl003/README
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.nii.gz
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_aslcontext.tsv
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_m0scan.json
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.json
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz
/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asllabeling.jpg
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.json
/home/chymera/src/bids-examples/asl003/sub-Sub1/anat/sub-Sub1_T1w.nii.gz
Now let's take one and get the info: /home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz
[{'subject': 'Sub1', 'session': None, 'acquisition': None, 'reconstruction': None, 'direction': None, 'run': None, 'path': '/home/chymera/src/bids-examples/asl003/sub-Sub1/perf/sub-Sub1_asl.nii.gz'}]

@jwodder
Copy link
Member

jwodder commented Jul 28, 2022

@TheChymera @yarikoptic @satra The code linked by Horea above only applies BIDS metadata to NWB files, and it replaces the normal NWB metadata. Are either of those things that we want to do/keep doing?

@satra
Copy link
Member

satra commented Jul 28, 2022

the normal metadata for NWB should not be replaced nor should the bids metadata be applied to NWB files.

if it's bids, then there is the possibility of an NWB file inside it, so it should also do the related NWB processing for those files.

however, setting that aside, any bids path should result in metadata that augments all the standard metadata we have for an asset. so similar to nwb2asset, we will want a bidspath2asset. this latter function also needs to handle a participants.tsv file and, if present, a sessions.tsv and a samples.tsv files to fill in the relevant pieces of the asset metadata,

the original use of the metadata dictionary was to support organize, and that would be irrelevant for a bids dandiset.

@TheChymera
Copy link
Contributor

@jwodder

only applies BIDS metadata to NWB files,

From my local testing with ls it seemed to source the data of other files as well, e.g. .nii.gz.

it replaces the normal NWB metadata

Yes, this is true, It doesn't replace it, though, it just doesn't go on to source the nwb data at all if it's a BIDS dataset. Perhaps this isn't such a bad idea? i.e. if it's a BIDS dataset we source the information via the BIDS standard.

@jwodder
Copy link
Member

jwodder commented Jul 28, 2022

@satra So:

  • Given a BIDS asset that is not an NWB file, how should I determine the metadata?
  • Given a BIDS asset that is also an NWB file, how should I determine the metadata?

@jwodder
Copy link
Member

jwodder commented Jul 28, 2022

@TheChymera

From my local testing with ls it seemed to source the data of other files as well, e.g. .nii.gz.

dandi ls does not accurately reflect metadata for non-NWB files, as it uses the get_metadata() function, which the upload routines only use for NWBs.

@satra
Copy link
Member

satra commented Jul 28, 2022

nwb file have had two notions of metadata, one that is extracted from the file and then the Asset structure (which is what i would call metadata). are you talking about the former or both?

for current flow: nwb --> metadata --> Asset metadata

for bids it would be:
bidspath --> metadata --> asset metadata
bidspath + nwb --> metadata (joint info from both processors) --> asset metadata

perhaps the refactoring can simply create one metadata structure, the Asset structure, which can then be used for other things (ls, organize, validate, etc.,.).

@jwodder
Copy link
Member

jwodder commented Jul 28, 2022

@satra

nwb file have had two notions of metadata, one that is extracted from the file and then the Asset structure (which is what i would call metadata). are you talking about the former or both?

I'm talking about the final dandischema.BareAsset value.

for bids it would be:
bidspath --> metadata --> asset metadata
bidspath + nwb --> metadata (joint info from both processors) --> asset metadata

Describe these steps in detail.

@satra
Copy link
Member

satra commented Jul 28, 2022

Describe these steps in detail.

i would leave that to @TheChymera and can review it, but describing those steps would be almost equivalent to writing the code. essentially one has to follow the details of the bids-specification here and map each field or piece of information into the bareasset structure.

@yarikoptic
Copy link
Member Author

@satra So:

  • Given a BIDS asset that is not an NWB file, how should I determine the metadata?

This would be for the @TheChymera to clarify.

  • Given a BIDS asset that is also an NWB file, how should I determine the metadata?

Given the current (absent yet overall) agreement in bids-standard/bids-specification#761 (comment) about overloading of metadata in side-car files (and thus in filepath), I would say: for extraction (not validation), take metadata contained in .nwb file, and only complement (add fields which have no non-None value) with the metadata from the BIDS filename (we aren't even getting anything from sidecar yet, right @TheChymera ?)

@jwodder
Copy link
Member

jwodder commented Jul 28, 2022

@yarikoptic @TheChymera What about things like BIDS' "subject" field, which the code currently maps to our "subject_id" field? Should the metadata for a BIDS NWB asset use the subject_id from the NWB, from BIDS, or what?

@yarikoptic
Copy link
Member Author

@yarikoptic @TheChymera What about things like BIDS' "subject" field, which the code currently maps to our "subject_id" field? Should the metadata for a BIDS NWB asset use the subject_id from the NWB, from BIDS, or what?

per my comment above let's do from .nwb for now for consistency (unless empty there). It would be for (our) validator to complain if there is inconsistency, and it will be for .nwb overloading (WiP) to be used to "fix it" if needed.

@jwodder
Copy link
Member

jwodder commented Jul 28, 2022

@yarikoptic @TheChymera Based on what Horea's posted so far, the BIDS metadata contains the following fields:

  • subject (currently renamed to subject_id)
  • session (currently renamed to session_id)
  • path
  • acquisition
  • reconstruction
  • direction
  • run
  • bids_schema_version

Just where in a dandischema.BareAsset should these fields be mapped to? Aside from the first three fields, none of these seem to have any presence in the Dandi schema, and I believe they're currently discarded when uploading.

@TheChymera
Copy link
Contributor

@yarikoptic

we aren't even getting anything from sidecar yet,

yes, no files are being opened.

I don't have any strong opinions either way, really, but the reason why I suggested going BIDS-first (or rather path-first) is for what I thought was consistency/usability. We don't get any metadata from the NIfTI headers either, so in addition to privileging NWB over BIDS we'd be privileging NWB over NIfTI. Not least of all, path metadata can be easily sourced without downloading any files... But if you disagree, don't mind me :) just explaining why I suggested BIDS first.

@TheChymera
Copy link
Contributor

TheChymera commented Jul 28, 2022

@yarikoptic @TheChymera Based on what Horea's posted so far, the BIDS metadata contains the following fields:
...

Those are the fields for a file with that specific suffix.

In total BIDS files can have any of the following entities, though they might be None if none is given (most are optional) or not present as keys in the output if they would be invalid for the specific suffix.

chymera@darkhost ~/src/bids-specification/src/schema $ ag "^[a-z]*?:" objects/entities.yaml 
6:acquisition:
28:atlas:
38:ceagent:
49:chunk:
58:density:
69:description:
79:direction:
88:echo:
101:flip:
113:hemisphere:
125:inversion:
137:label:
148:modality:
157:mtransfer:
171:part:
195:processing:
209:reconstruction:
218:recording:
228:resolution:
239:run:
256:sample:
268:session:
289:space:
307:split:
329:stain:
341:subject:
348:task:
358:tracer:

@yarikoptic
Copy link
Member Author

@jwodder I do not think we map much from BIDS to our metadata ATM. Just use what we have at https://github.com/dandi/dandi-cli/blob/HEAD/dandi/metadata.py#L124

        _meta = validate_bids(path)
        meta = _meta["match_listing"][0]
        meta["bids_schema_version"] = _meta["bids_schema_version"]
        meta = _rename_bids_keys(meta)

may be RF it into a function get_bids_raw_metadata. Then eventually we would add more keys mappings into that _rename_bids_keys I guess. And also extraction of sex is not there yet at all.

@satra
Copy link
Member

satra commented Aug 1, 2022

@yarikoptic - for mapping to bareasset the function should map subject, session, modality info. i would suggest a bids2asset function similar to the nwb2asset function.

@jwodder
Copy link
Member

jwodder commented Aug 2, 2022

@yarikoptic Using that code will result in only subject and session being set, as those are the only keys that are mapped to something that prepare_metadata() (née metadata2asset()) recognizes. Are you sure that's what you want?

@satra

for mapping to bareasset the function should map subject, session, modality info.

Where/how is modality represented in the Dandi schema?

i would suggest a bids2asset function similar to the nwb2asset function.

Assuming this function returns a BareAsset instance, do you have a recommendation for how to combine a BIDS BareAsset and an NWB BareAsset in accordance with this comment of Yarik's?

@yarikoptic
Copy link
Member Author

@yarikoptic Using that code will result in only subject and session being set, as those are the only keys that are mapped to something that prepare_metadata() (née metadata2asset()) recognizes. Are you sure that's what you want?

Let's start with that. We will improve upon this later. I will leave to @satra to clarify about modality

@yarikoptic
Copy link
Member Author

I will leave to @satra to clarify about modality

@jwodder, let's proceed without mapping modality for now, we could always improve that later.

@jwodder
Copy link
Member

jwodder commented Aug 3, 2022

@TheChymera @satra So now I know how to determine metadata for BIDS assets, but what should the Dandi metadata be for the dataset_description.json file itself?

@TheChymera
Copy link
Contributor

@jwodder I actually don't think it would have any, it's a metadata file, so perhaps it should be handled in the same way as dandiset.yaml? Regarding the information it can contain it also overlaps with dandiset.yaml, though currently the validator does not parse any of its contents other than BIDSVersion which applies to the entire dataset.

@satra
Copy link
Member

satra commented Aug 3, 2022

Assuming this function returns a BareAsset instance, do you have a recommendation for how to combine a BIDS BareAsset and an NWB BareAsset

i would say this should apply: bareasset_dict.update(**nwb_asset_dict)

for now treat dataset_description.json as just another asset. we could help populate it from dandiset metadata, but that's a separate feature in my opinion. we are also not doing anything for all other json and tsv files at the moment.

modality, sessions/samples/participants.tsv can come in a separate PR.

yarikoptic added a commit that referenced this issue Aug 9, 2022
Add classes for BIDS assets
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants