Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swin UNETR Pretraining: HNSCC Data Extraction #189

Open
coxjoseph opened this issue Feb 15, 2023 · 4 comments
Open

Swin UNETR Pretraining: HNSCC Data Extraction #189

coxjoseph opened this issue Feb 15, 2023 · 4 comments

Comments

@coxjoseph
Copy link

When trying to pretrain the Swin Transformer model found in research-contributions/SwinUNETR/Pretrain/, I became aware of a discrepancy between the HNSCC json and the TCIA Colonography json.

The two json files downloaded from the links in the README (dataset_HNSCC_0.json and dataset_TCIAcolon_v2_0.json), while named correctly, both reference images in the images directory. At first I assumed that this just meant I had to somehow rename files from one dataset or renumber based on some ordering. Upon further inspection, however, the two files reference 602 of the same images in the same directory. Reading through the code, it does not seem that these images are handled any differently, leading me to believe that either one of the json files is linked incorrectly or the code is loading in multiple of the same images believing they are from different datasets. If the jsons are correct, could you please advise on how to rename/reorder the image files to correctly pretrain the model?

Here's the short python script to validate that the two files are indeed reading the same images (place both json files in a subdirectory jsons relative to your working directory)

import json

def get_image_paths(json_file: dict) -> set: 
    training_images = json_file['training']  # List of dicts with only one key
    training_paths = [training_image['image'] for training_image in training_images]

    validation_images = json_file['validation']
    validation_paths = [validation_image['image'] for validation_image in validation_images]
    
    return set(training_paths).union(validation_paths)

if __name__ == '__main__':
    with open('./jsons/dataset_HNSCC_0.json', 'r') as hnscc, \
         open('./jsons/dataset_TCIAcolon_v2_0.json', 'r') as colon:
        hnscc_json = json.load(hnscc)
        colon_json = json.load(colon)

    hnscc_paths = get_image_paths(hnscc_json)
    colon_paths = get_image_paths(colon_json)

    paths_in_common = hnscc_paths.intersection(colon_paths)
    
    print(f'Found {len(paths_in_common)} paths in common.')

> Found 602 paths in common.

@coxjoseph
Copy link
Author

coxjoseph commented Feb 15, 2023

I see now that each dataset is placed in its own directory (dataset/datset1, dataset/datset2, dataset/datset3, dataset/datset4, and dataset/datset8), perhaps the README could be a bit more clear on that. But this leads to a slightly different issue - the number of subjects available from HNSCC data is only 609. The json file in question references images exceeding img_1000.nii.gz. How are the images extracted/processed from the HNSCC dataset?

@coxjoseph coxjoseph changed the title JSON Discrepancy in Swin UNETR Pretraining Code Swin UNETR Pretraining: HNSCC Data Extraction Feb 15, 2023
@tangy5
Copy link
Contributor

tangy5 commented Feb 23, 2023

Hi @coxjoseph , thanks so much for the question.
The raw HNSCC datasets should be more than 609, there are ~1300 CT volumes. I guess there might be inconsistency when covnerting Dicom images to NIFTI format.

How about this, we have a copy that are already converted to NIFTI, QAed and removed outliers.
You can refer to this link to download the HNSCC dataset.
https://drive.google.com/file/d/1KU5cq6O1ToN0D7_0YkkV6gZoSSPChSjO/view?usp=share_link

Thanks.

@JakobDexl
Copy link

Thanks for your great contribution. Hi @tangy5 , could you please provide more information on the TCIAcolon dataset as well? For example the mapping.json? I'm also having trouble to find the correct relation.

@GLARKI
Copy link

GLARKI commented Aug 22, 2024

I have a similar difficulty regarding reproducibility, and unfortunately, the link from tangy5 no longer works.
Would someone be able to help me?

I've downloaded the dataset HNSCC.
When looking at the json file, I assume they are the same as the folder id, e.g., 'HNSCC-01-0001'.
However, the IDs in this JSON file go until 1100+, but in my downloaded cases, the IDs go to 630.

Is this because the database was updated on request of the PI?
See quote on the database website "Version 4: Updated 2024/05/15
Replaced Head-Neck-CT-Atlas clinical data file per PI request. The old version is no longer available.".

Or did I miss something?
Maybe if a person has multiple CT's the count still goes up?

Thank you very much in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants