Swin UNETR Pretraining: HNSCC Data Extraction #189

coxjoseph · 2023-02-15T16:12:40Z

When trying to pretrain the Swin Transformer model found in research-contributions/SwinUNETR/Pretrain/, I became aware of a discrepancy between the HNSCC json and the TCIA Colonography json.

The two json files downloaded from the links in the README (dataset_HNSCC_0.json and dataset_TCIAcolon_v2_0.json), while named correctly, both reference images in the images directory. At first I assumed that this just meant I had to somehow rename files from one dataset or renumber based on some ordering. Upon further inspection, however, the two files reference 602 of the same images in the same directory. Reading through the code, it does not seem that these images are handled any differently, leading me to believe that either one of the json files is linked incorrectly or the code is loading in multiple of the same images believing they are from different datasets. If the jsons are correct, could you please advise on how to rename/reorder the image files to correctly pretrain the model?

Here's the short python script to validate that the two files are indeed reading the same images (place both json files in a subdirectory jsons relative to your working directory)

import json

def get_image_paths(json_file: dict) -> set: 
    training_images = json_file['training']  # List of dicts with only one key
    training_paths = [training_image['image'] for training_image in training_images]

    validation_images = json_file['validation']
    validation_paths = [validation_image['image'] for validation_image in validation_images]
    
    return set(training_paths).union(validation_paths)

if __name__ == '__main__':
    with open('./jsons/dataset_HNSCC_0.json', 'r') as hnscc, \
         open('./jsons/dataset_TCIAcolon_v2_0.json', 'r') as colon:
        hnscc_json = json.load(hnscc)
        colon_json = json.load(colon)

    hnscc_paths = get_image_paths(hnscc_json)
    colon_paths = get_image_paths(colon_json)

    paths_in_common = hnscc_paths.intersection(colon_paths)
    
    print(f'Found {len(paths_in_common)} paths in common.')

> Found 602 paths in common.

The text was updated successfully, but these errors were encountered:

coxjoseph · 2023-02-15T16:39:47Z

I see now that each dataset is placed in its own directory (dataset/datset1, dataset/datset2, dataset/datset3, dataset/datset4, and dataset/datset8), perhaps the README could be a bit more clear on that. But this leads to a slightly different issue - the number of subjects available from HNSCC data is only 609. The json file in question references images exceeding img_1000.nii.gz. How are the images extracted/processed from the HNSCC dataset?

tangy5 · 2023-02-23T06:27:30Z

Hi @coxjoseph , thanks so much for the question.
The raw HNSCC datasets should be more than 609, there are ~1300 CT volumes. I guess there might be inconsistency when covnerting Dicom images to NIFTI format.

How about this, we have a copy that are already converted to NIFTI, QAed and removed outliers.
You can refer to this link to download the HNSCC dataset.
https://drive.google.com/file/d/1KU5cq6O1ToN0D7_0YkkV6gZoSSPChSjO/view?usp=share_link

Thanks.

JakobDexl · 2023-04-13T08:03:29Z

Thanks for your great contribution. Hi @tangy5 , could you please provide more information on the TCIAcolon dataset as well? For example the mapping.json? I'm also having trouble to find the correct relation.

GLARKI · 2024-08-22T14:00:33Z

I have a similar difficulty regarding reproducibility, and unfortunately, the link from tangy5 no longer works.
Would someone be able to help me?

I've downloaded the dataset HNSCC.
When looking at the json file, I assume they are the same as the folder id, e.g., 'HNSCC-01-0001'.
However, the IDs in this JSON file go until 1100+, but in my downloaded cases, the IDs go to 630.

Is this because the database was updated on request of the PI?
See quote on the database website "Version 4: Updated 2024/05/15
Replaced Head-Neck-CT-Atlas clinical data file per PI request. The old version is no longer available.".

Or did I miss something?
Maybe if a person has multiple CT's the count still goes up?

Thank you very much in advance!

coxjoseph changed the title ~~JSON Discrepancy in Swin UNETR Pretraining Code~~ Swin UNETR Pretraining: HNSCC Data Extraction Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swin UNETR Pretraining: HNSCC Data Extraction #189

Swin UNETR Pretraining: HNSCC Data Extraction #189

coxjoseph commented Feb 15, 2023

coxjoseph commented Feb 15, 2023 •

edited

Loading

tangy5 commented Feb 23, 2023

JakobDexl commented Apr 13, 2023

GLARKI commented Aug 22, 2024

Swin UNETR Pretraining: HNSCC Data Extraction #189

Swin UNETR Pretraining: HNSCC Data Extraction #189

Comments

coxjoseph commented Feb 15, 2023

coxjoseph commented Feb 15, 2023 • edited Loading

tangy5 commented Feb 23, 2023

JakobDexl commented Apr 13, 2023

GLARKI commented Aug 22, 2024

coxjoseph commented Feb 15, 2023 •

edited

Loading