From 01a312e93b9546302e13f1feb995164a91bb16e7 Mon Sep 17 00:00:00 2001 From: Andrey Fedorov Date: Thu, 22 Aug 2024 18:56:35 -0400 Subject: [PATCH] update to simplify to use clinical_index from idc-index --- .../exploring_clinical_data.ipynb | 2171 +++++++++++++++++ 1 file changed, 2171 insertions(+) create mode 100644 notebooks/getting_started/exploring_clinical_data.ipynb diff --git a/notebooks/getting_started/exploring_clinical_data.ipynb b/notebooks/getting_started/exploring_clinical_data.ipynb new file mode 100644 index 00000000..65f1eb0f --- /dev/null +++ b/notebooks/getting_started/exploring_clinical_data.ipynb @@ -0,0 +1,2171 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Working with IDC clinical data without BigQuery\n", + "\n", + "In this notebook we cover the basics of how you can access and search IDC clinical data without depending on Google BigQuery.\n", + "\n", + "In addition to maintaining clinical data in Google BigQuery tables, we also export those in Parquet format into a public cloud-based storage bucket. Those files are free to download, and are rather small (as of IDC v18, less than 65MB altogether).\n", + "\n", + "Once downloaded, you can search the content using Pandas sytax of SQL.\n", + "\n", + "This brief notebook will guide you through the steps of the above.\n", + "\n", + "If you have never worked with IDC before, we recommend you first complete the getting started tutorial [here](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/part2_searching_basics.ipynb).\n", + "\n", + "---\n", + "Initial version: Jul 2024\n", + "\n", + "Updated: Aug 2024" + ], + "metadata": { + "id": "RVHEoPZJgVbl" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Prerequisites\n", + "\n", + "The only prerequisite is [`idc-index`](https://github.com/ImagingDataCommons/idc-index) - python package that contains various utilities to simplify access to IDC data. As part of this package installation, you will get several other packages that we will use later:\n", + "* `s5cmd` for very efficient download of data from cloud buckets using S3 API\n", + "* `pandas` for dataframe operations\n", + "* `duckdb` for querying pandas dataframes using SQL syntax" + ], + "metadata": { + "id": "t-tzpK4DakEP" + } + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "k8B1uiZkYlHu" + }, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install --upgrade idc-index" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Fetch clinical data index\n", + "\n", + "`idc-index` packages various tables with the key metadata. We refer to those as _indices_. The main index that supports API calls related to download and search is installed by default. To support search of the clinical data accompanying IDC images you will need the `clinical_index` table, which contains the list of all columns and all tables across all of the IDC collections that are available." + ], + "metadata": { + "id": "neM2YGeQamIu" + } + }, + { + "cell_type": "code", + "source": [ + "from idc_index import index\n", + "\n", + "c = index.IDCClient()\n", + "\n", + "c.fetch_index('clinical_index')\n", + "\n", + "print('Columns avaialable in clinical_index:\\n'+'\\n'.join(c.clinical_index.keys()))" + ], + "metadata": { + "id": "ehb4MeuPYy4c", + "outputId": "75fbb6b4-edb1-4598-8aa9-1753300ac193", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": 2, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Columns avaialable in clinical_index:\n", + "collection_id\n", + "case_col\n", + "table_name\n", + "column\n", + "column_label\n", + "data_type\n", + "original_column_headers\n", + "values\n", + "values_source\n", + "files\n", + "sheet_names\n", + "batch\n", + "column_numbers\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Find all metadata available for the specific collection\n", + "\n", + "A common use case is to find all clinical data available for a specific IDC collection.\n", + "\n", + "The key columns of this dataframe are:\n", + "* `collection_id`: which collection given metadata attribute corresponds to\n", + "* `table_name`: the name of the table where this metadata attribute is located\n", + "* `column`: name of the column (attribute)\n", + "\n", + "Depending on the specific attribute and how it was provided/documented by the submitter, you may find more information about it in the `column_label` column.\n", + "\n", + "Let's assume we are interested in the clinical data accompanying the `rms_mutation_prediction` collection. We can select all clinical data attributes that are available for this collection as shown next." + ], + "metadata": { + "id": "DvIHHAg7ao8F" + } + }, + { + "cell_type": "code", + "source": [ + "# define the query that selects all rows where collection_id is 'rms_mutation_prediction'\n", + "# note that we can refer to clinical_index table in the query\n", + "query = \"\"\"\n", + "SELECT *\n", + "FROM clinical_index\n", + "WHERE collection_id = 'rms_mutation_prediction'\n", + "\"\"\"\n", + "\n", + "# execute the query\n", + "matching_items = c.sql_query(query)\n" + ], + "metadata": { + "id": "L6rMuMEHjiML" + }, + "execution_count": 5, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "matching_items" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "X5jfORxlaQ-Q", + "outputId": "e10b214d-0550-413b-fdbb-991074482644" + }, + "execution_count": 6, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " collection_id case_col \\\n", + "0 rms_mutation_prediction False \n", + "1 rms_mutation_prediction False \n", + "2 rms_mutation_prediction True \n", + "3 rms_mutation_prediction False \n", + "4 rms_mutation_prediction False \n", + "5 rms_mutation_prediction False \n", + "6 rms_mutation_prediction False \n", + "7 rms_mutation_prediction False \n", + "8 rms_mutation_prediction False \n", + "9 rms_mutation_prediction False \n", + "10 rms_mutation_prediction True \n", + "11 rms_mutation_prediction True \n", + "12 rms_mutation_prediction False \n", + "13 rms_mutation_prediction False \n", + "14 rms_mutation_prediction False \n", + "15 rms_mutation_prediction False \n", + "16 rms_mutation_prediction False \n", + "17 rms_mutation_prediction False \n", + "18 rms_mutation_prediction False \n", + "19 rms_mutation_prediction False \n", + "20 rms_mutation_prediction False \n", + "21 rms_mutation_prediction False \n", + "22 rms_mutation_prediction False \n", + "23 rms_mutation_prediction False \n", + "24 rms_mutation_prediction False \n", + "25 rms_mutation_prediction False \n", + "26 rms_mutation_prediction False \n", + "27 rms_mutation_prediction False \n", + "28 rms_mutation_prediction False \n", + "29 rms_mutation_prediction False \n", + "30 rms_mutation_prediction False \n", + "31 rms_mutation_prediction False \n", + "32 rms_mutation_prediction False \n", + "33 rms_mutation_prediction False \n", + "34 rms_mutation_prediction False \n", + "35 rms_mutation_prediction False \n", + "36 rms_mutation_prediction False \n", + "37 rms_mutation_prediction False \n", + "38 rms_mutation_prediction False \n", + "\n", + " table_name \\\n", + "0 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "1 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "2 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "3 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "4 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "5 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "6 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "7 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "8 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "9 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "10 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "11 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "12 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "13 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "14 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "15 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "16 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "17 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "18 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "19 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "20 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "21 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "22 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "23 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "24 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "25 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "26 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "27 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "28 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "29 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "30 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "31 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "32 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "33 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "34 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "35 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "36 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "37 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "38 bigquery-public-data.idc_v18_clinical.rms_muta... \n", + "\n", + " column column_label \\\n", + "0 sample_id sample_id \n", + "1 primary_site primary_site \n", + "2 participant_id participant_id \n", + "3 age_at_diagnosis age_at_diagnosis \n", + "4 dicom_patient_id idc_provenance_dicom_patient_id \n", + "5 dicom_patient_id idc_provenance_dicom_patient_id \n", + "6 dicom_patient_id idc_provenance_dicom_patient_id \n", + "7 days_to_recurrence days_to_recurrence \n", + "8 sample_anatomic_site sample_anatomic_site \n", + "9 days_to_last_followup days_to_last_followup \n", + "10 participantparticipant_id participant.participant_id \n", + "11 participantparticipant_id participant.participant_id \n", + "12 participant_age_at_collection participant_age_at_collection \n", + "13 tumor_grade tumor_grade \n", + "14 diagnosis_id diagnosis_id \n", + "15 tumor_morphology tumor_morphology \n", + "16 sample_description sample_description \n", + "17 tumor_incidence_type tumor_incidence_type \n", + "18 tumor_stage_clinical_m tumor_stage_clinical_m \n", + "19 tumor_stage_clinical_n tumor_stage_clinical_n \n", + "20 tumor_stage_clinical_t tumor_stage_clinical_t \n", + "21 progression_or_recurrence progression_or_recurrence \n", + "22 tissue_or_organ_of_origin tissue_or_organ_of_origin \n", + "23 site_of_resection_or_biopsy site_of_resection_or_biopsy \n", + "24 days_to_last_known_disease_status days_to_last_known_disease_status \n", + "25 primary_diagnosis_reference_source primary_diagnosis_reference_source \n", + "26 stage Stage \n", + "27 last_known_disease_status last_known_disease_status \n", + "28 metastasis_at_diagnosis Metastasis_at_diagnosis \n", + "29 histological_classification Histological_Classification \n", + "30 race race \n", + "31 source_batch idc_provenance_source_batch \n", + "32 source_batch idc_provenance_source_batch \n", + "33 source_batch idc_provenance_source_batch \n", + "34 sample_type sample_type \n", + "35 sample_tumor_status sample_tumor_status \n", + "36 gender gender \n", + "37 primary_diagnosis primary_diagnosis \n", + "38 disease_type disease_type \n", + "\n", + " data_type original_column_headers \\\n", + "0 String [['sample_id']] \n", + "1 String [['primary_site']] \n", + "2 String [['participant_id']] \n", + "3 float64 [['age_at_diagnosis']] \n", + "4 String [['idc_provenance_dicom_patient_id']] \n", + "5 String [['idc_provenance_dicom_patient_id']] \n", + "6 String [['idc_provenance_dicom_patient_id']] \n", + "7 String [['days_to_recurrence']] \n", + "8 String [['sample_anatomic_site']] \n", + "9 String [['days_to_last_followup']] \n", + "10 String [['participant.participant_id']] \n", + "11 String [['participant.participant_id']] \n", + "12 float64 [['participant_age_at_collection']] \n", + "13 String [['tumor_grade']] \n", + "14 String [['diagnosis_id']] \n", + "15 String [['tumor_morphology']] \n", + "16 String [['sample_description']] \n", + "17 String [['tumor_incidence_type']] \n", + "18 String [['tumor_stage_clinical_m']] \n", + "19 String [['tumor_stage_clinical_n']] \n", + "20 String [['tumor_stage_clinical_t']] \n", + "21 String [['progression_or_recurrence']] \n", + "22 String [['tissue_or_organ_of_origin']] \n", + "23 String [['site_of_resection_or_biopsy']] \n", + "24 String [['days_to_last_known_disease_status']] \n", + "25 String [['primary_diagnosis_reference_source']] \n", + "26 String [['Stage']] \n", + "27 String [['last_known_disease_status']] \n", + "28 String [['Metastasis_at_diagnosis']] \n", + "29 String [['Histological_Classification']] \n", + "30 String [['race']] \n", + "31 int64 [['idc_provenance_source_batch']] \n", + "32 int64 [['idc_provenance_source_batch']] \n", + "33 int64 [['idc_provenance_source_batch']] \n", + "34 String [['sample_type']] \n", + "35 String [['sample_tumor_status']] \n", + "36 String [['gender']] \n", + "37 String [['primary_diagnosis']] \n", + "38 String [['disease_type']] \n", + "\n", + " values \\\n", + "0 [] \n", + "1 [] \n", + "2 [] \n", + "3 [] \n", + "4 [] \n", + "5 [] \n", + "6 [] \n", + "7 [] \n", + "8 [] \n", + "9 [] \n", + "10 [] \n", + "11 [] \n", + "12 [] \n", + "13 [{'option_code': '', 'option_description': None}] \n", + "14 [{'option_code': '', 'option_description': None}] \n", + "15 [{'option_code': '', 'option_description': None}] \n", + "16 [{'option_code': '', 'option_description': None}] \n", + "17 [{'option_code': '', 'option_description': None}] \n", + "18 [{'option_code': '', 'option_description': None}] \n", + "19 [{'option_code': '', 'option_description': None}] \n", + "20 [{'option_code': '', 'option_description': None}] \n", + "21 [{'option_code': '', 'option_description': None}] \n", + "22 [{'option_code': '', 'option_description': None}] \n", + "23 [{'option_code': '', 'option_description': None}] \n", + "24 [{'option_code': '', 'option_description': None}] \n", + "25 [{'option_code': '', 'option_description': None}] \n", + "26 [{'option_code': '', 'option_description': Non... \n", + "27 [{'option_code': '', 'option_description': Non... \n", + "28 [{'option_code': '', 'option_description': Non... \n", + "29 [{'option_code': '', 'option_description': Non... \n", + "30 [{'option_code': '', 'option_description': Non... \n", + "31 [{'option_code': '0', 'option_description': No... \n", + "32 [{'option_code': '0', 'option_description': No... \n", + "33 [{'option_code': '0', 'option_description': No... \n", + "34 [{'option_code': 'Tumor', 'option_description'... \n", + "35 [{'option_code': 'Tumor', 'option_description'... \n", + "36 [{'option_code': 'Female', 'option_description... \n", + "37 [{'option_code': 'Rhabdomyosarcoma', 'option_d... \n", + "38 [{'option_code': 'Soft Tissue Tumors and Sarco... \n", + "\n", + " values_source \\\n", + "0 None \n", + "1 None \n", + "2 None \n", + "3 None \n", + "4 None \n", + "5 None \n", + "6 None \n", + "7 None \n", + "8 None \n", + "9 None \n", + "10 None \n", + "11 None \n", + "12 None \n", + "13 derived from inspection of values \n", + "14 derived from inspection of values \n", + "15 derived from inspection of values \n", + "16 derived from inspection of values \n", + "17 derived from inspection of values \n", + "18 derived from inspection of values \n", + "19 derived from inspection of values \n", + "20 derived from inspection of values \n", + "21 derived from inspection of values \n", + "22 derived from inspection of values \n", + "23 derived from inspection of values \n", + "24 derived from inspection of values \n", + "25 derived from inspection of values \n", + "26 derived from inspection of values \n", + "27 derived from inspection of values \n", + "28 derived from inspection of values \n", + "29 derived from inspection of values \n", + "30 derived from inspection of values \n", + "31 derived from inspection of values \n", + "32 derived from inspection of values \n", + "33 derived from inspection of values \n", + "34 derived from inspection of values \n", + "35 derived from inspection of values \n", + "36 derived from inspection of values \n", + "37 derived from inspection of values \n", + "38 derived from inspection of values \n", + "\n", + " files sheet_names batch \\\n", + "0 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [sample] [0] \n", + "1 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "2 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [participant] [0] \n", + "3 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "4 [] [] [] \n", + "5 [] [] [] \n", + "6 [] [] [] \n", + "7 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "8 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [sample] [0] \n", + "9 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "10 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "11 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [sample] [0] \n", + "12 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [sample] [0] \n", + "13 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [sample] [0] \n", + "14 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "15 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [sample] [0] \n", + "16 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [sample] [0] \n", + "17 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [sample] [0] \n", + "18 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [sample] [0] \n", + "19 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [sample] [0] \n", + "20 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [sample] [0] \n", + "21 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "22 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "23 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "24 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "25 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "26 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "27 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "28 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "29 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [sample] [0] \n", + "30 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [participant] [0] \n", + "31 [] [] [] \n", + "32 [] [] [] \n", + "33 [] [] [] \n", + "34 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [sample] [0] \n", + "35 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [sample] [0] \n", + "36 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [participant] [0] \n", + "37 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "38 [CCDI_Submission_Template_v1.0.1_DM_v2.2023-02... [diagnosis] [0] \n", + "\n", + " column_numbers \n", + "0 [1] \n", + "1 [5] \n", + "2 [0] \n", + "3 [6] \n", + "4 [] \n", + "5 [] \n", + "6 [] \n", + "7 [9] \n", + "8 [3] \n", + "9 [13] \n", + "10 [0] \n", + "11 [0] \n", + "12 [4] \n", + "13 [6] \n", + "14 [1] \n", + "15 [10] \n", + "16 [12] \n", + "17 [11] \n", + "18 [9] \n", + "19 [8] \n", + "20 [7] \n", + "21 [14] \n", + "22 [12] \n", + "23 [15] \n", + "24 [11] \n", + "25 [4] \n", + "26 [8] \n", + "27 [10] \n", + "28 [7] \n", + "29 [5] \n", + "30 [1] \n", + "31 [] \n", + "32 [] \n", + "33 [] \n", + "34 [2] \n", + "35 [13] \n", + "36 [2] \n", + "37 [3] \n", + "38 [2] " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
collection_idcase_coltable_namecolumncolumn_labeldata_typeoriginal_column_headersvaluesvalues_sourcefilessheet_namesbatchcolumn_numbers
0rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...sample_idsample_idString[['sample_id']][]None[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[sample][0][1]
1rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...primary_siteprimary_siteString[['primary_site']][]None[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][5]
2rms_mutation_predictionTruebigquery-public-data.idc_v18_clinical.rms_muta...participant_idparticipant_idString[['participant_id']][]None[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[participant][0][0]
3rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...age_at_diagnosisage_at_diagnosisfloat64[['age_at_diagnosis']][]None[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][6]
4rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...dicom_patient_ididc_provenance_dicom_patient_idString[['idc_provenance_dicom_patient_id']][]None[][][][]
5rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...dicom_patient_ididc_provenance_dicom_patient_idString[['idc_provenance_dicom_patient_id']][]None[][][][]
6rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...dicom_patient_ididc_provenance_dicom_patient_idString[['idc_provenance_dicom_patient_id']][]None[][][][]
7rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...days_to_recurrencedays_to_recurrenceString[['days_to_recurrence']][]None[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][9]
8rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...sample_anatomic_sitesample_anatomic_siteString[['sample_anatomic_site']][]None[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[sample][0][3]
9rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...days_to_last_followupdays_to_last_followupString[['days_to_last_followup']][]None[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][13]
10rms_mutation_predictionTruebigquery-public-data.idc_v18_clinical.rms_muta...participantparticipant_idparticipant.participant_idString[['participant.participant_id']][]None[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][0]
11rms_mutation_predictionTruebigquery-public-data.idc_v18_clinical.rms_muta...participantparticipant_idparticipant.participant_idString[['participant.participant_id']][]None[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[sample][0][0]
12rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...participant_age_at_collectionparticipant_age_at_collectionfloat64[['participant_age_at_collection']][]None[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[sample][0][4]
13rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...tumor_gradetumor_gradeString[['tumor_grade']][{'option_code': '', 'option_description': None}]derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[sample][0][6]
14rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...diagnosis_iddiagnosis_idString[['diagnosis_id']][{'option_code': '', 'option_description': None}]derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][1]
15rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...tumor_morphologytumor_morphologyString[['tumor_morphology']][{'option_code': '', 'option_description': None}]derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[sample][0][10]
16rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...sample_descriptionsample_descriptionString[['sample_description']][{'option_code': '', 'option_description': None}]derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[sample][0][12]
17rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...tumor_incidence_typetumor_incidence_typeString[['tumor_incidence_type']][{'option_code': '', 'option_description': None}]derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[sample][0][11]
18rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...tumor_stage_clinical_mtumor_stage_clinical_mString[['tumor_stage_clinical_m']][{'option_code': '', 'option_description': None}]derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[sample][0][9]
19rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...tumor_stage_clinical_ntumor_stage_clinical_nString[['tumor_stage_clinical_n']][{'option_code': '', 'option_description': None}]derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[sample][0][8]
20rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...tumor_stage_clinical_ttumor_stage_clinical_tString[['tumor_stage_clinical_t']][{'option_code': '', 'option_description': None}]derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[sample][0][7]
21rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...progression_or_recurrenceprogression_or_recurrenceString[['progression_or_recurrence']][{'option_code': '', 'option_description': None}]derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][14]
22rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...tissue_or_organ_of_origintissue_or_organ_of_originString[['tissue_or_organ_of_origin']][{'option_code': '', 'option_description': None}]derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][12]
23rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...site_of_resection_or_biopsysite_of_resection_or_biopsyString[['site_of_resection_or_biopsy']][{'option_code': '', 'option_description': None}]derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][15]
24rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...days_to_last_known_disease_statusdays_to_last_known_disease_statusString[['days_to_last_known_disease_status']][{'option_code': '', 'option_description': None}]derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][11]
25rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...primary_diagnosis_reference_sourceprimary_diagnosis_reference_sourceString[['primary_diagnosis_reference_source']][{'option_code': '', 'option_description': None}]derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][4]
26rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...stageStageString[['Stage']][{'option_code': '', 'option_description': Non...derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][8]
27rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...last_known_disease_statuslast_known_disease_statusString[['last_known_disease_status']][{'option_code': '', 'option_description': Non...derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][10]
28rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...metastasis_at_diagnosisMetastasis_at_diagnosisString[['Metastasis_at_diagnosis']][{'option_code': '', 'option_description': Non...derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][7]
29rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...histological_classificationHistological_ClassificationString[['Histological_Classification']][{'option_code': '', 'option_description': Non...derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[sample][0][5]
30rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...raceraceString[['race']][{'option_code': '', 'option_description': Non...derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[participant][0][1]
31rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...source_batchidc_provenance_source_batchint64[['idc_provenance_source_batch']][{'option_code': '0', 'option_description': No...derived from inspection of values[][][][]
32rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...source_batchidc_provenance_source_batchint64[['idc_provenance_source_batch']][{'option_code': '0', 'option_description': No...derived from inspection of values[][][][]
33rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...source_batchidc_provenance_source_batchint64[['idc_provenance_source_batch']][{'option_code': '0', 'option_description': No...derived from inspection of values[][][][]
34rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...sample_typesample_typeString[['sample_type']][{'option_code': 'Tumor', 'option_description'...derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[sample][0][2]
35rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...sample_tumor_statussample_tumor_statusString[['sample_tumor_status']][{'option_code': 'Tumor', 'option_description'...derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[sample][0][13]
36rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...gendergenderString[['gender']][{'option_code': 'Female', 'option_description...derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[participant][0][2]
37rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...primary_diagnosisprimary_diagnosisString[['primary_diagnosis']][{'option_code': 'Rhabdomyosarcoma', 'option_d...derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][3]
38rms_mutation_predictionFalsebigquery-public-data.idc_v18_clinical.rms_muta...disease_typedisease_typeString[['disease_type']][{'option_code': 'Soft Tissue Tumors and Sarco...derived from inspection of values[CCDI_Submission_Template_v1.0.1_DM_v2.2023-02...[diagnosis][0][2]
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "matching_items", + "summary": "{\n \"name\": \"matching_items\",\n \"rows\": 39,\n \"fields\": [\n {\n \"column\": \"collection_id\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"rms_mutation_prediction\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"case_col\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n true\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"table_name\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"bigquery-public-data.idc_v18_clinical.rms_mutation_prediction_sample\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"column\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 34,\n \"samples\": [\n \"tumor_stage_clinical_m\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"column_label\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 34,\n \"samples\": [\n \"tumor_stage_clinical_m\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"data_type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"String\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"original_column_headers\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"values\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"values_source\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"derived from inspection of values\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"files\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sheet_names\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"batch\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"column_numbers\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 6 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Downloading the tables with clinical data\n", + "\n", + "The `table` column in `clinical_index` refers to the actual tables that contain the listed columns. Those tables are relatively small, and are maintained by IDC both in Google BigQuery and in Parquet files available for download from a public AWS bucket.\n", + "\n", + "Since the total size of those tables is small, it is easiest to download all of them (but of course you can also download just the individual ones)." + ], + "metadata": { + "id": "dFqOIXInR0N0" + } + }, + { + "cell_type": "code", + "source": [ + "%%capture\n", + "idc_version = c.get_idc_version()\n", + "\n", + "clinical_data_aws_path = f\"s3://idc-open-metadata/bigquery_export/idc_{idc_version}_clinical/*\"\n", + "\n", + "!mkdir -p idc_clinical_data\n", + "!s5cmd --no-sign-request cp {clinical_data_aws_path} ./idc_clinical_data" + ], + "metadata": { + "id": "BksJlKqwSPPw" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Access table that contains specific metadata attribute\n", + "\n", + "Let's assume we are interested in the `tumor_grade` attribute (row 13 in the table above). From the above, column `table_name` is telling us that it is contained in the table `bigquery-public-data.idc_v18_clinical.rms_mutation_prediction_sample`. The table name is the fully resolved name in BigQuery. For locating this table as downloaded, we need just the last component of the name: `rms_mutation_prediction_sample`.\n", + "\n", + "We can access and search this table next." + ], + "metadata": { + "id": "3dbLIUCea0hd" + } + }, + { + "cell_type": "code", + "source": [ + "import pandas as pd\n", + "\n", + "a_table_df = pd.read_parquet('./idc_clinical_data/rms_mutation_prediction_sample')\n", + "\n", + "a_table_df" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 583 + }, + "id": "geX2KhQqcWHc", + "outputId": "4a0531e3-a93f-4b60-9059-a1dc74a7043f" + }, + "execution_count": 12, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " dicom_patient_id source_batch participantparticipant_id sample_id \\\n", + "0 RMS2325 0 RMS2325 PAWDLM \n", + "1 RMS2124 0 RMS2124 PATMDI \n", + "2 RMS2137 0 RMS2137 PATVPL \n", + "3 RMS2140 0 RMS2140 PATYYW \n", + "4 RMS2145 0 RMS2145 PAUKHP \n", + ".. ... ... ... ... \n", + "398 RMS2374 0 RMS2374 PAUPVA \n", + "399 RMS2352 0 RMS2352 PASGZC \n", + "400 RMS2205 0 RMS2205 PAMSJL \n", + "401 RMS2267 0 RMS2267 PALWAA \n", + "402 RMS2459 0 RMS2459 PAPRFM \n", + "\n", + " sample_type sample_anatomic_site participant_age_at_collection \\\n", + "0 Tumor Leg 44.56 \n", + "1 Tumor 0.90 \n", + "2 Tumor 0.83 \n", + "3 Tumor 1.07 \n", + "4 Tumor 2.72 \n", + ".. ... ... ... \n", + "398 Tumor Paratesticular, left 2.46 \n", + "399 Tumor Paratesticular, right 0.68 \n", + "400 Tumor Pelvis 2.76 \n", + "401 Tumor Soft tissue, abdomen 17.96 \n", + "402 Tumor Prostate 8.39 \n", + "\n", + " histological_classification tumor_grade \\\n", + "0 \n", + "1 BOTRYOID \n", + "2 BOTRYOID \n", + "3 BOTRYOID \n", + "4 BOTRYOID \n", + ".. ... ... \n", + "398 SPINDLE CELL RHABDOMYOSARCOMA \n", + "399 SPINDLE CELL RHABDOMYOSARCOMA \n", + "400 MIXED ALVEOLAR AND EMBRYONAL RHABDOMYOSARCOMA \n", + "401 MIXED ALVEOLAR AND EMBRYONAL RHABDOMYOSARCOMA \n", + "402 EMBRYONAL RHABDOMYOSARCOMA WITH DIFFUSE ANAPLASIA \n", + "\n", + " tumor_stage_clinical_t tumor_stage_clinical_n tumor_stage_clinical_m \\\n", + "0 \n", + "1 \n", + "2 \n", + "3 \n", + "4 \n", + ".. ... ... ... \n", + "398 \n", + "399 \n", + "400 \n", + "401 \n", + "402 \n", + "\n", + " tumor_morphology tumor_incidence_type sample_description \\\n", + "0 \n", + "1 \n", + "2 \n", + "3 \n", + "4 \n", + ".. ... ... ... \n", + "398 \n", + "399 \n", + "400 \n", + "401 \n", + "402 \n", + "\n", + " sample_tumor_status \n", + "0 Tumor \n", + "1 Tumor \n", + "2 Tumor \n", + "3 Tumor \n", + "4 Tumor \n", + ".. ... \n", + "398 Tumor \n", + "399 Tumor \n", + "400 Tumor \n", + "401 Tumor \n", + "402 Tumor \n", + "\n", + "[403 rows x 16 columns]" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
dicom_patient_idsource_batchparticipantparticipant_idsample_idsample_typesample_anatomic_siteparticipant_age_at_collectionhistological_classificationtumor_gradetumor_stage_clinical_ttumor_stage_clinical_ntumor_stage_clinical_mtumor_morphologytumor_incidence_typesample_descriptionsample_tumor_status
0RMS23250RMS2325PAWDLMTumorLeg44.56Tumor
1RMS21240RMS2124PATMDITumor0.90BOTRYOIDTumor
2RMS21370RMS2137PATVPLTumor0.83BOTRYOIDTumor
3RMS21400RMS2140PATYYWTumor1.07BOTRYOIDTumor
4RMS21450RMS2145PAUKHPTumor2.72BOTRYOIDTumor
...................................................
398RMS23740RMS2374PAUPVATumorParatesticular, left2.46SPINDLE CELL RHABDOMYOSARCOMATumor
399RMS23520RMS2352PASGZCTumorParatesticular, right0.68SPINDLE CELL RHABDOMYOSARCOMATumor
400RMS22050RMS2205PAMSJLTumorPelvis2.76MIXED ALVEOLAR AND EMBRYONAL RHABDOMYOSARCOMATumor
401RMS22670RMS2267PALWAATumorSoft tissue, abdomen17.96MIXED ALVEOLAR AND EMBRYONAL RHABDOMYOSARCOMATumor
402RMS24590RMS2459PAPRFMTumorProstate8.39EMBRYONAL RHABDOMYOSARCOMA WITH DIFFUSE ANAPLASIATumor
\n", + "

403 rows × 16 columns

\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "a_table_df", + "summary": "{\n \"name\": \"a_table_df\",\n \"rows\": 403,\n \"fields\": [\n {\n \"column\": \"dicom_patient_id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 403,\n \"samples\": [\n \"RMS2394\",\n \"RMS2429\",\n \"RMS2435\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"source_batch\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 0,\n \"num_unique_values\": 1,\n \"samples\": [\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"participantparticipant_id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 403,\n \"samples\": [\n \"RMS2394\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sample_id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 403,\n \"samples\": [\n \"PAKXHZ\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sample_type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"Tumor\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sample_anatomic_site\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 179,\n \"samples\": [\n \"Skull Base, Right\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"participant_age_at_collection\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6.371507327410722,\n \"min\": 0.02,\n \"max\": 45.25,\n \"num_unique_values\": 360,\n \"samples\": [\n 12.75\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"histological_classification\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 8,\n \"samples\": [\n \"BOTRYOID\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tumor_grade\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tumor_stage_clinical_t\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tumor_stage_clinical_n\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tumor_stage_clinical_m\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tumor_morphology\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tumor_incidence_type\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sample_description\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sample_tumor_status\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"Tumor\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 12 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Now that this table is loaded, you can search it as you would any pandas dataframe (or you can use SQL with duckdb, as shown earlier!).\n", + "\n", + "Note that the `dicom_patient_id` column, which you will find in **every** clinical data table, can be used to link clinical metadata attributes to the DICOM image metadata!" + ], + "metadata": { + "id": "NXVWCXJTk5Dk" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Summary\n", + "\n", + "We hope you enjoyed this tutorial! If something didn't work as expected, if you have any feedback or suggestions of what should be added to this tutorial, please contact IDC support by sending email to support@canceridc.dev or posting your question on [IDC User forum](https://discourse.canceridc.dev)." + ], + "metadata": { + "id": "C1T3OjYBleUW" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Acknowledgments\n", + "\n", + "Imaging Data Commons has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.\n", + "\n", + "If you use IDC in your research, please cite the following publication:\n", + "\n", + "> Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. _National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence_. RadioGraphics (2023). https://doi.org/10.1148/rg.230180" + ], + "metadata": { + "id": "EldMrJ_llh5B" + } + } + ] +} \ No newline at end of file