Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate entries in qualitative measurements table #70

Open
fedorov opened this issue Feb 27, 2023 · 1 comment
Open

Duplicate entries in qualitative measurements table #70

fedorov opened this issue Feb 27, 2023 · 1 comment
Assignees

Comments

@fedorov
Copy link
Member

fedorov commented Feb 27, 2023

The issues below were reported by @deepakri201 via discord. Need to investigate.

Using this query: #69. I think similar issues are occurring as before, when a slice has more than one body part region assigned to it, or more than one landmark assigned to it. For example:

  1. For the case of multiple regions per slice -- If we take PatientID="LUNG1-002", and check where trackingIdentifier="Annotations group 14", we should only get 2 rows corresponding to Abdomen and Chest regions, but we get 4 rows.

  2. For the case of multiple landmarks per slice -- If we take PatientID="LUNG1-001", and check where trackingIdentifier="Annotations group landmarks 1" , we should only get 2 rows corresponding to Kidney + Bottom, and L2 vertebra + Center, but we get 4 rows.

However, using the query here: https://github.com/vkt1414/etl_flow/blob/dde527d1e3ad85fcabe3571a66468f69c387a033/bq/derived_table_creation/BQ_Table_Building/derived_data_views/sql/qualitative_measurements.sql, the regions and landmarks are correct. Andrey, I think you may have worked from a slightly older version of Vamsi's query where he fixed these problems. (edited)

@fedorov fedorov self-assigned this Feb 27, 2023
fedorov added a commit to fedorov/etl_flow that referenced this issue Feb 28, 2023
This reverts changes introduced in ImagingDataCommons#64
and following PRs.

Following the investigation of the issues identified in
ImagingDataCommons#70, I decided not
to hold the release of v13 any longer. These revisions will require
more work and time.
@fedorov
Copy link
Member Author

fedorov commented Feb 28, 2023

TL;DR: After reviewing this, thinking about possible solutions, and discussing with @vkt1414, I decided to revert the updates to the queries done in #64 and proceeding with v13.


The original queries that correspond to bigquery-public-data.idc_current.qualitative_measurements and bigquery-public-data.idc_current.quantitative_measurements were written to flatten the content of TID 1500 SRs that we had at the time. In all of those SRs, both quantitative and qualitative measurements were accompanying image regions defined by segmentations, such as the example below (section of the output of dsrdump for the file in gs://idc-dev-open/c5dd463f-7740-47da-80d3-e6114904e5c3.dcm):

 <contains CONTAINER:(,,"Imaging Measurements")=SEPARATE>
    <contains CONTAINER:(,,"Measurement Group")=SEPARATE>
      <has obs context TEXT:(,,"Activity Session")="1">
      <has obs context TEXT:(,,"Tracking Identifier")="Nodule 1">
      <has obs context UIDREF:(,,"Tracking Unique Identifier")="2.25.84572801268285922663419591960434030454640929448094786485074">
      <contains CODE:(,,"Finding")=(M-03010,SRT,"Nodule")>
      <has obs context TEXT:(,,"Time Point")="1">
      <contains IMAGE:(,,"Referenced Segment")=(SG image,,1)>
      <contains UIDREF:(,,"Source series for segmentation")="1.3.6.1.4.1.14519.5.2.1.6279.6001.179049373636438705059720603192">
      <has concept mod CODE:(,,"Finding Site")=(T-28000,SRT,"Lung")>
      <contains NUM:(,,"Volume")="6.594475E+03" (mm3,UCUM,"cubic millimeter")>
        <has concept mod TEXT:(,,"Algorithm Name")="pylidc">
        <has concept mod TEXT:(,,"Algorithm Version")="0.2.0">
      <contains NUM:(,,"Diameter")="3.195933E+01" (mm,UCUM,"millimeter")>
        <has concept mod TEXT:(,,"Algorithm Name")="pylidc">
        <has concept mod TEXT:(,,"Algorithm Version")="0.2.0">
      <contains NUM:(,,"Surface area of mesh")="2.392704E+03" (mm2,UCUM,"square millimeter")>
        <has concept mod TEXT:(,,"Algorithm Name")="pylidc">
        <has concept mod TEXT:(,,"Algorithm Version")="0.2.0">
      <contains CODE:(,,"Subtlety score")=(105,99LIDCQIICR,"5 out of 5 (Obvious)")>
      <contains CODE:(,,"Internal structure")=(C12471,NCIt,"Soft tissue")>

For all of those SRs, the following assumptions were valid:

  1. each measurement group contains one and only "Finding" and "Finding site" concepts
  2. each measurement group contains one or more NUM content items with the quantitative measurements
  3. each measurement group contains one or more CODE content items with the qualitative measurements/assessments, none of which uses "Finding" or "Finding site" concepts, and so can be distinguished from the items in 1 above.

With those assumptions, the result of "flattening" was the following table schema (for the qualitative measurements):

image

Now, the new dataset has SRs that:

  1. Contain only qualitative measurements.
  2. Use "Finding site" concept to describe the actual qualitative assessment, and not the location of the segmented region, with multiple "Finding site" content items allowable within the same measurement group.
    <contains CONTAINER:(,,"Measurement Group")=CONTINUOUS>
      <has obs context TEXT:(,,"Tracking Identifier")="Annotations group 162">
      <has obs context UIDREF:(,,"Tracking Unique Identifier")="1.2.826.0.1.3680043.8.498.20536057271431471310083689490807745912">
      <has concept mod CODE:(,,"Finding Site")=(45048000,SCT,"Neck")>
      <contains IMAGE:(,,"Source")=(CT image,)>

It is not clear to me what one would want to have as expected behavior flattening those measurement groups above into the schema of the table we established, or if this would make any sense at all.

We could, arguably, put "Finding site" into the findingSite column, and have one row for each "Finding site" content item. But, in my opinion, this would be confusing, since the actual values in the Quantity/Value columns would have to be either replicating the findingSite column, or be left blank. And the query would be getting quite complex, since we would probably need to detect measurement groups that are not accompanying segmentations, and process those differently. Yet another idea would be to use a concept different from "Finding site" for those annotations (I was reluctant to use that concept from the start, as noted in https://github.com/ImagingDataCommons/IDC-ProjectManagement/issues/1218#issuecomment-1372254742, anticipating problems due to the clash of the concept).

Alternatively, we could have a completely separate query that would handle evaluations that are not derived from segmentations. I think this would be easier to understand for the user. I think for v13 we should do just that, and use that query in the notebooks and other materials accompanying the new nnU-Net-BPR-annotations collection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant