Add ability to request specific dimension data. #35

aaronweeden · 2024-10-30T14:57:01Z

WORK IN PROGRESS

Description

This PR adds the ability to request specific dimension data. For example, given this call to get_data() (which has the renamed date_range and group_by parameters), note the fields specified for the group_by and filters parameters:

with dw:
    df = get_data(
        date_range=('2024-01-01', '2024-01-01'),
        realm='Jobs',
        metric='Number of Jobs Ended',
        group_by={
            'User': (
                'Label',
                'ACCESS ID',
                'ORCID',
                'Globus ID',
            )
        },
        dataset_type='timeseries',
        aggregation_unit='Day',
        filters={
            'User': {
                'ACCESS ID': (
                    'access-example1',
                    'access-example2',
                )
            },
        },
    )

The df variable will be assigned a data frame like the following (which is in the long format from #32), noting the extra columns for ACCESS ID, ORCID, and Globus ID:

	Date	Metric	User ID	User Label	User ACCESS ID	User ORCID	User Globus ID	Value
0	2024-01-01	Number of Jobs Ended	12345	Example User 1	`access-example1`	`1234-5678-1234-5678`	`globus-example1`	3
1	2024-01-01	Number of Jobs Ended	12346	Example User 2	`access-example2`	`1234-5678-1234-5679`	`globus-example2`	2

If the group_by parameter is just given a single string, e.g.:

group_by='User'

Then the only dimension columns in the data frame will be ID and Label.

Similarly, if the filters parameter does not have a field specified, e.g.:

filters={
    'User': ('Example User 1', 'Example User 2')
}

Then the Label field will be used for filtering. Otherwise, it will use the specified field for filtering, as in the first example above.

Otherwise, group_by can take a dictionary with a single key where the key is the dimension's ID or label. The value can be a collection (as in the first example above) or a single string, e.g.:

group_by={
    'User': 'ORCID'
}

In which case the only dimension columns in the data frame will be ID and ORCID.

The data frame returned by the get_dimension_metadata() method will now include a column listing the additional dimension fields that can be used for grouping or filtering, e.g.:

ID	Label	Description	Fields
person	User	A person who is on a PIs allocation, hence able to run jobs on resources.	ID, Label, ACCESS ID, ORCID, Globus ID

The data frame returned by the get_dimension_data() method will now have an additional fields parameter that will allow specifying which fields to include in the resulting data frame, as in:

df = get_dimension_data('Jobs', 'User', ('ACCESS ID', 'ORCID'))

The resulting df will have this structure:

User ID	User ACCESS ID	User ORCID
12345	`access-example1`	`1234-5678-1234-5678`
12346	`access-example2`	`1234-5678-1234-5679`

If the fields parameter is not given, then only the ID and Label fields will be included.

Motivation and Context

Since some entities have multiple IDs associated with them depending on the context, this PR enables the Data Analytics Framework to make it easier to work with such entities.

Tests performed

Types of changes

Refactoring / documentation update (non-breaking change which does not change functionality)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Release preparation

Checklist:

CHANGELOG.md has been updated
The milestone is set correctly on the pull request
The appropriate labels have been added to the pull request
Running the automated tests (see docs/developing.md) produces no errors
Updates have been made to the xdmod-notebooks repository as necessary, and the notebooks all run successfully

aaronweeden added 2 commits October 30, 2024 09:47

Add changelog entry.

942ba4d

Update PR title.

5e043dd

aaronweeden added the enhancement New feature or request label Oct 30, 2024

aaronweeden added this to the 2.0.0 milestone Oct 30, 2024

This was referenced Oct 30, 2024

Rename methods. #36

Draft

Change get_data() to always return a long form data frame with dimension IDs. #32

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to request specific dimension data. #35

Add ability to request specific dimension data. #35

aaronweeden commented Oct 30, 2024 •

edited

Loading

Add ability to request specific dimension data. #35

Are you sure you want to change the base?

Add ability to request specific dimension data. #35

Conversation

aaronweeden commented Oct 30, 2024 • edited Loading

WORK IN PROGRESS

Description

Motivation and Context

Tests performed

Types of changes

Checklist:

aaronweeden commented Oct 30, 2024 •

edited

Loading