Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to request specific dimension data. #35

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

aaronweeden
Copy link

@aaronweeden aaronweeden commented Oct 30, 2024

WORK IN PROGRESS

Description

This PR adds the ability to request specific dimension data. For example, given this call to get_data() (which has the renamed date_range and group_by parameters), note the fields specified for the group_by and filters parameters:

with dw:
    df = get_data(
        date_range=('2024-01-01', '2024-01-01'),
        realm='Jobs',
        metric='Number of Jobs Ended',
        group_by={
            'User': (
                'Label',
                'ACCESS ID',
                'ORCID',
                'Globus ID',
            )
        },
        dataset_type='timeseries',
        aggregation_unit='Day',
        filters={
            'User': {
                'ACCESS ID': (
                    'access-example1',
                    'access-example2',
                )
            },
        },
    )

The df variable will be assigned a data frame like the following (which is in the long format from #32), noting the extra columns for ACCESS ID, ORCID, and Globus ID:

Date Metric User ID User Label User ACCESS ID User ORCID User Globus ID Value
0 2024-01-01 Number of Jobs Ended 12345 Example User 1 access-example1 1234-5678-1234-5678 globus-example1 3
1 2024-01-01 Number of Jobs Ended 12346 Example User 2 access-example2 1234-5678-1234-5679 globus-example2 2

If the group_by parameter is just given a single string, e.g.:

group_by='User'

Then the only dimension columns in the data frame will be ID and Label.

Similarly, if the filters parameter does not have a field specified, e.g.:

filters={
    'User': ('Example User 1', 'Example User 2')
}

Then the Label field will be used for filtering. Otherwise, it will use the specified field for filtering, as in the first example above.

Otherwise, group_by can take a dictionary with a single key where the key is the dimension's ID or label. The value can be a collection (as in the first example above) or a single string, e.g.:

group_by={
    'User': 'ORCID'
}

In which case the only dimension columns in the data frame will be ID and ORCID.

The data frame returned by the get_dimension_metadata() method will now include a column listing the additional dimension fields that can be used for grouping or filtering, e.g.:

ID Label Description Fields
person User A person who is on a PIs allocation, hence able to run jobs on resources. ID, Label, ACCESS ID, ORCID, Globus ID

The data frame returned by the get_dimension_data() method will now have an additional fields parameter that will allow specifying which fields to include in the resulting data frame, as in:

df = get_dimension_data('Jobs', 'User', ('ACCESS ID', 'ORCID'))

The resulting df will have this structure:

User ID User ACCESS ID User ORCID
12345 access-example1 1234-5678-1234-5678
12346 access-example2 1234-5678-1234-5679

If the fields parameter is not given, then only the ID and Label fields will be included.

Motivation and Context

Since some entities have multiple IDs associated with them depending on the context, this PR enables the Data Analytics Framework to make it easier to work with such entities.

Tests performed

Types of changes

  • Refactoring / documentation update (non-breaking change which does not change functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Release preparation

Checklist:

  • CHANGELOG.md has been updated
  • The milestone is set correctly on the pull request
  • The appropriate labels have been added to the pull request
  • Running the automated tests (see docs/developing.md) produces no errors
  • Updates have been made to the xdmod-notebooks repository as necessary, and the notebooks all run successfully

@aaronweeden aaronweeden added the enhancement New feature or request label Oct 30, 2024
@aaronweeden aaronweeden added this to the 2.0.0 milestone Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant