The PI_NAME field is current free-text and unconstrained #6

matdon17 · 2020-11-24T11:05:55Z

PI_NAME is a free-text field with no links to an external resource. and as a result may not clearly identify a person uniquely.

PI_NAME has all sorts of variant ways of being populated, including cases, initials, honorifics, etc. Examples include:
Virginie THIERRY
B. Klein
Dr. Birgit Klein
BRECK OWENS, STEVEN JAYNE, P.E. ROBBINS
Pierre-Marie Poulain
M Ravichandran
DEAN ROEMMICH
GREGORY C. JOHNSON

We have options e.g. ORCiD which we could reference: http://www.orcid.org/

tcarval · 2021-09-23T14:56:17Z

We should probably provide guidance to fill the PI_NAME variable:

FirstName LastName
Upper case for the initial and lower case for the remaining
Example : Anna Klein

Should Argo manage a vocabulary of the PI_NAME ? probably not

We could add an additional variable : PI_NAME_ID
This variable would contain an ORCID, a ResearchID or other ID
PI_NAME_ID can be a list of blank separated IDs
Example: http://orcid.org/0000-0001-2345-6789

mscanderbeg · 2021-10-04T21:08:23Z

I agree with the guidance on how to fill the PI_NAME variable. This is simple and easily readable.

Would the PI_NAME_ID also be 'unrestricted' in the sense that any ID could be used? Is this more useful because each person should have only one ID rather than multiple spellings of a name? In that case, we should probably discourage PIs from changing the ID over time.

I imagine we would not manage a vocab of PI_NAME_ID if we don't manage a vocab of PI_NAME.

I'm just wondering about the usefulness of adding yet another uncontrolled PI variable, even if it is supposed to be unique.

tcarval · 2021-10-05T10:16:09Z

The PI_NAME_ID could be restricted to accepted IDs.
I know two types of researcher IDs : ORCID and ResearcherID
Accepted IDs could be:

ORCID : https://orcid.org/*
ResearcherID : https://www.researcherid.com/rid/*

apswong · 2021-10-05T21:45:49Z

I agree guidance on how to fill PI_NAME is good. This can be provided quite easily under the "Comment" column in 2.2.4, 2.3.4, 2.4.4, etc, in the Argo Users Manual.

I'm not convinced that a new variable, such as PI_NAME_ID, is needed. I would like to see a stronger argument about why it's needed, aside from the general need for a controlled vocabulary. Without knowing why it's needed specifically, it's difficult to design a new variable. Also, the variable name "PI_NAME_ID" does not make sense, since it implies a duplication of information (name + id).

If such a new variable is genuinely needed, I would advocate for the use of ORCID, which is already being used to uniquely identify delayed-mode operators in the global attributes of the D and BD files. I would also suggest the variable name "PI_ORCID", and that it be a new variable in the meta files only. This does not need to be in every profile file.

RomainCancouet · 2021-10-06T12:40:24Z

In the review of this metadata field that my colleague @lucarduini did this summer (e.g. PI_NAME_2021.xlsx with existing PI_NAME on the GDAC, and possible new value to 1. harmonize different existing spelling for the same PI_NAME and 2. harmonize lower/upper case), we choose the guidance to be FirstName LASTNAME, but FirstName LastName is equally fine.

If we do not manage a vocabulary of the PI names, how can we prevent new entries to be populated not following the guidance and matching the corresponding existing PI_NAME value for a specific PI?

Could the presence of a PI_(ORC)ID in the meta file automatically update the corresponding PI_NAME value?

Other points I see are:

how we manage multiple existing names in the PI_NAME value (BRECK OWENS, STEVEN JAYNE, P.E. ROBBINS)? Should we keep one name, have a N_LEVEL, etc.?
in the guidance should we discourage the use of accents and symbols (ë, ., -, ö, ', /, _, Ú)?

mscanderbeg · 2021-10-06T19:18:36Z

I don't have a strong feeling between FirstName LASTNAME vs. FirstName LastName.

It is tricky to prevent new entries of PI_NAME that do not follow this guidance if we do not manage a list. Would this be improved at all with using an ORCID? Would we ask the FileChecker to somehow check that the URL is resolvable? Or would we manage a list of acceptable ORCIDs?

I agree to the suggestion of PI_ORCID. I'm not sure I understand what Romain means by 'the presence of PI_ORCID automatically updating the PI_NAME value'. Does that mean that PI_NAME entry would be replaced with the PI_ORCID text? Would this be done at the DAC level?

As to Romain's first bullet point, I agree that it could be useful to track floats from a certain institution and sometimes the PI_NAME changes over time. Is there another place in the files where the institution is recorded? I think it is important to keep the various PI_NAMEs even if they refer to floats from the same group/institution.

Perhaps it would be good to discourage the use of accents and symbols as they are not always handled well.

apswong · 2021-10-07T04:27:47Z

The point of this discussion is to explore how to have a controlled vocabulary for the variable PI_NAME, so that the person can be identified uniquely. @tcarval did not think we should manage a vocabulary for PI_NAME, and so suggested a new parallel ID variable. @RomainCancouet suggested that we should, in fact, manage a controlled vocabulary for PI_NAME.

In my opinion, PI_NAME is an existing variable, and should never be replaced. Therefore we should manage a controlled vocabulary for it to make sure it is useful. I suggest a new reference table for PI_NAME. Its entries will be char strings of the form "Firstname LASTNAME". Multiple entries for the same person is not allowed, thereby ensuring uniqueness. Multiple unique PIs can be concatenated into one char string separated by comma, e.g. "Firstname1 LASTNAME1, Firstname2 LASTNAME2, ...".

In that case, a new parallel ID variable, such as PI_ORCID, is not needed. If ORCID is somehow necessary, then it can be an extra column in the PI_NAME reference table. But we still won't have to create a new parallel ID variable. It is undesirable to create new variables unnecessarily, especially when an existing variable can do the job.

gwemon · 2021-10-07T07:21:24Z

Dear All, a couple of practical points regarding this issue. If you decide that creating a new vocab for the Argo variable PI_NAME is the most efficient way forward, then do you have a feel for how homonyms will be managed? Also bear in mind that ORCIDs have unique URIs. These can either be referenced in a field in the data files or linked to the PI_NAMES in the new vocabulary. So the relationship between the two could be managed as mappings.

tcarval · 2021-10-07T14:00:13Z

Gwen, I did not envisage a reference table for individual names, having in mind the GDPR sensibility on personal data.
However, if we think that this is aceptable, it is for sure the simplest solution:

a recommendation in the manual to use "First-name NAME" or "First-name Name" (as with ORCID)
a new table with mapping to ORCID

In case of homonym, well... let's hope we won't have as the table would contain hundreds not thousands of entries

roswri · 2021-11-12T12:20:26Z

After a meeting with Vi and reading the discussion on this ticket, we have decided a controlled vocabulary (ideally with a mapping to ORCID) is the best solution.

GDPR
I have consulted the GDPR expert at BODC and they have said that since the names are already in the public domain and there is a business need to capture them there shouldn't be any GDPR issues with creating a controlled vocabulary. The have recommended that we document the following:

preference to use work-based identification (eg name, work email address and/or orcid) rather than private emails; that way we are just using what's already in the public domain
there is a business need for us to capture (and retain indefinitely) this info: so there is an audit trail of who submitted the data and can be contacted in case of queries about it
requirement for this data to be kept securely - this should be in an Oracle table, or similar which has various cybersecurity features, ie not stuffed in a spreadsheet on a drive somewhere
consent of the PI - not mandatory, but good practice - this should be covered by the statements in the submissions form or submissions tool

Format
@tcarval @mscanderbeg @nvs-vocabs/oceanops @RomainCancouet @lucarduini @apswong
I have created a draft version of what the controlled vocab might look like in the NVS based on Romain and Luca's work: PI_NAME_2021_NVS.xlsx
Does anyone have any preferences/ideas/opinions on what the Identifier could be? I have gone for a concatenation of First initial and Surname to make sure it is unique, but I think there might be a length limit, so any other ideas are welcome.
The 'Preferred label' and 'Alternative label' could be made the same or different (as in the spreadsheet), any preferences?

Many thanks,
Roseanna

apswong · 2021-11-14T21:09:54Z

@roswri I agree that we need to control the entries in the PI_NAME variable. This variable currently has a length limit of 64 characters, so there is sufficient space for all the names.

Regarding the PI_NAME xlsx - thanks for drafting it. However, I don't understand the purpose of the 3 columns. Could you explain them please? Is 'Identifier' what you propose we use to fill PI_NAME? And why is there a 'Preferred label' and a 'Alternative label'? If the difference is only in setting the last name to upper case or lower case, then we should simply decide on one, and not leave both in the xlsx. Lastly, some PI names are missing in the xlsx.

Thanks, Annie

roswri · 2021-11-15T09:18:37Z

Hi @apswong ,

The columns in the excel file relate to the standard columns on the NVS (e.g. https://vocab.nerc.ac.uk/search_nvs/R04/).

'Identifier' is a unique code for the term that will become part of the URI for the term e.g. BO for BODC in the R04 vocabulary: http://vocab.nerc.ac.uk/collection/R04/current/BO/
With regards to 'Preferred label' and 'Alternative label' I have included both because both columns will be shown on the page for the vocab, but Alternate label could be left blank or made the same as the Preferred label. Usually Preferred label is the full title of the term and Alternative label is a short title or something else that the term might be known as, to help users searching for the term, but this is not really relevant for this use case.
There are several instances where the Preferred label or Alternative label have been used to fill Argo metadata fields instead of the identifier, so I would recommend we use the preferred label to fill PI_NAME.

Thanks for letting me know that some PI names are missing, I will try to get a more complete list! Could you give me an example of a PI that is missing from the list so I can check something?

Thanks,
Roseanna

apswong · 2021-11-15T16:09:57Z

Hello @roswri. Examples of some PIs who are missing from the xlsx are 'KENNETH JOHNSON' and 'STEVEN JAYNE'. They are in PI_NAME with multiple PIs concatenated into one char string, separated by commas.

RomainCancouet · 2021-11-15T16:30:56Z

I reckon the issue might come from the version of the file provided following Luca's analysis. Only one name was kept for multiple PIs, whereas indeed it is better to keep the different PI names, separated by commas.

roswri · 2021-11-18T10:06:00Z

Thanks @apswong and @RomainCancouet! I have updated the list with the missing names: PI_NAME_2021_NVS_v2.xlsx
There were a couple of names that were a bit vague:

'P.E. ROBBINS' - I found a Paul E Robbins and a Pelle E Robbins both from WHOI associated with Argo related papers, any ideas which one P.E. Robbins is most likely to be?
'NICHOLSON' - listed alongside WIJFFELS, so I think this is 'David Nicholson' based on this article, do you agree?
'JAYNCE' I think this is a spelling mistake and is actually 'Steven JAYNE'

For the purposes of the controlled vocabulary I think each name should be a separate entry, but the guidance should state that multiple terms can be given if separated by commas.

Many thanks,
Roseanna

apswong · 2021-11-18T11:04:31Z

Hi @roswri.

Paul E Robbins and Pelle E Robbins are the same person. 'Pelle E Robbins' is the current name and so should be the name to use here.
'David Nicholson' is correct.
'JAYNCE' corrected to 'Steven JAYNE' is correct.

I agree each name should be a separate entry, and that multiple names can be used to fill PI_NAME if separated by commas.

roswri · 2021-11-18T11:15:13Z

Hi @apswong,

Thanks for clearing that up, I should have worked out they were the same person since they have the same email address! I will add Pelle E Robbins and David Nicholson to the list.

Many thanks,
Roseanna

RomainCancouet · 2021-11-18T14:38:10Z

Hello @roswri
Thanks!
Did you parse all the recent GDAC data to build the list of PI names? I did not check all the content of your file but I could not find some PI names (e.g Nicolas BARRE https://data-argo.ifremer.fr/dac/coriolis/3900396/ or CHUNSHENG JING
https://data-argo.ifremer.fr/dac/csio/2902873/)

Once the vocab will in place, do you think we can update the content of the meta files on the GDAC with the updated values for PI names?

roswri · 2021-11-26T09:18:01Z

Hi @RomainCancouet,

I intend to to check for any additional missing PI names in the recent GDAC metadata files, I'm having some technical difficulties at the moment, but I will do that when I can and add any additional PI names to the list.

Regarding your question about updating the content of the meta files on the GDAC with the values from the new controlled vocabulary, do you mean from the individual DAC perspective or if it's something the GDAC can do for all meta files across the board? I'm not sure how the process for updating the meta files works, but I imagine as long as there's a mapping between the current terms in use and the new controlled vocabulary terms the files can be updated with the new terms.

The next thing to make a decision on is the table metadata e.g.
Table ID: RPI
TITLE: Names of Argo principal investigators
SHORT_NAME: PI_NAME
DEFINITION: Names of principal investigators in charge of Argo floats used in Argo metadata NetCDF files.

I'm wondering if it should be PI specific, or if there's potential for the table to be used for other fields as well like FLOAT_OWNER, in which case we may want to re-think the table metadata to make it more generic. Any thoughts on this?

Thanks,
Roseanna

RomainCancouet · 2021-11-26T10:12:58Z

Hi @roswri ,

OK, thanks for the future check of missing PI names.

Yes my question regarding the meta files was: are we willing and DACs able to update the content of the meta files once some entries (PI_NAME, etc.) will be constrained and new values suggested? I acknowledge that will require efforts, and maybe this could be done only once other metadata fields have been revisited by this task team? e..g #5, #2, etc.

For your question about FLOAT_OWNER, as it is presently populated with a mixture of people, institution or DAC names, I do not know.
Romain

roswri · 2021-12-03T09:17:16Z

@tcarval @nvs-vocabs/oceanops @RomainCancouet
Hi all,

Who should be granted editing permissions for this vocabulary for PI_NAME? I can add extra people if anyone else decides they want or need editing permissions later on.

Thanks,
Roseanna

roswri · 2022-01-10T13:50:17Z

Hi @tcarval and @nvs-vocabs/oceanops ,

It seems we were able to reach a decision on the specifics of the PI_NAME NVS table at ADMT, which is great.
However, we still need to decide who will oversee this collection (i.e. who will have the ability to request/add new terms). I have tagged Thierry and OceanOps as the best candidates that I am aware of, but please let me know if you object to overseeing the collection, or if there is anyone else that should have permission to edit the collection.

Many thanks,
Roseanna

tcarval · 2022-01-10T14:12:11Z

Hello Roseanna - @roswri , I agree to be editor of this new collection

RomainCancouet · 2023-10-06T07:18:50Z

Thanks to BODC the (constrained) list of available PI names is available in the NVS (https://vocab.nerc.ac.uk/search_nvs/R40/).

It is now a matter of updating the list if entries are missing (people could contact the Vocabs Editors as described in ADMT webpage), and use this table with the FileChecker (nvs-vocabs/ArgoVocabs_Meetings#1) to constrain the allowed entries in netCDF meta files.

apswong · 2023-10-09T17:03:04Z

Thanks for all for working on PI_NAME. I would like to add that when a float has multiple PIs, the unique entries in R40 can be concatenated into one character string to fill the PI_NAME variable in the various Argo data files. The current practice is to use commas to separate the multiple unique names. This should be explained clearly in the Users Manual and with the GDAC file checker.

tcarval · 2023-10-26T06:14:17Z

From ADMT-24, there is an agreement on having a controled list of PIs.
There are privacy concerns to expose a PI_NAME in a public list on Internet.
Can NVS restrict the access to this list ? Should the list be managed outside the NVS ?

matdon17 · 2023-10-26T06:28:55Z

Hi Thierry, Aren't these same PI names going to appear in open access data in the same form anyway? If so, the same privacy concerns surely apply to the dataset entries as well as a vocab entry. Matt

…

On Thu, 26 Oct 2023, 07:14 tcarval, ***@***.***> wrote: From ADMT-24, there is an agreement on having a controled list of PIs. There are privacy concerns to expose a PI_NAME in a public list on Internet. Can NVS restrict the access to this list ? Should the list be managed outside the NVS ? — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGNGYH5X4FLSOBJDLZK7523YBH5UJAVCNFSM4UAXLSW2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYGA2DONRUGM4Q> . You are receiving this because you authored the thread.Message ID: ***@***.***>

gwemon · 2023-10-26T07:15:04Z

@tcarval we cannot hide the content of a collection published on the NVS. Have you considered the use of ORCIDS for the PIs? You could capture people's ORCIDS instead of their name if privacy is required.

tcarval · 2023-10-26T08:13:25Z

Hi Thierry, Aren't these same PI names going to appear in open access data in the same form anyway? If so, the same privacy concerns surely apply to the dataset entries as well as a vocab entry. Matt
…
On Thu, 26 Oct 2023, 07:14 tcarval, @.> wrote: From ADMT-24, there is an agreement on having a controled list of PIs. There are privacy concerns to expose a PI_NAME in a public list on Internet. Can NVS restrict the access to this list ? Should the list be managed outside the NVS ? — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGNGYH5X4FLSOBJDLZK7523YBH5UJAVCNFSM4UAXLSW2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYGA2DONRUGM4Q . You are receiving this because you authored the thread.Message ID: @.>

@matdon17 , Hello Matt
The remaining concern was "consent of the PI - not mandatory, but good practice - this should be covered by the statements in the submissions form or submissions tool".
An information will be addressed to AST team, and we should have a green light.

tcarval · 2023-10-26T08:16:30Z

@tcarval we cannot hide the content of a collection published on the NVS. Have you considered the use of ORCIDS for the PIs? You could capture people's ORCIDS instead of their name if privacy is required.

Yes, ORCID was proposed, but did not receive strong support. In a next step, we may manage the link to DOI from R40

vpaba · 2023-10-27T11:57:54Z

@matdon17 has a point, in that once information is in an open access data file, there is nothing stopping any tool from extracting the information, collating it and re-exposing it under a category. For example, OceanOps has a list of people's names under its 'Contacts' filter, and the Euro Argo fleet monitoring tool has a 'PI_NAME' tick box that can be selected in its search results filter so that names are displayed next to the corresponding float numbers.

I thus think that removing or replacing people's names from the NVS R40 collection will only partly address the underlying concern - though it would also be the quickest way to start.

vpaba · 2024-05-07T11:57:54Z

Issue resolved post ADMT-24. PI_NAME collection is live: https://vocab.nerc.ac.uk/search_nvs/R40/ or https://vocab.nerc.ac.uk/collection/R40/current/

apswong · 2024-10-11T15:16:45Z

@tcarval Now that PI_NAME entries are officially a controlled vocabulary in R40, we should update the "Comment" column in the Users Manual for PI_NAME with this new information, and explain that multiple names can be concatenated, separated by commas, to fill PI_NAME.

tcarval · 2024-10-11T15:47:02Z

@tcarval Now that PI_NAME entries are officially a controlled vocabulary in R40, we should update the "Comment" column in the Users Manual for PI_NAME with this new information, and explain that multiple names can be concatenated, separated by commas, to fill PI_NAME.

I updated the "comment" column for PI_NAME
I opened this "FileChecker" ticket to implement the check : OneArgo/ArgoFormatChecker#25

https://docs.google.com/document/d/1vFLDlk4paPPUGUFVgTBn0QadrqVsmWy1fFL6eb2w6gE/edit?tab=t.0#bookmark=id.3fnk8nvardpx

apswong · 2024-10-11T16:27:01Z

@tcarval Thank you, Thierry. We also need to update the PI_NAME comment column in Sections 2.3.3 and 2.4.4.

tcarval · 2024-10-16T15:40:00Z

@tcarval Thank you, Thierry. We also need to update the PI_NAME comment column in Sections 2.3.3 and 2.4.4.

Oups, yes, I just did it, noted in user manual history section

vpaba added the avtt Argo Vocabulary Task Team label Jun 3, 2021

vpaba changed the title ~~The PI_NAME field is current free-text and uncontrained~~ The PI_NAME field is current free-text and unconstrained Sep 27, 2021

vpaba assigned roswri Nov 10, 2021

tcarval mentioned this issue Dec 9, 2022

Improve consistency with controlled vocabularies #44

Open

tcarval added admt approval requested This ticket is waiting for ADMT approval and removed avtt Argo Vocabulary Task Team labels Dec 9, 2022

tcarval removed the admt approval requested This ticket is waiting for ADMT approval label Oct 26, 2023

vpaba closed this as completed May 7, 2024

github-project-automation bot added this to AVTT issues management Aug 29, 2024

github-project-automation bot moved this to Done in AVTT issues management Aug 29, 2024

tcarval reopened this Oct 11, 2024

tcarval closed this as completed Oct 11, 2024

tcarval mentioned this issue Oct 11, 2024

check pi_name content OneArgo/ArgoFormatChecker#25

Open

The PI_NAME field is current free-text and unconstrained #6

The PI_NAME field is current free-text and unconstrained #6

Comments

matdon17 commented Nov 24, 2020

tcarval commented Sep 23, 2021 • edited Loading

mscanderbeg commented Oct 4, 2021

tcarval commented Oct 5, 2021

apswong commented Oct 5, 2021

RomainCancouet commented Oct 6, 2021

mscanderbeg commented Oct 6, 2021

apswong commented Oct 7, 2021

gwemon commented Oct 7, 2021

tcarval commented Oct 7, 2021

roswri commented Nov 12, 2021

apswong commented Nov 14, 2021

roswri commented Nov 15, 2021

apswong commented Nov 15, 2021

RomainCancouet commented Nov 15, 2021

roswri commented Nov 18, 2021

apswong commented Nov 18, 2021

roswri commented Nov 18, 2021

RomainCancouet commented Nov 18, 2021

roswri commented Nov 26, 2021

RomainCancouet commented Nov 26, 2021

roswri commented Dec 3, 2021

roswri commented Jan 10, 2022

tcarval commented Jan 10, 2022

RomainCancouet commented Oct 6, 2023

apswong commented Oct 9, 2023

tcarval commented Oct 26, 2023

matdon17 commented Oct 26, 2023 via email

gwemon commented Oct 26, 2023

tcarval commented Oct 26, 2023

tcarval commented Oct 26, 2023

vpaba commented Oct 27, 2023 • edited Loading

vpaba commented May 7, 2024

apswong commented Oct 11, 2024

tcarval commented Oct 11, 2024 • edited Loading

apswong commented Oct 11, 2024 • edited Loading

tcarval commented Oct 16, 2024

tcarval commented Sep 23, 2021 •

edited

Loading

vpaba commented Oct 27, 2023 •

edited

Loading

tcarval commented Oct 11, 2024 •

edited

Loading

apswong commented Oct 11, 2024 •

edited

Loading