Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The PI_NAME field is current free-text and unconstrained #6

Closed
matdon17 opened this issue Nov 24, 2020 · 36 comments
Closed

The PI_NAME field is current free-text and unconstrained #6

matdon17 opened this issue Nov 24, 2020 · 36 comments
Assignees

Comments

@matdon17
Copy link

PI_NAME is a free-text field with no links to an external resource. and as a result may not clearly identify a person uniquely.

PI_NAME has all sorts of variant ways of being populated, including cases, initials, honorifics, etc. Examples include:
Virginie THIERRY
B. Klein
Dr. Birgit Klein
BRECK OWENS, STEVEN JAYNE, P.E. ROBBINS
Pierre-Marie Poulain
M Ravichandran
DEAN ROEMMICH
GREGORY C. JOHNSON

We have options e.g. ORCiD which we could reference: http://www.orcid.org/

@vpaba vpaba added the avtt Argo Vocabulary Task Team label Jun 3, 2021
@tcarval
Copy link
Contributor

tcarval commented Sep 23, 2021

We should probably provide guidance to fill the PI_NAME variable:

  • FirstName LastName
  • Upper case for the initial and lower case for the remaining
  • Example : Anna Klein

Should Argo manage a vocabulary of the PI_NAME ? probably not

We could add an additional variable : PI_NAME_ID
This variable would contain an ORCID, a ResearchID or other ID
PI_NAME_ID can be a list of blank separated IDs
Example: http://orcid.org/0000-0001-2345-6789

@vpaba vpaba changed the title The PI_NAME field is current free-text and uncontrained The PI_NAME field is current free-text and unconstrained Sep 27, 2021
@mscanderbeg
Copy link

I agree with the guidance on how to fill the PI_NAME variable. This is simple and easily readable.

Would the PI_NAME_ID also be 'unrestricted' in the sense that any ID could be used? Is this more useful because each person should have only one ID rather than multiple spellings of a name? In that case, we should probably discourage PIs from changing the ID over time.

I imagine we would not manage a vocab of PI_NAME_ID if we don't manage a vocab of PI_NAME.

I'm just wondering about the usefulness of adding yet another uncontrolled PI variable, even if it is supposed to be unique.

@tcarval
Copy link
Contributor

tcarval commented Oct 5, 2021

The PI_NAME_ID could be restricted to accepted IDs.
I know two types of researcher IDs : ORCID and ResearcherID
Accepted IDs could be:

@apswong
Copy link

apswong commented Oct 5, 2021

I agree guidance on how to fill PI_NAME is good. This can be provided quite easily under the "Comment" column in 2.2.4, 2.3.4, 2.4.4, etc, in the Argo Users Manual.

I'm not convinced that a new variable, such as PI_NAME_ID, is needed. I would like to see a stronger argument about why it's needed, aside from the general need for a controlled vocabulary. Without knowing why it's needed specifically, it's difficult to design a new variable. Also, the variable name "PI_NAME_ID" does not make sense, since it implies a duplication of information (name + id).

If such a new variable is genuinely needed, I would advocate for the use of ORCID, which is already being used to uniquely identify delayed-mode operators in the global attributes of the D and BD files. I would also suggest the variable name "PI_ORCID", and that it be a new variable in the meta files only. This does not need to be in every profile file.

@RomainCancouet
Copy link

In the review of this metadata field that my colleague @lucarduini did this summer (e.g. PI_NAME_2021.xlsx with existing PI_NAME on the GDAC, and possible new value to 1. harmonize different existing spelling for the same PI_NAME and 2. harmonize lower/upper case), we choose the guidance to be FirstName LASTNAME, but FirstName LastName is equally fine.

If we do not manage a vocabulary of the PI names, how can we prevent new entries to be populated not following the guidance and matching the corresponding existing PI_NAME value for a specific PI?

Could the presence of a PI_(ORC)ID in the meta file automatically update the corresponding PI_NAME value?

Other points I see are:

  • how we manage multiple existing names in the PI_NAME value (BRECK OWENS, STEVEN JAYNE, P.E. ROBBINS)? Should we keep one name, have a N_LEVEL, etc.?
  • in the guidance should we discourage the use of accents and symbols (ë, ., -, ö, ', /, _, Ú)?

@mscanderbeg
Copy link

I don't have a strong feeling between FirstName LASTNAME vs. FirstName LastName.

It is tricky to prevent new entries of PI_NAME that do not follow this guidance if we do not manage a list. Would this be improved at all with using an ORCID? Would we ask the FileChecker to somehow check that the URL is resolvable? Or would we manage a list of acceptable ORCIDs?

I agree to the suggestion of PI_ORCID. I'm not sure I understand what Romain means by 'the presence of PI_ORCID automatically updating the PI_NAME value'. Does that mean that PI_NAME entry would be replaced with the PI_ORCID text? Would this be done at the DAC level?

As to Romain's first bullet point, I agree that it could be useful to track floats from a certain institution and sometimes the PI_NAME changes over time. Is there another place in the files where the institution is recorded? I think it is important to keep the various PI_NAMEs even if they refer to floats from the same group/institution.

Perhaps it would be good to discourage the use of accents and symbols as they are not always handled well.

@apswong
Copy link

apswong commented Oct 7, 2021

The point of this discussion is to explore how to have a controlled vocabulary for the variable PI_NAME, so that the person can be identified uniquely. @tcarval did not think we should manage a vocabulary for PI_NAME, and so suggested a new parallel ID variable. @RomainCancouet suggested that we should, in fact, manage a controlled vocabulary for PI_NAME.

In my opinion, PI_NAME is an existing variable, and should never be replaced. Therefore we should manage a controlled vocabulary for it to make sure it is useful. I suggest a new reference table for PI_NAME. Its entries will be char strings of the form "Firstname LASTNAME". Multiple entries for the same person is not allowed, thereby ensuring uniqueness. Multiple unique PIs can be concatenated into one char string separated by comma, e.g. "Firstname1 LASTNAME1, Firstname2 LASTNAME2, ...".

In that case, a new parallel ID variable, such as PI_ORCID, is not needed. If ORCID is somehow necessary, then it can be an extra column in the PI_NAME reference table. But we still won't have to create a new parallel ID variable. It is undesirable to create new variables unnecessarily, especially when an existing variable can do the job.

@gwemon
Copy link
Contributor

gwemon commented Oct 7, 2021

Dear All, a couple of practical points regarding this issue. If you decide that creating a new vocab for the Argo variable PI_NAME is the most efficient way forward, then do you have a feel for how homonyms will be managed? Also bear in mind that ORCIDs have unique URIs. These can either be referenced in a field in the data files or linked to the PI_NAMES in the new vocabulary. So the relationship between the two could be managed as mappings.

@tcarval
Copy link
Contributor

tcarval commented Oct 7, 2021

Gwen, I did not envisage a reference table for individual names, having in mind the GDPR sensibility on personal data.
However, if we think that this is aceptable, it is for sure the simplest solution:

  • a recommendation in the manual to use "First-name NAME" or "First-name Name" (as with ORCID)
  • a new table with mapping to ORCID

In case of homonym, well... let's hope we won't have as the table would contain hundreds not thousands of entries

@roswri
Copy link

roswri commented Nov 12, 2021

After a meeting with Vi and reading the discussion on this ticket, we have decided a controlled vocabulary (ideally with a mapping to ORCID) is the best solution.

GDPR
I have consulted the GDPR expert at BODC and they have said that since the names are already in the public domain and there is a business need to capture them there shouldn't be any GDPR issues with creating a controlled vocabulary. The have recommended that we document the following:

  • preference to use work-based identification (eg name, work email address and/or orcid) rather than private emails; that way we are just using what's already in the public domain
  • there is a business need for us to capture (and retain indefinitely) this info: so there is an audit trail of who submitted the data and can be contacted in case of queries about it
  • requirement for this data to be kept securely - this should be in an Oracle table, or similar which has various cybersecurity features, ie not stuffed in a spreadsheet on a drive somewhere
  • consent of the PI - not mandatory, but good practice - this should be covered by the statements in the submissions form or submissions tool

Format
@tcarval @mscanderbeg @nvs-vocabs/oceanops @RomainCancouet @lucarduini @apswong
I have created a draft version of what the controlled vocab might look like in the NVS based on Romain and Luca's work: PI_NAME_2021_NVS.xlsx
Does anyone have any preferences/ideas/opinions on what the Identifier could be? I have gone for a concatenation of First initial and Surname to make sure it is unique, but I think there might be a length limit, so any other ideas are welcome.
The 'Preferred label' and 'Alternative label' could be made the same or different (as in the spreadsheet), any preferences?

Many thanks,
Roseanna

@apswong
Copy link

apswong commented Nov 14, 2021

@roswri I agree that we need to control the entries in the PI_NAME variable. This variable currently has a length limit of 64 characters, so there is sufficient space for all the names.

Regarding the PI_NAME xlsx - thanks for drafting it. However, I don't understand the purpose of the 3 columns. Could you explain them please? Is 'Identifier' what you propose we use to fill PI_NAME? And why is there a 'Preferred label' and a 'Alternative label'? If the difference is only in setting the last name to upper case or lower case, then we should simply decide on one, and not leave both in the xlsx. Lastly, some PI names are missing in the xlsx.

Thanks, Annie

@roswri
Copy link

roswri commented Nov 15, 2021

Hi @apswong ,

The columns in the excel file relate to the standard columns on the NVS (e.g. https://vocab.nerc.ac.uk/search_nvs/R04/).

  • 'Identifier' is a unique code for the term that will become part of the URI for the term e.g. BO for BODC in the R04 vocabulary: http://vocab.nerc.ac.uk/collection/R04/current/BO/
  • With regards to 'Preferred label' and 'Alternative label' I have included both because both columns will be shown on the page for the vocab, but Alternate label could be left blank or made the same as the Preferred label. Usually Preferred label is the full title of the term and Alternative label is a short title or something else that the term might be known as, to help users searching for the term, but this is not really relevant for this use case.
  • There are several instances where the Preferred label or Alternative label have been used to fill Argo metadata fields instead of the identifier, so I would recommend we use the preferred label to fill PI_NAME.

Thanks for letting me know that some PI names are missing, I will try to get a more complete list! Could you give me an example of a PI that is missing from the list so I can check something?

Thanks,
Roseanna

@apswong
Copy link

apswong commented Nov 15, 2021

Hello @roswri. Examples of some PIs who are missing from the xlsx are 'KENNETH JOHNSON' and 'STEVEN JAYNE'. They are in PI_NAME with multiple PIs concatenated into one char string, separated by commas.

@RomainCancouet
Copy link

I reckon the issue might come from the version of the file provided following Luca's analysis. Only one name was kept for multiple PIs, whereas indeed it is better to keep the different PI names, separated by commas.

@roswri
Copy link

roswri commented Nov 18, 2021

Thanks @apswong and @RomainCancouet! I have updated the list with the missing names: PI_NAME_2021_NVS_v2.xlsx
There were a couple of names that were a bit vague:

  • 'P.E. ROBBINS' - I found a Paul E Robbins and a Pelle E Robbins both from WHOI associated with Argo related papers, any ideas which one P.E. Robbins is most likely to be?
  • 'NICHOLSON' - listed alongside WIJFFELS, so I think this is 'David Nicholson' based on this article, do you agree?
  • 'JAYNCE' I think this is a spelling mistake and is actually 'Steven JAYNE'

For the purposes of the controlled vocabulary I think each name should be a separate entry, but the guidance should state that multiple terms can be given if separated by commas.

Many thanks,
Roseanna

@apswong
Copy link

apswong commented Nov 18, 2021

Hi @roswri.

  • Paul E Robbins and Pelle E Robbins are the same person. 'Pelle E Robbins' is the current name and so should be the name to use here.
  • 'David Nicholson' is correct.
  • 'JAYNCE' corrected to 'Steven JAYNE' is correct.

I agree each name should be a separate entry, and that multiple names can be used to fill PI_NAME if separated by commas.

@roswri
Copy link

roswri commented Nov 18, 2021

Hi @apswong,

Thanks for clearing that up, I should have worked out they were the same person since they have the same email address! I will add Pelle E Robbins and David Nicholson to the list.

Many thanks,
Roseanna

@RomainCancouet
Copy link

Hello @roswri
Thanks!
Did you parse all the recent GDAC data to build the list of PI names? I did not check all the content of your file but I could not find some PI names (e.g Nicolas BARRE https://data-argo.ifremer.fr/dac/coriolis/3900396/ or CHUNSHENG JING
https://data-argo.ifremer.fr/dac/csio/2902873/)

Once the vocab will in place, do you think we can update the content of the meta files on the GDAC with the updated values for PI names?

@roswri
Copy link

roswri commented Nov 26, 2021

Hi @RomainCancouet,

I intend to to check for any additional missing PI names in the recent GDAC metadata files, I'm having some technical difficulties at the moment, but I will do that when I can and add any additional PI names to the list.

Regarding your question about updating the content of the meta files on the GDAC with the values from the new controlled vocabulary, do you mean from the individual DAC perspective or if it's something the GDAC can do for all meta files across the board? I'm not sure how the process for updating the meta files works, but I imagine as long as there's a mapping between the current terms in use and the new controlled vocabulary terms the files can be updated with the new terms.

The next thing to make a decision on is the table metadata e.g.
Table ID: RPI
TITLE: Names of Argo principal investigators
SHORT_NAME: PI_NAME
DEFINITION: Names of principal investigators in charge of Argo floats used in Argo metadata NetCDF files.

I'm wondering if it should be PI specific, or if there's potential for the table to be used for other fields as well like FLOAT_OWNER, in which case we may want to re-think the table metadata to make it more generic. Any thoughts on this?

Thanks,
Roseanna

@RomainCancouet
Copy link

Hi @roswri ,

OK, thanks for the future check of missing PI names.

Yes my question regarding the meta files was: are we willing and DACs able to update the content of the meta files once some entries (PI_NAME, etc.) will be constrained and new values suggested? I acknowledge that will require efforts, and maybe this could be done only once other metadata fields have been revisited by this task team? e..g #5, #2, etc.

For your question about FLOAT_OWNER, as it is presently populated with a mixture of people, institution or DAC names, I do not know.
Romain

@roswri
Copy link

roswri commented Dec 3, 2021

@tcarval @nvs-vocabs/oceanops @RomainCancouet
Hi all,

Who should be granted editing permissions for this vocabulary for PI_NAME? I can add extra people if anyone else decides they want or need editing permissions later on.

Thanks,
Roseanna

@roswri
Copy link

roswri commented Jan 10, 2022

Hi @tcarval and @nvs-vocabs/oceanops ,

It seems we were able to reach a decision on the specifics of the PI_NAME NVS table at ADMT, which is great.
However, we still need to decide who will oversee this collection (i.e. who will have the ability to request/add new terms). I have tagged Thierry and OceanOps as the best candidates that I am aware of, but please let me know if you object to overseeing the collection, or if there is anyone else that should have permission to edit the collection.

Many thanks,
Roseanna

@tcarval
Copy link
Contributor

tcarval commented Jan 10, 2022

Hello Roseanna - @roswri , I agree to be editor of this new collection

@tcarval tcarval added admt approval requested This ticket is waiting for ADMT approval and removed avtt Argo Vocabulary Task Team labels Dec 9, 2022
@RomainCancouet
Copy link

Thanks to BODC the (constrained) list of available PI names is available in the NVS (https://vocab.nerc.ac.uk/search_nvs/R40/).

It is now a matter of updating the list if entries are missing (people could contact the Vocabs Editors as described in ADMT webpage), and use this table with the FileChecker (nvs-vocabs/ArgoVocabs_Meetings#1) to constrain the allowed entries in netCDF meta files.

@apswong
Copy link

apswong commented Oct 9, 2023

Thanks for all for working on PI_NAME. I would like to add that when a float has multiple PIs, the unique entries in R40 can be concatenated into one character string to fill the PI_NAME variable in the various Argo data files. The current practice is to use commas to separate the multiple unique names. This should be explained clearly in the Users Manual and with the GDAC file checker.

@tcarval tcarval removed the admt approval requested This ticket is waiting for ADMT approval label Oct 26, 2023
@tcarval
Copy link
Contributor

tcarval commented Oct 26, 2023

From ADMT-24, there is an agreement on having a controled list of PIs.
There are privacy concerns to expose a PI_NAME in a public list on Internet.
Can NVS restrict the access to this list ? Should the list be managed outside the NVS ?

@matdon17
Copy link
Author

matdon17 commented Oct 26, 2023 via email

@gwemon
Copy link
Contributor

gwemon commented Oct 26, 2023

@tcarval we cannot hide the content of a collection published on the NVS. Have you considered the use of ORCIDS for the PIs? You could capture people's ORCIDS instead of their name if privacy is required.

@tcarval
Copy link
Contributor

tcarval commented Oct 26, 2023

Hi Thierry, Aren't these same PI names going to appear in open access data in the same form anyway? If so, the same privacy concerns surely apply to the dataset entries as well as a vocab entry. Matt

On Thu, 26 Oct 2023, 07:14 tcarval, @.> wrote: From ADMT-24, there is an agreement on having a controled list of PIs. There are privacy concerns to expose a PI_NAME in a public list on Internet. Can NVS restrict the access to this list ? Should the list be managed outside the NVS ? — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGNGYH5X4FLSOBJDLZK7523YBH5UJAVCNFSM4UAXLSW2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYGA2DONRUGM4Q . You are receiving this because you authored the thread.Message ID: @.>

@matdon17 , Hello Matt
The remaining concern was "consent of the PI - not mandatory, but good practice - this should be covered by the statements in the submissions form or submissions tool".
An information will be addressed to AST team, and we should have a green light.

@tcarval
Copy link
Contributor

tcarval commented Oct 26, 2023

@tcarval we cannot hide the content of a collection published on the NVS. Have you considered the use of ORCIDS for the PIs? You could capture people's ORCIDS instead of their name if privacy is required.

Yes, ORCID was proposed, but did not receive strong support. In a next step, we may manage the link to DOI from R40

@vpaba
Copy link
Contributor

vpaba commented Oct 27, 2023

@matdon17 has a point, in that once information is in an open access data file, there is nothing stopping any tool from extracting the information, collating it and re-exposing it under a category. For example, OceanOps has a list of people's names under its 'Contacts' filter, and the Euro Argo fleet monitoring tool has a 'PI_NAME' tick box that can be selected in its search results filter so that names are displayed next to the corresponding float numbers.

I thus think that removing or replacing people's names from the NVS R40 collection will only partly address the underlying concern - though it would also be the quickest way to start.

@vpaba
Copy link
Contributor

vpaba commented May 7, 2024

Issue resolved post ADMT-24. PI_NAME collection is live: https://vocab.nerc.ac.uk/search_nvs/R40/ or https://vocab.nerc.ac.uk/collection/R40/current/

@apswong
Copy link

apswong commented Oct 11, 2024

@tcarval Now that PI_NAME entries are officially a controlled vocabulary in R40, we should update the "Comment" column in the Users Manual for PI_NAME with this new information, and explain that multiple names can be concatenated, separated by commas, to fill PI_NAME.

@tcarval
Copy link
Contributor

tcarval commented Oct 11, 2024

@tcarval Now that PI_NAME entries are officially a controlled vocabulary in R40, we should update the "Comment" column in the Users Manual for PI_NAME with this new information, and explain that multiple names can be concatenated, separated by commas, to fill PI_NAME.

I updated the "comment" column for PI_NAME
I opened this "FileChecker" ticket to implement the check : OneArgo/ArgoFormatChecker#25

pi_name

https://docs.google.com/document/d/1vFLDlk4paPPUGUFVgTBn0QadrqVsmWy1fFL6eb2w6gE/edit?tab=t.0#bookmark=id.3fnk8nvardpx

@apswong
Copy link

apswong commented Oct 11, 2024

@tcarval Thank you, Thierry. We also need to update the PI_NAME comment column in Sections 2.3.3 and 2.4.4.

@tcarval
Copy link
Contributor

tcarval commented Oct 16, 2024

@tcarval Thank you, Thierry. We also need to update the PI_NAME comment column in Sections 2.3.3 and 2.4.4.

Oups, yes, I just did it, noted in user manual history section

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants