Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata requirements #14

Open
arcadiafalcone opened this issue Aug 30, 2024 · 10 comments
Open

Metadata requirements #14

arcadiafalcone opened this issue Aug 30, 2024 · 10 comments

Comments

@arcadiafalcone
Copy link

arcadiafalcone commented Aug 30, 2024

I met with Amy yesterday and we determined that the easy deposit metadata should for the most part follow H2. The main record description should describe the SDR resource, with the SDR as publisher and deposit date as publication date. If a DOI is supplied for the version of record, that version is described as a related resource with the journal publication information (if applicable). The DOI is an attribute of the related resource, not the deposited file; per Amy, the deposited file will not have its own DOI.

User uploads file, does not provide DOI for version of record; metadata is extracted from document
Title
Authors
-first name
-last name
-ORCID
-affiliations - organization, department
-type = person (not editable by user)
-role = author (not editable by user)
Publication date = date of deposit (not editable by user)
Publisher = Stanford Digital Repository (not editable by user)
Form (following H2 mapping, not editable by user except possible exception below)
-H2 type = Text
-H2 subtype = Article (maybe option to add Preprint also - check with Amy)
-MODS resource type - text
-DataCite type - Text
-if Preprint, also include "grey literature" as AAT genre
Abstract
Keywords
Preferred citation (same model as H2)
Purl

User uploads file, provides DOI for version of record
Main record: same as above (may fall back on DOI as metadata source for user-editable fields)
Related resource (metadata derived from DOI), type = has version of record (need to add to cocina type list)
Preferred citation (check with Amy on format) - title, contributors, publisher/journal info, publication date
DOI

To be discussed further
-Whether this is the best way to represent the version of record - if it makes sense to capture structured metadata for related resource instead of citation, or alternatively to just have the title and link to DOI
-Including form space with additional optional fields for user to fill in, such as links to other related resources like datasets (not automatically derived information)
-ETA: look up ROR for organizational affiliation

@justinlittman
Copy link
Contributor

Looking at some real data, I'm wondering about "department" for affiliations.

Department doesn't map neatly to many organizational structures that we see in articles. This is likely to lead to many errors in guessing at the department. (I also suspect that users will be frustrated as well, if there not able to figure out what is meant by "department".)

  • What is this data used for?
  • Would it be good enough to just extract the organization?

@justinlittman
Copy link
Contributor

In the above, preferred citation is for the SDR resource? (e.g., it has a link to purl). Can the user replace it with there own citation, e.g., for the related resource?

@justinlittman
Copy link
Contributor

What is the preferred format / style for the related resource citation? APA? What appears in the paper?

@arcadiafalcone
Copy link
Author

The preferred citation in the main record should be for the SDR resource. I think that it shouldn't be editable, only the one for the related resource/version of record, but Amy and I didn't discuss that explicitly.

Amy, what is the preferred format for the citation?

justinlittman added a commit that referenced this issue Sep 4, 2024
@jcoyne
Copy link

jcoyne commented Sep 6, 2024

We see all sorts of affiliations in the Open Access dataset. How do we mark things like this:
"Group for project name", "Division of specialty", "Donor Center for Speciality", "Department of broader speciality", "Donor Name Labratory", "Donor Name School of Something", "Stanford University".

@arcadiafalcone
Copy link
Author

arcadiafalcone commented Sep 16, 2024

Of these, the metadata to extract from the document (or look up from DOI/Crossref/ROR/etc) if available is:
Title
Author first name, last name, institutional affiliation and ROR, ORCID
Abstract
Keywords

Abstract and keywords should be extracted if provided as part of the document or DOI metadata. They do not need to be generated locally if not available from either source.

If DOI is provided, also construct a preferred citation from the DOI-sourced metadata and include as relatedResource (i.e., link to version of record).

(per meeting with @amyhodge, @vivnwong, and @RochelleLundy)

@jcoyne
Copy link

jcoyne commented Sep 16, 2024

@arcadiafalcone we often see multiple affiliations per author. Presumably we want to capture them all, right?

Does anyone have examples of ROR and/or ORCIDs in a preprint?

@arcadiafalcone
Copy link
Author

@jcoyne That is correct. And any duplicates from "rounding up" to the parent institution should be removed so that the institution name appears only once.

@arcadiafalcone
Copy link
Author

arcadiafalcone commented Sep 18, 2024

Clarification on DOI handling after consultation with Amy:

For the purpose of AI tool evaluation: success is extracting a DOI from the PDF that identifies a version of the deposited document.

For the purpose of creating specific metadata for the PDF: success is extracting the title, author first and last names, abstract (if present in PDF/DOI), and keywords (if present in PDF/DOI). Either the PDF itself or DOI may be the source of any or all of this information.

For the purpose of creating a Cocina record: success is integrating specific document information with default values and representing it in the Cocina schema. This includes the extracted DOI being mapped to relatedResource1.identifier1.value. Additional information from DOI lookup may also be used in the relatedResource1 description.

For the purpose of a user interface: the DOI is presented to the user as a related resource with type "is version of." The user may change this type to "is version of record of".

ETA: If a DOI is not extracted from the PDF, no related resource is created.

@amyehodge
Copy link

The preferred citation in the main record should be for the SDR resource. I think that it shouldn't be editable, only the one for the related resource/version of record, but Amy and I didn't discuss that explicitly.

Amy, what is the preferred format for the citation?

Sorry, I missed this question earlier.

I have no preference for the format of the citation.

The preferred citation is a touch tricky. We currently do allow users to edit this in H2, including for OA articles. Many folks want all citations to go to a single DOI (eventually the version of record) and not to the open access version, so that their citation counts aren't getting diluted. This is not how the system is intended to work, i.e. if someone read the OA version they really should cite that version. It's a balance between what the users want and what we believe to be the "correct" thing to do from a library/schol comms perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants