Crossref schema update (draft for public comment)

The feedback period is over. If you do have additional feedback to share, please contact Patricia (pfeeney@crossref.org).

We’ll be gathering feedback through January 15, 2020.

Our metadata input schema was originally created to capture basic bibliographic information and facilitate matching DOIs to citations. Over the past 20 years the bibliographic metadata we collect has deepened, and we’ve expanded our schema to include funding information, license, updates, relations, and other metadata that can be used to find, cite, link and assess scholarly content. Updates have been made with some input from our membership, but going forward we will be requesting public feedback for all new versions.

This document contains proposed changes to Crossref’s metadata input schema, and is our first fully public request for feedback for our metadata input schema. The proposed changes are intended to address some long-standing issues with the schema (weak affiliation support) as well as add support for newer initiatives (CRediT, data citation). These changes don’t address all of the requests we’ve had, and we’re saving some big updates for when we’re able to do them well (Open Access indicators, anyone?). We’re looking forward to your feedback! Please leave feedback, ask questions, and make suggestions in this document, or if you prefer send feedback via email to feedback@crossref.org.

Backwards compatibility

This proposed schema breaks backwards compatibility within the contributors section (roles and affiliations specifically), but otherwise the changes are to collect additional metadata only. I’m presenting this as a single schema update but that may change as we move closer to implementation.

Proposed changes

1. Identifiers in our metadata

What will it look like?

3. Citation markup

Support for data citations

Proposed changes to support data citation:

Proposed @cited_item_type values

What will it look like?

4. Books

5. Provenance

6. Conference IDs

7. New types of dates

8. Additional smallish updates

9. Other updates to consider but will probably have to wait for later

1. Identifiers in our metadata

We are proposing to expand support for external identifiers - read this for details. Note that this change currently impacts only contributors and organizations (publisher names, for example).

2. Contributors

The ‘contributors’ section of our metadata deposit schema currently supports very basic contributor metadata with the intention of capturing basic bibliographic information and matching DOIs to citations. In line with our goals of enabling machine-readable metadata that both describes and connects scholarly content, we are revisiting how we handle contributors overall.

To fully and elegantly support affiliation identifiers and multiple author roles we need to break backwards compatibility. It makes sense to take this opportunity to address other outstanding contributor-related issues as well. This change allows us to align more with JATS contributor recommendations, making it easy for the many JATS-supporting publishers to provide full contributor metadata.

The goals of the next contributors update are to:

Support expanded author roles (CRediT, allow multiple roles)
Expand affiliation support
Add author and organization identifiers beyond ORCID
Expand support for corporate/collaborative/organizational authors (authors who are not people)
Support author name versions
update name conventions to be less confusing and Western-centric

Current structure

<person_name sequence="first" contributor_role="chair">

<given_name>Minerva</given_name>

<surname>Housecat</surname>

<affiliation>Crossref University</affiliation>

<ORCID authenticated="true">https://orcid.org/0000-0002-4011-3590</ORCID>

</person_name>

<organization sequence=”first” contributor_role=”author”>Crossref</organization>

</contributors>

Person names

We currently include an alt-name section to capture alternate author names, this option was never promoted, isn’t indexed or included in our JSON output, and needs some attention. I propose we replace this with a repeating string-based alternate_name element.

given_name - no change
surname - change to family_name
alternate_name - allow multiple, string
ORCID - unchanged for now as a few services use this, may fold into identifier tags in the future
name-style - replace with alternate_name
add alternate_name and include name-style and xml:lang attribute
require given_name or surname instead of just surname as not all names have surnames
add identifier support

Group authors

A group (or corporate) author is currently collected as a string in the organization element. As we are adding organization identifier support, the proposal is to:

change organization to collab (term aligns with JATS) - ‘organization’ confuses people and we often get affiliation info in this field; we’re making this backwards-incompatible anyway so it’s an opportunity to clean up this vocabulary issue
add collab as a container - (organization had no children)
add collab-name to capture name only
add role hierarchy (as with person_name)
add identifier support

Question: should we add affiliations to collab? We don’t currently collect affiliations for group authors, but presumably there may be some?

Roles

retire @contributor_role and replace with new repeatable role element - this allows multiple roles to be included
add @role-type to capture role type, and expand list of roles to include CRediT-specific roles.
add @vocab to capture vocabulary

Example:

Current roles:

author
editor
chair
reviewer
review-assistant
stats-reviewer
reviewer-external
reader
translator

CRediT Roles to add:

The CRediT taxonomy terms contain spaces and capitalization, for our implementation we’ll replace spaces with underscores to make the outputs more machine-friendly and consistent with the values we use elsewhere.

Conceptualization - conceptualization
Data curation - data_curation
Formal analysis - formal_analysis
Funding acquisition - funding_acquisition
Investigation - investigation
Methodology - methodology
Project Administration - project_administration
Resources - resources
Software - software
Supervision - supervision
Validation - validation
Visualization - visualization
Writing - original draft - writing-original_draft
Writing - reviewing & editing - writing-reviewing_and_editing

Also adding:

other

Final list:

author
editor
chair
reviewer
review-assistant
stats-reviewer
reviewer-external
reader
translator
conceptualization
data_curation
formal_analysis
funding_acquisition
investigation
methodology
other
project_administration
resources
software
supervision
validation
visualization
writing-original_draft
writing-reviewing_and_editing

Note that there is some overlap between the existing Crossref roles and the (‘author’ = ‘Writing - original draft’ and ‘editor’ = ‘Writing - reviewing & editing’) but we’ll keep them to support members who do not use CRediT

Affiliations

Affiliations are now just a single repeatable tag, affiliation, intended to contain an org. name and maybe location. affiliation will be replaced with the affiliations container tag, :

add an existing element - institution - to capture info of the affiliated institution. institution has the child elements:

institution_name (required)
institution_acronym (optional)
institution_place (optional)

add optional @country attribute to institution_place to allow sorting of affiliations by country (ISO 31661-1 2 alpha code)

institution_department (optional)

add support for institution_id (optional)

Example:

<institution_id institution_id_type="ror">https://ror.org/02twcfp32</institution_id>

<institution_id institution_id_type="isni">0000000405062673</institution_id>

<institution_id institution_id_type=”wikidata”>Q5188229</institution_id>

<institution_name>Crossref</institution_name>

<institution_acronym>CR</institution_acronym>

<institution_place country=”us”>Lynnfield, MA</institution_place>

<institution_department>Feline Outreach</institution_department>

</institution>

<institution_name>University of Somewhere Awesome</institution_name>

<institution_acronym>USA</institution_acronym>

<institution_id institution_id_type="ror">https://ror.org/02twcfxyz</institution_id>

<institution_id institution_id_type="isni">0000000401234567</institution_id>

<institution_id institution_id_type=”wikidata”>Q11111111</institution_id>

<institution_place country=”ca”>Winnepeg</institution_place>

<institution_department>Feline Research</institution_department>

</institution>

</affiliations>

What will it look like?

new or changed tagging in green; tags that break backwards-compatibility in red

<person_name sequence=”first”>

<given_name>Minerva</given_name>

<family_name>Housecat</family_name>

<institution_id institution_id_type="ror">https://ror.org/02twcfp32</institution_id>

<institution_id institution_id_type="isni">0000000405062673</institution_id>

<institution_id institution_id_type=”wikidata”>Q5188229</institution_id>

<institution_name>Crossref</institution_name>

<institution_acronym>CR</institution_acronym>

<institution_place country=”us”>Lynnfield, MA</institution_place>

<institution_department>Feline Outreach</institution_department>

</institution>

<institution_name>University of Somewhere Awesome</institution_name>

<institution_acronym>USA</institution_acronym>

<institution_id institution_id_type="ror">https://ror.org/02twcfxyz</institution_id>

<institution_id institution_id_type="isni">0000000401234567</institution_id>

<institution_id institution_id_type=”wikidata”>Q11111111</institution_id>

<institution_place country=”ca”>Winnepeg</institution_place>

<institution_department>Feline Research</institution_department>

</institution>

</affiliations>

<contrib_id contrib_id_type=”isni”>0000000121032683</contrib_id>

<contrib_id contrib_id_type=”orcid” authenticated=”true”>https://orcid.org/0000-0002-4011-3590</contrib_id>

<alternate_name name_style=”western” xml:lang=”en”>Minnie H</alternate_name>

<alternate_name name_style=”eastern“ xml:lang=”jp”>ミネルバハウスキャット

</alternate_name>

</person_name>

<collab_name>Crossref</collab_name>

<contrib_id contrib_id_type="ror">https://ror.org/02twcfp32</contrib_id>

<contrib_id contrib_id_type="isni">0000000405062673</contrib_id>

<contrib_id contrib_id_type="wikidata”>Q5188229</contrib_id>

</collab>

</contributors>

3. Citation markup

Individual citations can be submitted as a string of text, as an unformatted_citation or as a set of tags containing basic citation metadata. These tags support journals and books fairly well, but do not support other types of content, particularly data citations. We’ll be expanding our support for tagged metadata in citations. This will include support for identifying a publication type for each citation. This means a data citation will clearly be a data citation, and a journal article will clearly be a journal article.

Support for data citations

I suggest we loosely follow the jats4r recommendations for capturing data citations (https://jats4r.org/data-citations). Our citation markup differs enough that to do so exactly would require a greater overhaul, but we can adapt some of the concepts if not the exact tags.

Current elements:

issn
isbn
journal_title
author
volume
issue
first_page
cYear
e_location_id
series_title
volume_title
edition_number
component_number
article_title
std_designator
standards_body
unstructured_citation
doi

Of the above, only author, doi, and cYear neatly apply to data citations. This is not enough to identify data citations as such without a DataCite DOI. It’s also a challenge to clearly identify how to mark up citations of software, videos and other media, blogs, and other content.

Proposed changes to support data citation:

Add @cited_item_type to citation tag to flag the type of citation, citation_type values will be a list, proposed types are listed below

example: <citation cited_item_type=”data”>

Add item_title element to capture the title of a piece of content (such as a dataset)
Source: Jats4R recommends ‘source’ to capture data repository. We don’t have a similar concept, recommend institution to match input
add identifier and @identifier-type to collect non-DOI identifier (similar to e_location_id), will have a controlled list of types (TBD)
add version to capture version info

The set of elements / attributes appropriate for data citations will be:

author - contributor associated with dataset
cYear - year dataset was deposited in repository
item_title* - title of dataset
institution - name of data repository
identifier* - identifier other than DOI
version* - version number of the dataset
unstructured_citation
DOI

* new elements

Note that all elements for citations are optional.

Proposed @cited_item_type values

journal
journal_article
book
book_chapter
conference_proceeding
conference_paper
standard
dataset
software
report
dissertation
website
preprint
other

What will it look like?

new or changed tagging in green; tags that break backwards-compatibility in red

a data citation:

<author>Morinha F</author>

<item_title>Extreme genetic structure in a social bird species despite high dispersal capacity</data_title>

<institution>Dryad Digital Repository</institution>

<identifier type=”uri”>http://www.example.org/boohooidonthaveadoi</identifier>

<doi>10.3201/nowihaveadoi</doi>

</citation>

a journal citation:

<journal_title>Information Technology and Libraries</journal_title>

<first_page>104</first_page>

<article_title>Metadata creation practices in digital repositories and collections: Schemata, selection criteria, and interoperability</article_title>

</citation>

4. Books

We support a number of book types that can be applied at the title and chapter (etc.) level. The types need some refining - while we don’t want to be overly granular, we want to capture

Current book types:

edited-book
monograph
reference
other

Book content-item types (typically chapters):

chapter
section
part
track
reference-entry
other

A more complete revamping of books metadata is forthcoming, for this update I plan to eliminate the ‘book-track’ option and provide best practices for the other types.

5. Provenance

We currently do not collect publisher metadata for anything other than books but there is a need to distinguish the publisher of a registered item from the prefix and member info. We can do this by adding the existing publisher element to all content types and adding identifier support and country code. We can also make publisher_name repeatable and add an @xml:lang attribute.

Example:

<publisher_name xml:lang=”en”>Pumpkin Spice League of Hatred</publisher_name>

<publisher_name xml:lang=”jp”>パンプキンスパイスリーグオブ憎しみ</publisher_name>

<publisher_place country=”us”>Burlington, VT</publisher_place>

</publisher>

6. Conference IDs

Ideally I’d like this update to contain the Conference ID updates as well, this will depend on timing as we don’t want one change to hold up the other.

7. New types of dates

We have had requests to expand the dates we collect and would like feedback on adding the following types of dates:

citation date - to capture member’s preferred citation date (this can vary greatly btwn print and online across publishers)
submitted - per member and internal requests
copyright - to capture the specific copyright date as it may differ from a publication date

8. Additional smallish updates

add e-location_id as a metadata element - we currently accept article IDs /e-location IDs as a subset of publisher_item, but they’re common enough to rate their own tag, this may make inclusion more consistent and common.
add citations to peer reviews - because reviews cite things
Expand list of archive locations to include (per gitlab issue):

NAA (National Archives of Australia)
US_NARA (US National Archives and Records Administration)
UK_NA (National Archives UK)
ADS (Archaeology Data Service
LOC (Library of Congress)
SP (Scholars Portal)
HathiTrust
PKP_PN (PKP Preservation Network)
BL (British Library)
Cariniana (Cariniana Network)
NDPP_China (National Digital Preservation Program, China)
SNL (Swiss National Library) (removing per SNL request)

9. Other updates to consider but will probably have to wait for later

We want to limit the scope of this update to allow us to roll out some much-needed changes, but are aware that more metadata is always needed. Things that we want to change or add include:

changes to how we handle titles - repeatable groups (title, subtitle) with ‘lang’ attribute, no ‘translated title’ element, do we need a ‘translation’ option as an attribute
new content types

capturing version info 😱
open access indicators 😱 - we get a lot of requests for open access information beyond the license info we currently provide. To support this well, we need a OA taxonomy that is widely adopted.
refresh of preprint metadata
contributors: allow numbering of author sequence, we currently just support ‘first’ and ‘additional’
ISSN - we flag ISSN as ‘print’ and ‘electronic’, should support ISSN-L as well
relations

need to add new relations to support grant and conf. IDs,
need to reconcile current list of relations with relations used by Event Data
we currently do not version the relations schema but should

consider remodeling funding data to a.) align more closely with JATS b). collect more funding-related metadata