Crossref schema update (draft for public comment)

The feedback period is over. If you do have additional feedback to share, please contact Patricia (pfeeney@crossref.org).

We’ll be gathering feedback through January 15, 2020.

Our metadata input schema was originally created to capture basic bibliographic information and facilitate matching DOIs to citations. Over the past 20 years the bibliographic metadata we collect has deepened, and we’ve expanded our schema to include funding information, license, updates, relations, and other metadata that can be used to find, cite, link and assess scholarly content. Updates have been made with some input from our membership, but going forward we will be requesting public feedback for all new versions.

This document contains proposed changes to Crossref’s metadata input schema, and is our first fully public request for feedback for our metadata input schema.  The proposed changes are intended to address some long-standing issues with the schema (weak affiliation support) as well as add support for newer initiatives (CRediT, data citation). These changes don’t address all of the requests we’ve had, and we’re saving some big updates for when we’re able to do them well (Open Access indicators, anyone?).  We’re looking forward to your feedback!  Please leave feedback, ask questions, and make suggestions in this document, or if you prefer send feedback via email to feedback@crossref.org.

Backwards compatibility

This proposed schema breaks backwards compatibility within the contributors section (roles and affiliations specifically), but otherwise the changes are to collect additional metadata only. I’m presenting this as a single schema update but that may change as we move closer to implementation.

Proposed changes

1. Identifiers in our metadata

2. Contributors

Current structure

Person names

Group authors

Roles

Affiliations

What will it look like?

3. Citation markup

Support for data citations

Proposed changes to support data citation:

Proposed @cited_item_type values

What will it look like?

4. Books

5. Provenance

6. Conference IDs

7. New types of dates

8. Additional smallish updates

9. Other updates to consider but will probably have to wait for later

1. Identifiers in our metadata

We are proposing to expand support for external identifiers - read this for details. Note that this change currently impacts only contributors and organizations (publisher names, for example).

2. Contributors

The ‘contributors’ section of our metadata deposit schema currently supports very basic contributor metadata with the intention of capturing basic bibliographic information and matching DOIs to citations.  In line with our goals of enabling machine-readable metadata that both describes and connects scholarly content, we are revisiting how we handle contributors overall.

To fully and elegantly support affiliation identifiers and multiple author roles we need to break backwards compatibility. It makes sense to take this opportunity to address other outstanding contributor-related issues as well.  This change allows us to align more with JATS contributor recommendations, making it easy for the many JATS-supporting publishers to provide full contributor metadata.

The goals of the next contributors update are to:

  • Support expanded author roles (CRediT, allow multiple roles)
  • Expand affiliation support
  • Add author and organization identifiers beyond ORCID
  • Expand support for corporate/collaborative/organizational authors (authors who are not people)
  • Support author name versions
  • update name conventions to be less confusing and Western-centric

Current structure

<contributors>

    <person_name sequence="first" contributor_role="chair">

      <given_name>Minerva</given_name>

      <surname>Housecat</surname>

      <affiliation>Crossref University</affiliation>

      <ORCID authenticated="true">https://orcid.org/0000-0002-4011-3590</ORCID>

    </person_name>

<organization sequence=”first” contributor_role=”author”>Crossref</organization>

</contributors>

Person names

We currently include an alt-name section to capture alternate author names, this option was never promoted, isn’t indexed or included in our JSON output, and needs some attention. I propose we replace this with a repeating string-based alternate_name element.  

  • given_name - no change
  • surname - change to family_name
  • alternate_name - allow multiple, string
  • ORCID - unchanged for now as a few services use this, may fold into identifier tags in the future
  • name-style - replace with alternate_name
  • add alternate_name and include name-style and xml:lang attribute
  • require given_name or surname instead of just surname as not all names have surnames
  • add identifier support 

Group authors

A group (or corporate) author is currently collected as a string in the organization element. As we are adding organization identifier support, the proposal is to:

  • change organization to collab (term aligns with JATS) - ‘organization’ confuses people and we often get affiliation info in this field; we’re making this backwards-incompatible anyway so it’s an opportunity to clean up this vocabulary issue
  • add collab as a container -  (organization had no children)
  • add collab-name to capture name only
  • add role hierarchy (as with person_name)
  • add identifier support 

Question: should we add affiliations to collab? We don’t currently collect affiliations for group authors, but presumably there may be some?

Roles

  • retire @contributor_role and replace with new repeatable role element - this allows multiple roles to be included
  • add @role-type to capture role type, and expand list of roles to include CRediT-specific roles.  
  • add @vocab to capture vocabulary

Example:

<role role-type=”conceptualization”/>

Current roles:

  • author
  • editor
  • chair
  • reviewer
  • review-assistant
  • stats-reviewer
  • reviewer-external
  • reader
  • translator

CRediT Roles to add:

The CRediT taxonomy terms contain spaces and capitalization, for our implementation we’ll replace spaces with underscores to make the outputs more machine-friendly and consistent with the values we use elsewhere.

  • Conceptualization -  conceptualization
  • Data curation - data_curation
  • Formal analysis - formal_analysis
  • Funding acquisition - funding_acquisition
  • Investigation - investigation
  • Methodology - methodology
  • Project Administration - project_administration
  • Resources - resources
  • Software - software
  • Supervision - supervision
  • Validation - validation
  • Visualization - visualization
  • Writing - original draft - writing-original_draft
  • Writing - reviewing & editing - writing-reviewing_and_editing

Also adding:

  • other

Final list:

  • author
  • editor
  • chair
  • reviewer
  • review-assistant
  • stats-reviewer
  • reviewer-external
  • reader
  • translator
  • conceptualization
  • data_curation
  • formal_analysis
  • funding_acquisition
  • investigation
  • methodology
  • other
  • project_administration
  • resources
  • software
  • supervision
  • validation
  • visualization
  • writing-original_draft
  • writing-reviewing_and_editing

Note that there is some overlap between the existing Crossref roles and the  (‘author’ = ‘Writing - original draft’ and ‘editor’ = ‘Writing - reviewing & editing’) but we’ll keep them to support members who do not use CRediT

Affiliations

Affiliations are now just a single repeatable tag, affiliation, intended to contain an org. name and maybe location.  affiliation will be replaced with the affiliations container tag, :

  • add an existing element -  institution - to capture info of the affiliated institution.  institution has the child elements:
  • institution_name (required)
  • institution_acronym (optional)
  • institution_place (optional)
  • add optional @country attribute to institution_place to allow sorting of affiliations by country (ISO 31661-1 2 alpha code)
  • institution_department (optional)
  • add support for institution_id (optional) 

Example:

 <affiliations>

      <institution>

        <institution_id institution_id_type="ror">https://ror.org/02twcfp32</institution_id>

<institution_id institution_id_type="isni">0000000405062673</institution_id>

<institution_id institution_id_type=”wikidata”>Q5188229</institution_id>

 <institution_name>Crossref</institution_name>

        <institution_acronym>CR</institution_acronym>

        <institution_place country=”us”>Lynnfield, MA</institution_place>

        <institution_department>Feline Outreach</institution_department>

      </institution>

<institution>

        <institution_name>University of Somewhere Awesome</institution_name>

        <institution_acronym>USA</institution_acronym>

  <institution_id institution_id_type="ror">https://ror.org/02twcfxyz</institution_id>

  <institution_id institution_id_type="isni">0000000401234567</institution_id>

  <institution_id institution_id_type=”wikidata”>Q11111111</institution_id>

        <institution_place country=”ca”>Winnepeg</institution_place>

        <institution_department>Feline Research</institution_department>

      </institution>

    </affiliations>        

What will it look like?

new or changed tagging in green; tags that break backwards-compatibility in red

<contributors>

  <person_name sequence=”first”>

    <given_name>Minerva</given_name>

    <family_name>Housecat</family_name>

    <role role-type=”conceptualization”/>

    <role role-type=”author”/>

    <affiliations>

      <institution>

        <institution_id institution_id_type="ror">https://ror.org/02twcfp32</institution_id>

<institution_id institution_id_type="isni">0000000405062673</institution_id>

<institution_id institution_id_type=”wikidata”>Q5188229</institution_id>

 <institution_name>Crossref</institution_name>

        <institution_acronym>CR</institution_acronym>

        <institution_place country=”us”>Lynnfield, MA</institution_place>

        <institution_department>Feline Outreach</institution_department>

      </institution>

<institution>

        <institution_name>University of Somewhere Awesome</institution_name>

        <institution_acronym>USA</institution_acronym>

  <institution_id institution_id_type="ror">https://ror.org/02twcfxyz</institution_id>

  <institution_id institution_id_type="isni">0000000401234567</institution_id>

  <institution_id institution_id_type=”wikidata”>Q11111111</institution_id>

        <institution_place country=”ca”>Winnepeg</institution_place>

        <institution_department>Feline Research</institution_department>

      </institution>

    </affiliations>

    <contrib_id contrib_id_type=”isni”>0000000121032683</contrib_id>

    <contrib_id contrib_id_type=”orcid” authenticated=”true”>https://orcid.org/0000-0002-4011-3590</contrib_id>

    <alternate_name name_style=”western” xml:lang=”en”>Minnie H</alternate_name>

    <alternate_name name_style=”eastern“ xml:lang=”jp”>ミネルバハウスキャット    

    </alternate_name>

  </person_name>

  <collab sequence=”additional”>

    <collab_name>Crossref</collab_name>

    <role role-type=”data_curation”/>

      <contrib_id contrib_id_type="ror">https://ror.org/02twcfp32</contrib_id>

<contrib_id contrib_id_type="isni">0000000405062673</contrib_id>

<contrib_id contrib_id_type="wikidata”>Q5188229</contrib_id>

  </collab>

</contributors>

3. Citation markup

Individual citations can be submitted as a string of text, as an unformatted_citation or as a set of tags containing basic citation metadata. These tags support journals and books fairly well, but do not support other types of content, particularly data citations. We’ll be expanding our support for tagged metadata in citations. This will include support for identifying a publication type for each citation. This means a data citation will clearly be a data citation, and a journal article will clearly be a journal article.

Support for data citations

I suggest we loosely follow the jats4r recommendations for capturing data citations (https://jats4r.org/data-citations). Our citation markup differs enough that to do so exactly would require a greater overhaul, but we can adapt some of the concepts if not the exact tags.

Current elements: 

  • issn
  • isbn
  • journal_title
  • author
  • volume
  • issue
  • first_page
  • cYear
  • e_location_id
  • series_title
  • volume_title
  • edition_number
  • component_number
  • article_title
  • std_designator
  • standards_body
  • unstructured_citation
  • doi

Of the above, only author, doi,  and cYear neatly apply to data citations. This is not enough to identify data citations as such without a DataCite DOI. It’s also a challenge to clearly identify how to mark up citations of software, videos and other media, blogs, and other content.

Proposed changes to support data citation:

  • Add  @cited_item_type to citation tag to flag the type of citation, citation_type values will be a list, proposed types are listed below
  • example: <citation cited_item_type=”data”>
  • Add item_title element to capture the title of a piece of content (such as a dataset)
  • Source: Jats4R recommends ‘source’ to capture data repository.  We don’t have a similar concept, recommend institution to match input
  • add identifier and @identifier-type to collect non-DOI identifier (similar to e_location_id), will have a controlled list of types (TBD)
  • add version to capture version info

The set of elements / attributes appropriate for data citations will be:

  • author - contributor associated with dataset
  • cYear  -  year dataset was deposited in repository
  • item_title* - title of dataset
  • institution - name of data repository
  • identifier* - identifier other than DOI
  • version* - version number of the dataset
  • unstructured_citation 
  • DOI

* new elements

Note that all elements for citations are optional.

Proposed @cited_item_type values

  • journal
  • journal_article
  • book
  • book_chapter
  • conference_proceeding
  • conference_paper
  • standard
  • dataset
  • software
  • report
  • dissertation
  • website
  • preprint
  • other

What will it look like?

new or changed tagging in green; tags that break backwards-compatibility in red

a data citation:

<citation cited_item_type=”data” key=”ref4”>

  <author>Morinha F</author>

  <cYear>2017</cYear>

  <item_title>Extreme genetic structure in a social bird species despite high dispersal capacity</data_title>

  <institution>Dryad Digital Repository</institution>

  <identifier type=”uri”>http://www.example.org/boohooidonthaveadoi</identifier>

<doi>10.3201/nowihaveadoi</doi>

</citation>

a journal citation:

<citation cited_item_type=”journal_article” key="ref5">

  <journal_title>Information Technology and Libraries</journal_title>

  <author>Park</author>

  <volume>29</volume>

  <issue>3</issue>

  <first_page>104</first_page>

  <cYear>2010</cYear>

  <doi>10.6017/ital.v29i3.3136</doi>

  <article_title>Metadata creation practices in digital repositories and collections: Schemata, selection criteria, and interoperability</article_title>

</citation>

4. Books

We support a number of book types that can be applied at the title and chapter (etc.) level. The types need some refining - while we don’t want to be overly granular, we want to capture

Current book types:

  • edited-book
  • monograph
  • reference
  • other

Book content-item types (typically chapters):

  • chapter
  • section
  • part
  • track
  • reference-entry
  • other

A more complete revamping of books metadata is forthcoming, for this update I plan to eliminate the ‘book-track’ option and provide best practices for the other types.

5. Provenance

We currently do not collect publisher metadata for anything other than books but there is a need to distinguish the publisher of a registered item from the prefix and member info. We can do this by adding the existing publisher element to all content types and adding identifier support and country code.  We can also make publisher_name repeatable and add an @xml:lang attribute.

Example:

<publisher>

  <publisher_name xml:lang=”en”>Pumpkin Spice League of Hatred</publisher_name>

 <publisher_name xml:lang=”jp”>パンプキンスパイスリーグオブ憎しみ</publisher_name>

  <publisher_place country=”us”>Burlington, VT</publisher_place>

</publisher>

6. Conference IDs

Ideally I’d like this update to contain the Conference ID updates as well, this will depend on timing as we don’t want one change to hold up the other.

7. New types of dates

We have had requests to expand the dates we collect and would like feedback on adding the following types of dates:

  • citation date - to capture member’s preferred citation date (this can vary greatly btwn print and online across publishers)
  • submitted - per member and internal requests
  • copyright - to capture the specific copyright date as it may differ from a publication date

8. Additional smallish updates

  • add e-location_id as a metadata element - we currently accept article IDs /e-location IDs as a subset of publisher_item, but they’re common enough to rate their own tag, this may make inclusion more consistent and common.
  • add citations to peer reviews - because reviews cite things
  • Expand list of archive locations to include (per gitlab issue):
  • NAA (National Archives of Australia)
  • US_NARA (US National Archives and Records Administration)
  • UK_NA (National Archives UK)
  • ADS (Archaeology Data Service
  • LOC (Library of Congress)
  • SP (Scholars Portal)
  • HathiTrust
  • PKP_PN (PKP Preservation Network)
  • BL (British Library)
  • Cariniana (Cariniana Network)
  • NDPP_China (National Digital Preservation Program, China)
  • SNL (Swiss National Library)   (removing per SNL request)

9. Other updates to consider but will probably have to wait for later

We want to limit the scope of this update to allow us to roll out some much-needed changes, but are aware that more metadata is always needed.  Things that we want to change or add include:

  • changes to how we handle titles  -  repeatable groups (title, subtitle) with ‘lang’ attribute, no ‘translated title’ element, do we need a ‘translation’ option as an attribute
  • new content types
  • capturing version info 😱
  • open access indicators 😱 - we get a lot of requests for open access information beyond the license info we currently provide.  To support this well, we need a OA taxonomy that is widely adopted.
  • refresh of preprint metadata
  • contributors:  allow numbering of author sequence, we currently just support ‘first’ and ‘additional’
  • ISSN - we flag ISSN as ‘print’ and ‘electronic’, should support ISSN-L as well
  • relations
  • need to add new relations to support grant and conf. IDs,
  • need to reconcile current list of relations with relations used by Event Data
  • we currently do not version the relations schema but should
  • consider remodeling funding data to a.) align more closely with JATS b). collect more funding-related metadata