Crossref as a source of open bibliographic metadata for preprints
In most scientific disciplines, the standard way to share new research findings is to publish a research article in a peer-reviewed journal. Increasingly, however, the standard approach to scientific publishing is complemented by alternative approaches, the most significant one being the publication of non-peer-reviewed research articles on preprint servers. As shown in recent work by one of us, preprinting is especially popular in physics and mathematics, but is also gaining traction in other disciplines, such as computer science, geoscience, chemistry, biology, and psychology. Some biomedical research funders have even started to mandate preprinting. Moreover, preprinting is increasingly seen not just as a complement to traditional approaches to scientific publishing, but as a key building block for innovative new approaches to scientific publishing, such as the publish-review-curate model.
Most bibliographic databases still focus primarily on indexing of journal articles. The availability of bibliographic metadata for preprints tends to be limited in these databases. Crossref offers a powerful infrastructure to improve the availability of preprint metadata. Like many journal publishers, many preprint servers work with Crossref to register Digital Object Identifiers (DOIs). For each preprint, these preprint servers register a DOI at Crossref. As part of the process of DOI registration, preprint servers also deposit bibliographic metadata to Crossref. Depositing the title and the publication date of a preprint is mandatory, while depositing other metadata elements, such as the reference list, the abstract, the authors, and the author affiliations, is optional. Bibliographic metadata deposited to Crossref is made openly available, in line with the growing recognition of the importance of openness of research information, promoted for instance through the Barcelona Declaration on Open Research Information. Bibliographic databases and other tools can ingest preprint metadata from Crossref to make it available to their users.
Completeness of Crossref metadata for preprints
The value of preprint metadata made available through the Crossref infrastructure depends on the completeness of the metadata. Previously, we reported detailed statistics on the completeness of Crossref metadata for journal articles, showing how the completeness of this metadata has improved over time, even though many journal publishers still do not submit complete metadata to Crossref. Below we report similar statistics for preprint servers, focusing on six metadata elements: reference lists, abstracts, ORCIDs, author affiliations, funding information, and license information.
The statistics presented below are based on the Crossref XML Metadata Plus Snapshot from January 2025. Only records classified as preprint (or more formally, records of the type 'posted-content' and the subtype 'preprint') in Crossref are considered. Our statistics focus on the 763,951 preprint records in Crossref with publication year 2023 or 2024. There may be multiple versions of a preprint on a preprint server that each have their own DOI. In that case, each version was treated as a separate preprint.
Our statistics do not include preprints posted on arXiv, the world's largest preprint server, used primarily by researchers in physics, mathematics, and computer science. arXiv registers DOIs at DataCite, not at Crossref, and therefore preprints posted on arXiv are not included in our statistics. The same applies to preprints posted on platforms such as Zenodo and Figshare, which also register DOIs at DataCite.
Interactive versions of the scatter plots presented below are available here. The statistics are also available in Zenodo.
Reference lists
For each preprint server with at least 2,000 preprints in 2023 and 2024, Figure 1 shows the number of preprints posted on the server in 2023 and 2024 (vertical axis; logarithmic scale) and the percentage of these preprints with at least one reference in their Crossref metadata (horizontal axis).
Consider for instance bioRxiv and medRxiv, major preprint servers in, respectively, the life sciences and the medical sciences, operated by openRxiv, a recently established non-profit organization. As can be seen in Figure 1, for almost all preprints on bioRxiv and medRxiv a reference list is included in their Crossref metadata. The same applies to JMIR Preprints, a smaller preprint server operated by the medical publisher JMIR Publications. The two largest preprint servers in our analysis, SSRN owned by Elsevier and Research Square owned by Springer Nature, also include reference lists in the Crossref metadata of preprints, although in the case of SSRN reference lists are missing for more than half of the preprints.
Other preprint servers do not include reference lists in the Crossref metadata of preprints. Reference lists are for instance missing in the metadata of preprints posted on Preprints.org, Authorea, and TechRxiv, preprint servers owned by MDPI, Wiley, and IEEE, respectively. The same applies to preprints posted on ChemRxiv, a preprint server in chemistry owned by a number of chemical societies. Preprint servers that use the OSF platform operated by the Center for Open Science also do not include reference lists in the Crossref metadata of preprints. This is for instance the case for PsyArXiv in the psychological sciences and SocArXiv in the social sciences.

Abstracts
Almost all preprint servers include abstracts in the metadata they submit to Crossref (see Figure 2), although for some servers, such as Authorea and TechRxiv, abstracts are missing for a significant share of the preprints. An important exception is SSRN, the largest preprint server in our analysis. Abstracts are not included in the Crossref metadata of SSRN preprints.

ORCIDs
As can be seen in Figure 3, for most preprint servers a large majority of the preprints include ORCIDs in their Crossref metadata. Importantly, Figure 3 focuses on the presence of an ORCID for at least one author of a preprint, so ORCIDs may not be available for all authors. While most preprint servers include ORCIDs in the metadata they submit to Crossref, there are exceptions. SSRN is the most significant one. No ORCIDs are available in the Crossref metadata of SSRN preprints. In the case of Research Square, ORCIDs are available only for a small share of the preprints.

Author affiliations
Most preprint servers do not include author affiliations in the metadata they submit to Crossref (see Figure 4), or they include author affiliations only for a small share of their preprints. There are a few positive exceptions. Most notably, almost all preprints on Research Square and Authorea include at least one author affiliation in their Crossref metadata. This is also the case for preprints on ChemRxiv.

Funding information
There are only two preprint servers that include funding information in the metadata they submit to Crossref (see Figure 5). Funding information is available in the Crossref metadata of about 40% of the preprints on ChemRxiv and about 75% of the preprints on EGUsphere, a preprint server in geoscience owned by the European Geosciences Union. Preprints on other preprint servers do not include funding information in their Crossref metadata. However, recently bioRxiv started to collect funding information and to include this information in the metadata it submits to Crossref.

License information
License information is included in the metadata most preprint servers submit to Crossref (see Figure 6). SSRN and JMIR Preprints are exceptions. They do not include license information in the Crossref metadata of preprints. Authorea includes license information only for a small share of its preprints. Likewise, in the case of bioRxiv and medRxiv, license information is available only for a minority of the preprints. However, the team at bioRxiv and medRxiv informed us they recently started to make license information available for new preprints and they are also working on making license information available for older preprints.

Implications for the preprint ecosystem
Preprints are playing an increasingly important role in scientific publishing. As the reliance on preprints increases, preprint servers need to become more mature. Depositing preprint metadata to infrastructures such as Crossref is an important requirement in a mature preprint ecosystem.
At the same time, preprint servers do not charge authors and readers for their services and typically operate with very limited resources. Consequently, it may not be realistic to expect preprint servers to provide a service level comparable to the service level offered by well-resourced journals. Perhaps preprint metadata submitted to Crossref therefore cannot be expected to reach the same level of completeness as journal article metadata.
On balance, we believe preprint servers should at least submit basic metadata elements to Crossref. This includes the abstract and the license information of a preprint as well as the ORCID and the affiliation of the corresponding author. Ideally, preprint servers also submit reference lists and funding information to Crossref as well as ORCIDs and affiliations of all authors. We realize this is more resource-intensive and may currently not be feasible for some preprint servers. However, as technology advances, new less costly approaches for processing preprint metadata may be developed, making it easier for preprint servers to deposit complete metadata to infrastructures such as Crossref.
We hope this blog post will help to improve the completeness of Crossref metadata for preprints. In parallel with other efforts to improve metadata practices for preprints, we see this as a crucial step toward a mature preprint ecosystem!
We thank Katie Corker and Bianca Kramer for helpful feedback on an earlier version of this blog post. A draft of this blog post was also shared with some of the preprint servers. Responses from SSRN and OSF can be found in the box below.
Response from SSRNSSRN's mission is to support the open sharing of scientific research. SSRN currently sends basic metadata to Crossref. We appreciate the benefit of richer metadata to the broader research ecosystem. We are currently exploring ways to expand the range and quality of data we share with Crossref in a sustainable and efficient manner that makes the best use of our limited resources and can handle SSRN's large number of preprints. Response from OSFMetadata is key to making open research FAIR (Findable, Accessible, Interoperable, and Reusable), and the Center for Open Science is aligned with the importance of improving preprint metadata. At OSF, we are committed to supporting richer, more standardized metadata so that preprints are easier to discover, cite, and reuse. As part of the OSF roadmap, we are migrating from FundRef (Open Funder Registry) to Research Organisation Registry (ROR), which we already use for institutional affiliations, to strengthen and standardize affiliation metadata across all OSF content. |
Additional details
Description
In most scientific disciplines, the standard way to share new research findings is to publish a research article in a peer-reviewed journal. Increasingly, however, the standard approach to scientific publishing is complemented by alternative approaches, the most significant one being the publication of non-peer-reviewed research articles on preprint servers.
Identifiers
- UUID
- 04844ac2-70e8-4260-b7b9-ccf2fd729765
- GUID
- https://www.leidenmadtrics.nl/articles/crossref-as-a-source-of-open-bibliographic-metadata-for-preprints
- URL
- https://www.leidenmadtrics.nl/articles/crossref-as-a-source-of-open-bibliographic-metadata-for-preprints
Dates
- Issued
-
2025-10-16T08:56:00
- Updated
-
2025-10-16T09:12:48