Published February 9, 2020 | https://doi.org/10.59350/7k4cn-ab012

Three research dataset related updates - Dimensions, Google Dataset search and DataCite search

Creators & Contributors

  • 1. ROR icon Singapore Management University
Feature image

1. Dimensions (includng free) now includes Research Datasets

2. Google dataset search comes out of beta

3. Datacite adds citation display showing links between articles and datasets


The release, storage, management,  and discovery of Research Datasets is an area that has been advancing in the last few years.

Here are three new updates that caught my eye in the last few months.

 Dimensions (including free) now includes Research Datasets

Dimensions by Digital Science is a new academic search engine that is shaping up to be a challenger to citation index like Scopus and Web of Science.

One of the main differentiating points about Dimensions besides the fact that it was more inclusive than Scopus, it also searched and linked not just to documents but also clinical trials, grants, patents and policy documents.

On 28 Jan 2020, they added datasets to the mix and unlike Clinical Trials, Grants, patents and policy documents which are not available in the free version, datasets are available in the free version.

Dimensions claims 1.4 million datasets and we can see in terms of frequency it draws from PLOS, Pangaea, ACS, Dryad, Figshare, Springer Nature, Zenodo etc.

Is 1.4 million a lot? Not really, an research data aggregatory Mendeley Data claims 20.5 million items, while in the latest Jan 2020 blogpost , Google claims "almost 25 million" for their Google Dataset Search. But this isn't the full story.

Mendeley Data claims 20.5 million datasets


Accordingly when I run the search for "coronavirus", Dimensions gets me 332 results, Mendeley Data 2,447 results and who can tell with Google Dataset search since it doesn't show the actual number of results and just displays "100+ results". But this isn't the whole story.

Dataset or Research Data?

One of the more interesting things that occurs to me recently is that we use Research Data and datasets somewhat interchangably but of course they aren't the same. (At the higher level I often think of all items with Datacite DOIs as research datasets which is very wrong, since they are often assigned for preprints, white papers etc)

It is important to note that unlike Mendeley Data or Google dataset search which uses the broader definition of research data to include images, software, preprint etc, Dimensions only indexes "datasets" as defined by repositories it harvests from such as Figshare, Dryad as 

Note this isn't foolproof, see this image indexed by Dimensions because it was wrongly labelled as dataset in Figshare.

Mendeley Data includes more than datasets, software, video, text etc


It's hard to compare Mendeley data and Dimensions because of this. If we assume Mendeley data's Dataset (78 results) and Tabular Data (1370) is roughly equalvant to Dimensions dataset , then Mendeley's margin decreases slightly.

In terms of features, when searching by dataset, Dimensions (free) allows me to filter by the usual

  • Publication Year

  • Researcher

  • Field of Research

  • Source Title

  • Repository

and as expected does not allow you to filter by institution which is reserved for the paid version.

However, I notice it does not allow you to filter further by research data type, unlike Mendeley data probably because it has only one type of research data?

One of the nicer features that Dimensions has when searching dataset that isn't in Mendeley Data or Google dataset search is that it allows you to filter by "fields of research". This allows you to experiment to look at datasets by domains that aren't in the usual STEM.

Below for example, I look at datasets assigned to Philisophy and Religious studies.

Datasets in Dimensions on Philisophy and religious studies 



All in all a interesting development to watch.

Google Dataset search comes out of beta

I've blogged about the excitement over Google Scholar dataset search in Oct 2018 and it came of somewhat of a surprise to me that Google announced in Jan 2020, that Google dataset search would be coming out of beta.

Given how long many Google services like Gmail and Google Scholar were in beta, Google seemed to have dropped the beta tag on this in barely over a year!

However for those expecting a big leap in features, you are likely to be disappointed.

There are at best three major improvements.

Firstly, you can now filter by data-types - Table, Document, Image, Text, Archive, other and whether it is free.

Secondly, If a dataset is about a geographic area, you can see the map

Thirdly it is mobile responsive now. 

Overall, it is still quite barebones even for a Google product. 

For instance, taking Google Scholar has a standard my wish list would for it to eventually include things like

1. Email alerts for saved searches

2. A citation function

3. Allow custom ranges for filtering years

Another fustration for me is that Google Dataset search only displays result counts that say "100+ datasets found". I find this extremely odd. While it is acceptable for Google or Google Scholar to display  an estimate like "about 250,000 results" , always showing 100+ is odd. You get zero sense of how many results is really out there.

Further work to be done, to study which of the Google search syntax applies to dataset search,

Datacite adds citation display showing links between articles and datasets

As I blogged in posts  in the past, while dataset discovery by keyword search is important, another important use case will be users finding datasets by going from papers to the datasets or via versa.

People who are watching this space carefully will be aware that one can use the Crossref/datacite event data API to query for "events" or relationships around content deposited with dois with Crossef, and Datacite, this includes of course data citations usually between articles published with Crossref dois and datasets deposited with Datacite dois.





There are two ways to relate articles (with crossref dois) with datasets (with datacite dois). One can put in the relationship from the article side (if you were the publisher + author) but for most libraries we will be adding the relationship from the opposite direct from the dataset deposited in our repository (usually with Datacite doi) to the article doi.

The cliff notes are below.

Figure 1. Steps to share data citations using the Scholix framework.



Essentially when someone deposits a dataset via Datacite, you can choose to relate it to another identifer - typically an article using one of the following relationships


  • isReferencedBy

  • isCitedBy

  • isSupplementto



The cool thing is you can now look at some relationships by using Datacite search, directly as noted in the blog post.

Figure 3. Citations Display




But as you can see above, why are there "citations" and "references" tabs?

Things get confusing as they classify links into "Citations" if they are inward looking or alternatively pointing towards the dataset and "References" if they are outward looking.

Figure 2. Data citations criteria proposal


Honestly, this just feels confusing to me.

Leaving aside the fact that references and citations are mostly used interchangably in layman usage and the fact that opencitations.net defines references and citations differently , you already use relationships like "isreferencedby" and "iscitedby" and yet you differentiate between inward and outward relationships using pretty much the same terms "Citations" and "references".

I'm also confused by the inwards, pointing towards dataset.

The blog post says this applies to all criteria "across the board to all resources with a Datacite DOI. This means that if you have a resource of e.g. the type text, software or any other you will get their citations display shown as well, and the citations will be classified in the same way."

So technically in this case, it is not so much citations are inward looking to datasets but inwards looking to Datacite dois.

In any case, this feature seems to be somewhat unstable as the speed is somewhat slow when I am trying and I'm looking forward to improvements to this feature.

Conclusion

I recently attend a Exlibris Primo 2020 webinar and it was mentioned that Primo would eventually be linking research data to articles (expected 2H 2020). This seems possible since they already index Datacite contents in the central index.

All in all, data discovery seems to be on the way up, though things are still in a early stage of development. It will be interesting as we progress to see how much of what we know about article discovery applies and does not apply to dataset discovery.

Additional details

Description

1. Dimensions (includng free) now includes Research Datasets 2. Google dataset search comes out of beta 3. Datacite adds citation display showing links between articles and datasets The release, storage, management,  and discovery of Research Datasets is an area that has been advancing in the last few years. Here are three new updates that caught my eye in the last few months.

Identifiers

UUID
bb411d55-5992-4c7a-9b59-e8bf043b62f1
GUID
164998280
URL
https://aarontay.substack.com/p/three-research-dataset-related-updates

Dates

Issued
2020-02-09T20:19:00
Updated
2020-02-09T20:19:00