Messaggi di Rogue Scholar

language
Pubblicato in rOpenSci - open tools for open science
Autore David Winter

I am happy to say that the latest issue of The R Journal includes a paperdescribing rentrez,the rOpenSci package for retrieving data from the National Center for Biotechnology Information(NCBI). The NCBI is one of the most important sources of biological data. The centreprovides access to information on 28 million scholarly articles through PubMed and 250million DNA sequences through GenBank.

Pubblicato in iPhylo

In a recent Twitter conversation including David Shorthous and myself (and other poor souls who got dragged in) we discussed how to demonstrate that adopting JSON-LD as a simple linked-data friendly format might help bootstrap the long awaited "biodiversity knowledge graph" (see below for some suggestions for keeping JSON-LD simple). David suggests partnering with "Three small, early adopting projects". I disagree.

Pubblicato in rOpenSci - open tools for open science
Autore David Winter

A new version of rentrez, our package for the NCBI’s EUtils API, is makingit’s way around the CRAN mirrors. This release represents a substantialimprovement to rentrez, including a new vignettethat documents the whole package. This posts describes some of the new things in rentrez, and gives us a chanceto thank some of the people that have contributed to this package’s development.

Pubblicato in iPhylo

If we view biodiversity data as part of the "biodiversity knowledge graph" then specimens are a fairly central feature of that graph. I'm looking at ways to link specimens to sequences, taxa, publications, etc., and doing this across multiple data providers. Here are some rough notes on trying to model this in a simple way.

Pubblicato in iPhylo

Scott Federhen told me about a nice new feature in GenBank that he's described in a piece for NCBI News. The NCBI taxonomy database now shows a its of type material (where known), and the GenBank sequence database "knows: about types. Here's the summary: You can query for sequences from type using the query "sequence from type"[filter]. This could lead to some nice automated tools.

Pubblicato in iPhylo

In response to Rutger Vos's question I've started to add GBIF taxon ids to the iPhylo Linkout website. If you've not come across iPhylo Linkout, it's a Semantic Mediawiki-based site were I maintain links between the NCBI taxonomy and other resources, such as Wikipedia and the BBC Nature Wildlife finder. For more background see Page, R. D. M. (2011). Linking NCBI to Wikipedia: a wiki-based approach. PLoS Currents, 3, RRN1228.

Pubblicato in iPhylo

Dark taxa have become even darker. NCBI has pulled the plug on large numbers of DNA barcode sequences that lack scientific names. For example, taxon Cyclopoida sp. BOLD:AAG9771 (tax_id 818059) now has a sparse page that has no associated sequences. From an earlier download of EMBL I know that this taxon is associated with at least 5 sequences, such as GU679674. But if you go to that sequence you get this: So the the sequence is hidden.

Pubblicato in iPhylo

Last week I was at the NSF "Assembling, Visualising and Analysing the Tree of Life" Ideas Lab, run by KnowInnovation.com/. It was an interesting experience, essentially a structured week of brainstorming ideas. One thing I came away with is the feeling that our notions of the "tree of life" are fuzzy, contradictory, and often probably unobtainable.

Pubblicato in iPhylo

In an earlier post (Are names really the key to the big new biology?, I questioned Patterson et al.'s assertion in a recent TREE article (doi:10.1016/j.tree.2010.09.004) that names are key to the new biology. In this post I'm going to revisit this idea by doing a quick analysis of how many species in GenBank have "proper" scientific names, and whether the number of named species has changed over time.