Rogue Scholar

Publicados 3 de fevereiro de 2022 in iPhylo

Autor Roderic Page

There are several instances where I have a collection of references that I want to deduplicate and merge.

BatsClassificationCluster MapsData CleaningGBIFCiências da Computação e da InformaçãoInglês

Cluster maps, papaya plots, and the trouble with GBIF taxonomy

https://doi.org/10.59350/dq1cv-szd96

Publicados 14 de agosto de 2013 in iPhylo

Autor Roderic Page

Continuing the theme of the failings of the GBIF classification I've been playing further with cluster maps to visualise the problem (see this earlier post for an introduction).Browsing through bats in GBIF I keep finding the same species appearing more than once, albeit in different genera.

Data CleaningRDFSPARQLTaxonomyUniprotCiências da Computação e da InformaçãoInglês

A use case for RDF in taxonomy

https://doi.org/10.59350/ch96r-vgg12

Publicados 1 de agosto de 2013 in iPhylo

Autor Roderic Page

Readers of this blog will know that I'm sceptical about the current value of linked data and RDF in biodiversity informatics. But I came across an interesting paper on RDF and biocuration that suggests a good "use case" for RDF in constructing and curating taxonomic databases.The paper is "Catching inconsistencies with the semantic web: a biocuration case study" (PDF here) by Jerven Bolleman and Sebastien Gehant.

BioNamesData CleaningMatchingReconciliationCiências da Computação e da InformaçãoInglês

BioNames update - reconciliation strategies

https://doi.org/10.59350/9jnjr-qvp20

Publicados 22 de abril de 2013 in iPhylo

Autor Roderic Page

Over on Google Plus (yeah, me neither) Donat Agosti is giving me a hard time regarding the quality of some data that I am using.

Catalogue Of LifeChresonymData CleaningErrorsHomonymCiências da Computação e da InformaçãoInglês

More fictional taxa and the myth of the expert taxonomic database

https://doi.org/10.59350/regph-e8w09

Publicados 25 de junho de 2012 in iPhylo

Autor Roderic Page

I know I'm starting to sound like a broken record, but the more I look, the more taxonomic databases seem to be full of garbage. Databases such as the Catalogue of life, which states that it is a "quality-assured checklist" have records that are patently wrong.

BioStorClassificationData CleaningErrorGBIFCiências da Computação e da InformaçãoInglês

The GBIF classification is broken — how do we fix it?

https://doi.org/10.59350/5a5re-kp839

Publicados 30 de maio de 2012 in iPhylo

Autor Roderic Page

This post arose from an ongoing email conversation with Tony Rees about extracting and annotating taxonomic names. In BioStor I use the GBIF classification to display the taxonomic names found in the OCR text in the form of a tree.

ClusteringData CleaningGraphvizTaxonomyCiências da Computação e da InformaçãoInglês

Clustering strings

https://doi.org/10.59350/wfhyy-qt220

Publicados 22 de fevereiro de 2012 in iPhylo

Autor Roderic Page

Revisiting an old idea (Clustering taxonomic names) I've added code to cluster strings into sets of similar strings to the phyloinformatics course site.This service (available at http://iphylo.org/~rpage/phyloinformatics/services/clusterstrings.php) takes a list of strings, one per line, and returns a list of clusters.

Data CleaningGoogle RefineTaxonomic NameCiências da Computação e da InformaçãoInglês

Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data

https://doi.org/10.59350/jyyjb-ppf17

Publicados 6 de fevereiro de 2012 in iPhylo

Autor Roderic Page

Google Refine is an elegant tool for data cleaning. One of its most powerful features is the ability to call "Reconciliation Services" to help clean data, for example by matching names to external identifiers. Google Refine comes with the ability to use Freebase reconciliation services, but you can also add external services.

BioStorChallengeData CleaningDuplicatesMendeleyCiências da Computação e da InformaçãoInglês

Mendeley mangles my references: phantom documents and the problem of duplicate references

https://doi.org/10.59350/925mf-6fq39

Publicados 10 de novembro de 2010 in iPhylo

Autor Roderic Page

One issue I'm running into with Mendeley is that it can create spurious documents, mangling my references in the process. This appears to be due to some over-zealous attempts to de-duplicate documents.

BHLData CleaningIndexMatchingMySQLCiências da Computação e da InformaçãoInglês

n-gram fulltext indexing in MySQL

https://doi.org/10.59350/26ame-4a164

Publicados 23 de outubro de 2009 in iPhylo

Autor Roderic Page

Continuing with my exploration of the Biodiversity Heritage Library one obstacle to linking BHL content with nomenclature databases is the lack of a consistent way to refer to the same bibliographic item (e.g., book or journal). For example, the Amphibia Species of the World (ASW) page for Gastrotheca aureomaculata gives the first reference for this name as: Gastrotheca aureomaculata Cochran and Goin, 1970, Bull. U.S. Natl.

Postagens de Rogue Scholar

Deduplicating bibliographic data

Cluster maps, papaya plots, and the trouble with GBIF taxonomy

A use case for RDF in taxonomy

BioNames update - reconciliation strategies

More fictional taxa and the myth of the expert taxonomic database

The GBIF classification is broken — how do we fix it?

Clustering strings

Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data

Mendeley mangles my references: phantom documents and the problem of duplicate references

n-gram fulltext indexing in MySQL