Catalogs and Indices for Finding (Scientific) Software
Creators
The problem
I've been thinking about software catalogs for a while, mostly since the program I lead at NSF (Software Infrastructure for Sustained Innovation, SI2, see the second half of http://nsf.gov/si2) has funded a fairly large number of projects, around 100. Many of these projects are producing and supporting a piece of software, but some contribute to multiple pieces of software, and in some cases, multiple projects contribute to the same piece of software. In addition, some of the projects were funded long enough ago that they have ended or will end soon.
The issue of catalogs or indices is important in multiple ways:
- As a funder, I would like to have a list of the products that we are supporting or have supported, tied to the projects that we have funded.
- As a user, I would like to know what software is likely to be around for a while, and NSF funding (particularly with the period of funding shown) would be a useful thing to know.
- As a developer, I want others to be able to easily find (and then both use and cite) my software.
What happens in other areas?
In the publishing and web content worlds, this is somewhat similar to the roles of a publisher/aggregator/distributor, a content consumer, and a content creator.
Important artifacts are:
- The identifiers of each item of content that enable the items to be uniquely and consistently identified: e.g., DOI, ISBN, ISSN, URL
- The catalogs and indices that store content identifiers and related metadata (e.g., creator, publication date, format), and allow consumers to find them: e.g., publisher catalog, journal table of contents, curated pages of links, best-seller lists, Google, Google Scholar
- The services and tools that generate such catalogs and indices, whether public or private: e.g., Google Scholar, Mendeley, university profiles and knowledge systems
What some government programs are doing
We (the SI2 program) have talked about this a lot, and what we decided to do, partially as a compromise between effort and reward, is a google sites page: http://bit.ly/sw-ci
NIH, as part of its Big Data to Knowledge (BD2K) initiative, held a "Software Discovery Index" workshop in May 2014, with a report available.
The Department of Energy's Office of Science's X-Stack program maintains a wiki to track its current projects.
DARPA has an open catalog, and other DARPA programs may join this catalog. NASA's code.nasa.gov is based on the same idea.
All of these have value, but none are really fully satisfying. There are issues with both the SI2 page and the X-Stack page with old projects either needing to be manually removed or being forcibly removed (listing is tied to funding, not to the ongoing value of the project). Centralized maintenance of these pages seems to be part of the problem. The DARPA and NASA catalogs link to software repos, but not DOIs, and require metadata JSON files to be placed in a particular GitHub repository, which is used to build the catalogs.
While not a government program, freshmeat/freecode is an interesting example of a catalog failing due to low utilization and perhaps, impractical maintenance. [pointer from James Howison]
Proposed solution
Perhaps we can build an infrastructure that allows content creators and users both to contribute? Here's a set of ideas that might make this work:
- Repositories provide a simple option to publish a release, like one can do from GitHub and Zenodo, but in fewer steps and with fewer changes of servers. (Note: I assume that publication of software is a conscious decision that is made at particular times — releases. An alternate model that could also be explored is that publication happens by default as software is checked in to repositories.)
- The person who publishes a release fills in metadata about the
release (as is done on the Zenodo page
now.) These releases would then be assigned DOIs. And they would
become catalog entries through the work of a crawler or future system
like Google Scholar (let's call it OSC – open software catalog – for
now). Additionally, they could also be crawled or curated
for discipline specific purposes.
- Generating metadata for repositories is a larger challenge than just for software. The idea of user-generated metadata, as suggested here and by others, is a potential answer.
- In order for user-generated metadata to work, we as a community need to agree on the standards for such metadata (equivalent to work going on for data in the Research Data Alliance.) Ideally, there would be a minimal set of required metadata, with optional extensions, perhaps even discipline-specific options.
- A related question is how much of the needed metadata can be autogenerated? For example, commits could be used to generate author lists, or at least a starting point for a person to use in defining the authors. (Note that doing this kind of autogeneration is useful in credit in general; see Implementing Transitive Credit with JSON-LD and Project CRedIT as example of where it could be used.)
- A DOAP RDF file could be used to store this information.
- The person who publishes a release can optionally indicate the funding source (with the funding end date), which is added to the metadata, and visible perhaps as a badge/flag, once the funder affirms that it is correct. When the end date passes, the badge/flag would change to indicate prior funding. If the projects gets new funding, a new badge/flag can be added.
- Quality metrics could be also generated, either by peers or by
automated testing.
- Users could give ratings and discuss successes and failures with the software (as is done in NanoHub and App Stores). Citations to the software from papers could also automatically be generated.
- Quality metrics could be automatically generated from developer-provided test suites. Note that this gets back to the question of standard practices as in the DARPA open catalog – it would be better if this was directly supported within the repositories rather than by the catalogs). This would also potentially address the issue of bit rot (somewhat related to an idea from James Howison.)
- People looking for software to use or to further develop can search the catalog. It's unclear how much this would happen vs. just talking to colleagues to see what they use, but perhaps these badges/flags/scores would make this more common.
- A related idea is that catalog could be partially automatically generated, and partially curated. This idea builds on the idea of a software wikipedia to enable developers to know who is doing what and collaborate rather than compete (even unintentionally), as was suggested by one of the attendees of the 2015 SI2 PI workshop. (Sorry I don't remember who suggested this.)
Others discussing (or even working on) this
- Alice Allen: Astronomy Source Code Library (ASCL), with manual entries and manual curation
- Geoffrey Bilder, Jennifer Lin, Cameron Neylon: Principles for Open Scholarly Infrastructures
- Jason Duley: Code.NASA.gov and Empowering the Open Source Community
- Martin Fenner: Metrics for scientific software
- Paul Groth: Trip report from Supporting Scientific Discovery through Norms and Practices for Software and Data Citation and Attribution workshop
- Arfon Smith: JSON-LD for software discovery, reuse and credit
- Jure Triglav: Discovery of scientific software and ScienceToolbox
(I'm happy to add more links here – please email me)
Acknowledgements
Thanks for Amy Friedlander, Martin Fenner, David Proctor, James Howison, Chris Mattman, and Arfon Smith for useful comments on a earlier draft of this blog. The suggestions have been very help, and any errors are solely mine.
Disclaimer
Some work by the author was supported by the National Science Foundation (NSF) while working at the Foundation; any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF.
Comments are welcome.
Additional details
Description
The problem I've been thinking about software catalogs for a while, mostly since the program I lead at NSF (Software Infrastructure for Sustained Innovation, SI 2 , see the second half of http://nsf.gov/si2) has funded a fairly large number of projects, around 100.
Identifiers
- UUID
- d0110c15-48cc-4ec8-b9af-7ce8896c7531
- GUID
- https://danielskatzblog.wordpress.com/?p=15
- URL
- https://danielskatzblog.wordpress.com/2015/02/23/catalogs-and-indices-for-finding-scientific-software
Dates
- Issued
-
2015-02-23T16:58:00
- Updated
-
2015-03-04T18:31:47