Rogue Scholar Beiträge

language
Veröffentlicht in Donny Winston

A lot of extract-transform-load (ETL) work requires unloading and un-transforming first. Rather than \(ETL\), it’s \(L_A^{-1}T_A^{-1}ET_BL_B\). What the data provider did is \(A\). What you want to do is \(B\). The data provider gave you a “dump” of their data. You don’t know what it means. If you did, you could extract (subset) from it according to your needs – filter entities by some meaningful criteria and collect selected attributes.

Veröffentlicht in Donny Winston

Wikidata uses opaque identifiers for its catalogued information resources. For example, the statement “wd:Q42 wdt:P69 wd:Q691283” may map to the label sequence “‘Douglas Adams’ ’educated at’ ‘St John’s College’” with a language-locale preference of English-US. Opaque naming is wise for internationalization.

Veröffentlicht in Donny Winston

Imagine a data system modeled as three parts: an interface, a processor, and a repository. The repository “contains” information. The processor receives symbol streams to alter or retrieve information from the repository, and the processor outputs symbol streams. The interface is the medium, the opaque surface, of symbol-stream exchange between you and the processor. 1 What information is “in” the repository?

Veröffentlicht in Donny Winston

After my last note on identifiers, Leo Talirz pointed me to a great riff 1 on Tim Berners-Lee’s classic note 2 on “cool URIs”. In the “Cool DOIs” article, Fenner breaks down a DOI into three parts: proxy, prefix, and suffix. A proxy is a server that maintains a map from prefixes to registrants. Example proxies are https://doi.org/ and https://hdl.handle.net/. An example prefix is 10.5281.

Veröffentlicht in Konrad Hinsen's blog

A few days ago, Google announced its experimental project Open Source Insights, which permits the exploration of the dependency graph of Open Source software. My first look at it ended with a disappointment: in its initial stage, the site considers only the package universes of Java, JavaScript, Go, and Rust. That excludes most of the software I know and use, which tends to be written mainly in C, C++, Fortran, and Python.

Veröffentlicht in Donny Winston

Data protocols vary over project lifetimes, and many projects involve parameter sweeps. You might see filesystem directory structures evolve naming schemes like the following 1 : # let's not overthink this at first. concentration_A_0.25/ # hierarchy is good, right? concentration_A/0.25/ # paths are getting too long. conc_A/0.25/ # change to percentages. clever!

Veröffentlicht in Donny Winston

Approaches to data citation may span classes of big-O complexity, for both space (memory/storage) and time (compute/transfer). Dataset revisions may be minted and persisted without any delta encoding / structural sharing. The main mechanism of reproduction for citations in this case is restoration. Space complexity is high, as storage needs are high.