Source of truth

The slow but inextricable migration from files to dependencies

2019-10-29 work

Today I read Computer Files Are Going Extinct by Simon Pitt and I felt seen:

I’d go through my collection and painstakingly add IDv1 and IDv2 music tags. (…) Sometimes I’d even listen to the damn things, though I suspect the time spent organizing and validating them vastly outweighed the time spent listening.

Source of truth in code

In software development, we use the term “source of truth” to define where is the “master copy” of source code. A bit like a master recording of a song, there’s the git server (or other SCM) that is the one really holding the real pristine copy. In 2019, it often means GitHub for open source projects but there are exceptions.

A view of the source of truth

For linux, the source of truth is generally considered to be Linus' repository. Don’t be fooled, github.com/torvalds/linux is merely a copy, the real truth is git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git which you can browse at git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/, itself a mere view of the real thing.

Similarly, the Go programming language has source of truth on go.googlesource.com/go but a live copy is available on github.com/golang.

Why projects would create these view of the real source of truth? A major part is for UX purposes. GitHub provides a form of registry, users are trained to look for the official copy and expect one to be there.

Projects as utilities

In the late 2000 and early 2010, we saw the rise of language specific project registries. It happened that there was concurrent registries for a language but eventually, they slowly but surely became monopolies, one per language.

If you, as a project author, want to publish a python project, you manually upload it to pypi.org. Javascript? That’s npm.js. Ruby? Everyone knows rubygems.org. That’s the new-wave languages.

Older languages have a harder time, likely due to longer history and different engineering culture. C/C++ is still without any real project registry. Same for Java as far as I know. That’s the dinosaur languages.

On new-wave languages, as a project owner, you upload your files to the registry. You delegate the source of truth to the registry.

In the good old days, if you wanted to use a javascript framework, you’d copy the files on your web server and import them directly in your HTML page.

Then most people eventually offloaded the responsibility to a content delivery network, as a caching layer. There’s many reasons for this. The file has a higher chance of being already cached on the client, the CDN is likely higher bandwidth than your web server, and you save on bandwidth costs. Win-win.

That’s still a file centric view of the world. People are shoveling around files.

Dependencies as a service

Nowadays, nobody goes to github.com/facebook/react to take whatever is the current version of react and download it manually. People instead create their react application with something like:

npx create-react-app mon-app
cd mon-app
npm start

There’s a service centric view of the world. The implicit dependency here is the npm registry, you don’t keep a local copy of react, you keep a cache of the server’s copy. You become dependent on npmjs.com/package/react, which itself becomes the source of truth from its upstream at github.com/facebook/react.

There’s a real trade off here. In the file centric world, you need to shovel the files around. It’s work. When your project fail to build, it’s your own responsibility. Now it can fail because GitHub is down.

Alexander Danco put it amazingly in Everything is Amazing, But Nothing is Ours:

Worlds of scarcity are made out of things. Worlds of abundance are made out of dependencies. That’s the software playbook: find a system made of costly, redundant objects; and rearrange it into a fast, frictionless system made of logical dependencies. The delta in performance is irresistible, and dependencies are a compelling building block: they seem like just a piece of logic, with no cost and no friction. But they absolutely have a cost: the cost is complexity, outsourced agency, and brittleness. The cost of ownership is up front and visible; the cost of access is back-dated and hidden.

Registries in languages

Java was the first language to my knowledge to try to encode the ownership in the package name. People know the form com.google.gwt and its unwieldy long paths on the file system. In theory, this could have been a canonical URL, but this wasn’t done this way initially. Later we finally saw projects uses canonical URLs, like org.springframework and org.hibernate but it took a while.

20 years later, Go went the other way and enforced projects to have a canonical URL, directly encoded in the source tree, not as metadata. Unlike predecessor languages like python, ruby, C/C++, javascript and others, Go used an internet-first way to identify dependencies.

You don’t from flask import Flask which looses the fact that flask is hosted at github.com/pallets/flask, you import "github.com/gorilla/mux", which is exactly its source of truth.

This is because new-wave languages use product first registries. The registries contains a build product, not raw sources. In some ways, the Docker Hub registry is a new-wave registry, hosting build products. This puts Go into its own distinct category, a source-first one.

This is an important distinction between Go and Rust. While both are recent languages, Rust’s crates.io is a new-wave registry, hosting build products, not raw source code. Go did not need a registry, because the package’s URL is the unique indentifier.

Shielding from services

In large companies, the source of truth is often an internal, delayed view of the external world. The company will vendor external dependencies, and will use this internal copy as the source of truth. You can guess that Google engineers are doing doing npm install, expecting to hit npmjs.com.

Employees won’t fetch from github, they will manually copy external dependencies, sometimes even make small internal adjustement to the code and will use this.

This is essentially maintaining a file-first view of the world, cutting off dependencies. The on-going cost of a file-first view of the world is important.

The file has been replaced with the ecosystem

That’s why I think goproxy.io is great. It is not a source of truth per se, it’s a pure audit service. In my opinion, it is closer to the certificate transparency project than to crates.io. Unlike previous generations like pypi, npm registry or docker hub, which acted both as the source of truth (as an author, you actively upload there) and a. It’s the audit trail, closer analogy of (TODO).

A stronger stance is single source of truth where there’s only one real master copy.

My 2¢: this calls out to have the standard library to use the long form for coherency, so that importing os would become import golang.org/pkg/os or something like that.