May 09, 2018

Antoine Amarilli a.k.a a3nm

Migrating from cgit to stagit

I serve my git repositories over HTTP for people who want to browse them without having to clone them. I used to do this with cgit, which is a server-side dynamic solution written in C. It worked nicely, but lately some bots have been busy crawling these git repositories, and I regularly ran into trouble where the cgit.cgi processes ended up in a busy loop, eating 100% of CPU for unclear reasons. More generally, I had always been anxious about using a dynamic solution to serve these repositories: all the rest of my website is static, which I think is more elegant and more reassuring in terms of security.

The natural approach would be to turn cgit into a static solution by precompiling all pages whenever a git repository is updated. However, this is not reasonable: cgit allows you, e.g., to see the status of every file at every commit, or to diff any pair of commits, which would be too expensive to precompute. These features are not very useful, so I was considering to do it but tweak cgit's output to suppress the useless parts; but this would have been tedious.

Fortunately, there is a better way: the stagit tool is a minimalistic variant of cgit, also written in C, which is designed to be static. So I have just removed cgit from my server and installed stagit instead. Obviously it's too early for me to say whether stagit is a perfect solution, but I'm happy with what I have seen so far. Here are some quick and messy notes about how I did it and what surprised me, in case you are considering doing the same.

Stagit is not packaged for Debian yet but it's easy to compile and install (and the source code is rather short if you want to hack it). You will need libgit2-dev, which is packaged by Debian. I edited a bit the source to suit my needs; cf my local fork: I changed a bit the HTML, fixed the CSS to work better on mobile displays, renamed some files, etc. It's a bit ugly to have HTML boilerplate hardcoded in the C code, but it works, and if it starts misbehaving it will be easier for me to investigate.

Stagit provides one command stagit to generate the HTML for a repository, and one command stagit-index to generate an index of the various repositories. The README is rather clear (you can also look at the manpages in the repo). Of course, you need to re-run stagit whenever a git repository is updated, so you'll need a post-receive hook like the one they provide, which I adapted to my needs. One concern is that running stagit is synchronous, i.e., when doing a git push, you must wait for stagit to complete. However, it seems to run instantly on my repositories, so that's no big deal.

To get a nice index of the repositories, you need to change your git repositories to edit description with a description and url with the clone URL. There is also support for a owner field, but I removed this from the generated HTML as I'm the owner of all the repos I host. As the setup of a new git repository had become a bit tedious, I wrote a script for that, too.

About the url: you should know that stagit does not take care of allowing people to clone your repository. One solution is to run a git server for that (which the official stagit repository seems to do), but I didn't want it because it's not static. Instead, I intend people to clone my repositories using the dumb HTTP protocol: it only requires you to serve your git repositories with your Web server, and to run git update-server-info, as can be done easily using the post-update.sample hook. So for each repository you will have the stagit version and the bare repository. However, this will mean that the git clone URL will be different from the stagit URL, which is a bit jarring. So I cheated using some lighttpd mod_rewrite rules to transparently do the redirection. (Note that git clone will still point out the existence of this redirect when doing the cloning, so it's not completely transparent.) Here are the rules, following this page thanks to immae for suggesting an improvement:

  "^/git/([^/.]*)/HEAD$" => "/git/$1.git/HEAD",
  "^/git/([^/.]*)/info/(.*)$" => "/git/$1.git/info/$2",
  "^/git/([^/.]*)/objects/(.*)$" => "/git/$1.git/objects/$2",
  "^/git/([^/.]*)/git-upload-pack$" => "/git/$1.git/git-upload-pack",
  "^/git/([^/.]*)/git-receive-pack$" => "/git/$1.git/git-receive-pack",

One last thing about the migration to stagit is that I didn't want to break all the cgit URLs that used to work before. Of course, not all cgit pages have a stagit counterpart, but most of the important ones do, however their names are a bit different. Again, not very robust, but here goes:

  "^/git/([^/.]*)/commit/\?id=(.*)$" => "/git/$1/commit/$2.html",
  "^/git/([^/.]*)/about(/.*)?$" => "/git/$1/file/README.html",
  "^/git/([^/.]*)/log(/.*)?$" => "/git/$1/index.html",
  "^/git/([^/.]*)/refs(/.*)?$" => "/git/$1/refs.html",
  "^/git/([^/.]*)/tree/?(\?.*)?$" => "/git/$1/files.html",
  "^/git/([^/.]*)/tree/([^?]*)(\?.*)?$" => "/git/$1/file/$2.html",
  "^/git/([^/.]*)/plain/([^?]*)(\?.*)?$" => "/git/$1/file/$2.html",
  "^/git/([^.?]*)\?.*$" => "/git/$1",
  "^/git/([^/.]*)/([^?]*)\?.*$" => "/git/$1",

So there you have it: a completely static web version of my git repositories that can also be used to clone them with the dumb HTTP transport, a hook to update the web version, a script to create a new repository, and no more problems or possible security vulnerabilities with cgit!

by a3nm at May 09, 2018 11:18 PM

April 13, 2018

Nicolas Dandrimont a.k.a olasd

docker and 127.0.0.1 in /etc/resolv.conf

I’ve been doing increasingly more stuff with docker these days, and I’ve bumped into an issue: all my systems use a local DNS resolver, one way or another: as I’m roaming and have some VPNs with split-horizon DNS, my laptop uses dnsmasq as configured by NetworkManager; as I want DNSSEC validation, my servers use a local unbound instance. (I’d prefer to use unbound everywhere, but I roam on some networks where the only DNS recursors that aren’t firewalled don’t support DNSSEC, and unbound is pretty bad at handling that unfortunately)

Docker handles 127.0.0.1 in /etc/resolv.conf the following way: it ignores the entry (upstream discussion). When there’s no DNS servers left, it will fall back to using 8.8.8.8 and 8.8.4.4.

This is all fine and dandy, if that’s the sort of thing you like (I don’t kinkshame), but if you don’t trust the owners of those DNS recursors, or if your network helpfully firewalls outbound DNS requests, you’ll likely want to use your own resolver instead.

The upstream docker FAQ just tells people to disable the local resolver and/or hardcode some DNS entries in the docker config. That’s not very helpful when you roam and you really want to use your local resolver… To do things properly, you’ll have to set two bits of configuration: tell docker to use the local resolver, and tell your local resolver to listen on the docker interface.

Docker DNS configuration

First, you need to configure the docker daemon to use the host as DNS server: In the /etc/docker/daemon.json file, set the dns key to [“172.17.0.1”] (or whatever IP address your docker host is set to use).

Local resolver: dnsmasq in NetworkManager

When NetworkManager is set up to use dnsmasq, it runs with a configuration that’s built dynamically, and updated when the network settings change (e.g. to switch upstream resolvers according to DHCP or VPN settings).

You can add drop-in configurations in the /etc/NetworkManager/dnsmasq.d directory. I have set the following configuration variables in a /etc/NetworkManager/dnsmasq.d/docker.conf file:

# Makes dnsmasq bind to all interfaces (but still only accept queries on localhost as per NetworkManager configuration)
bind-interfaces

# Makes dnsmasq accept queries on 172.17.0.0/16
listen-address=172.17.0.1

Restarting NetworkManager brings up a new instance of dnsmasq, which will let Docker do its thing.

Local resolver: unbound

The same sort of configuration is done with unbound. We tell unbound to listen to all interfaces and to only accept recursion from localhost and the docker subnet. In /etc/unbound/unbound.conf.d/docker.conf:

interface: 0.0.0.0
interface: ::
access-control: 127.0.0.0/8 allow
access-control: ::1/128 allow
access-control: 172.17.0.0/16 allow

Restart unbound (reload doesn’t reopen the sockets) and your docker container will have access to your local resolver.

Both these local resolver configurations should work even when the docker interface comes up after the resolver: we tell the resolver to listen to all interfaces, then only let it answer to clients on the relevant networks.

Result

Before:

$ docker run busybox nslookup pypi.python.org
Server: 8.8.8.8
Address 1: 8.8.8.8 google-public-dns-a.google.com

Name: pypi.python.org
Address 1: 2a04:4e42::223
Address 2: 2a04:4e42:200::223
Address 3: 2a04:4e42:400::223
Address 4: 2a04:4e42:600::223
Address 5: 151.101.0.223
Address 6: 151.101.64.223
Address 7: 151.101.128.223
Address 8: 151.101.192.223

After:

$ docker run busybox nslookup pypi.python.org
Server: 172.17.0.1
Address 1: 172.17.0.1

Name: pypi.python.org
Address 1: 2a04:4e42:1d::223
Address 2: 151.101.16.223

by olasd at April 13, 2018 10:04 PM

March 11, 2018

Antoine Amarilli a.k.a a3nm

An update on CalDAV and CardDAV with Radicale

This is a quick update to a previous post where I explained how to self-host your calendar and contacts using the Radicale CalDAV and CardDAV server, and how to access them on Android devices with DAVdroid.

Three years later, I am still using this setup. I only use my Android phone to access the calendar and contacts, so the Radicale server is essentially a way to back the contacts and calendars up; although I have also tried accessing them, e.g., with Evolution. Over these three years, DAVdroid has evolved and gotten a bit more user-friendly and stable, though I have had a few problems (e.g., duplicated calendar events). Radicale has evolved too, I'm currently at version 1.1.1, which is the one provided by Debian even though it is really outdated. (Also, as of this writing, Radicale is not available in the Debian testing repos, see here, but it can be installed from Debian stable.)

The main change that I did is on the server. In the old guide, I explained how to set up Radicale so that it listens on port 5232, manages authentication and encryption, and DAVdroid connects to it directly. I have changed this setup so that DAVdroid now connects to Apache2, which manages authentication and encryption, and talks to Radicale using WSGI. This has a number of advantages:

  • You can encrypt the connection with SSL managed by Apache, e.g., using Let's Encrypt, without self-signed certificates or other ad-hoc setup; and you don't need to trust Radicale to do the encryption correctly.
  • The server listens on the standard HTTPS port (443) rather than the custom Radicale port (5232) so the connections aren't blocked on unfriendly networks.
  • You can use vhosts, e.g., to host it on a subdomain.
  • Authentication is managed by Apache, not Radicale. This is somewhat reassuring: even if Radicale has a massive security flaw, only users that correctly authenticated with Apache can talk to it at all.
  • The most important point: with the old setup, Radicale would inexplicably hang every now and then, presumably when the phone disconnected messily from it. (I think it is this bug). With the new setup, this does not happen. (Maybe the bug has been fixed in more recent Radicale versions anyway, I don't know.)

Of course, the downside of this new setup is that you need Apache just to route requests to Radicale. As I needed Apache for other purposes, though, I didn't mind.

The setup

I haven't documented this setup while I did it, so here a hopefully complete description of what I currently have.

You need to install Apache, and enable the SSL and WSGI and auth_basic modules (run as root a2enmod ssl and a2enmod wsgi and a2enmod auth_basic and service apache2 restart). Of course, basic HTTP authentication may sound insecure, but we will only be doing it over HTTPS.

You should set up Let's Encrypt certificates (e.g., with certbot), something I mentioned in this previous guide.

Of course you need to install radicale. We are going to put all radicale-related stuff in /srv/radicale, but of course this can be changed. The files in this directory should be readable and writable by the Web server.

You then need to create a file in /etc/apache2/sites-enabled whose contents look as follows:

<IfModule mod_ssl.c>
<VirtualHost *:443>
        ServerName dav.example.com

        ServerAdmin youremail@example.com
        DocumentRoot /var/www/html/

        ErrorLog ${APACHE_LOG_DIR}/error.log
        CustomLog ${APACHE_LOG_DIR}/access.log combined

        WSGIDaemonProcess radicale user=www-data group=www-data threads=1
        WSGIScriptAlias / /srv/radicale/radicale.wsgi

        <Directory /srv/radicale/>
            WSGIProcessGroup radicale
            WSGIApplicationGroup %{GLOBAL}
            AllowOverride None
            AuthType basic
            AuthName "dav.example.com"
            AuthUserFile /srv/radicale/passwd
            Require user youruser
            SSLRequireSSL
        </Directory>

SSLCertificateFile /etc/letsencrypt/live/example.com/fullchain.pem
SSLCertificateKeyFile /etc/letsencrypt/live/example.com/privkey.pem
Include /etc/letsencrypt/options-ssl-apache.conf
</VirtualHost>
</IfModule>

The file /srv/radicale/passwd contains the username and passwords of who can access the server, managed as usual with the htpasswd utility. The file /srv/radicale/radicale.wsgi contains the invocation to run Radicale and points to the config file, as follows:

import radicale
configuration = radicale.config.read(["/srv/radicale/config"])
radicale.log.start()
application = radicale.Application()

To create the config file, you can, e.g., write the following in /srv/radicale/config

[encoding]
request = utf-8
stock = utf-8

[rights]
type = owner_only

[storage]
type = filesystem
filesystem_folder = /srv/radicale/collections

[logging]
config = /srv/radicale/logging

In this file, /srv/radicale/collections contains the Radicale collections as in the old guide. The file /srv/radicale/logging contains the radicale logging configuration. Here is mine:

# inspired by https://github.com/Kozea/Radicale/issues/266#issuecomment-121170414
[loggers]
keys = root

[handlers]
keys = file

[formatters]
keys = full

[logger_root]
level = DEBUG
handlers = file

[handler_file]
args = ('/srv/radicale/logs/radicale.log','a',32768,3)
level = INFO
class = handlers.RotatingFileHandler
formatter = full

[formatter_full]
format = %(asctime)s - %(levelname)s: %(message)s

In the above, /srv/radicale/logs is where you want radicale to write its log files. You probably need to specify it manually, because radicale is run by the Web server, which may not have the right to log, e.g., in /var/log/radicale as the default configuration would do.

by a3nm at March 11, 2018 07:32 PM

February 27, 2018

Antoine Amarilli a.k.a a3nm

SWERC 2017 and 2018

I just realized I hadn't mentioned here something that had kept me busy over the autumn months. With my university, Télécom ParisTech, and with my colleagues Bertrand Meyer and Pierre Senellart, we have been organizing the SWERC programming contest in November 2017, and will do so again in December 2018. SWERC is the South-Western Europe Regional Contest for ACM ICPC which is the most famous competitive programming competition for university students. You can read more about the contest here. We have welcomed 76 teams of three contestants each, from 48 institutions in France, Israel, Italy, Portugal, Spain, and Switzerland. The top-3 teams in the rankings are from ENS Paris, ETH Zürich, and SNS Pisa: they will compete in the ICPC world finals in Beijing.

The Télécom student association Comète has made a very nice video covering SWERC'17, which went out recently, and gives a good idea of what the contest was like. You can watch it on Youtube or in the iframe below, or download it directly if you prefer.

If you like competitive programming, you can have a look at the SWERC'17 problems on our website, or on UVa Online Judge or ACM-ICPC Live Archive. And if you are from a university in South-Western Europe and are eligible to participate, then we'd be glad to see you compete at SWERC'18! Registrations will open here in early September 2018.

by a3nm at February 27, 2018 10:38 PM

February 25, 2018

Nicolas Dandrimont a.k.a olasd

Report from Debian SnowCamp: day 3

[Previously: day 1, day 2]

Thanks to Valhalla and other members of LIFO, a bunch of fine Debian folks have convened in Laveno, on the shores of Lake Maggiore, for a nice weekend of relaxing and sprinting on various topics, a SnowCamp.

As a starter, and on request from Valhalla, please enjoy an attempt at a group picture (unfortunately, missing a few people). Yes, the sun even showed itself for a few moments today!

One of the numerous SnowCamp group pictures

As for today’s activities… I’ve cheated a bit by doing stuff after sending yesterday’s report and before sleep: I reviewed some of Stefano’s dc18 pull requests; I also fixed papered over the debexpo uscan bug.

After keeping eyes closed for a few hours, the day was then spent tickling the python-gitlab module, packaged by Federico, in an attempt to resolve https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=890594 in a generic way.

The features I intend to implement are mostly inspired from jcowgill’s multimedia-cli:

  • per-team yaml configuration of “expected project state” (access level, hooks and other integrations, enablement of issues, merge requests, CI, …)
  • new repository creation (according to a team config or a personal config, e.g. for collab-main the Debian group)
  • audit of project configurations
  • mass-configuration changes for projects

There could also be some use for bits of group management, e.g. to handle the access control of the DebConf group and its subgroups, although I hear Ganneff prefers shell scripts.

My personal end goal is to (finally) do the 3D printer team repository migration, but e.g. the Python team would like to update configuration of all repos to use the new KGB hook instead of irker, so some generic interest in the tool exists.

As the tool has a few dependencies (because I really have better things to do than reimplement another wrapper over the GitLab API) I’m not convinced devscripts is the right place for it to live… We’ll see when I have something that does more than print a list of projects to show!

In the meantime, I have the feeling Stefano has lined up a new batch of DebConf website pull requests for me, so I guess that’s what I’m eating for breakfast “tomorrow”… Stay tuned!

My attendance to SnowCamp is in part made possible by donations to the Debian project. If you want to keep the project going, please consider donating, joining the Debian partners program, or sponsoring the upcoming Debian Conference.

by olasd at February 25, 2018 12:49 AM

February 24, 2018

Nicolas Dandrimont a.k.a olasd

Report from Debian SnowCamp: day 2

[Previously: day 1]

Thanks to Valhalla and other members of LIFO, a bunch of fine Debian folks have convened in Laveno, on the shores of Lake Maggiore, for a nice weekend of relaxing and sprinting on various topics, a SnowCamp.

Today’s pièce de résistance was the long overdue upgrade of the machine hosting mentors.debian.net to (jessie then) stretch. We’ve spent most of the afternoon doing the upgrades with Mattia.

The first upgrade to jessie was a bit tricky because we had to clean up a lot of cruft that accumulated over the years. I even managed to force an unexpected database restore test 😇. After a few code fixes, and getting annoyed at apache2.4 for ignoring VirtualHost configs that don’t end with .conf (and losing an hour of debugging time in the process…), we managed to restore the functonality of the website.

We then did the stretch upgrade, which was somewhat smooth sailing in comparison… We had to remove some functionality which depended on packages that didn’t make it to stretch: fedmsg, and the SOAP interface. We also noticed that the gpg2 transition completely broke the… “interesting” GPG handling of mentors… An install of gnupg1 later everything should be working as it was before.

We’ve also tried to tackle our current need for a patched FTP daemon. To do so, we’re switching the default upload queue directory from / to /pub/UploadQueue/. Mattia has submitted bugs for dput and dupload, and will upload an updated dput-ng to switch the default. Hopefully we can do the full transition by the next time we need to upgrade the machine.

Known bugs: the uscan plugin now fails to parse the uscan output… But at least it “supports” version=4 now 🙃

Of course, we’re still sorely lacking volunteers who would really care about mentors.debian.net; the codebase is a pile of hacks upon hacks upon hacks, all relying on an old version of a deprecated Python web framework. A few attempts have been made at a smooth transition to a more recent framework, without really panning out, mostly for lack of time on the part of the people running the service. I’m still convinced things should restart from scratch, but I don’t currently have the energy or time to drive it… Ugh.

More stuff will happen tomorrow, but probably not on mentors.debian.net. See you then!

My attendance to SnowCamp is in part made possible by donations to the Debian project. If you want to keep the project going, please consider donating, joining the Debian partners program, or sponsoring the upcoming Debian Conference.

by olasd at February 24, 2018 12:15 AM

February 22, 2018

Nicolas Dandrimont a.k.a olasd

Report from Debian SnowCamp: day 1

Thanks to Valhalla and other members of LIFO, a bunch of fine Debian folks have convened in Laveno, on the shores of Lake Maggiore, for a nice weekend of relaxing and sprinting on various topics, a SnowCamp.

This morning, I arrived in Milan at “omfg way too early” (5:30AM, thanks to a 30 minute early (!) night train), and used the opportunity to walk the empty streets around the Duomo while the Milanese .oO(mapreri) were waking up. This gave me the opportunity to take very nice pictures of monuments without people, which is always appreciated!

 

After a short train ride to Laveno, we arrived at the Hostel at around 10:30. Some people had already arrived the day before, so there already was a hacking kind of mood in the air.  I’d post a panorama but apparently my phone generated a corrupt JPEG 🙄

After rearranging the tables in the common spaces to handle power distribution correctly (♥ Gaffer Tape), we could start hacking!

Today’s efforts were focused on the DebConf website: there were a bunch of pull requests made by Stefano that I reviewed and merged:

I’ve also written a modicum of code.

Finally, I have created the Debian 3D printing team on salsa in preparation for migrating our packages to git. But now is time to do the sleep thing. See you tomorrow?

My attendance to SnowCamp is in part made possible by donations to the Debian project. If you want to keep the project going, please consider donating, joining the Debian partners program, or sponsoring the upcoming Debian Conference.

by olasd at February 22, 2018 10:40 PM

February 20, 2018

Nicolas Dandrimont a.k.a olasd

Listing and loading of Debian repositories: now live on Software Heritage

Software Heritage is the project for which I’ve been working during the past two and a half years now. The grand vision of the project is to build the universal software archive, which will collect, preserve and share the Software Commons.

Today, we’ve announced that Software Heritage is archiving the contents of Debian daily. I’m reposting this article on my blog as it will probably be of interest to readers of Planet Debian.

TL;DR: Software Heritage now archives all source packages of Debian as well as its security archive daily. Everything is ready for archival of other Debian derivatives as well. Keep on reading to get details of the work that made this possible.

History

When we first announced Software Heritage, back in 2016, we had archived the historical contents of Debian as present on the snapshot.debian.org service, as a one-shot proof of concept import.

This code was then left in a drawer and never touched again, until last summer when Sushant came do an internship with us. We’ve had the opportunity to rework the code that was originally written, and to make it more generic: instead of the specifics of snapshot.debian.org, the code can now work with any Debian repository. Which means that we could now archive any of the numerous Debian derivatives that are available out there.

This has been live for a few months, and you can find Debian package origins in the Software Heritage archive now.

Mapping a Debian repository to Software Heritage

The main challenge in listing and saving Debian source packages in Software Heritage is mapping the content of the repository to the generic source history data model we use for our archive.

Organization of a Debian repository

Before we start looking at a bunch of unpacked Debian source packages, we need to know how a Debian repository is actually organized.

At the top level of a Debian repository lays a set of suites, representing versions of the distribution, that is to say a set of packages that have been tested and are known to work together. For instance, Debian currently has 6 active suites, from wheezy (“old old stable” version), all the way up to experimental; Ubuntu has 8, from precise (12.04 LTS), up to bionic (the future 18.04 release), as well as a devel suite. Each of those suites also has a bunch of “overlay” suites, such as backports, which are made available in the archive alongside full suites.

Under the suites, there’s another level of subdivision, which Debian calls components, and Ubuntu calls areas. Debian uses its components to segregate packages along licensing terms (main, contrib and non-free), while Ubuntu uses its areas to denote the level of support of the packages (main, universe, multiverse, …).

Finally, components contain source packages, which merge upstream sources with distribution-specific patches, as well as machine-readable instructions on how to build the package.

Organization of the Software Heritage archive

The Software Heritage archive is project-centric rather than version-centric. What this means is that we are interested in keeping the history of what was available in software origins, which can be thought of as a URL of a repository containing software artifacts, tagged with a type representing the means of access to the repository.

For instance, the origin for the GitHub mirror of the Linux kernel repository has the following data:

For each visit of an origin, we take a snapshot of all the branches (and tagged versions) of the project that were visible during that visit, complete with their full history. See for instance one of the latest visits of the Linux kernel. For the specific case of GitHub, pull requests are also visible as virtual branches, so we fetch those as well (as branches named refs/pull/<pull request number>/head).

Bringing them together

As we’ve seen, Debian archives (just as well as archives for other “traditional” Linux distributions) are release-centric rather than package-centric. Mapping distributions to the Software Heritage archive therefore takes a little bit of gymnastics, to transpose the list of source packages available in each suite to a list of available versions per source package. We do this step by step:

  1. Download the Sources indices for all the suites and components known in the Debian repository
  2. Parse the Sources indices, listing all source packages inside
  3. For each source package, tell the Debian loader to load all the available versions (grouped by name), generating a complete snapshot of the state of the source package across the Debian repository

The source packages are mapped to origins using the following format:

  • type: deb
  • url: deb://<repository name>/packages/<source package name> (e.g. deb://Debian/packages/linux)

We use a repository name rather than the actual URL to a repository so that links can persist even if a given mirror disappears.

Loading Debian source packages

To load Debian source packages into the Software Heritage archive, we have to convert them: Debian-based distributions distribute source packages as a set of files, a dsc (Debian Source Control) and a set of tarballs (usually, an upstream tarball and a Debian-specific overlay). On the other hand, Software Heritage only stores version-control information such as revisions, directories, files.

Unpacking the source packages

Our philosophy at Software Heritage is to store the source code of software in the precise form that allows a developer to start working on it. For Debian source packages, this is the unpacked source code tree, with all patches applied. After checking that the files we have downloaded match the checksums published in the index files, we simply use dpkg-source -x to extract the source package, with patches applied, ready to build. This also means that we currently fail to import packages that don’t extract with the version of dpkg-source available in Debian Stretch.

Generating a synthetic revision

After walking the extracted source package tree, computing identifiers for all its contents, we get the identifier of the top-level tree, which we will reference in the synthetic revision.

The synthetic revision contains the “reproducible” metadata that is completely intrinsic to the Debian source package. With the current implementation, this means:

  • the author of the package, and the date of modification, as referenced in the last entry of the source package changelog (referenced as author and committer)
  • the original artifact (i.e. the information about the original source package)
  • basic information about the history of the package (using the parsed changelog)

However, we never set parent revisions in the synthetic commits, for two reasons:

  • there is no guarantee that packages referenced in the changelog have been uploaded to the distribution, or imported by Software Heritage (our update frequency is lower than that of the Debian archive)
  • even if this guarantee existed, and all versions of all packages were available in Software Heritage, there would be no guarantee that the version referenced in the changelog is indeed the version we imported in the first place

This makes the information stored in the synthetic revision fully intrinsic to the source package, and reproducible. In turn, this allows us to keep a cache, mapping the original artifacts to synthetic revision ids, to avoid loading packages again once we have loaded them once.

Storing the snapshot

Finally, we can generate the top-level object in the Software Heritage archive, the snapshot. For instance, you can see the snapshot for the latest visit of the glibc package.

To do so, we generate a list of branches by concatenating the suite, the component, and the version number of each detected source package (e.g. stretch/main/2.24-10 for version 2.24-10 of the glibc package available in stretch/main). We then point each branch to the synthetic revision that was generated when loading the package version.

In case a version of a package fails to load (for instance, if the package version disappeared from the mirror between the moment we listed the distribution, and the moment we could load the package), we still register the branch name, but we make it a “null” pointer.

There’s still some improvements to make to the lister specific to Debian repositories: it currently hardcodes the list of components/areas in the distribution, as the repository format provides no programmatic way of eliciting them. Currently, only Debian and its security repository are listed.

Looking forward

We believe that the model we developed for the Debian use case is generic enough to capture not only Debian-based distributions, but also RPM-based ones such as Fedora, Mageia, etc. With some extra work, it should also be possible to adapt it for language-centric package repositories such as CPAN, PyPI or Crates.

Software Heritage is now well on the way of providing the foundations for a generic and unified source browser for the history of traditional package-based distributions.

We’ll be delighted to welcome contributors that want to lend a hand to get there.

by olasd at February 20, 2018 01:52 PM

January 21, 2018

Antoine Amarilli a.k.a a3nm

Modern blockbusters: a dining metaphor

This is just a text I wrote to explain how I felt about most blockbuster movies. I didn't know what to do with it, so here it is.

There's this new restaurant in town that has posters and ads everywhere. Everyone's talking about it, and they all seem to have a pretty strong opinion, so you go with some friends to see what it's like.

The first impression is outstanding. The restaurant is lavishly decorated. The room, furniture, atmosphere, music, are all spectacular and have obviously been painstakingly designed for your enjoyment. The waiters have fancy, colorful, creative dresses, and they usher you to a comfortable seat on a magnificent table adorned with the finest dining ware.

You spend some time admiring the setting: the paintings on the wall, the patterns of the wallpaper, the carefully engineered lighting, and the complex ballet of the waiters. Soon enough, the first dish is served. It consists of various kinds of canapés, neatly arranged on a splendid plate. They look wonderful even if not particularly original. You have a bite, and the taste is good, not exceptional compared to your expectations, but certainly not bad either; just not especially remarkable.

You finish the plate, and a different waiter comes to the table after a while, with a new plate of other kinds of canapés. How formal, you think, how unbelievably fancy to have two rounds of appetizers before the meal has even started! The ingredients are different, but your opinion is essentially the same: excellent visual impression, classical recipes, enjoyable yet somewhat unsurprising taste.

A third plate of canapés comes in, and now you start to suspect that something is off. Why are they only serving such cocktail food? Worse, you can't figure out any logic in the contents of the plates: now some of the bites are sweet, but on the next plate everything is savory again. And the meal continues like this, with a series of plates of different kinds of hors d'oeuvres brought by various waiters.

It's not that the experience is really unpleasant. You can appreciate the setting, the lighting, and the subtle changes in atmosphere and music throughout the evening. You can also wonder about the seemingly random assortment of tastes, plates, and waiters, that comes to the table every now and then. You can also enjoy the food, which is acceptable even if not strikingly good. But after one hour and a half of this, your expectations have been building up to something more. Surely all of this has been leading to a proper dish of some kind? Alas, no: the series of appetizers continues for one more hour, you progressively realize that it's getting too late for your hopes to materialize, and then the check comes and confirms what you had feared. You feel somewhat queasy as you get up and leave the table, like when you have too many snacks in a row: you're no longer hungry, but you don't feel like you had a proper meal either. In fact, it's a bit as if you had been robbed of the opportunity of having one.

As it turns out, your friends are all thrilled about this incredible dinner experience, but it seems that you haven't been paying attention to the same things as them. For one thing, they really enjoyed the beauty of the setting, the music, and how everything was pleasing to the eye and ears. You readily concede that all of this was perfect, but you try to bring the discussion back to the food. "But wasn't the food pretty too", they ask? "Didn't it perfectly match the plates, the table and the decoration of the room?"

Your friends also loved that the restaurant staff was so varied. This is something that you had essentially missed, although you do remember that the plates were brought by many different waiters, with interesting costumes and ties and hairstyles. To your friends, the main point of the various canapés was the story that they were telling about the lives of the waiters and the relationships between them. They can discuss it for ages: "Did you understand why the short bearded guy brought the foie gras plate, although the chunks of duck magret had all been delivered by the tall blonde waitress until then?" "Oh, my interpretation is that the bearded guy has a secret duck side in him, but he's conflicted about his relationship with the bald guy who brought the veal liver."

You ask: "But why the hell did they serve chocolate mousse verrines between the foie gras and veal liver?" Of course, they reply, the reason why the skinny old waitress brought the chocolate mousse was to appease the tension between bearded guy and bald guy. "But what good did it do to the meal", you ask? And they answer: "It brings forward the side of the old waitress's character that feels guilty for the bearded guy's struggle."

You try to explain how you would have liked the meal to have a certain structure, with recognizable dishes arranged in a consistent order. Your friends pounce on this, and question you: why are you so attached to this traditional structure of a formal meal? Why should a good meal necessarily consist of a starter, a main course, and a dessert? "But the point is not the specific structure," you reply, "so much as having any kind of understandable connection between the successive dishes." Some of your friends then ask: "But don't you see how subversive it is to have served an anchovy paste toast just after a chocolate parfait? Don't you like this sort of strong political statement?" You still fail to see the radical appeal of this, given that the setting was otherwise rather consensual, and the food consisted of perfectly standard Western fare. To you, the meal didn't look like a satire of anything in particular, except maybe itself.

They ask, "but didn't you like how the meal was surprising and unpredictable?" And indeed, you have to agree that you couldn't anticipate anything, given that it appeared to be completely random. You explain how the lack of structure makes it impossible for you to summarize, or indeed to remember, the sequence of foods that you had. They disagree: to them, the meal was rich and complex, and anyway the main questions to examine are character-related, e.g., how the blonde waitress's disappearance at the middle of the meal could be linked to the increasingly important role of the bald guy in connection to the sweet and especially fruit-flavored foods.

Your friends are all eager to return to this place when they will start serving their new menu next year. To them, this meal has been building up to the great surprises that the next dinner will surely bring. "Think of all the new kinds of food that we will discover! And in particular I wonder whether we will see the blonde waitress again? I wonder whether she might bring us some scallops in a green plate, because remember that the only seafood so far had been brought by the old waitress, also in a green plate, so this could be some hint of a family relationship between them?" And when you express your lack of enthusiasm, they don't understand you: if you complained so much about the food, why aren't you hungry for more?

by a3nm at January 21, 2018 05:13 PM