August 14, 2018

Antoine Amarilli a.k.a a3nm

Finding the members of the theoretical database community with DBLP

The DBLP service is a great bibliographical tool for computer science research. In this post, I explain how to use it to prepare the list of members of a research community. I will be using the theoretical database community, whose two conferences are PODS and ICDT.

The list of publications for one edition of a conference can be found on DBLP as XML, e.g., for ICDT'18. It is then easy to use xmlstarlet to find the list of people who have published at that conference:

curl -s 'https://dblp.uni-trier.de/db/conf/icdt/icdt2018.xml' |
  xmlstarlet sel -T -t -m "//inproceedings/author" -m . -c '.' -n |
  sort | uniq

For each person in the list, we can obtain detailed XML information, including its homepage, ORCID, etc., using the DBLP API again. (This also gives us a canonical form for the name, which may appear in different ways in various inproceedings entries.) This is just a bit more complicated than it should, because of a limitation of the DBLP search API: when queried with a name, sometimes the API inexplicably favors non-exact matches even in some cases where an exact match exist. So we must filter the matches ourselves to use an exact match if one exists, and a non-exact match otherwise. Of course, independently from this problem, you may be getting the wrong author, in particular because of homonyms, so these results should be taken with a grain of salt.

NAME="Antoine Amarilli"
ENAME=$(echo "$NAME" | sed 's/ /%20/g')
curl -s "https://dblp.org/search/author/api?h=1000&q=$ENAME" > matches.xml
URL=$(xmlstarlet sel -T -t -m "/result/hits/hit/info[author='$NAME']" \
    -c url -n < matches.xml | head -1)
if [[ -z "$URL" ]]
then
  URL=$(xmlstarlet sel -T -t -m /result/hits/hit/info/url \
      -c . -n < matches.xml | head -1)
fi
curl -L "${URL}.xml"

From there, we can use this to prepare a list of community members. Of course, any criterion for inclusion is completely arbitrary... My criterion to get a list of "active community members" is to select the who have published on three different years, with one publication in 2015 or later. Which gives:

Click to see the list...

Another inclusion criterion for a "historical" list would be the list of people who are not necessarily still active but have published over a long period, say, 10 different (not necessarily contiguous) years. Here is the resulting list, sorted by the year where the person has last published in ICDT or PODS.

Click to see the list...

Another kind of statistics that can be computed in this way is the "neighboring" conferences, i.e., the other conferences where members of the community have published. Here is the list of the top neighboring conferences of PODS and ICDT, sorted by the number of active community members who have published at least once there since 2015 (with hyperlinks and descriptions added manually):

  • 38: SIGMOD Conference, the practical database conference held jointly with PODS
  • 34: AMW, the database theory workshop held in honor of Alberto O. Mendelzon (whom you may remember from the previous list)
  • 29: IJCAI, an AI conference
  • 28: ICALP, a theoretical CS conference on logics and automata
  • 26: LICS, another theoretical CS conference about logics
  • 20: SODA, a theoretical CS conference on algorithms
  • 19: AAAI, another AI conference
  • 19: EDBT, the practical database conference held jointly with ICDT
  • 18: WWW, a conference about the World Wide Web
  • 15: Description Logics, the workshop on description logics
  • 15: ICDE, a practical data management conference
  • 13: CIKM, an information and knowledge management conference
  • 12: SEBD, the Italian conference on databases
  • 12: STOC, a general-purpose theoretical computer science conference
  • 11: KR, a conference on knowledge representation and reasoning
  • 11: FOCS, another general-purpose theoretical computer science conference

It would be interesting to visualize this data differently, e.g., visualize a world map with the community members, but sadly the affiliation information in DBLP is too sparse for this to work.

by a3nm at August 14, 2018 11:23 PM

July 07, 2018

Antoine Amarilli a.k.a a3nm

Indexing encrypted email with notmuch

Since version 0.26, the mail indexing tool that I use, notmuch, now makes it easy to index encrypted mail.

The original behavior was that notmuch did not index the contents of encrypted emails, as they were encrypted and it couldn't access them. This meant that you couldn't search inside encrypted emails (except for headers, e.g., the subject, recipient, etc.).

Now, notmuch is able to use gpg (and gpg-agent) to read and index the cleartext of encrypted emails. Of course, this means that notmuch's index can now be used to reconstruct encrypted emails; in particular, as notmuch stores the session keys for messages in its index, this means that any attacker who can access the index can decrypt the messages1. For my use case, I think that this security risk is acceptable: I essentially see GPG as a tool to ensure that messages are not altered between the sender and recipient, my notmuch index is stored on an encrypted partition anyway, and my GPG passphrase is usually cached by gpg-agent so an attacker who has control over my machine would be able to access the plaintext of encrypted messages quite easily.

So, if you also use notmuch, if you also have your passphrase cached by gpg-agent at least part of the time, and if you want notmuch to index the cleartext of your encrypted emails, here is what you should do. First, you should make sure that you have notmuch 0.26 or a more recent version. Second, you should tell notmuch that you want it to index the cleartext of encrypted email:

notmuch config set index.decrypt true

Beware, this configuration flag lives only in the database, not in the config file; hence, e.g., it will not be synchronized across multiple machines if you synchronize your config files from one machine to another.

Then you should reindex all encrypted email that notmuch knows about but hasn't indexed yet (this took around 15 mins in my case):

notmuch reindex tag:encrypted and not property:index.decryption=success

Of course, you will be prompted for your GPG passphrase if it isn't cached (and also possibly for the passphrase of other keys that you have used in the past). Once this has completed, you should check the encrypted messages that notmuch was still unable to index:

notmuch search tag:encrypted and not property:index.decryption=success

In my case, there were only a few that couldn't be indexed -- and usually it was because they hadn't been encrypted for my key because the sender had made some mistake.

From now on, notmuch new should automatically index the cleartext of incoming messages when your GPG passphrase is cached by gpg-agent. The last step is the following: if your passphrase is not cached all the time, then you should arrange for the notmuch reindex command above to be executed regularly, so that encrypted messages will eventually be indexed.

The setup described in this post lead to unpleasant side effects where GPG invocations would hang, probably because notmuch tried to ask for a passphrase. To avoid this, I had to ensure that the notmuch reindex command, when run regularly, never tried to ask for a passphrase if it wasn't currently stored by the agent. I did this by setting PINENTRY_USER_DATA=none and modifying my custom pinentry script to handle properly this value. (Of course, this means that encrypted messages will not be correctly indexed when the GPG agent hasn't cached the passphrase, but the hope is that they will eventually be indexed.)

Another problem that I had is that notmuch reindex would waste CPU time by trying to reindex each time the emails where it had previously failed. To avoid this, I reviewed manually the mails that couldn't be indexed, tagged them with a special tag, and then excluded mails with that tag from the notmuch reindex command. I also added a crontab entry to review periodically the emails where indexing failed, so I will tag them appropriately if the failure is expected. A more elaborate idea would be to exclude from the notmuch reindex command the emails that are too ancient; or maybe script things so that when all GPG keys are available in the agent but notmuch cannot index a message then it should tag it so as not to try again.


  1. In fact, the historical workaround to index encrypted email with notmuch was simply to arrange for it to be decrypted when it arrives. I would also be OK with the security implications of this, but I have never set it up, because it's complicated to do right, especially because my GPG passphrase isn't always available in gpg-agent's cache. Besides, I prefer to keep an original copy of the email that I receive, so I think it's cleaner to keep the encrypted messages as-is and have notmuch store in its index its additional information that it needs. 

by a3nm at July 07, 2018 05:53 PM

May 09, 2018

Antoine Amarilli a.k.a a3nm

Migrating from cgit to stagit

I serve my git repositories over HTTP for people who want to browse them without having to clone them. I used to do this with cgit, which is a server-side dynamic solution written in C. It worked nicely, but lately some bots have been busy crawling these git repositories, and I regularly ran into trouble where the cgit.cgi processes ended up in a busy loop, eating 100% of CPU for unclear reasons. More generally, I had always been anxious about using a dynamic solution to serve these repositories: all the rest of my website is static, which I think is more elegant and more reassuring in terms of security.

The natural approach would be to turn cgit into a static solution by precompiling all pages whenever a git repository is updated. However, this is not reasonable: cgit allows you, e.g., to see the status of every file at every commit, or to diff any pair of commits, which would be too expensive to precompute. These features are not very useful, so I was considering to do it but tweak cgit's output to suppress the useless parts; but this would have been tedious.

Fortunately, there is a better way: the stagit tool is a minimalistic variant of cgit, also written in C, which is designed to be static. So I have just removed cgit from my server and installed stagit instead. Obviously it's too early for me to say whether stagit is a perfect solution, but I'm happy with what I have seen so far. Here are some quick and messy notes about how I did it and what surprised me, in case you are considering doing the same.

Stagit is not packaged for Debian yet but it's easy to compile and install (and the source code is rather short if you want to hack it). You will need libgit2-dev, which is packaged by Debian. I edited a bit the source to suit my needs; cf my local fork: I changed a bit the HTML, fixed the CSS to work better on mobile displays, renamed some files, etc. It's a bit ugly to have HTML boilerplate hardcoded in the C code, but it works, and if it starts misbehaving it will be easier for me to investigate.

Stagit provides one command stagit to generate the HTML for a repository, and one command stagit-index to generate an index of the various repositories. The README is rather clear (you can also look at the manpages in the repo). Of course, you need to re-run stagit whenever a git repository is updated, so you'll need a post-receive hook like the one they provide, which I adapted to my needs. One concern is that running stagit is synchronous, i.e., when doing a git push, you must wait for stagit to complete. However, it seems to run instantly on my repositories, so that's no big deal.

To get a nice index of the repositories, you need to change your git repositories to edit description with a description and url with the clone URL. There is also support for a owner field, but I removed this from the generated HTML as I'm the owner of all the repos I host. As the setup of a new git repository had become a bit tedious, I wrote a script for that, too.

About the url: you should know that stagit does not take care of allowing people to clone your repository. One solution is to run a git server for that (which the official stagit repository seems to do), but I didn't want it because it's not static. Instead, I intend people to clone my repositories using the dumb HTTP protocol: it only requires you to serve your git repositories with your Web server, and to run git update-server-info, as can be done easily using the post-update.sample hook. So for each repository you will have the stagit version and the bare repository. However, this will mean that the git clone URL will be different from the stagit URL, which is a bit jarring. So I cheated using some lighttpd mod_rewrite rules to transparently do the redirection. (Note that git clone will still point out the existence of this redirect when doing the cloning, so it's not completely transparent.) Here are the rules, following this page thanks to immae for suggesting an improvement:

  "^/git/([^/.]*)/HEAD$" => "/git/$1.git/HEAD",
  "^/git/([^/.]*)/info/(.*)$" => "/git/$1.git/info/$2",
  "^/git/([^/.]*)/objects/(.*)$" => "/git/$1.git/objects/$2",
  "^/git/([^/.]*)/git-upload-pack$" => "/git/$1.git/git-upload-pack",
  "^/git/([^/.]*)/git-receive-pack$" => "/git/$1.git/git-receive-pack",

One last thing about the migration to stagit is that I didn't want to break all the cgit URLs that used to work before. Of course, not all cgit pages have a stagit counterpart, but most of the important ones do, however their names are a bit different. Again, not very robust, but here goes:

  "^/git/([^/.]*)/commit/\?id=(.*)$" => "/git/$1/commit/$2.html",
  "^/git/([^/.]*)/about(/.*)?$" => "/git/$1/file/README.html",
  "^/git/([^/.]*)/log(/.*)?$" => "/git/$1/index.html",
  "^/git/([^/.]*)/refs(/.*)?$" => "/git/$1/refs.html",
  "^/git/([^/.]*)/tree/?(\?.*)?$" => "/git/$1/files.html",
  "^/git/([^/.]*)/tree/([^?]*)(\?.*)?$" => "/git/$1/file/$2.html",
  "^/git/([^/.]*)/plain/([^?]*)(\?.*)?$" => "/git/$1/file/$2.html",
  "^/git/([^.?]*)\?.*$" => "/git/$1",
  "^/git/([^/.]*)/([^?]*)\?.*$" => "/git/$1",

So there you have it: a completely static web version of my git repositories that can also be used to clone them with the dumb HTTP transport, a hook to update the web version, a script to create a new repository, and no more problems or possible security vulnerabilities with cgit!

by a3nm at May 09, 2018 11:18 PM

April 13, 2018

Nicolas Dandrimont a.k.a olasd

docker and 127.0.0.1 in /etc/resolv.conf

I’ve been doing increasingly more stuff with docker these days, and I’ve bumped into an issue: all my systems use a local DNS resolver, one way or another: as I’m roaming and have some VPNs with split-horizon DNS, my laptop uses dnsmasq as configured by NetworkManager; as I want DNSSEC validation, my servers use a local unbound instance. (I’d prefer to use unbound everywhere, but I roam on some networks where the only DNS recursors that aren’t firewalled don’t support DNSSEC, and unbound is pretty bad at handling that unfortunately)

Docker handles 127.0.0.1 in /etc/resolv.conf the following way: it ignores the entry (upstream discussion). When there’s no DNS servers left, it will fall back to using 8.8.8.8 and 8.8.4.4.

This is all fine and dandy, if that’s the sort of thing you like (I don’t kinkshame), but if you don’t trust the owners of those DNS recursors, or if your network helpfully firewalls outbound DNS requests, you’ll likely want to use your own resolver instead.

The upstream docker FAQ just tells people to disable the local resolver and/or hardcode some DNS entries in the docker config. That’s not very helpful when you roam and you really want to use your local resolver… To do things properly, you’ll have to set two bits of configuration: tell docker to use the local resolver, and tell your local resolver to listen on the docker interface.

Docker DNS configuration

First, you need to configure the docker daemon to use the host as DNS server: In the /etc/docker/daemon.json file, set the dns key to [“172.17.0.1”] (or whatever IP address your docker host is set to use).

Local resolver: dnsmasq in NetworkManager

When NetworkManager is set up to use dnsmasq, it runs with a configuration that’s built dynamically, and updated when the network settings change (e.g. to switch upstream resolvers according to DHCP or VPN settings).

You can add drop-in configurations in the /etc/NetworkManager/dnsmasq.d directory. I have set the following configuration variables in a /etc/NetworkManager/dnsmasq.d/docker.conf file:

# Makes dnsmasq bind to all interfaces (but still only accept queries on localhost as per NetworkManager configuration)
bind-interfaces

# Makes dnsmasq accept queries on 172.17.0.0/16
listen-address=172.17.0.1

Restarting NetworkManager brings up a new instance of dnsmasq, which will let Docker do its thing.

Local resolver: unbound

The same sort of configuration is done with unbound. We tell unbound to listen to all interfaces and to only accept recursion from localhost and the docker subnet. In /etc/unbound/unbound.conf.d/docker.conf:

interface: 0.0.0.0
interface: ::
access-control: 127.0.0.0/8 allow
access-control: ::1/128 allow
access-control: 172.17.0.0/16 allow

Restart unbound (reload doesn’t reopen the sockets) and your docker container will have access to your local resolver.

Both these local resolver configurations should work even when the docker interface comes up after the resolver: we tell the resolver to listen to all interfaces, then only let it answer to clients on the relevant networks.

Result

Before:

$ docker run busybox nslookup pypi.python.org
Server: 8.8.8.8
Address 1: 8.8.8.8 google-public-dns-a.google.com

Name: pypi.python.org
Address 1: 2a04:4e42::223
Address 2: 2a04:4e42:200::223
Address 3: 2a04:4e42:400::223
Address 4: 2a04:4e42:600::223
Address 5: 151.101.0.223
Address 6: 151.101.64.223
Address 7: 151.101.128.223
Address 8: 151.101.192.223

After:

$ docker run busybox nslookup pypi.python.org
Server: 172.17.0.1
Address 1: 172.17.0.1

Name: pypi.python.org
Address 1: 2a04:4e42:1d::223
Address 2: 151.101.16.223

by olasd at April 13, 2018 10:04 PM

March 11, 2018

Antoine Amarilli a.k.a a3nm

An update on CalDAV and CardDAV with Radicale

This is a quick update to a previous post where I explained how to self-host your calendar and contacts using the Radicale CalDAV and CardDAV server, and how to access them on Android devices with DAVdroid.

Three years later, I am still using this setup. I only use my Android phone to access the calendar and contacts, so the Radicale server is essentially a way to back the contacts and calendars up; although I have also tried accessing them, e.g., with Evolution. Over these three years, DAVdroid has evolved and gotten a bit more user-friendly and stable, though I have had a few problems (e.g., duplicated calendar events). Radicale has evolved too, I'm currently at version 1.1.1, which is the one provided by Debian even though it is really outdated. (Also, as of this writing, Radicale is not available in the Debian testing repos, see here, but it can be installed from Debian stable.)

The main change that I did is on the server. In the old guide, I explained how to set up Radicale so that it listens on port 5232, manages authentication and encryption, and DAVdroid connects to it directly. I have changed this setup so that DAVdroid now connects to Apache2, which manages authentication and encryption, and talks to Radicale using WSGI. This has a number of advantages:

  • You can encrypt the connection with SSL managed by Apache, e.g., using Let's Encrypt, without self-signed certificates or other ad-hoc setup; and you don't need to trust Radicale to do the encryption correctly.
  • The server listens on the standard HTTPS port (443) rather than the custom Radicale port (5232) so the connections aren't blocked on unfriendly networks.
  • You can use vhosts, e.g., to host it on a subdomain.
  • Authentication is managed by Apache, not Radicale. This is somewhat reassuring: even if Radicale has a massive security flaw, only users that correctly authenticated with Apache can talk to it at all.
  • The most important point: with the old setup, Radicale would inexplicably hang every now and then, presumably when the phone disconnected messily from it. (I think it is this bug). With the new setup, this does not happen. (Maybe the bug has been fixed in more recent Radicale versions anyway, I don't know.)

Of course, the downside of this new setup is that you need Apache just to route requests to Radicale. As I needed Apache for other purposes, though, I didn't mind.

The setup

I haven't documented this setup while I did it, so here a hopefully complete description of what I currently have.

You need to install Apache, and enable the SSL and WSGI and auth_basic modules (run as root a2enmod ssl and a2enmod wsgi and a2enmod auth_basic and service apache2 restart). Of course, basic HTTP authentication may sound insecure, but we will only be doing it over HTTPS.

You should set up Let's Encrypt certificates (e.g., with certbot), something I mentioned in this previous guide.

Of course you need to install radicale. We are going to put all radicale-related stuff in /srv/radicale, but of course this can be changed. The files in this directory should be readable and writable by the Web server.

You then need to create a file in /etc/apache2/sites-enabled whose contents look as follows:

<IfModule mod_ssl.c>
<VirtualHost *:443>
        ServerName dav.example.com

        ServerAdmin youremail@example.com
        DocumentRoot /var/www/html/

        ErrorLog ${APACHE_LOG_DIR}/error.log
        CustomLog ${APACHE_LOG_DIR}/access.log combined

        WSGIDaemonProcess radicale user=www-data group=www-data threads=1
        WSGIScriptAlias / /srv/radicale/radicale.wsgi

        <Directory /srv/radicale/>
            WSGIProcessGroup radicale
            WSGIApplicationGroup %{GLOBAL}
            AllowOverride None
            AuthType basic
            AuthName "dav.example.com"
            AuthUserFile /srv/radicale/passwd
            Require user youruser
            SSLRequireSSL
        </Directory>

SSLCertificateFile /etc/letsencrypt/live/example.com/fullchain.pem
SSLCertificateKeyFile /etc/letsencrypt/live/example.com/privkey.pem
Include /etc/letsencrypt/options-ssl-apache.conf
</VirtualHost>
</IfModule>

The file /srv/radicale/passwd contains the username and passwords of who can access the server, managed as usual with the htpasswd utility. The file /srv/radicale/radicale.wsgi contains the invocation to run Radicale and points to the config file, as follows:

import radicale
configuration = radicale.config.read(["/srv/radicale/config"])
radicale.log.start()
application = radicale.Application()

To create the config file, you can, e.g., write the following in /srv/radicale/config

[encoding]
request = utf-8
stock = utf-8

[rights]
type = owner_only

[storage]
type = filesystem
filesystem_folder = /srv/radicale/collections

[logging]
config = /srv/radicale/logging

In this file, /srv/radicale/collections contains the Radicale collections as in the old guide. The file /srv/radicale/logging contains the radicale logging configuration. Here is mine:

# inspired by https://github.com/Kozea/Radicale/issues/266#issuecomment-121170414
[loggers]
keys = root

[handlers]
keys = file

[formatters]
keys = full

[logger_root]
level = DEBUG
handlers = file

[handler_file]
args = ('/srv/radicale/logs/radicale.log','a',32768,3)
level = INFO
class = handlers.RotatingFileHandler
formatter = full

[formatter_full]
format = %(asctime)s - %(levelname)s: %(message)s

In the above, /srv/radicale/logs is where you want radicale to write its log files. You probably need to specify it manually, because radicale is run by the Web server, which may not have the right to log, e.g., in /var/log/radicale as the default configuration would do.

by a3nm at March 11, 2018 07:32 PM

February 27, 2018

Antoine Amarilli a.k.a a3nm

SWERC 2017 and 2018

I just realized I hadn't mentioned here something that had kept me busy over the autumn months. With my university, Télécom ParisTech, and with my colleagues Bertrand Meyer and Pierre Senellart, we have been organizing the SWERC programming contest in November 2017, and will do so again in December 2018. SWERC is the South-Western Europe Regional Contest for ACM ICPC which is the most famous competitive programming competition for university students. You can read more about the contest here. We have welcomed 76 teams of three contestants each, from 48 institutions in France, Israel, Italy, Portugal, Spain, and Switzerland. The top-3 teams in the rankings are from ENS Paris, ETH Zürich, and SNS Pisa: they will compete in the ICPC world finals in Beijing.

The Télécom student association Comète has made a very nice video covering SWERC'17, which went out recently, and gives a good idea of what the contest was like. You can watch it on Youtube or in the iframe below, or download it directly if you prefer.

If you like competitive programming, you can have a look at the SWERC'17 problems on our website, or on UVa Online Judge or ACM-ICPC Live Archive. And if you are from a university in South-Western Europe and are eligible to participate, then we'd be glad to see you compete at SWERC'18! Registrations will open here in early September 2018.

by a3nm at February 27, 2018 10:38 PM

February 25, 2018

Nicolas Dandrimont a.k.a olasd

Report from Debian SnowCamp: day 3

[Previously: day 1, day 2]

Thanks to Valhalla and other members of LIFO, a bunch of fine Debian folks have convened in Laveno, on the shores of Lake Maggiore, for a nice weekend of relaxing and sprinting on various topics, a SnowCamp.

As a starter, and on request from Valhalla, please enjoy an attempt at a group picture (unfortunately, missing a few people). Yes, the sun even showed itself for a few moments today!

One of the numerous SnowCamp group pictures

As for today’s activities… I’ve cheated a bit by doing stuff after sending yesterday’s report and before sleep: I reviewed some of Stefano’s dc18 pull requests; I also fixed papered over the debexpo uscan bug.

After keeping eyes closed for a few hours, the day was then spent tickling the python-gitlab module, packaged by Federico, in an attempt to resolve https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=890594 in a generic way.

The features I intend to implement are mostly inspired from jcowgill’s multimedia-cli:

  • per-team yaml configuration of “expected project state” (access level, hooks and other integrations, enablement of issues, merge requests, CI, …)
  • new repository creation (according to a team config or a personal config, e.g. for collab-main the Debian group)
  • audit of project configurations
  • mass-configuration changes for projects

There could also be some use for bits of group management, e.g. to handle the access control of the DebConf group and its subgroups, although I hear Ganneff prefers shell scripts.

My personal end goal is to (finally) do the 3D printer team repository migration, but e.g. the Python team would like to update configuration of all repos to use the new KGB hook instead of irker, so some generic interest in the tool exists.

As the tool has a few dependencies (because I really have better things to do than reimplement another wrapper over the GitLab API) I’m not convinced devscripts is the right place for it to live… We’ll see when I have something that does more than print a list of projects to show!

In the meantime, I have the feeling Stefano has lined up a new batch of DebConf website pull requests for me, so I guess that’s what I’m eating for breakfast “tomorrow”… Stay tuned!

My attendance to SnowCamp is in part made possible by donations to the Debian project. If you want to keep the project going, please consider donating, joining the Debian partners program, or sponsoring the upcoming Debian Conference.

by olasd at February 25, 2018 12:49 AM

February 24, 2018

Nicolas Dandrimont a.k.a olasd

Report from Debian SnowCamp: day 2

[Previously: day 1]

Thanks to Valhalla and other members of LIFO, a bunch of fine Debian folks have convened in Laveno, on the shores of Lake Maggiore, for a nice weekend of relaxing and sprinting on various topics, a SnowCamp.

Today’s pièce de résistance was the long overdue upgrade of the machine hosting mentors.debian.net to (jessie then) stretch. We’ve spent most of the afternoon doing the upgrades with Mattia.

The first upgrade to jessie was a bit tricky because we had to clean up a lot of cruft that accumulated over the years. I even managed to force an unexpected database restore test 😇. After a few code fixes, and getting annoyed at apache2.4 for ignoring VirtualHost configs that don’t end with .conf (and losing an hour of debugging time in the process…), we managed to restore the functonality of the website.

We then did the stretch upgrade, which was somewhat smooth sailing in comparison… We had to remove some functionality which depended on packages that didn’t make it to stretch: fedmsg, and the SOAP interface. We also noticed that the gpg2 transition completely broke the… “interesting” GPG handling of mentors… An install of gnupg1 later everything should be working as it was before.

We’ve also tried to tackle our current need for a patched FTP daemon. To do so, we’re switching the default upload queue directory from / to /pub/UploadQueue/. Mattia has submitted bugs for dput and dupload, and will upload an updated dput-ng to switch the default. Hopefully we can do the full transition by the next time we need to upgrade the machine.

Known bugs: the uscan plugin now fails to parse the uscan output… But at least it “supports” version=4 now 🙃

Of course, we’re still sorely lacking volunteers who would really care about mentors.debian.net; the codebase is a pile of hacks upon hacks upon hacks, all relying on an old version of a deprecated Python web framework. A few attempts have been made at a smooth transition to a more recent framework, without really panning out, mostly for lack of time on the part of the people running the service. I’m still convinced things should restart from scratch, but I don’t currently have the energy or time to drive it… Ugh.

More stuff will happen tomorrow, but probably not on mentors.debian.net. See you then!

My attendance to SnowCamp is in part made possible by donations to the Debian project. If you want to keep the project going, please consider donating, joining the Debian partners program, or sponsoring the upcoming Debian Conference.

by olasd at February 24, 2018 12:15 AM

February 22, 2018

Nicolas Dandrimont a.k.a olasd

Report from Debian SnowCamp: day 1

Thanks to Valhalla and other members of LIFO, a bunch of fine Debian folks have convened in Laveno, on the shores of Lake Maggiore, for a nice weekend of relaxing and sprinting on various topics, a SnowCamp.

This morning, I arrived in Milan at “omfg way too early” (5:30AM, thanks to a 30 minute early (!) night train), and used the opportunity to walk the empty streets around the Duomo while the Milanese .oO(mapreri) were waking up. This gave me the opportunity to take very nice pictures of monuments without people, which is always appreciated!

 

After a short train ride to Laveno, we arrived at the Hostel at around 10:30. Some people had already arrived the day before, so there already was a hacking kind of mood in the air.  I’d post a panorama but apparently my phone generated a corrupt JPEG 🙄

After rearranging the tables in the common spaces to handle power distribution correctly (♥ Gaffer Tape), we could start hacking!

Today’s efforts were focused on the DebConf website: there were a bunch of pull requests made by Stefano that I reviewed and merged:

I’ve also written a modicum of code.

Finally, I have created the Debian 3D printing team on salsa in preparation for migrating our packages to git. But now is time to do the sleep thing. See you tomorrow?

My attendance to SnowCamp is in part made possible by donations to the Debian project. If you want to keep the project going, please consider donating, joining the Debian partners program, or sponsoring the upcoming Debian Conference.

by olasd at February 22, 2018 10:40 PM

February 20, 2018

Nicolas Dandrimont a.k.a olasd

Listing and loading of Debian repositories: now live on Software Heritage

Software Heritage is the project for which I’ve been working during the past two and a half years now. The grand vision of the project is to build the universal software archive, which will collect, preserve and share the Software Commons.

Today, we’ve announced that Software Heritage is archiving the contents of Debian daily. I’m reposting this article on my blog as it will probably be of interest to readers of Planet Debian.

TL;DR: Software Heritage now archives all source packages of Debian as well as its security archive daily. Everything is ready for archival of other Debian derivatives as well. Keep on reading to get details of the work that made this possible.

History

When we first announced Software Heritage, back in 2016, we had archived the historical contents of Debian as present on the snapshot.debian.org service, as a one-shot proof of concept import.

This code was then left in a drawer and never touched again, until last summer when Sushant came do an internship with us. We’ve had the opportunity to rework the code that was originally written, and to make it more generic: instead of the specifics of snapshot.debian.org, the code can now work with any Debian repository. Which means that we could now archive any of the numerous Debian derivatives that are available out there.

This has been live for a few months, and you can find Debian package origins in the Software Heritage archive now.

Mapping a Debian repository to Software Heritage

The main challenge in listing and saving Debian source packages in Software Heritage is mapping the content of the repository to the generic source history data model we use for our archive.

Organization of a Debian repository

Before we start looking at a bunch of unpacked Debian source packages, we need to know how a Debian repository is actually organized.

At the top level of a Debian repository lays a set of suites, representing versions of the distribution, that is to say a set of packages that have been tested and are known to work together. For instance, Debian currently has 6 active suites, from wheezy (“old old stable” version), all the way up to experimental; Ubuntu has 8, from precise (12.04 LTS), up to bionic (the future 18.04 release), as well as a devel suite. Each of those suites also has a bunch of “overlay” suites, such as backports, which are made available in the archive alongside full suites.

Under the suites, there’s another level of subdivision, which Debian calls components, and Ubuntu calls areas. Debian uses its components to segregate packages along licensing terms (main, contrib and non-free), while Ubuntu uses its areas to denote the level of support of the packages (main, universe, multiverse, …).

Finally, components contain source packages, which merge upstream sources with distribution-specific patches, as well as machine-readable instructions on how to build the package.

Organization of the Software Heritage archive

The Software Heritage archive is project-centric rather than version-centric. What this means is that we are interested in keeping the history of what was available in software origins, which can be thought of as a URL of a repository containing software artifacts, tagged with a type representing the means of access to the repository.

For instance, the origin for the GitHub mirror of the Linux kernel repository has the following data:

For each visit of an origin, we take a snapshot of all the branches (and tagged versions) of the project that were visible during that visit, complete with their full history. See for instance one of the latest visits of the Linux kernel. For the specific case of GitHub, pull requests are also visible as virtual branches, so we fetch those as well (as branches named refs/pull/<pull request number>/head).

Bringing them together

As we’ve seen, Debian archives (just as well as archives for other “traditional” Linux distributions) are release-centric rather than package-centric. Mapping distributions to the Software Heritage archive therefore takes a little bit of gymnastics, to transpose the list of source packages available in each suite to a list of available versions per source package. We do this step by step:

  1. Download the Sources indices for all the suites and components known in the Debian repository
  2. Parse the Sources indices, listing all source packages inside
  3. For each source package, tell the Debian loader to load all the available versions (grouped by name), generating a complete snapshot of the state of the source package across the Debian repository

The source packages are mapped to origins using the following format:

  • type: deb
  • url: deb://<repository name>/packages/<source package name> (e.g. deb://Debian/packages/linux)

We use a repository name rather than the actual URL to a repository so that links can persist even if a given mirror disappears.

Loading Debian source packages

To load Debian source packages into the Software Heritage archive, we have to convert them: Debian-based distributions distribute source packages as a set of files, a dsc (Debian Source Control) and a set of tarballs (usually, an upstream tarball and a Debian-specific overlay). On the other hand, Software Heritage only stores version-control information such as revisions, directories, files.

Unpacking the source packages

Our philosophy at Software Heritage is to store the source code of software in the precise form that allows a developer to start working on it. For Debian source packages, this is the unpacked source code tree, with all patches applied. After checking that the files we have downloaded match the checksums published in the index files, we simply use dpkg-source -x to extract the source package, with patches applied, ready to build. This also means that we currently fail to import packages that don’t extract with the version of dpkg-source available in Debian Stretch.

Generating a synthetic revision

After walking the extracted source package tree, computing identifiers for all its contents, we get the identifier of the top-level tree, which we will reference in the synthetic revision.

The synthetic revision contains the “reproducible” metadata that is completely intrinsic to the Debian source package. With the current implementation, this means:

  • the author of the package, and the date of modification, as referenced in the last entry of the source package changelog (referenced as author and committer)
  • the original artifact (i.e. the information about the original source package)
  • basic information about the history of the package (using the parsed changelog)

However, we never set parent revisions in the synthetic commits, for two reasons:

  • there is no guarantee that packages referenced in the changelog have been uploaded to the distribution, or imported by Software Heritage (our update frequency is lower than that of the Debian archive)
  • even if this guarantee existed, and all versions of all packages were available in Software Heritage, there would be no guarantee that the version referenced in the changelog is indeed the version we imported in the first place

This makes the information stored in the synthetic revision fully intrinsic to the source package, and reproducible. In turn, this allows us to keep a cache, mapping the original artifacts to synthetic revision ids, to avoid loading packages again once we have loaded them once.

Storing the snapshot

Finally, we can generate the top-level object in the Software Heritage archive, the snapshot. For instance, you can see the snapshot for the latest visit of the glibc package.

To do so, we generate a list of branches by concatenating the suite, the component, and the version number of each detected source package (e.g. stretch/main/2.24-10 for version 2.24-10 of the glibc package available in stretch/main). We then point each branch to the synthetic revision that was generated when loading the package version.

In case a version of a package fails to load (for instance, if the package version disappeared from the mirror between the moment we listed the distribution, and the moment we could load the package), we still register the branch name, but we make it a “null” pointer.

There’s still some improvements to make to the lister specific to Debian repositories: it currently hardcodes the list of components/areas in the distribution, as the repository format provides no programmatic way of eliciting them. Currently, only Debian and its security repository are listed.

Looking forward

We believe that the model we developed for the Debian use case is generic enough to capture not only Debian-based distributions, but also RPM-based ones such as Fedora, Mageia, etc. With some extra work, it should also be possible to adapt it for language-centric package repositories such as CPAN, PyPI or Crates.

Software Heritage is now well on the way of providing the foundations for a generic and unified source browser for the history of traditional package-based distributions.

We’ll be delighted to welcome contributors that want to lend a hand to get there.

by olasd at February 20, 2018 01:52 PM