November 08, 2018

Nicolas Dandrimont a.k.a olasd

Record number of uploads of a Debian package in an arbitrary 24-hour window

Since Dimitri has given me the SQL virus I have a hard time avoiding opportunities for twisting my brain.

Seeing the latest post from Chris Lamb made me wonder: how hard would it be to do better? Splitting by date is rather arbitrary (the split may even depend on the timezone you’re using when you’re doing the query), so let’s try to find out the maximum number of uploads that happened for each package in any 24 hour window.

First, for each upload, we get how many uploads of the same package happened in the subsequent 24 hours.

SELECT
  source,
  date,
  (
    SELECT
      count(*)
    FROM
      upload_history AS other_upload
    WHERE
      other_upload.source = first_upload.source
      AND other_upload.date >= first_upload.date
      AND other_upload.date < first_upload.date + '24 hours') AS count
  FROM
    upload_history AS first_upload

For each source package, we want the maximum count of uploads in a 24 hour window.

SELECT
  source,
  max(count)
FROM
  upload_counts
GROUP BY
  source

We can then join both queries together, to get the 24-hour window in which the most uploads of a given source package has happened.

WITH upload_counts AS (
  SELECT
    source,
    date,
    (
      SELECT
        count(*)
      FROM
        upload_history AS other_upload
      WHERE
        other_upload.source = first_upload.source
        AND other_upload.date >= first_upload.date
        AND other_upload.date < first_upload.date + '24 hours') AS count
    FROM
      upload_history AS first_upload
)
SELECT
  source,
  date,
  count
FROM
  upload_counts
INNER JOIN (
  SELECT
    source,
    max(count) AS max_uploads
  FROM
    upload_counts
  GROUP BY
    source
  ) AS m
  USING (source)
WHERE
  count = max_uploads
  AND max_uploads >= 9
ORDER BY
  max_uploads DESC,
  date ASC;

The results are almost the ones Chris has found, but cl-sql and live-config now have one more upload than live-boot.

       source       |          date          | count 
--------------------+------------------------+-------
 cl-sql             | 2004-04-17 03:34:52+00 |    14
 live-config        | 2010-07-15 17:19:11+00 |    14
 live-boot          | 2010-07-15 17:17:07+00 |    13
 zutils             | 2010-12-30 17:33:45+00 |    11
 belocs-locales-bin | 2005-03-20 21:05:44+00 |    10
 openerp-web        | 2010-12-30 17:32:07+00 |    10
 debconf            | 1999-09-25 18:52:37+00 |     9
 gretl              | 2000-06-16 18:53:11+00 |     9
 posh               | 2002-07-24 17:04:46+00 |     9
 module-assistant   | 2003-09-11 05:53:18+00 |     9
 live-helper        | 2007-04-20 18:16:38+00 |     9
 dxvk               | 2018-11-06 00:04:02+00 |     9
(12 lines)

Thanks to Adrian and Chris for the involuntary challenge!

by olasd at November 08, 2018 10:56 PM

August 24, 2018

Antoine Amarilli a.k.a a3nm

What's wrong with academia?

I have just finished writing up a long document that tries to give a comprehensive list of problems affecting academic research.

Writing this document is something that I had been meaning to do for a very long time, almost since I got started in academia in 2012 with my master's internship. Many academic practices did not make any sense to me already at the time, e.g., hiding research articles behind paywalls rather than simply hosting them online. I tried to ignore these concerns for a while, and did my PhD without questioning too much the order of things: for some practices I eventually saw a justification, but for many others I did not, and they made me more and more uneasy. So I thought that I should eventually come back to these problems, to re-examine my beliefs about the way academia works. Hence this long list of all the problems that annoy me, which I will try to keep up-to-date as time passes and my experience evolves.

Of course, complaining is always easy, so I have also tried to give some thought in the document about ways to fix these problems. The document does not mention my own initiatives in this direction (e.g., refusing to review for closed-access venues), which I will eventually write up separately.

So I encourage you to have a look at the document, What's wrong with academia?, and share with me any feedback that you may have!

by a3nm at August 24, 2018 09:51 AM

August 14, 2018

Antoine Amarilli a.k.a a3nm

Finding the members of the theoretical database community with DBLP

The DBLP service is a great bibliographical tool for computer science research. In this post, I explain how to use it to prepare the list of members of a research community. I will be using the theoretical database community, whose two conferences are PODS and ICDT.

The list of publications for one edition of a conference can be found on DBLP as XML, e.g., for ICDT'18. It is then easy to use xmlstarlet to find the list of people who have published at that conference:

curl -s 'https://dblp.uni-trier.de/db/conf/icdt/icdt2018.xml' |
  xmlstarlet sel -T -t -m "//inproceedings/author" -m . -c '.' -n |
  sort | uniq

For each person in the list, we can obtain detailed XML information, including its homepage, ORCID, etc., using the DBLP API again. (This also gives us a canonical form for the name, which may appear in different ways in various inproceedings entries.) This is just a bit more complicated than it should, because of a limitation of the DBLP search API: when queried with a name, sometimes the API inexplicably favors non-exact matches even in some cases where an exact match exist. So we must filter the matches ourselves to use an exact match if one exists, and a non-exact match otherwise. Of course, independently from this problem, you may be getting the wrong author, in particular because of homonyms, so these results should be taken with a grain of salt.

NAME="Antoine Amarilli"
ENAME=$(echo "$NAME" | sed 's/ /%20/g')
curl -s "https://dblp.org/search/author/api?h=1000&q=$ENAME" > matches.xml
URL=$(xmlstarlet sel -T -t -m "/result/hits/hit/info[author='$NAME']" \
    -c url -n < matches.xml | head -1)
if [[ -z "$URL" ]]
then
  URL=$(xmlstarlet sel -T -t -m /result/hits/hit/info/url \
      -c . -n < matches.xml | head -1)
fi
curl -L "${URL}.xml"

From there, we can use this to prepare a list of community members. Of course, any criterion for inclusion is completely arbitrary... My criterion to get a list of "active community members" is to select the who have published on three different years, with one publication in 2015 or later. Which gives:

Click to see the list...

Another inclusion criterion for a "historical" list would be the list of people who are not necessarily still active but have published over a long period, say, 10 different (not necessarily contiguous) years. Here is the resulting list, sorted by the year where the person has last published in ICDT or PODS.

Click to see the list...

Another kind of statistics that can be computed in this way is the "neighboring" conferences, i.e., the other conferences where members of the community have published. Here is the list of the top neighboring conferences of PODS and ICDT, sorted by the number of active community members who have published at least once there since 2015 (with hyperlinks and descriptions added manually):

  • 38: SIGMOD Conference, the practical database conference held jointly with PODS
  • 34: AMW, the database theory workshop held in honor of Alberto O. Mendelzon (whom you may remember from the previous list)
  • 29: IJCAI, an AI conference
  • 28: ICALP, a theoretical CS conference on logics and automata
  • 26: LICS, another theoretical CS conference about logics
  • 20: SODA, a theoretical CS conference on algorithms
  • 19: AAAI, another AI conference
  • 19: EDBT, the practical database conference held jointly with ICDT
  • 18: WWW, a conference about the World Wide Web
  • 15: Description Logics, the workshop on description logics
  • 15: ICDE, a practical data management conference
  • 13: CIKM, an information and knowledge management conference
  • 12: SEBD, the Italian conference on databases
  • 12: STOC, a general-purpose theoretical computer science conference
  • 11: KR, a conference on knowledge representation and reasoning
  • 11: FOCS, another general-purpose theoretical computer science conference

It would be interesting to visualize this data differently, e.g., visualize a world map with the community members, but sadly the affiliation information in DBLP is too sparse for this to work.

by a3nm at August 14, 2018 11:23 PM

July 07, 2018

Antoine Amarilli a.k.a a3nm

Indexing encrypted email with notmuch

Since version 0.26, the mail indexing tool that I use, notmuch, now makes it easy to index encrypted mail.

The original behavior was that notmuch did not index the contents of encrypted emails, as they were encrypted and it couldn't access them. This meant that you couldn't search inside encrypted emails (except for headers, e.g., the subject, recipient, etc.).

Now, notmuch is able to use gpg (and gpg-agent) to read and index the cleartext of encrypted emails. Of course, this means that notmuch's index can now be used to reconstruct encrypted emails; in particular, as notmuch stores the session keys for messages in its index, this means that any attacker who can access the index can decrypt the messages1. For my use case, I think that this security risk is acceptable: I essentially see GPG as a tool to ensure that messages are not altered between the sender and recipient, my notmuch index is stored on an encrypted partition anyway, and my GPG passphrase is usually cached by gpg-agent so an attacker who has control over my machine would be able to access the plaintext of encrypted messages quite easily.

So, if you also use notmuch, if you also have your passphrase cached by gpg-agent at least part of the time, and if you want notmuch to index the cleartext of your encrypted emails, here is what you should do. First, you should make sure that you have notmuch 0.26 or a more recent version. Second, you should tell notmuch that you want it to index the cleartext of encrypted email:

notmuch config set index.decrypt true

Beware, this configuration flag lives only in the database, not in the config file; hence, e.g., it will not be synchronized across multiple machines if you synchronize your config files from one machine to another.

Then you should reindex all encrypted email that notmuch knows about but hasn't indexed yet (this took around 15 mins in my case):

notmuch reindex tag:encrypted and not property:index.decryption=success

Of course, you will be prompted for your GPG passphrase if it isn't cached (and also possibly for the passphrase of other keys that you have used in the past). Once this has completed, you should check the encrypted messages that notmuch was still unable to index:

notmuch search tag:encrypted and not property:index.decryption=success

In my case, there were only a few that couldn't be indexed -- and usually it was because they hadn't been encrypted for my key because the sender had made some mistake.

From now on, notmuch new should automatically index the cleartext of incoming messages when your GPG passphrase is cached by gpg-agent. The last step is the following: if your passphrase is not cached all the time, then you should arrange for the notmuch reindex command above to be executed regularly, so that encrypted messages will eventually be indexed.

The setup described in this post lead to unpleasant side effects where GPG invocations would hang, probably because notmuch tried to ask for a passphrase. To avoid this, I had to ensure that the notmuch reindex command, when run regularly, never tried to ask for a passphrase if it wasn't currently stored by the agent. I did this by setting PINENTRY_USER_DATA=none and modifying my custom pinentry script to handle properly this value. (Of course, this means that encrypted messages will not be correctly indexed when the GPG agent hasn't cached the passphrase, but the hope is that they will eventually be indexed.)

Another problem that I had is that notmuch reindex would waste CPU time by trying to reindex each time the emails where it had previously failed. To avoid this, I reviewed manually the mails that couldn't be indexed, tagged them with a special tag, and then excluded mails with that tag from the notmuch reindex command. I also added a crontab entry to review periodically the emails where indexing failed, so I will tag them appropriately if the failure is expected. A more elaborate idea would be to exclude from the notmuch reindex command the emails that are too ancient; or maybe script things so that when all GPG keys are available in the agent but notmuch cannot index a message then it should tag it so as not to try again.


  1. In fact, the historical workaround to index encrypted email with notmuch was simply to arrange for it to be decrypted when it arrives. I would also be OK with the security implications of this, but I have never set it up, because it's complicated to do right, especially because my GPG passphrase isn't always available in gpg-agent's cache. Besides, I prefer to keep an original copy of the email that I receive, so I think it's cleaner to keep the encrypted messages as-is and have notmuch store in its index its additional information that it needs. 

by a3nm at July 07, 2018 05:53 PM