Scraping GNU Mailman Pipermail Email List Archives

Sunday September 21, 2014

I worked with Code for Progress fellow Casidy at a recent Code for DC civic hacknight on migrating old email list archives for the Commotion mesh network project to a new system. The source system was GNU Mailman with its Pipermail web archives for several email lists such as commotion-discuss.

We used Python's lxml for the first pass scraping of all the archive file URLs. The process was then made more interesting by the gzip'ing of most monthly archives. Instead of saving the gzip'ed files to disk and then gunzip'ing them, we used Python's gzip and StringIO modules. The result is the full text history of a specified email list, ready for further processing. Here's the code we came up with:

#!/usr/bin/env python

import requests
from lxml import html
import gzip
from StringIO import StringIO

listname = 'commotion-discuss'
url = 'https://lists.chambana.net/pipermail/' + listname + '/'

response = requests.get(url)
tree = html.fromstring(response.text)

filenames = tree.xpath('//table/tr/td[3]/a/@href')

def emails_from_filename(filename):
    print filename
    response = requests.get(url + filename)
    if filename[-3:] == '.gz':
        contents = gzip.GzipFile(fileobj=StringIO(response.content)).read()
    else:
        contents = response.content
    return contents

contents = [emails_from_filename(filename) for filename in filenames]
contents.reverse()

contents = "\n\n\n\n".join(contents)

with open(listname + '.txt', 'w') as filehandle:
    filehandle.write(contents)

Christian asked about a license for the above, which I hadn't considered. So let's consider it CC0 (Public Domain).

This post was originally hosted elsewhere.