Scraping GNU Mailman Pipermail Email List Archives
Sunday September 21, 2014
I worked with Code for Progress fellow Casidy at a recent Code for DC civic hacknight on migrating old email list archives for the Commotion mesh network project to a new system. The source system was GNU Mailman with its Pipermail web archives for several email lists such as commotion-discuss.
We used Python's lxml for the first pass scraping of all the archive file URLs. The process was then made more interesting by the gzip'ing of most monthly archives. Instead of saving the gzip'ed files to disk and then gunzip'ing them, we used Python's gzip and StringIO modules. The result is the full text history of a specified email list, ready for further processing. Here's the code we came up with:
#!/usr/bin/env python import requests from lxml import html import gzip from StringIO import StringIO listname = 'commotion-discuss' url = 'https://lists.chambana.net/pipermail/' + listname + '/' response = requests.get(url) tree = html.fromstring(response.text) filenames = tree.xpath('//table/tr/td/a/@href') def emails_from_filename(filename): print filename response = requests.get(url + filename) if filename[-3:] == '.gz': contents = gzip.GzipFile(fileobj=StringIO(response.content)).read() else: contents = response.content return contents contents = [emails_from_filename(filename) for filename in filenames] contents.reverse() contents = "\n\n\n\n".join(contents) with open(listname + '.txt', 'w') as filehandle: filehandle.write(contents)
Christian asked about a license for the above, which I hadn't considered. So let's consider it CC0 (Public Domain).
This post was originally hosted elsewhere.