Scraping a phpBB3 board with Python

Published on January 15, 2018

Intro

Recently I was asked to help archive an old phpBB3 forum. At work and at home I am fortunate enough to be able to work with Python. Aside from being accessible and free, those involved with Python often publish thier code which can also be acessed generally under an open licence.

In this particular case I was already familiar with the requests module and while I had heard of the Beautiful Soup module I hadn't yet used it. There's nothing new with this approach - see this gist (from six years ago) and this more recent review of the similar approach (from 2017). Thanks to both authors for adding the information to the Internet.

Getting started - handling the login to the board

Having downloaded and installed requests and Beautiful Soup via pip the first job was to get access to the forum programatically. The phpBB3 forums have a login page ucp (user control panel) and a call via the requests module can include a username and password. The concept is explained well in this StackOverflow post

The only change I had to make was to reinclude the headers where making the second call to the requests session.

r = session.post(forum + "ucp.php?mode=login",
headers=headers,
data=payload)
# get the sid for passing with other href requests
# Start with the index.php page - via redirects specified in the payload 
sidStart = r.text.find("sid")+4
sid = r.text[sidStart:sidStart+32]
parameters = {'mode': 'login', 'sid': sid}
r = session.post(forum, headers=headers, params=parameters)

Processing the forum data

 The redirect to the Index page of the forum provided in the payload redirects the second call to the index page. In our case the bullitin board shows a list of Forum posts.

Passing the return value made by the session request to Beautiful Soup makes it easy to identify all the links in the Index Page. In this case the phpBB3 forum page included class values associated with the Forum topics

<a href="./viewforum.php?f=19&amp;sid=b857a462b80a43778623ff5f39bfafce" class="forumtitle">General Discussion &amp; Resources</a><br/>

 Once these links are identifed they can be requested. In the case of this particular BB the page was divided into two separate categories, additional Forums (archives and other categorisation for example) which included the same forumtitle class and a section that includes actual forum links. These latter links have the class definition as topictitle.

Using recursion it's possible to drill down into further formus or process a topic. Topics themselves have the option to be rendered as printable pages simply by adding a parameter to the url.

&view=print

The resulting HTML can then simply be parsed by Beautiful Soup and copied to a file. The file title comprised of the topic title and the date it was first entered. The nicely structured HTML of the phpBB board made it straightforward to obtain this information. For example obtaining dates:

soup = BeautifulSoup(r.text)
dates = soup.findAll("div", { "class": "date" })
# We just use the date of the first posting
# but could use the last one if it was a popular post
date = dates[0].text

 The HTML of the parsed forum page is then simply saved to a file which when viewed in a storage like Google Drive as a Google doc is rendered correctly as a page.


0 comments


Additional comments have been disabled for this entry

An entry posted on January 15, 2018.

Categories: Data and Python

Tags: phpBB3 and python