Datafest-2013

It’s creeping towards the summer and this is festival time. I’ve only been to one music festival of note and that was Latitude in 2010. I failed to pick up a programme in advance and so spent some time during the weekend without sufficient data to decide which of the many stages to go to. I find if you’re not careful this can lead to smoking too many cigarettes and drinking too much beer while listening to Nick Cave being Grinderman. To avoid that here in HS2 land I’ve been putting together a programme of Hansard activity – I wanted to get a feel for how much debate had taken place about this high speed railway.

It’s apparently not very easy to search for synonyms (e.g. HS2 and High Speed Rail) or to filter out specific types of references (e.g news releases from debates) on the parliament.uk website: the “phrase and boolean functionality” has been removed. In what could be read as an abrogation of sovereignty the webmaster recommends the use of Google to perform advance searches. But when I followed the guidelines offered I found that the search giant wasn’t really much help. I constrained the search to the parliament website, I limited the url to Hansard, I used curly quotes around my search term “high speed rail” and I filtered the dates to 1989-1990 in the hope of tracing some early developments. Nothing. I extended the date range progressively up to 2010-2014 and switched to “HS2”. Still nothing. I went back to the basic search on the parliament website and tried throwing a few of the Google tips into the search box there and after some experimentation was able to come up with what seemed like a reasonable way to proceed. A search for HS2 and “cmhansrd” (this is how the website differentiates between the Commons and the Lords Hansard records in its URLs) produces 592 results (a screenshot is below).

hansardresults

A similar search for “high speed rail” gives 908. Duplicate these two searches for the House of Lords (using “ldhansrd”) and the total number of results generated is 1768.

But four separate searches, each displaying a maximum of 100 results at a time, isn’t conducive to the “getting a feel for the debate” that I was hoping for. Well, it feels pretty big but there is a popular notion that size isn’t everything. Next stage then was to consider how best to collate all of the results into one place – at least it would be handy to know how big it is. This was necessary not only for the sake of a good overview of the data but also to be able to work out how to weed out the inevitable duplicates that were appearing across the HS2 and high speed rail results.

Here is the eventual method in its entirety.
1: Identify relevant terms
2: Perform search and view the source text of the results in the web browser
3: Copy the displayed results section of source text into a text editor
4: Repeat for each page of results and each separate search
5: Use regular expression matching to remove extraneous html code and then to add structure to the relevant data points (HS2/HSR, Commons/Lords, Debate/Text/Written, URL, date, summary text)
6: Copy structured text into Excel
7: Use formula to detect and remove identical urls

This process converts 20 visually styled (and next to useless) web pages (as reproduced in the image above) into one spreadsheet. This spreadsheet consists of 1432 individual references to HS2 and High Speed Rail that have occurred in either the House of Commons or House of Lords debating chambers since 1989.

Now what was I wanting them for? Wordpress doesn’t like that many lines of code all in one go, at least not in the visual editor and the web browsers are also not keen. If you’re still reading this you probably are keen and if you want to test out your browser the full list is here. Actually this list is what I wanted – a simple way to access every instance from a single location.

To give a very brief quantitative summary of what’s here the following is interesting, in a trainspotterly kind of way. There are no surprises. A background noise of references from 1989 onwards matches the on/off debate around high speed rail that has been documented elsewhere – the House of Commons Library do a very good job at providing background papers on this kind of thing. The debate starts to pick up in 2003 when the Channel Tunnel Rail Link falteringly becomes High Speed 1, leading up to the Section 2 St Pancras link opening in 2007. The Labour government set up HS2 Ltd in 2009 but they failed to regain power in the next general election. The subsequent review of HS2 by the newly formed coalition took place in 2010. Ongoing consultations since then have kept up the level of interest. The inclusion of the HS2 (Preparation) Bill  into the parliamentary business schedule for 2013 ensured that year saw more references than the last two years added together.

hansardresults

Although I started this piece in a field in Suffolk surrounded by music fans I don’t think the main purpose of the exercise is to create a top of the pops of parliamentary business. But the debate takes shape in a particular way when viewed like this and there are various ways that this shape reflects events across the wider discourse. This is seen in the potted history above but would also make sense when positioned against parallel developments such as road projects – the M1 is a particularly good example.

There are also multiple perspectives from within the debates themselves. The relationship between the use of terms HS2 and HSR and the relationship between the House of Commons and the House of Lords are two structural points of possible interest. If I want to stay out of Marlboro country I need to get back to that list of links and start putting some qualitative flesh onto these quantitative bones.