My local newspaper The Chronicle Journal in Thunder Bay, Ontario does not publish an RSS feed. I find it quite odd for any newspaper in the 21th century not to have an RSS feed, considering every major news outlet down to the long tail blogger has one. I personally blame the son-in-law of the circulation director for this lack of feed.
Anyway, a while ago I was listening to an interview on a Google Code Podcast where they were discussing how people build mash ups. Essentially, “mash up” is a term where a web developer builds a web application by mashing up other web services and content together to create a new web application. The popular example is to take Google Maps and combine it with date, say from Craiglists, that might show the location of near by houses for sale on a Google Map. It’s common for developers to use RSS feeds to grab content from a website, however, in the case where RSS feeds do not exist, developers have two choices: they can pull data off of a webpage by using regular expressions, or they can use tricks that take advantage of the DOM, document object model, to grab the content they’re interested in. It’s the latter technique that caught my attention.
A Python module called BeautifulSoup is a HTML/XML parser that can search and navigate the DOM. A nice benefit of BeautifulSoup is that is tolerates bad markup, which is very common to HTML documents. A quick examination of the HTML code of my local newspaper’s website shows that:
<td width=287><a href="/stories_national.php?id=84364"><font class="headlines">Former coroner Larry Campbell says he resents racist allegations in man's death</font></a></td>
You will notice that headlines use the font class headlines. It just so happens that an unintentional advantage to Cascading Style Sheets is that it makes our data extraction lives much easier. You will find this pattern in many, if not all, websites on the internet, where data is differentiated by CSS classes or identifiers. Using this knowledge and BeautifulSoup we can construct the following:
self.headlines = []
headline = self.soup.find('font', {"class" : "headlines"})
while headline:
url = headline.findParent('a')['href']
self.headlines.append([url, headline.string])
headline = headline.findNext('font', {"class" : "headlines"})
(pardon the improper python indenting… I’m working on it)
This little snippet of code parses the DOM for font tags of the class headline, extracts the text contained in that tag, then looks at the parent tag for the a tag, which contains the url in the href. The results are then stored in an array.
We could then stop here and use the array in our application, or we could go one step further and build an RSS feed. I used PyRSS2Gen to build an RSS feed, which provides a simple interface to construct an RSS2 compliant xml file. And now, armed with an RSS feed of my local newspaper I can enjoy local Thunder Bay news on my iGoogle page. I have also made the feed public which you can find at http://feeds.feedburner.com/thechroniclejournal, if you happen to have an interest in Thunder Bay.
This technique of extracting information from an HTML file using the document object model is a simple and structured approach, compared to extracting data using regular expressions. Since the document we’re using is a structured document the first place, it only makes sense. And with CSS, it only gets easier.
Related posts: