You are here

beautifulsoup

BeautifulSoup available for Python 3

Python 3 can now strip the hell out of webpages just as well as Python 2.

A Python3-compatible version of BeautifulSoup is now bundled with the Python2 BeautifulSoup tarball. It's actually been available since 27 December, but the most recent version 3.1.0.1 addresses a bug in attribute handling.

It's a bit fiddly to get it working---you need patch, and both python3 and 2to3 on the command line (and 2to3 to be called 2to3-3.0---but when it does so, that ol' BS magic is pretty clear. While there's still lots of good reasons not to convert all your Python2 code to Python3, there's now one less reason not to begin your next big project in Python3.

(BeautifulSoup has an active user group on Google Groups, so you can report any bugs there.)

Blog category: 

The TimeToLead.eu technical stack: Django and Flex

Move over LAMP: here comes LAPD.

As discussed previously, at Torchbox we recently built TimeToLead.eu, an advocacy site set up by the four major environmental NGOs to prompt MEPs into passing the sort of legislation the world's climate desperately needs. The project itself needed a fast turnaround time, and its pan-European audience demanded strong i18n and l10n. This had to be in place at all layers, including an embedded Flex application.

In a parallel with the acronym LAMP, the overall stack is probably best described as "LAPD":

  • Linux
  • Apache
  • PostgreSQL
  • Django

although there are refinements at almost every level, which I'll go into below.

I18n with Django was a joy to implement. We had to have the site translated in six languages, but this was practically a doddle with Django's core internationalization behaviour. Before we'd put the translations in, I clicked on "POLSKI" and was sure the l10n wasn't working, until I spotted that the "ENGLISH" link had magically changed to "ANGIELSKI".

The only issue we've had---which we're hoping to work around---was deciding on an initial translation when the visitor arrived for the first time. Some parts of the system seemed to need l10n---the plumping for a specific version of the site---before Django's LocaleMiddleware had done that for us, so we've had to force English until the user states otherwise. I'm hoping that a version 2 rewrite will fix that.

PostgreSQL might seem an odd choice of database to the LAMP community, who are used to MySQL's minuscule overhead and often actively work around its deficiencies to keep that low overhead. But PostgreSQL's stability and maturity outweigh the performance issues which---with a little care and attention---it's possible to at least partly mitigate.

Compared to MySQL or Oracle, PostgreSQL also has the accolade of being supported by Django but having no outstanding database issues. Whether that's because nobody else uses it, or because the integration is tight and relatively bug-free, I daren't comment, but we've had nothing but seamless, transparent behaviour thus far.

Our Django-oriented hosting is with Webfaction, who are geared up for one-click mod_python Django deployments. The Django app sits in its own Apache process, while static files are served by the Linux server's main Apache process: differential serving of content lets us take advantage of differential hosting fees. The hosting is a pretty good package, although TimeToLead.eu (despite being a pretty small application) is already finding the maximum memory package a bit restrictive, so we'll need to keep an eye on that.

On top of this, Django has to a greater or lesser extent enabled us to use a whole host of other neat little technologies to integrate the site and improve both the user and the developer's experience:

  • The main Flash widget is written in Flex: I'm no particular fan of Flash myself, but it fulfills the remit admirably here. Because of text flow issues, there are actually six versions of the Flash file, one for each language, but the majority of the explanatory text is then picked up from an XML feed provided by Django, so can be retranslated by the client without recompiling Flex.
  • Swfobject serves up our main widget and the YouTube file. It degrades well, but bear in mind that the API to it has changed considerably in the most recent major version.
  • Unit tests in Django cover a number of stress points in the code (although not all, owing to the timescale). Towards the end, we were testing any major functionality changes by writing the tests first, which I was really pleased with.
  • BeautifulSoup, which I mentioned a few days ago, is employed in the unit tests to check that certain content is coming through on the front end. It also managed to help us proof against a browser-dependent FOUC we encountered during development.
  • Translations are managed by the Django application Rosetta: the most recent stable version is for Django 0.96 but the bleeding-edge repository copy seems to work OK on 1.0b1

Django has been the saviour here, living up to its promise as a framework for rapid application development. No framework is ever perfect, and there'll always be arguments over the right way to proceed, reconciling what the framework developers want you to do with what the end developer wants to do, but this particular brief sojourn alongside Holovaty, Kaplan-Moss, Willison et al has been tremendous fun.

Spliticket running again with BeautifulSoup

Or, how I learned to stop parsing and love the soup

Ages ago Matthew Somerville emailed me to say that spliticket had fallen over. It's my hacky interface to his wiki page documenting split tickets, and ultimately it found the vagaries of even wiki-generated HTML a bit too hard to cope with.

At the time I built the HTML parser using core SAX-based HTML parsing, and it was horrible. SAX works in a basic sense, but you have to build your own internal state engine, track which elements have gone past while working out what to do with the current context, and even write rules for what to do when the underlying dumb parser encounters HTML entities: no mean feat when the document is peppered with – en dashes.

Not only was writing the rules initially a pain in the rear, but adding new rules and bugfixing the existing ones was even worse. But I lived with SAX, because I was deploying on shared hosting: I presumed that this was the best option available if I couldn't install any new shared libraries.

Not true! I've just rebuilt the entire parsing layer with Beautiful Soup, a Python HTML/XML parser library which (a) is available as a single file and (b) works out a decent HTML DOM tree from pretty much anything you throw at it.

Try it yourself, if you have to do any HTML parsing.It's astonishing; beautiful, in fact. I will never write another SAX parser ever again, which I'm sure I've said before.

Subscribe to RSS - beautifulsoup