Now that's magic (quotes)

If your web application ensures that all your incoming CGI variables are free of the most common source of malicious site damage, can you stop worrying?

I wondered this as I got far enough into a PHP publishing system that I had to start thinking about adding new content through the system (rather than just jamming it into the database by hand, which is why the previous incarnation has sadly fallen into disuse). As it’s typically configured, PHP will add backslashes to anything it doesn’t trust: hence the comment “it’s a great site you’ve got here” will, when submitted by a POST request, become “it’s a great site you’ve got here”. Whether or not your server does this automatically can be checked by calling the function magic_quotes_gpc() (I realised only the other day that “gpc” stood for “GET, POST and cookies:” I probably have some catching up to do). In performing this blanket adding of slashes, PHP prevents the unwary coder from leaving his site open to both unintentional database hiccups and intentional malevolent attacks, the SQL injection attack.

All well and good, but my application is heavily object-oriented. Such objects store whatever content you give them, as well as optionally writing it to the database. If I want these objects to persist (even for the course of a single request) then any access to their internal storage must yield sensible data: those slashes have to disappear before the articles appear in an RSS feed, or on the website itself. So when the CGI environment gives slash-added content to an object, the object needs to know to both add it to the database verbatim and to produce it for viewing with the slashes removed. It can either do this by storing it in a slash-removed state or by placing a filter on its outputs.

There’s a further complication, in that content can also be written to an object by the PHP application itself: the publishing of all my unpublished articles, for example, would change the status of their accompanying objects without reference to any CGI variable. If I assumed all of this content had had its slashes escaped, then this article, for example, would lose all of its ’ text, because the object would assume they’d been added by PHP’s internals: in my second paragraph, the “after” string would look like the “before” string, and the “before” string would instead break the database insertion. In addition, what if the server is reconfigured? Can I trust my hosting company to never change the configuration of PHP, even accidentally during an upgrade?

I found myself lost in a maze of adding, removing and then adding slashes, with no clear way of deciding. Suddenly I decided: why not use one of PHP’s major downsides—that it doesn’t support persistence of objects from one request to the next very well, and hence each action is fighting against the overhead of constantly recreating and recompiling code—to ascertain which input/output processes were the most frequent (and most public) and hence needed to be the fastest? I drew a flowchart of a typical object’s behaviour and, by identifying which channels could be safely bottlenecked, arrived at a reasonable solution to the problem.

From my phrasing it’s clear that it was a foregone conclusion: I wanted, more than anything else, for content to flow straight from the database (through the object if applicable) to the user. This content needed to stay in any object in a simple, de-slashed form, so it could flow and flow as long as the object was in existence. That meant that incoming CGI content could not be stored with its added slashes intact. Counter-intuitively, then, my solution was to undo PHP’s default safety mechanisms, unescaping the CGI content and storing it raw, and then without fail adding slashes to anything that CGI or my application wanted to add to the database. This would be my bottleneck: everything else would be as fast as it could be.

Exit gracefully: ensuring all incoming content can be added to the database safely is not necessarily the most efficient or desirable long-term solution. By examining the likely workflows for content, it’s possible to make pragmatic decisions on where content should be pre-processed and where it should be left alone. Consider all your overheads, including that of short-term programming and long-term cumulative processing time: this will vary depending on your environment. Also, if you’re aware of a safety net, over the presence of which you have minimal control, account for the possibility that someone might one day remove it.