[PHP-DEV] Run-time taint support proposal

One of the largest discussions on the PHP internals list begun last Friday with a lengthy proposal by Wietse Venema. At the time of writing this entry, the proposal generated over 100 responses. Everyone on the list had it’s word to say. Now that the two Jesters and the King of Spades (Andi Gutmans, Zeev Suraski and Rasmus Lerdorf) gave their opinion, it might be time for a recapitulation. At this time, it’s almost certain an implementation will be made, not so certain it will be incorporated.

So, what is taint mode all about? It’s about automatically marking data from external sources as unsafe and making sure it’s sanitized before being used. Here is a sample explaining what it does in detail. For now, Beep! will indicate an error. A hundred responses was not enough to decide what to do with these.

$result = mysql_query( "SELECT * FROM foo
    WHERE bar = '{$_GET['baz']}'" ); // Beep!
$result = mysql_query( "SELECT * FROM foo
    WHERE bar = '" .
    mysql_real_escape_string($_GET['baz']) . "'" ); // OK

while( $row = mysql_fetch_row( $result ) )
{
    echo $row[0]; // Beep!
    echo htmlentities( $row[0] ); // OK
    echo (untainted) $row[0]; // OK, but no decision on syntax for this one
}

$stmt = $pdo_dbh->prepare( "INSERT INTO foo (bar, baz)
    VALUES(:a, :b)" );
$stmt->execute(
    array( ':a' => $_GET['baz'], ':b' => 123 ) ); // OK

Basically, the taint mode only ensures that the appropriate an escaping mechanism is used before sending information to an external source. It does not replace input filtering, as the sample above still contains many flaws.

The opposition to taint mode, lead by the Jack of Hearts (Ilia Alshanetsky) is scared that taint mode will only become an other safe_mode, causing more harm than good. In a worst case scenario, ISPs would enable the taint mode thinking it will make their servers safe from everything, which, as everyone agrees on, is not true. Fooling the taint mode is fairly easy.

exec( mysql_real_escape_string( $_GET['command'] ) ); // OK

To prevent this from happening, taint mode would need to be aware of the context, which would require more than 1 bit, as the current proposal indicates. A bulletproof solution would require too much overhead to be acceptable. Early in the discussions, the overhead of a single check was a concern.

Those supporting the idea (supporting the proposal is not quite the same thing) think it would be a good tool for security conscious developers willing to improve the security of their applications. Those willing to avoid security will be able to do it anyway (PHP always allowed anyone to shoot themselves in the foot). Even if taint mode does not catch every security issue, being able to catch 90% is better than nothing. Plus, it could actually be a good educational tool for those willing to learn. The perception of the public towards the new language feature is important and appropriate communication will be required to avoid problems with ISPs.

Enabling, disabling and error reporting

One thing is certain: it will be disabled by default to keep backwards compatibility. Even the best applications out there would probably fail the taint mode test. Take this safe example:

$foo = 123; // tainted, from external source
if( is_numeric( $foo ) )
    echo $foo;

The example is perfectly safe, but taint mode will not understand this. Most applications will not run under taint mode unless major modifications are made. This is why some suggested it should only be used for new developments. Some even suggested that, just like error reporting, it should be disabled in production. It was proposed that the checks could be enabled or disabled at compile time to avoid any overhead in a production environment, but this one did not make too much noise. The other option is of course to make it an option in php.ini. Making it an option in the configuration file would of course make it easier to Windows users not used to compiling PHP.

Three modes were proposed.

  1. Disabled.
  2. Audit mode. Application would still run, but problems would be logged to a file.
  3. Enforcement mode. Kill the script on taint error.

The last one, now known as mode 3, was loudly rejected by both sides for giving a false sense of security. It would encourage hosts to enable it and only accept taint mode compliant applications to run on the server. Passing the taint mode test does not mean the application is secure and that is not the message that wants to be sent. Plus, it would encourage all application writers out there to patch their applications blindly just to get them to run on shared hosts.

An option proposed would be to add an E_TAINT reporting level, which would allow to enable or disable the reporting easily and take advantage of all the logging mechanisms available.

After all, this is the internals list

Of course, there were discussions about implementation details, the impact on source code to be modified, how it would be integrated in the ZVAL. The conclusion to this was that it should not be too complicated, but a base implementation is required before giving more details.

No matter what, the workload required can’t be worst than converting 3000 functions for unicode compliance in PHP 6. As for the inclusion of taint mode: no words. Obviously PHP 6 if it’s accepted, but who knows, it might just be included in 5.3.

Some off topic issues

On the topic of perception by ISPs and good communication of the purpose of taint mode, it was proposed to rename php.ini to php.ini-development and php-ini-recommended to php.ini-production in the distribution for clarity purposes. The former would include E_ALL error reporting and probably taint mode, and the second would disable error output and taint mode. Unlike most issues on the thread, most seemed to agree on the rename, while the content of the file is not likely to change.

A proposition to kill $_GET/_POST/$_ENV/$_COOKIE/$_SESSION/$_REQUEST altogether was made. The point was that the filter extension provides a better solution and those legacy superglobals now caused more harm than good. I think this one was rejected without any discussion. Can you imagine how many applications would break? But the point is still valid: ext/filter is a better option.

Where do I stand?

I think taint mode combined with the filter extension could change the way PHP applications are being written in a drastic way. Ever since PHP 5, PHP has only become more elegant as a language. This new proposal would actually enforce good practises and place a focus on security. I think taint mode is only a tool for developers. Some PHP developers out there don’t care about security. Fine, they don’t have to use it. But some do, some make efforts to make their applications secure. Having taint mode would help these people catch that one place they forgot to escape a value. While it won’t assure that the application is completely secure, it will at least give a certain confidence level that most of it is safe.

Zend Framework 0.6.0, yet an other preview

The Zend Framework has a strict roadmap. Releases arrive exactly as planned. Mid-December, 0.6.0 was released. I had to wait a few days before looking into it, but I took some time. The first thing I had in mind was to get my sample application to run with the new version. For the MVC aspect, not much changed since 0.2.0, except that the new stuff is now out of the incubator. The Zend_Controller_Front is now a singleton again, as it was in 0.1.0 and the patch with setRequest() is no longer required. On that front, the release is a good one. On the other hand, getting the ACL code to work was not as easy, but that was to be expected as it was never marked as stable. They went through some serious refactoring. One positive aspect: they got rid of the ARO and ACO acronyms and decided to use Role and Resource, which makes the whole thing more readable. The downside is that they broke everything I liked about it.

From a flexible tool you could play around with and extend on the fly, it became a rock solid Java-like structure. You can no longer create resources on the way that simply extend the parent one. The fluent interface using the __get() magic methods is gone. Worst of all, you actually have to create instances yourself and build the tree manually. Resource names are now global-like, so there is no way to use the same word twice in a different context. I did not do any extensive testing before, so maybe it was not possible altogether, but now they made it obvious.

Seriously, if I am to give a name to something, and be able to use that name afterwards, why would I even care about the class name used internally? Having to write Zend_Acl_Resource and Zend_Acl_Role is simply annoying since there is no value to the instance it creates. Check out the samples from the documentation.

  1. Introduction
  2. Refining
  3. Advanced

Hopefully, they will refactor this one again before the final release scheduled for May.

Of course, there are a lot of good aspects to the new release. Overall, I noticed a great improvement on the documentation. Improving the documentation is still part of the roadmap, but what is available already is more than anything I’ve seen for a class library. The framework is not even in a final release stage and translations are already in progress. Other than English, 10 translations are available (most of them partial so far, but still impressive).

Classes for authentication and session handling are also added. The only adapter available at the moment is Digest, but I can imagine more coming in the future, like LDAP and other ‘standard’ protocols. Of course, you can create your own adapters. I don’t really get the point of the session handling classes, but they appear to offer additional features like namespaces for values and offer some options to simplify session security.

The class I find the most interesting definitely is the Zend_Measure one. It was available in the last release, but no documentation was available at the time other than the API documentation, and that does not really tell you what it’s all about. Over a year ago, I had to perform unit conversions in order to translate values in technical publications. At that time, I searched around to find a library that could handle the conversion for me. Not that multiplications are hard, it’s just that I hate having to type in the conversion factors. I couldn’t find anything, so I had to type in the conversion factors. Most of it was a hack. I had no intention of releasing it in any kind of way. Now I see this package coming. What it does is simply amazing. Not only it does the conversions for a gazillion types of units, but it will find the units to convert in strings and handle the locales.

I still need to apply these in a real world context to see how good it really is, but it all looks very promising.

XML Schema vs RelaxNG

When I first read this quote by Tim Bray yesterday (yes, I know it was posted on Slashdot last week, I just happen to have a huge feed backlog and not much time to read), I was a little surprised. I have been using XML Schema for a while now and never had any problems with it. Of course, there is quite a lot of vocabulary to learn to be able to write it, but it’s not that hard to read. After you wrote one of them, you can simply start from that one for the others and half of the burden imposed by the syntax is gone.

W3C XML Schemas (XSD) suck. They are hard to read, hard to write, hard to understand, have interoperability problems, and are unable to describe lots of things you want to do all the time in XML. Schemas based on Relax NG, also known as ISO Standard 19757, are easy to write, easy to read, are backed by a rigorous formalism for interoperability, and can describe immensely more different XML constructs.

I never really bothered looking into Relax NG. I saw the name a couple of times, saw an XML sample once and figured it did the same thing as XML Schema, so I had no reason to bother. When I read this quote, I knew there was something I missed about it. Really, the XML syntax of Relax NG is not more readable. For some reason, I would even say I prefer the XML Schema xs:element tag with the minOccurs and maxOccurs. I find it more readable than those “oneOrMore” tags.

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
  datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <ref name="library"/>
  </start>
  <define name="library">
    <element name="library">
      <zeroOrMore>
        <ref name="book"/>
      </zeroOrMore>
    </element>
  </define>
  <define name="book">
    <element name="book">
      <element name="title">
        <data type="string">
          <param name="minLength">1</param>
        </data>
      </element>
      <oneOrMore>
        <element name="author">
          <text/>
        </element>
      </oneOrMore>
    </element>
  </define>
</grammar>

Then I figured that this XML blurb could be generated from a much more simple syntax called Relax NG Compact. Seriously, I don’t even see a reason why they actually made a non-compact form. The syntax is a lot like the DTD syntax without all the brackets and more capabilities. It can even use the datatypes from XML Schemas (which are in a separate specification) to perform some additional validation. I’ll let you guess what the following piece does. (Note: the XML above was generated using this sample).

grammar {
	start = library

	library = element library { book* }
	book = element book {
		element title { xsd:string {minLength = "1"} },
		element author { text }+
	}
}

Not only the syntax is a lot simpler than anything out there for schema definition, it also has better documentation. The RELAX NG Compact Syntax Tutorial is a very good place to start. I don’t know if it can do more than XML Schema, but it sure can do about as much in a more efficient fashion. There are two elements I couldn’t find an equivalent for: minOccurs and maxOccurs. Relax NG only supports “one or zero”, “zero or more” and “one or more”, so unless you define the minimum as required elements and fill to the maximum as optional ones, there is no way to obtain the same behaviour.

All validation tools I came across use the XML syntax to validate, so you need a way to convert. I used Trang, which was available from my distribution’s repository. Usage is very simple:

trang -Irnc -Orng in.rnc out.rng

Have fun.

Why can’t things be simple

This is getting totally absurd. A few weeks ago, I decided to open a bank account in a different bank because I considered my current one was not able to offer the service quality I expected. As in, not able to answer simple questions about my finances. The whole transition just gives me the feeling I should really run away from them, and fast. For the moment, this is quite hard because they still have my money.

In the whole process, I made one big error. I expected things to be simple. I have been doing all my transactions electronically for the past few years. I expected that making a transfer would be a few clicks away. The have been running an ads campain about online person-to-person transfers for the past year or so. According to the publicity, all you need to do is enable the service. Sounds good to me. I liked the idea of keeping things flexible and keeping the account open for a little while, just in case (you never know, bankers in the other bank might just be worst). So I went to the bank to enable the service. I had to pull out my investments anyway, so I had to be there physically anyway. Pulling out the investments actually worked. But for their service, it was only available between their clients. Smart. But they had an other service for “inter-institution” transfers. So I requested to get that enabled, they told me it could take a few days. Fine.

A few days later, I check online to transfer some funds. The service is not enabled yet. It seems like it takes up to 7 days. If not enabled after that period, I need to call them. I waited a few more days and called them. It seemed like I didn’t send in the required paperwork. No one ever asked me to fill in more paperwork! I signed an agreement and gave the account information to the person in front of me. How could anything be missing. I didn’t do anything.

So I go back to the bank, again. Explain the situation with the request stuck pending and the missing paperwork. They must have asked everyone in the office to know what was going on. No one knew, so they ended up calling some remote office. The problem was that the person who served me forgot to fax the cheque sample for activation. This sounded weird to me. Why a cheque sample? The cheque number was typed in. Why would you need any more information? I suspected they had some cheque number verification authority to make sure I was not making transfers to the wrong account. They told me it would take 24 to 48 hours to activate. Better than a week, but over two weeks elapsed already since the original request. No other option there. Their procedures say so.

A few hours later. I receive a call at home. They are still missing information. I was pissed. Seriously. I started asking why they needed all this. I mean, all I want is to transfer money from account A to account B. Why do they need to validate that account B is really mine? If I decide to push money into someone else’s account, it’s my problem (or generosity). Now I had the most stunning answer I ever had. The service I had been waiting for all that time only allows to transfer from B to A, so they need to verify my identity to make sure they don’t take money from a random account. Why would I want such a service? Why would I ask my bank to pull money from my account in the other bank?

All this time wasted. They don’t even know the services they offer. And as if it wasn’t enough, they called me a few minutes ago to sell insurance.

As a software developper, I got to ask myself how could a system end up being so useless. Why would such a feature be requested? It’s simply backwards. I’ve been trying to find a use case for it. There is none for a client. Well, not unless there is such a set-up on both ends and you can actually think about going to the opposite interface to request a transfer. The service works the way banks work internally to handle cheques. They receive a piece of paper and call the other institution to ask a transfer. It makes sense when dealing with paper you carry around, but we’re talking about a software here. I bet it was easy to implement.

I will have to visit them again tomorrow to try to solve this problem. Soon, I’ll have to bill them for wasting my time.