Who should care about database transactions?

This is something I always found annoying. When dividing your application in a proper object model, decease or even a procedural model for that matter, ask you end up with situations where functions and methods perform multiple queries on the database, buy or make other calls that modify the data. Transactions exists to make sure all those modifications are cancelled if one query happen to fail. There are two ways to deal with transactions: rely on the parent call to handle the transaction or handle it yourself. The first case breaks one of the basic purposes of design my asking the parent to understand the details of the operation and the second one will break as soon as the function is called from a context where a transaction is already active.

My solution to this is to get each function to believe they are in control of the transaction, which may or may not be true. The solution is very simple: let each call begin the transaction, but only the first one to call it in a tree really begins the transaction. The following class does exactly this. Each scope willing to isolate a transaction instanciates a Transaction object. If the function terminates in an expected way, a call to commit must be made. Otherwise, failure is assumed and the transaction will roll back, but only if the scope actually owns the transaction. Even commit is a nil operation if the scope does not own the transaction. Some of you might have noticed, this is strongly inspired by C++ auto_ptr.

class Transaction
{
private static $transactionActive;
private $transactionOwner;

function __construct()
{
if( !self::$transactionActive )
{
self::$transactionActive = true;
$this->transactionOwner = true;
get_db()->beginTransaction();
}
}

function __destruct()
{
if( self::$transactionActive && $this->transactionOwner )
{
self::$transactionActive = false;
get_db()->rollBack();
}
}

function commit()
{
if( self::$transactionActive && $this->transactionOwner )
{
self::$transactionActive = false;
get_db()->commit();
}
}
}

The get_db() function is a simple function to return the PDO object and lazy-load it as required.

I use this technique by throwing an exception whenever an error occurs. The controller-level is responsible for catching all exceptions and log or report them as required. Of course, some higher levels also catch exceptions and handle them or re-throw them with a more specific type and message (such as converting PDO duplicate entries to a readable message).

With this simple class, the entire system can ensure transaction isolation without having to worry about the context. I find it very helpful.

Of course, it only works in PHP 5 as destructors are required.

Zend Framework 0.6.0, yet an other preview

The Zend Framework has a strict roadmap. Releases arrive exactly as planned. Mid-December, 0.6.0 was released. I had to wait a few days before looking into it, but I took some time. The first thing I had in mind was to get my sample application to run with the new version. For the MVC aspect, not much changed since 0.2.0, except that the new stuff is now out of the incubator. The Zend_Controller_Front is now a singleton again, as it was in 0.1.0 and the patch with setRequest() is no longer required. On that front, the release is a good one. On the other hand, getting the ACL code to work was not as easy, but that was to be expected as it was never marked as stable. They went through some serious refactoring. One positive aspect: they got rid of the ARO and ACO acronyms and decided to use Role and Resource, which makes the whole thing more readable. The downside is that they broke everything I liked about it.

From a flexible tool you could play around with and extend on the fly, it became a rock solid Java-like structure. You can no longer create resources on the way that simply extend the parent one. The fluent interface using the __get() magic methods is gone. Worst of all, you actually have to create instances yourself and build the tree manually. Resource names are now global-like, so there is no way to use the same word twice in a different context. I did not do any extensive testing before, so maybe it was not possible altogether, but now they made it obvious.

Seriously, if I am to give a name to something, and be able to use that name afterwards, why would I even care about the class name used internally? Having to write Zend_Acl_Resource and Zend_Acl_Role is simply annoying since there is no value to the instance it creates. Check out the samples from the documentation.

  1. Introduction
  2. Refining
  3. Advanced

Hopefully, they will refactor this one again before the final release scheduled for May.

Of course, there are a lot of good aspects to the new release. Overall, I noticed a great improvement on the documentation. Improving the documentation is still part of the roadmap, but what is available already is more than anything I’ve seen for a class library. The framework is not even in a final release stage and translations are already in progress. Other than English, 10 translations are available (most of them partial so far, but still impressive).

Classes for authentication and session handling are also added. The only adapter available at the moment is Digest, but I can imagine more coming in the future, like LDAP and other ‘standard’ protocols. Of course, you can create your own adapters. I don’t really get the point of the session handling classes, but they appear to offer additional features like namespaces for values and offer some options to simplify session security.

The class I find the most interesting definitely is the Zend_Measure one. It was available in the last release, but no documentation was available at the time other than the API documentation, and that does not really tell you what it’s all about. Over a year ago, I had to perform unit conversions in order to translate values in technical publications. At that time, I searched around to find a library that could handle the conversion for me. Not that multiplications are hard, it’s just that I hate having to type in the conversion factors. I couldn’t find anything, so I had to type in the conversion factors. Most of it was a hack. I had no intention of releasing it in any kind of way. Now I see this package coming. What it does is simply amazing. Not only it does the conversions for a gazillion types of units, but it will find the units to convert in strings and handle the locales.

I still need to apply these in a real world context to see how good it really is, but it all looks very promising.

XML Schema vs RelaxNG

When I first read this quote by Tim Bray yesterday (yes, I know it was posted on Slashdot last week, I just happen to have a huge feed backlog and not much time to read), I was a little surprised. I have been using XML Schema for a while now and never had any problems with it. Of course, there is quite a lot of vocabulary to learn to be able to write it, but it’s not that hard to read. After you wrote one of them, you can simply start from that one for the others and half of the burden imposed by the syntax is gone.

W3C XML Schemas (XSD) suck. They are hard to read, hard to write, hard to understand, have interoperability problems, and are unable to describe lots of things you want to do all the time in XML. Schemas based on Relax NG, also known as ISO Standard 19757, are easy to write, easy to read, are backed by a rigorous formalism for interoperability, and can describe immensely more different XML constructs.

I never really bothered looking into Relax NG. I saw the name a couple of times, saw an XML sample once and figured it did the same thing as XML Schema, so I had no reason to bother. When I read this quote, I knew there was something I missed about it. Really, the XML syntax of Relax NG is not more readable. For some reason, I would even say I prefer the XML Schema xs:element tag with the minOccurs and maxOccurs. I find it more readable than those “oneOrMore” tags.

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
  datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <ref name="library"/>
  </start>
  <define name="library">
    <element name="library">
      <zeroOrMore>
        <ref name="book"/>
      </zeroOrMore>
    </element>
  </define>
  <define name="book">
    <element name="book">
      <element name="title">
        <data type="string">
          <param name="minLength">1</param>
        </data>
      </element>
      <oneOrMore>
        <element name="author">
          <text/>
        </element>
      </oneOrMore>
    </element>
  </define>
</grammar>

Then I figured that this XML blurb could be generated from a much more simple syntax called Relax NG Compact. Seriously, I don’t even see a reason why they actually made a non-compact form. The syntax is a lot like the DTD syntax without all the brackets and more capabilities. It can even use the datatypes from XML Schemas (which are in a separate specification) to perform some additional validation. I’ll let you guess what the following piece does. (Note: the XML above was generated using this sample).

grammar {
	start = library

	library = element library { book* }
	book = element book {
		element title { xsd:string {minLength = "1"} },
		element author { text }+
	}
}

Not only the syntax is a lot simpler than anything out there for schema definition, it also has better documentation. The RELAX NG Compact Syntax Tutorial is a very good place to start. I don’t know if it can do more than XML Schema, but it sure can do about as much in a more efficient fashion. There are two elements I couldn’t find an equivalent for: minOccurs and maxOccurs. Relax NG only supports “one or zero”, “zero or more” and “one or more”, so unless you define the minimum as required elements and fill to the maximum as optional ones, there is no way to obtain the same behaviour.

All validation tools I came across use the XML syntax to validate, so you need a way to convert. I used Trang, which was available from my distribution’s repository. Usage is very simple:

trang -Irnc -Orng in.rnc out.rng

Have fun.

Discovering the Zend Framework (0.2.0 Preview)

At ZendCon, I heard there was a new version of the Zend Framework being released. The version number is 0.2.0. That does not sound very stable, but I decided to give it a try anyway, something I did not do back when 0.1.0 was released. This was quite an adventure. The new release is incorporating major changes in the MVC framework and quite a few other libraries. Most of the changes in the MVC section are made to improve testability, or so I was told. Since new changes may break a few things, they were placed in the incubator folder until the next major release. This seemed all right to me.

I downloaded the source code for the framework, added the incubator libs and normal libs to the PHP include path (with incubator first, so it takes those in priority) and started playing with it. Actually, I did not get to play too long before I ran into some problems. It seems like the new version relies on some PHP 5.2 only features. There was no warning for this anywhere, or if there was, I did not run into them. Still, this is acceptable. The Zend Framework is meant to be PHP 5 only and should be switching to bleeding edge technology before it gets too large of a user base. Anyway, as a user of *ubuntu, the bundled version of PHP is 5.1.x, so I removed the incubator libs and decided to play with the old versions instead. I was told the changes were mostly internal, so I should be able to live without them and simply switch later.

MVC

I simply followed the documentation to build the skeleton to start from. Everything went very well. Documentation is one of the very strong aspects of the framework. The API documentation is a little poor (as in auto-generated from source code and function headers were also auto-generated), but the end user documentation is great for the most part. Some libraries are under-documented, but I guess this will be resolved in a near future.

Once I had the MVC framework set-up, I decided to build a small form and use some input, just to play with view and such. The Zend_View class is very simple, and yet very powerful. I never really liked template engines like Smarty or others. I think PHP does HTML display good enough, and this is the default behaviour of the View class. It does not do much more than call a PHP script and catch the output. It lets you assign parameters and get them from the template, but it really forces you to separate the display and logic, and suggest good coding practices.


<?php if ($this->books): ?>
    
    <!-- A table of some books. -->
    <table>
        <tr>
            <th>Author</th>
            <th>Title</th>
        </tr>
        
        <?php foreach ($this->books as $key => $val): ?>
        <tr>
            <td><?php echo $this->escape($val['author']) ?></td>
            <td><?php echo $this->escape($val['title']) ?></td>
        </tr>
        <?php endforeach; ?>
        
    </table>
    
<?php else: ?>
    
    <p>There are no books to display.</p>
    
<?php endif; ?>

I took this code sample straight from the documentation. I always liked this alternate syntax when writing conditions or loops in HTML. It simply looks clean. But the very nice part about this template is that a special escape method is defined to escape output. No need to ask yourself which function to use to escape the information. Actually, you can actually still ask yourself that question, because the function used internally by escape can be chosen. I think this is a very good decision on the long run. With all security alerts that come up, if one is actually affecting what ever function is being used, it can actually be redefined to handle a specific case and the entire code base will be fixed.

You probably noticed that the values were accessed as attributes. The Zend_View class allows you to define properties on the fly. I think it simply looks more elegant than calling an assign() method everywhere with a key and value. The view actually does even more. It manages those display helper functions for you. All you have to do is indicate the path where you store your helpers and you will be able to call your own functions in a fashion similar to the call to escape().

Defining the path to controllers, views and helper functions, that makes quite a lot of configuration. It does not matter at all, because all those things can be initialized in your index page, objects can be stored for access using the Zend::registry() function and you no longer need to touch any configuration. This is a very nice aspect of the Zend Framework. Everything is independent. Once you’ve set-up the __autoload() function, you no longer need to worry about including files, paths to scripts or anything. The right controller will be called, the view is managed and everything in between is magic.

Input filtering

After playing with the views, I decided to handle some input. Naturally, I looked at the Zend_Filter_Input class. This was probably the worst decision I made that day. I wasted a whole lot of time to end up figuring out that the testEmail() method was not implemented, which was totally undocumented. I decided to fall back to the ext/filter extension, which was not installed by default in 5.1.x and that I could not get installed using pecl. I had to install 5.2.0, so I did.

0.2.0 incubator

With the incubator issue resolved, I reactivated the include path for it. Bad surprises there. Very bad. Nothing was working any more. Even if most of the changes were actually internal, it seemed like quite a few changes were required in the initialization phase. The worst part was that none of the samples in the incubator documentation had been updated to reflect those changes. I somehow had the feeling that release was pushed out early for ZendCon. So I searched on the web to find some documentation about it. The 0.2.0 version had been released for quite a few days now. Someone, somewhere must have figured out how to solve this. I simply didn’t feel like searching for samples in the unit tests or read too much API documentation to find the changes.

I found this piece of code on some website (I removed some information that could help identify it, because really, it’s a shame it was ever published). The article actually had the mention ‘updated for 0.2.0’, and they seemed to be proud of it. Actually, they might have used the 0.2.0 release without the incubator packages, which really makes MVC the same thing as 0.1.0.

<?php

require_once 'Zend.php';

Zend::loadClass('Zend_Controller_Front');
Zend::loadClass('Zend_Controller_RewriteRouter');

$controller = Zend_Controller_Front::getInstance();
$router = new Zend_Controller_RewriteRouter();
$router->setRewriteBase('/');
$controller->setRouter($router);
$controller->setControllerDirectory('../app/controllers');

// View init
Zend::loadClass('Zend_View');
$view = new Zend_View;
$view->setScriptPath('../app/views');
Zend::register('view', $view);

$controller->dispatch();
?>

So, what was wrong about it? Well, it looked pretty much like the code I previously had. I can tell you, that had nothing to do with 0.2.0. The author sure did not try to run the samples. First, Zend_Controller_Front::getInstance() no longer existed. The front controller class was now a simple class that had to be created like any other object. Same for setRewriteBase(), it’s gone.

The other major change in 0.2.0, the path routing is now independent from the controller/action mechanism. In the previous release, a path like http://www.example.com/foo/bar would have called barAction() on the FooController class. Additional parameters could be passed in the URI using /foo/bar/key1/value1/key2/value2/, which is quite ugly. The new routing mechanism allows to define custom paths using a syntax like ‘article/:year/:id’, where elements starting with a colon are variables. Using the previous example, http://www.example.com/article/2006/10 would be a valid URL and send the ‘year’ parameter as 2006 and ‘id’ as 10 to the selected controller/action. So, extra steps are actually required to bring a compatibility layer to the previous version. This is what the code should look like.

<?php

include 'Zend.php';

function __autoload($class)
{
    Zend::loadClass($class);
}

$view = new Zend_View;
$view->setScriptPath('../app/views');

Zend::register('view', $view);

$router = new Zend_Controller_RewriteRouter();
$router->addRoute('compat', new Zend_Controller_Router_Route(
	':controller/:action',
	array( 'controller' => 'index', 'action' => 'index' ) ) );

$front = new Zend_Controller_Front;
$front->setControllerDirectory('../app/controllers');
$front->setRequest( new Zend_Controller_Request_Http(
	'http://www.example.com' . $_SERVER['REQUEST_URI'] ) );
$front->setRouter($router);

$front->dispatch();

?>

I had to spend quite a while reading API documentation, unit tests and bug reports to figure out a routing problem. The line with the setRequest call really should not be required. The framework should find out about it, just like it did in the previous version, but without it, there is no way to get anything routed to anything else than the default controller/action.

Other than this problem with the preview 0.2.0 release, the Zend Framework is really promising. The MVC pattern is highly inspired by Ruby on Rails, there is no doubt about it. In fact, the framework documentation is refering to the RoR documentation in the section about routing. There is a whole lot more to the framework than MVC thought, I will get to other components an other day.

Alternative PHP Cache

APC has been around for quite a while. Mostly known as an opcode cache for PHP, it can do quite a few other things. Of course, this requires a certain level of control on the web server’s configuration. The extension allows to easily store and retrieve values from shared memory. A simple call to apc_store() and the value is ready to be read by an other request. Even storing the results in a file is more complex than using the APC extension.

if( !$result = apc_fetch( 'my_result' ) )
{
    $result = $dbh->query( "SELECT ..." )
        ->fetchAll( PDO::FETCH_ASSOC );

    apc_store( 'my_result', $result, 3600 );
}

The piece of code above transparently fetches the result from the cache or generate it if required, then store it for an hour. Sure, it takes a few more lines than a simple query, but it’s not that complicated. The results are terrific. I used it on a query that takes one or two seconds to execute, which is not that long, but the delay is just enough to be annoying on an intranet where users expect the page loads to be instant. With the cache, the page load time goes back to the instant level, except the first time the query is executed.

It can just as easily be used to cache queries with a parameter.

$key = "my_result_$id";
if( !$result = apc_fetch( $key ) )
{
    $result = $dbh->query( "SELECT ... $id ..." )
        ->fetchAll( PDO::FETCH_ASSOC );

    apc_store( $key, $result, 3600 );
}

I use 0 as my cache duration, which means it’s destroyed only when the web server is restarted (never, right?). Some of these results rarely change. Using a 5 minutes cache would allow to keep the data up to date when it changes, but it removes the whole advantage on an application that receives a few actions per minute only. Using a larger value, such as 1 hour does not make sense because the changes would not appear fast enough, so I decided to use infinite cache and clear it manually when needed.

The cache can be cleared manually using apc_clear_cache(), but I don’t really like removing all values at once. If I perform an operation that requires the cache to be cleared for a given set of results, I don’t want the entire cache of the application to be destroyed. I use a function that looks like this one:

function clear_cache( $prefix )
{
    $list = apc_cache_info( 'user' );
    foreach( $list['cache_list'] as $row )
        if( strpos( $row['info'], $prefix ) === 0 )
            apc_delete( $row['info'] );
}

clear_cache( 'my_result' );

Even if you don’t need opcode cache, it’s worth taking a look at APC.

Nested sets * Large amount of data = Need more RAM

MySQL

When I say more, I mean a lot more. A terabyte would probably do it. I got to play with nested sets quite a lot recently and they are really amazing. The relations you can pull out of them are simply incredible. For those who have no idea what a nested set is, I highly recommand reading Mike Hillyer’s article on the topic.

I was working with faily large sets, little over 100 nodes. Queries were so fast I could barely notice the page reloading, and those queries were using an average of 10 joins. The problems really came when I got this great idea of importing a truck’s (yes, it’s large) spare part list into the nested set. Going from hundreds to thousands of nodes really makes nested sets SLOW. A simple listing with depth value would take around 40 seconds, and that’s only a 2 table join.

I made an attempt to run one of the more complex queries overnight to see how long it actually took to process. I never found out. When I got back to work this morning, the computer’s fans were still at full speed, leaving an loud noise in the room. The display on the monitor was gone crazy. The display problems were due to the connector being loose. My guess was that all those vibrations caused some trouble. I double checked the indexes, ran EXPLAIN to verify if all the conditions were processed correctly. Everything was in order. I then made some quick calculations to find out that the joins generated a few billion rows, each containing quite a few columns. At that point, I realized I had to find an other solution.

Luckly enough, I didn’t really need the hierarchical data about the components.

Efficient use of Statements

PHP Logo

PHP 5 and MySQL 4.1 bring prepared statements to the masses. While various database extensions supported prepared statements already, they were not quite as popular as MySQL and remained obscur. PHP 5 arrives with a new MySQLi extension for MySQL 4.1 and above with support for prepared statements. In addition, PDO offers a unified API for databases and has PDOStatement objects (created by PDO->prepare()).

Using statements has the advantage of reducing the amount of data transfered to the server by sending the query once and only sending data afterwards. In the process, any work required to escape the values is usually handled by the bind process (which is a huge time saver). Of course, the basic rule is to reduce the amount of queries to the database as much as possible, but in the real world, maintenance tasks often require to perform a lot of processing and update queries are not very flexible when multiple rows need to be updated at once.

Continue reading “Efficient use of Statements”

Dealing with data

I often had to deal with converting data from various formats to others. Most of the time, the task is trivial and only requires writing a single script that does the job. It might be convert data from an old system to a new one, to allow different tools to perform analysis on it or simply to import data from an external entity. In most cases, errors can have huge implications. Since conversions don’t usually occur very often, it’s usually wise to waste a few CPU cycles to make sure incoming data is right and the output makes sense.

In a perfect world, you decide what the input format is and what the content is. In the real world, it’s not always possible due to restrictions on the capacities of the exporting application. The worst case will cause informations to be missing or unaccessible due to bad references. It’s not uncommon to see data scattered across multiple applications and some tasks might require to access multiple sources. Accessing multiple databases or gathering informations from multiple files should not be a technical problem. The problems come with inconsistencies in the data. I have seen systems where the same entry is not referenced to with the same number in different systems or where the numbering conventions simply changed with time and were never documented. The task was actually to make the user’s work more efficient when accessing various systems.

In those cases where the conversion can’t be 100% automated due to the lack of information, user input is required to perform parts of the conversions and resolve ambiguities. Users hate it and users usually hate developpers for it. The only way to actually make it better is to make the interface as friendly as possible. Reducing the amount of clicks and thinking required is a must. Using color highlights, suggesting possibilities and setting the most likely option as the default value usually helps the process (in the case of a web interface, JavaScript for validation or common tasks can be added). Sitting down with an expert user for a few hours to figure out what the common rules are and coding them into the system can cause miracles. When 100% automation can’t be reached, the goal should remain to get as close as possible to it.

A good trick to solve further problems is to write a good user documentation (yet an other task most developpers hate). Documentation is the only protection against user errors and fake bug reports a developper can have. The documentation shouldn’t only explain the different interface details of the application, but also what is the scope of the application (what it can do, what it can’t do and why it exists). A simple RTFM will then solve most problems.

Even more features!

VIM

I have been using VIM as my primary editor for as long as I can remember. Still, I keep finding new feature making it even more efficient. I actually had enough time today to dig into the documentation. I guess I now have an argument against all those who pretend GUIs are better than VIM simply because it has class navigation.

With folding, VIM offered a great overview of classes already. Today, I found out that there was actually a “go to definition” feature hidden. Using ctags (exuberant-ctags actually), a simple Ctrl-] while over a token such as a function or class name teleports you to the definition. Using Ctrl-W Ctrl-] (as a sequence) will open a split buffer with the destination in it. Once you’re done looking at the actual code, Ctrl-T bring you back exacltly where you were. The process works as a stack: you can navigate multiple definitions and Ctrl-T take you back to the previous until the stack is empty. As always, everything is lightning fast. The search uses a ctags file located in the current directory. The file is generated using your favorite ctags implementation (`exuberant-ctags -R` worked just fine for me).

Exuberant-ctags supports 33 languages (including all those you really need) and is integrated with multiple open source editors out there (including vim, emacs, nedit, …). From what I can understand, it seem to have started as a sub-project of VIM.

As if it was not enough, ‘gf’ over a file name loads it. :find file_name searches for file_name in all known paths (can be changed using :set path). These can be useful, but I still prefer switching files using buffer numbers.

Ready for XForms

PHP Logo

I just felt on a page dedicated to XForms on php.net. It’s pretty much only information as nothing special is required, but the method is still interesting. Since XForms sends the data as XML as the POST data for the HTTP request (as raw data). If always_populate_raw_post_data is set to on (which is not a default setting), the data can be accessed from $_SERVER['HTTP_RAW_POST_DATA']. Using this setting will cause PHP to always populate a variable with data that could be large, which would cause slow downs with time. Instead, you can read the data from the input stream when required. Once you have the raw data, you can simply use your XML parser of choice to read the content.

 <?php $fp = fopen( "php://stdin", "r" ); $data = ''; while( !feof( $fp ) )    $data .= fgets( $fp ); fclose( $fp ); ?> 

Even if XForms is not currently supported in browsers, it’s already used by multiple applications and should soon become an important standard in application communication. Of course, the traditional URL encoded POST data does the job, but XML simply opens more possibilities, such as validation against an XML Schema. It’s also a lot easyer to implement for a common application. All languages have XML libraries ready to use.

As a side note, I also noticed a section on security which gathers the most important concepts to be aware of during development. It seems like a response to multiple articles on security that have been written recently. Having all those informations centralized can’t be a bad thing.