Upcoming events

This summer, I will have my largest event line-up around a single theme. None of which will be technical! It will begin on June 25th with RecentChangesCamp (RoCoCo, to give it a French flavor) in Montreal. I first attended that event the last time it was in Montreal and again last year in Portland. It’s the gathering of wiki enthusiasts, developers, and pretty much anyone who cares to attend (it’s free). The entire event is based around the concept of Open Space, which means you cannot really know what to expect. Both times I attended, it had a strong local feel, even though the event moves around.

Next in line is WikiSym, which will be held in Gdańsk (Poland) on July 7-9th. I also attended it twice (Montreal in 2007, Porto 2008). I missed last year’s in Orlando due to a schedule conflict. WikiSym is an ACM conference, making it the most expensive wiki conference in the world (still fair, by other standards). Unlike the other ones which are more community-driven, this one is from the academic world (you know it when they refer to you as a practitioner). Most of the presentations are actually paper presentations. Because of that, attending the actual presentations is not so valuable as the entire content is provided as you get there. It’s much better to spend time chatting with everyone in the now-tradition Open Space. It really is a once per year opportunity to get everyone who spent years studying various topics around wikis from all over the world. Local audience is almost absent, except for the fact that the event tends to go to places where there is a non-null scientific wiki community.

Final stop will be WikiMania, at the exact same location as the previous one until July 11th. I really don’t know what to expect there. I never attended the official WikiMedia conference. However, it has a fantastic website with tons of relevant information for attendees. It probably has something to do with it being an open wiki and being attended by Wikipedia contributors.

I will next head toward Barcelona for a mandatory TikiFest. However, I don’t really consider this to be in the line-up as it’s mostly about meeting with friends.

That is three events on wikis and collaboration. Wikis being the simplest database that could possibly work, what could require 8-9 days on a single topic? It turns out the technology does not really matter. Just like software, writing is not hard. Getting many people to do it together is a much bigger challenge. Organizing the content alone to suit the needs of a community is challenging. Because the structure is so simple, it puts a lot of pressure on humans to link it all together, navigate the content and find the information they are looking for.

Categories: General, Wiki Tags:

More information overload please

A few months back, Microsoft made a presentation on OData at PHP Quebec. While I found the format interesting at first, with the way you can easily navigate and explore the dataset, I must admit I was a bit skeptical. After all, public organizations handing out their data to Microsoft does sound like a terrible idea. While a lot of that data will be hosted on Microsoft technologies, the format remains open, and it appears to be picking up.

I guess what was missing originally for me was a real use case for it. The example at the presentation used a sample database with products and inventory. Completely boring stuff. Today, at Make Web Not War, I had a conversation with Cory Fowler (with Jenna Hoffman sitting close by) who has been promoting OData for the city of Guelph. I got convinced right away that this was the right way to go. Not necessarily, OData as a format, but opening up information for citizens to explore and improve the city. If OData can emerge as a widespread standard to do it, it’s fine by me. The objective is far away from technology. In fact, when I look at it, I barely see it. It’s about providing open access to information for anyone to use. How they will use it is up to them.

The conference had a competition attached to it. Two projects among the finalists were using OData. One of them created a driving game with checkpoints in the city of Vancouver. They simply used the map data to build the streets and position buildings. That is a fairly ludicrous use of publicly available information, but still impressive that a small team could build a reasonable game environment in a short amount of time. The other project used data from Edmonton to rank houses based on the availability of nearby services, basically helping people seeking new properties to evaluate the neighborhood without actually getting away from their computer.

This is only the tip of the iceberg. The data made available at this time is mostly geographical. Cities expose the location of the various services they offer. The uses you can make out of it are quite limited. I’ve seen other applications helping you locate nearby parks or libraries. Sure, knowing there is a police station nearby is good, but there could be so much more. What we need is just more data: crime locations, incident reports, power usage, water consumption. Once relevant information is out there, small organizations and even businesses will be able to use it to find useful information and track it over time. At this time, a lot of the data is collected but only accessible by a few people. Effort duplication occurs when others attempt to collect it. Waste. Decisions are made based on poor evidence.

So there is information out there for Vancouver, Edmonton, even Guelph. Nothing about Montreal. Nothing in the province of Quebec that I could find. I think this is just sad.

Actually, if there is anything out there, it might be very hard to find. Even if there are these great data sources available openly, it remains hard to find them. There is no central index at this time. Even if there were, the question remains of what should go in it. Official sources? Collaborative sources? Not that there is anything like that, but consider people flagging potholes on streets with their mobile phones as they walk around. Of course, accuracy would vary, but it would serve as a great tool for the city employees to figure out which areas should become a priority. There are so many opportunities and so many challenges related to open data access. I don’t think we are fully prepared for the shift yet.

Categories: General Tags:

Only the words change

Amazon has brought me back to 1975 and the Mythical Man-Month. It had been on my reading list for quite a while, but at some point around two years ago, it became unavailable. After that, it sat on a shelf for a few months until the stack got down to it. I must say, skip a few technical details and this book could very well have been written last year. After all, in 1975, structured programming (that is, conditions and loops) was a recent concept and not widely adopted. Surprisingly, Brooks knew a whole lot about software development, testing and management. I have the feeling we have learned nothing since it was written. Concepts were only refined, renamed and spread out. As far as I can tell, just a few paragraphs in chapter 13 lays out the founding concepts of TDD.

Build plenty of scaffolding. By scaffolding, I mean all programs and data built for debugging purposes but never intended to be in the final product. It is not unreasonable for the to be half as much code in scaffolding as there is in product.

One form of scaffolding is the dummy component, which consists only of interfaces and perhaps some faked data or some small test cases. For example, a system may include a sort program which isn’t finished yet. Its neighbors can be tested by using a dummy program that merely reads and tests the format of input data, and spews out a set of well-formatted meaningless but ordered data.

Another form is the miniature file. A very common form of system but is misunderstanding of formats for tape and disk files. So it is worthwhile to build some little files that have only a few typical records, but all the descriptions, pointers, etc.

[...]

Yet another form of scaffolding are auxiliary programs. Generators for test data, special analysis printouts, cross-reference table analyzers, are all examples of the special-purpose jigs and fixtures one may want to build.

[...]

Add one component at a time. This precept, too, is obvious, but optimism and laziness tempt us to violate it. To do it requires dummies and other scaffolding, and that takes work. And after all, perhaps all that work won’t be needed? Perhaps there are no bugs?

No! Resist the temptation! That is what systematic system testing is all about. One must assume that there will be lots of bugs, and plan an orderly procedure for snaking them out.

Note that one must have thorough test cases, testing the partial systems after each new piece is added. And the old ones, run successfully on the last partial sum, must be rerun on the new one to test for system regression.

Does it sound familiar? I see test cases, test data, mock objects, fuzzing and quite a lot of things we hear about these days. Certainly, it was different. They had different constraints at the time, like having to schedule to get access to a batch-processing machine. There is some discussion about interactive programming and how it would speed up the code and test cycles.

I find it impressive given that they had so little to work with. I wasn’t even born when they figured that out.

Because the experience is based on system programming for an operating system to be ran on a machine built in parallel, there is a strong emphasis on top-down design, which is the most important new programming formalization of the decade, and requirements. To me, the word requirement is a scary one. I don’t do system programming and for what I do, prototyping and communication does a much better job. However, I found the take interesting.

Designing the Bugs Out

Bug-proofing the definition. The most pernicious and subtle bugs are system bugs arising from mismatched assumptions made by the authors of various components. The approach to conceptual integrity discussed above in Chapters 4, 5 and 6 addresses these problems directly. In short, conceptual integrity of the product not only makes it easier to use, it also makes it easier to build and less subject to bugs.

So does the detailed, painstaking architectural effort implied by that approach, V. A. Vyssotsky, of Bell Telephone Laboratories’ Safeguard Project, says, “The crucial task is to get the product defined. Many, many failures concern exactly those aspects that were never quite specified.” Careful function definition, careful specification, and the disciplined exorcism of frills of function and flights of technique all reduce the number of system bugs that have to be found.

Testing the specification

Long before any code exists, the specification must be handed to an outside testing group to be scrutinized for completeness and clarity. As Vyssotsky says, the developers themselves cannot do this: “They won’t tell you they don’t understand it; they will happily invent their way through the gaps and obscurities.”

Beyond the punch line, this does call for very detailed specifications. It felt to me that those were in-retrospect comments. I don’t think it was ever made that the specification were fully detailed enough for bugs to be driven out. If it had been, you would end up with an issue introduced in the previous chapter: “There are those who would argue that the OS/360 six-foot shelf of manuals represents verbal diarrhea, that the very voluminosity introduces a new kind of incomprehensibility. And there is some truth in that.”

Very detailed specifications, just like exhaustive documentation, will reach a point where it does not bring value because no one can get through all of it in a reasonable time frame. Attempting to find inconsistencies in a few pages of requirements is possible. Not in thousands of pages. The effort required is just surrealist. Sadly, following the advice of strong and detailed requirements, the world of software development sunk in waterfall in the years that followed. For all the great insight in the book we collectively realized decades later, this one got too influential.

Imagine if instead, the industry had used the advice of smaller, more productive teams when possible, higher-level programming languages and extensive scaffolding as the primary advice of the book where software development would be today.

Categories: Uncategorized Tags:

Branching, the cost is still too high

Everyone’s motivation to move to distributed version control systems (DVCS) was that the cost of branching was too high with Subversion. Part of it is true, but even with DVCS, I find the cost of branching to be too high for my taste. I can create feature branches for branches of a decent size, but I think traceability needs even even more granularity.

Let’s begin by listing my typical process to handle feature branches these days.

  1. Branch trunk from the repository to my local copy
  2. Copy configuration files from an other branch
  3. Make minor changes
  4. Run scripts to initialize the environment.
  5. Develop, commit, pull, merge – all of this is great
  6. Push to trunk

My problem is that dealing with those configuration files takes too much time and that is still troublesome. However, there is no real way around it. The application needs to connect to MySQL, Gearman, Sphinx and Memcached. On development setups, they are all on the same machine. Still because I am way too lazy to create new database instances and I often don’t change my prefixes as much as I should, I end up with multiple branches sitting there with only one really usable at any time. Of course, it would all be solved if I were more disciplined, but if it annoys me, it prevents me from doing it right. Just having to do the configuration part encourages me to re-use branches.

The goal of fine-grained branches is to represent the decision-making process as part of the revision control. The way I see it, top level branches represent a goal. It could be implementing a new feature, enhancing a piece of the user interface or anything. However, to reach those top level objectives, it may be required to perform some refactoring or upgrade a library. If those changes are made atomically through a branch and merged as a single commit, there would be ways to look at the hierarchy of commits to understand the flow of intentions. Bazaar can generate graphs from forks and merges. I can imagine tools to help traceability if the decision making is organized in the branch structure.

Why traceability you might ask. For many things that don’t seem to make sense in code, there is a good historical reason (unless it’s due to accidental complexity). Even in my own code written a few months prior, I find places that need refactoring. Most of the time, it’s simply because I was trying to look too far ahead at the time. I was anticipating the final shape of the software, but by the time it got there, new and better ways to achieve the same result had been implemented, leaving legacy behind. When this happens to be in my own code, I can think about the process that led to it, figure out what the original intention was and decide how the design should be adapted to the new reality. When the code is written by someone else, the original intention can only be guessed. I hope creating a hierarchy of branches can provide an outline of the thought process that would explain the decisions made.

My Subversion reflexes pointed me towards bzr switch. It brings a change to the way I got used to work with a DVCS.  My transition was to switch the concept of working copy to branching. Check-outs simply had no use. I was wrong. They can actually fix my issue of configuration burden. If I keep a single check-out of the code that is configured for my local environment, I can then switch it from one branch to an other. Because we are in the distributed world, those other branches can be kept locally, just not in the working copy. The process then changes.

  1. Create a new branch locally
  2. Switch the check-out to the new branch
  3. Develop, commit, pull, merge
  4. Switch check-out to parent branch
  5. Merge local branch

Of course, if changes happen in the configuration files outside of what was locally configured or the schema changes, this has to be dealt with, but I expect this to be much less frequent.

The next step will be to rebuild my development environment in a smarter way. Right now, I have way too many services running locally. I want to move all of those to a virtual machine, which I will fire up when I need them. For this step, I am waiting for the final release of Ubuntu 10.04, and probably a few more weeks. In the past, I had terrible experiences with pre-release OS and learned to stay away, no matter how fun and attractive new features are. It also means re-installing my entire machine, so I don’t look so much towards that. It should be easier now that almost everything is web-based, as long as I don’t loose those precious passwords.

Using virtual machine to keep your primary host clean of any excess is nothing new. I guess I did not do it before I though my disk space was more limited than it is. My laptop has a 64G SSD drive. It was a downscale from my previous laptop’s drive, which was continuously getting full. Too many check-outs, database dumps, log files. They just keep piling up over the years. It turns out the overhead of having an extra operating system isn’t that bad after all.

The good thing about virtual machines is that they are completely disposable. You can build it with the software you need, take a snapshot and move on from there. Simply reverting back to the snapshot will clean up all the mess created. Only one detail to keep in mind: no permanent data can be stored in there. I will keep my local branches on the main host and the check-out in the virtual machine. Having a shell on a virtual machine won’t make much of a difference than a shell locally.

Categories: Programming Tags:

Improving rendering speed

Speed is a matter of perception. We’d like to believe it’s all due to computational power or the execution speed of queries. There are barriers that should not be crossed, but in most case, getting your application to behave correctly while the user is waiting will improve the perspective. Improving the rendering speed is a good step and tweaking a few settings will improve perception more than trimming off milliseconds from an SQL query.

A now classic example of the effects of perception is the one of progress bars. When moving forward at different rates, even though the total time remains the same, will give the impression of being shorter or longer.

Fiddling with HTTP headers is actually very simple and will help lower the load on your server too. A hit you don’t get is so much faster to serve. Both Yahoo! and Google turned this optimization pain into a game by providing scores to increase. If you are not familiar with them, consider installing YSlow and Page Speed right away.  Now, if you’ve never used them before, chances are running it on your own website will provide terrible scores. Actually, running it on most of the websites out there provides terrible scores.

Both of them will complain about a few items:

  • Too many HTTP requests.
  • Missing expires headers
  • Uncompressed streams
  • Unminified CSS and Javascript
  • Recommend use of CDN

Fewer files

Now, the too many HTTP requests are likely caused by those multiple JavaScript and CSS files you include. The JavaScript part is very simple. All you have to do is concatenate the scripts in the appropriate order, minify them and deliver it all as a single file. There are good tools out there to do it. Depending on how you deploy the application, some may be better than others. I’ve used a PHP implementation to do it just in time and cached the result as a static file, and used a Java implementation as part of a build process. I find the later to be a better option if it is possible.

This is easy enough for production environments, but it really makes development a pain. Debugging a minified script is not quite pleasant. In Tikiwiki, this simple became an other option. In a typical Zend Framework environment, APPLICATION_ENV is a good binding point for the behavior. Basically, you need to know the individual files that need to be served. If in a development environment, serve them individually. In a production or staging environment, serve the compiled file (or build it JIT if building is not an option).

Unless you live with an application that has been shielded from the real world for a decade, it’s very likely that most of the code you use was not written by you. It comes from a framework. You can skip those altogether by not distributing them at all. Google provides a content delivery network (CDN) for those. Why is this faster? You don’t have to serve it, and your users likely won’t have to download it. Because the files are referenced by multiple websites, it’s very likely that they downloaded it and cached it locally in the past. They also serve the standard CSS files for JQuery UI (see bottom right corner), although that’s not quite as well indicated (you should be able to find the pattern).

Both of the minify libraries mentioned above also do the CSS minification. However, this is a bit more tricky as you will need to worry about the relative paths to images and imports of other CSS files.

The final step is to make sure all the CSS is in the header and the JavaScript at the bottom of the page.

Web server tuning

Now that the amount of files is reduced, your scores already improved significantly, an other class of issues will take over. Namely compression, expiry dates and improper ETags. The easiest to set-up is the compression. You will need to make sure mod_gzip or mod_deflate is installed in Apache. It almost always is. Everything is done transparently. All you need to do is make sure the right types are set. It can be done in the .htaccess file. Here is an example for mod deflate.
<IfModule deflate_module>
AddOutputFilterByType DEFLATE text/html text/plain text/xml text/css application/javascript
</IfModule>

Use firebug to see the content type of all files YSlow is still complaining about and add them to the list.

An other easy target is the ETag declaration. In most installs, Apache will generate an ETag for static files. ETags are a good idea. The browser remembers the last ETag it received for a given URI and requests it back asking if it changed. The server compares it and either sends 304 to indicate it was not modified or the new version. The problem is that your server still gets a hit. You’re better off not having them at all.
<FilesMatch "\.(js|png|gif|jpg|css)$">
FileEtag None
</FilesMatch>

Expiry headers are a bit more tricky. When those occur in your scripts, you have to deal with them. Setting an expiry date means accepting that your users might not see the most recent version of the content because they won’t query your server to check. These may not be easy decisions to make.

However, static files are much easier to handle. You will need mod_expires in Apache, which is not quite as common as the compression counterpart. The goal is just to set an arbitrary date in the future. Page Speed likes dates further than a month away. YSlow seems to settle for 2 weeks. The documentation uses 10 years. It should be far enough.
<FilesMatch "\.(js|png|gif|jpg|css|ico)$">
ExpiresActive on
ExpiresDefault "access plus 10 years"
</FilesMatch>

Cookies

Your website most likely uses a cookie to track the session. They are great for your PHP scripts that need them to track who’s visiting, but they also happen to be sent to static files as well because the browser does not know it makes no difference. Cookies alter the request and cause confusion to intermediate caches or whenever the cookies change, like when you change the session id to avoid session hijacks.

The easiest way avoid those cookies from being sent to the static files is to place them on a different server. Luckily, browsers don’t really know how things are organized on the other hand, so just using a different domain or sub-domain pointing to the exact same application will do the trick. If you have more load, you might want to serve them with a different HTTP server altogether, but that requires more infrastructure. It should be easy to push JavaScript and CSS to the other domain. Reaching the images will depend on the structure of your application. You will thank those view helpers if you have any.

If you serve some semi-dynamic files through that domain, make sure PHP does not start the session, otherwise, all this was futile.

You can then configure YSlow’s CDN list to include that other domain and the google CDN, and observe blazing scores. To modify the configuration, you need to edit Firefox preferences. Type about:config in the URL bar, say you will be careful, search for yslow and modify the cdnHostnames property to contain a comma separated list of domains.

One more

By default, PHP sends a ridiculous Cache-Control header. It basically asks the browser to verify for a new version of the script on every request. When you user presses back, you get a new request, and he will likely loose local modifications in the form. Not really nice, and one too many hit on your server. Setting the header to something like max-age=3600, must-revalidate, will resolve that issue and make navigation on your site look so much faster.

These items should cover most of the frequent issues. Both tools will report a few minor issues, some which may be easy to fix, some not so much. Make those verifications part of the release procedure. A new type may get introduced in the application and cause less than optimal behaviors due to the lack of a few headers. It may not be possible to get a perfect score on all pages of a site, but if you can cover the most important one, your users may believe your site is fast, even though you use a heavy framework.

Categories: Programming Tags:

A bit too much

Well, Confoo is now over. That is quite a lot of stress off my shoulders. Overall, I think the conference was a large success and opened up nice opportunities for the future. Over the years, PHP Quebec had evolved to include more and more topics related to PHP and web development. This year was the natural extension to shift the focus away from PHP and towards web, including other programming languages such as Python and sharing common tracks for web standard, testing, project management and security. Most of the conference was still centered around PHP and that was made very clear on Thursday morning during Rasmus Lerdorf’s presentation (which had to be moved to the ballroom with 250-300 attendees, including some speakers who faced an empty audience), but hopefully, the other user groups will be able to grow in the next year.

Having 8 tracks in parallel was a bit too much. It made session selection hard, especially since I always keep some time for hallway sessions. I feel that “track” lost quite a lot of participants this year compared to the previous ones.

For my own sessions, I learned a big lesson this year. I should not bite more than I can chew. It turns out some topics are much, much, harder to approach than others. A session on refactoring legacy software seemed like a great idea, until I actually had to piece together the content for it. I had to attempt multiple ways to approach the topic and ended up with one way that made some sense to me, but very little to the audience it seems. I spent so much time distilling and organizing the content that I had very little time to prepare the actual presentation for it. What came out was mostly a terrible performance on my part. I am truly sorry for that.

Lesson of the year: Never submit topics that involve abstract complexity.

The plan I ended up with was a little like this:

  • Explain why rewriting the software from scratch is not an option. Primarily because management will never accept, but also because we don’t know what the application does in details and the maintenance effort won’t stop during the rewrite.
  • Bringing a codebase back to life requires a break from the past. Developers must sit down and determine long term objectives and directions to take, figure out what aspects of the software must be kept and those that must change, and find a few concrete steps to be taken.
  • The effort is futile if the same practices that caused degradation are kept. Unit testing should be part of the strategy and coding standards must be brought higher.
  • The rest of the presentation was meant to be a bit more practical on how to gradually improve code quality by removing duplication, breaking dependencies to APIs, improving readability and removing complexity by breaking down very large functions in more manageable units.

As I was presenting, my feeling was that I was on one side preaching to converts that had done this before and knew it worked, and the rest of the crowd who did not one to hear it would take a while and thought I was an idiot (emphasized by my poor performance, which I was aware of).

An other factor that came in the mix was that I actually had two presentations. Both of which I had never given before, so both had to be prepared. Luckily, the second one on unit testing was a much easier topic and I find that one went better. It was in a smaller room with fewer people. Everyone was close by, so it was a lot closer to a conversation. I accepted questions at any time. Surprisingly, they came in pretty much the same order I had prepared the content in for the most part. The objective of this session was to bootstrap with unit testing. My intuition told me that the main thing that prevented people from writing unit tests was that they never know where to start. My plan was:

  • Explain quickly how unit testing fits in the development cycle and why test-first is really more effective if you want to write tests. I went over it quickly because I know everyone had that sermon before. I rather placed the emphasis on getting started with easy problems first as writing tests requires some habits. It’s perfectly fine to get started with new developments before going back to older code and test it.
  • Jump in a VM and go through the installation process for PHPUnit, setting up phpunit.xml and the bootstrap script, writing a first test to show it runs and can generate code coverage. I did it using TDD, showing how you first write the test, see it fail, then do what’s required to get it to pass.
  • Keeping it hands on, go through various assertions that help writing more expressive tests, using setUp, tearDown and data providers to shorten tests.
  • Move on to more advanced topics such as testing code that uses a database or other external dependency. I ran out of time on this one, so I could not make any live example of it.

I was quite satisfied with the type of interaction I had with the audience during the presentation and the feedback was quite positive too. It was a small room organized in a way that I was surrounded by the audience close by rather than in a long room barely seeing who I was speaking to. Although there were only 15 attendees, I am confident they got something they can work with.

I could have used a dry run before the presentation. I had done one two weeks prior, but that wasn’t quite fresh in my mind, so it was not quite fluid, but some of it was desired to show where to find the information.

During the other sessions I attended, I made two nice discoveries: Doctrine 2 which came up with a very nice structure that I find very compatible with the PHP way and MongoDB, a document-based database with a very nice way to manipulate data and that has nice performance attributes for most web applications out there.

Categories: General Tags:

Bad numbers

February 15th, 2010 1 comment

The most frequently quoted numbers in software engineering are probably those of the Standish Chaos report starting in 1995. To cut the story short, they are the ones that claim that only 16% of software projects succeed. Personally, I never really believed that. If it were that bad, I wouldn’t be making a living from writing software today, 15 years later. The industry would have been shut down long before they even published the report. As I was reading Practical Software Estimation, which had a mandatory citation of the above, I came to ask myself why there was such a difference. I don’t have numbers on this, but I would very surprised to hear any vendor claiming less than 90% success rate on projects. Seriously, with 16% odds of winning, you’re better off betting your company’s future on casino tables than on software projects.

The big question here is: what is a successful project? On what basis do they claim such a high failure rate?

Well, I probably don’t have the same definition of success and failure. I don’t think a project is a failure even if it’s a little behind schedule or a little over budget. From an economic standpoint, as long as it’s profitable, and above the opportunity cost, it was worth doing. Sometimes, projects are canceled for external factors. Even though we’d like to see the project development cycle as a black box in which you throw a fixed amount of money and get a product in the end, that’s not the way it is. The world keeps moving and if something comes out and makes your project obsolete, canceling it and changing direction is the right thing to do. Starting the project was also the right thing to do based on the information available at the time. Sure some money was lost, but if there are now cheaper ways to achieve the objective, you’re still winning. As Kent Beck reminds us, sunk costs are a trap.

When a project gets canceled, the executives might be really happy because they see the course change. The people that spent evenings on the project might not. Perspective matters when evaluating success or failure. However, when numbers after the fact are your only measure for success, that may be forgotten. Looking back, it won’t matter if a hockey team gave a good fight in the third period. On the scoreboard, they lost, and the scoreboard is what everyone will look back to.

Luckily, I wasn’t the only one to question what Standish was smoking when they drew those alarming figures. In the latest issue of IEEE Software, I came across a very interesting title: The Rise and Fall of the Chaos Report Figures. In the article, J. Laurenz Eveleens and Chris Verhoef explain how the analysis made inevitably leads to this kind of number. The data used by Standish is not publicly available, so analyzing it correctly is not possible (how many times do we have to ask for open data?). However, they were able to tap into other sources of project data to perform the analysis and come up with a very clear explanation of their results.

First off, my question was answered. The definition of success for Standish is pretty much arriving under budget based on the initial estimate of the project with all requirements met. Failure is being canceled. Everything else goes into the Challenged bucket, including projects completing slightly over budget and projects with changed scope. Considering that bucket contains half of the projects, questioning the numbers is fairly valid.

I remember a presentation by a QA director (not certain of the exact title) from Tata Consulting at the Montreal Software Process Improvement Network a few years ago in which they were explaining their quality process. They presented a graph where they had some data collected from projects during the post-mortem and asking to explain the causes of some slippage or other types of failures (it was a long time ago, my memory is not that great), but there was a big column labeled Miscellaneous. At the time, I did not notice anything wrong with it. All survey graphs I had ever seen contained a big miscellaneous section. However, the presented highlighted the fact that this was unacceptable as it provided them with absolutely no information. In the next editions of the project survey, they replaced Miscellaneous with Oversight, a word which no manager in their right mind would use to describe the cause for their failures. Turns out the following results were more accurate for the causes. When information is unclear, you can’t just accept that it is unclear. You need to dig deeper and ask why.

The authors then explain how they used Barry Boehm‘s long understood Cone of Uncertainty and and Tom DeMarco‘s Estimation Quality Factor (published in 1982, long before the reports) to identify organizational bias in the estimates and explain how aggregating data without considering it leads to absolutely nothing of value. As an example, they point our a graph containing hundreds of estimations at various moments in the projects within an organization of forecast over actual ratio. The graphic is striking as apparently no dots exist above the 1.0 line (nearly none, there are a few very close by). All the dots indicate that the company only occasionally overestimates a project. However, the cone is very visible on the graph and there is a very significant correlation. They asked the right questions and asked why that was. The company simply, and deliberately, used the lowest possible outcome as their estimate, leading them to a 6% success rate based on the Standish definition.

I would be interested to see numbers on how many companies go for this approach rather than providing realistic (50-50) estimates.

Now, imagine companies buffering their estimates added to the data collection. You get random correlations at best because you’re comparing apples to oranges.

Actually, the article presents one of those cases of abusive (fraudulent?) padding. The company had used the Standish definition internally to judge performance. Those estimates were padded so badly, they barely reflected reality. Even at 80% completion some of the estimates were off by two orders of magnitude. Yes, that means 10000%. How can any sane decision be made out of those numbers? I have no idea. In fact, with such random numbers, you’re probably better off not wasting time on estimation at all. If anything, this is a great example of dysfunction.

The article concludes like this:

We communicated our findings to the Standish Group, and Chairman Johnson replied: “All data and information in the Chaos reports and all Standish reports should be considered Standish opinion and the reader bears all risk in the use of this opinion.”

We fully support this disclaimer, which to our knowledge was never stated in the Chaos reports.

It covers the general tone of the article. Above the entertainment value (yes, it’s the first time I ever associate entertainment with scientific reading) brought by tearing apart the Chaos report, what I found the most interesting was the well vulgarized use of theory to analyze the data. I highly recommend reading if you are a subscriber to the magazine or have access to the IEEE Digital Library. However, scientific publications remain restricted in access.

Categories: Books, General Tags:

CUSEC 2010

This year, I attended CUSEC for the 4th time, two of which I was an organizer. Even though the target audience is students and I graduated what feels a really long time ago now, I still wait avidly for the event every year. The conference isn’t really ever technically in depth, I see it as an opportunity to see some trends. Every single time, it seems to have a perfect mix. It tends to hook onto the new technologies and hypes. After all, the program is made up by students.

This year, I think everyone would agree that the highlight was Greg Wilson with a very strong invitation to raise the bar and ask for higher standards from software research. Very few quoted studies are actually statistically relevant in any way. I had seen the session before at Dev Days in Toronto (it was around 90% identical). I would see it again. Perhaps, there would be even fewer FIXME notes in the slides. Greg is currently in the process of publishing a book on evidence in software engineering practice to be published as part of the O’Reilly Beautiful * series. The book does not yet have a name, so I can’t pre-order on amazon and that’s truly disappointing.

One of the lower visibility session I found interesting was IBM’s David Turek on Blue Gene and scientific processing. Many discarded the session because it was given by a VP. Now, I don’t really care about scientific calculations. I believe it’s important, but it’s not where my interests lie. I am almost certain I will never use Blue Gene. However, what I found interesting was to see how they tackle extremely large problems. Basically, the objective is to have supercomputers with 1000 times the computational capacity we have today by the end of the decade. Using current technologies, you would need a nuclear power plant to provide it.

Finally, Thomas Ptacek’s session on security was mostly entertaining. It was one of those 3 hour session compressed into one hour. I don’t think I could catch everything, but he went over common developer flaws and how simple omissions can take down the entire security strategies. He concluded with a very useful decision making process: if your encryption strategy involved something else than GPG and SSL, refactor. It’s one of the problems I always had with cryptography APIs. There are too many options. Many of which are plain wrong and irresponsible to use. On the other hand, he was quite a pessimist during the question period, saying there is no hope to create secure software using the current tools and technologies. All software ever made eventually had flaws found in them.

One of the most troubling moments of the conference for me was to see how much some people can be disconnected. I actually came across a software engineering student (not a freshmen) who did not know what Twitter was. Not only did he not know, he had never heard of it. How is that even possible? I don’t use Twitter. I use an open alternative, and I’m not that much into microblogging. However, I do believe it somewhat reached mainstream. You can hear the word while watching news on TV. I really need to lower my assumptions about what people know.

Next conference for me will be Confoo.ca, where I will be presenting two sessions and struggling to choose which of 8 sessions to attend every hour for 3 days.

Categories: General Tags:

Decision making

As part of the day to day work of a software developer, decisions have to be taken every single day. Some have a minor impact and can be reverted at nearly no cost. Others have a significant impact on the project and reverting it would be a fundamental change. I have found that, in most cases, not making a decision at all is a much better solution. A lot of time is wasted evaluating technology. Out of all the options available out there, there is a natural tendency to do everything possible to pick the best of the crop, the one that will offer the most to the project and provides the largest amount of features for future developments. While the reasoning sounds valid, it’s an attempt to predict the future and will most likely be wrong.

Of course, the project you are working on is great and you truly believe it will be revolutionary. However, you’re not alone. Every day, thousands of other teams work on their own projects. Chances are they are not competitors, but most likely a complement to yours and will likely make the package you spent so much time selecting completely obsolete before you’ve used all those advanced features.

Too often, I see a failure to classify the type of decision that has to be made in projects. They are not all equal. Some deserve more time. In the end, it’s all about managing risks and contingencies. The very first step is to identify the real need and what the boundaries are with your system. No one needs Sphinx. People need to search for content in their system. Sphinx is one option. You could also use Lucene or even external engines if your data is public. What matters when integrating in this case is how the data will be indexed and how searching will be made. When trying out new technology, most will begin with a prototype, which then evolves into production code. At that point, a critical error was made. Your application became dependent on the API.

If you begin by making clear that the objective is to index the content in your system, you can design boundaries that isolate the engine-specific aspects and leave a cohesive — neutral — language in your application.

Effectively, having such a division allows you not to choose between Sphinx or Lucene or something else. You can implement one that makes sense for you today and be certain that required changes to move to something else will be localized. With your application logic to extract the data to be indexed and the logic for fetching results and displaying them left independent, the decision-making step becomes irrelevant.

Certainly, there is some overhead. You need to convert the data to a neutral format rather than simply fetching what the API wants and then convert it to the appropriate format. Some look at the additional layer and see a performance loss. In most cases when integrating with other systems, the additional level of indirection really does not matter. You are about to call a remote system performing a complex operation over a network stack. If that wasn’t complex, you would have written it yourself.

A common pitfall is to create an abstraction that is too closely bound to the implementation rather than the real needs of the system. The abstraction must speak your system’s language and completely hide the implementation, otherwise, the layer serves no purpose. It’s a good idea to look at multiple packages and see how they work conceptually when designing the abstraction. While you’re not going to implement all of them, looking at other options gives a different perspective and helps in adjusting the level of abstraction.

Once the abstraction is in place. the integration problem is out of the critical path. You can implement the simplest solution, knowing that it won’t scale to the appropriate level down the road, but the simplest solution now will allow to focus on more important aspects until the limit is reached. When it will be, you will be able to re-asses the situation and select a better option knowing that changes will be localized to an adapter.

Abstracting away is a good design practice and it can be applied to almost any situation. It allows your code to remain clean, breaks dependencies to external systems that would otherwise make it hard to set-up the environment and decrease testability. Because the code is isolated, it leaves room for experimentation with a safety net. If the chosen technology proves to be unstable or a poor performer, you can always switch to something else.

While it works in most cases, it certainly does not work for some fundamental decisions, like the implementation language, unless you plan on writing your own language that would compile in other languages. Some abstractions just don’t make sense.

When you can’t defer decision making, stick with what you know. Sure you might want to try one this new framework in the cool new language. The core of your project, if you expect it to live, is no place to experiment. I have been using PHP for nearly a decade now. I’ve learned all the subtleties of the language. It is a better choice for me. I’ve used the Zend Framework on a few projects and know my way around it well enough. It’s a good solution for me. Both together are a much safer path than Python/Django or any alternative, no matter what Slashdot may say.

It might not sound like a good thing to say, but experimenting as part of projects is important. You can’t test a technology well enough unless it’s done part of a real project and a project is unlikely to be real unless it’s part of your job. It’s just important to isolate experiments to less critical aspects. It’s the responsible thing to do.

It’s all about risk management. Make sure all decisions you make are either irrelevant because they can be reverted at a low cost or use technologies you trust based on past experience and you will avoid bad surprises.

Categories: Programming Tags:

New favorite toy

It certainly ain’t cutting edge, but I recently started using Sphinx. I heard presentations about it, read some, but never had the occasion to use it. It’s very impressive as a search engine. The main downside that kept me away from it for so long is that it pretty much requires a dedicated server to run it. As I primarily work on open source software where I can make no assumption about the environment, sphinx never was an option. For those environments, the PHP implementation of Lucene in the Zend Framework is a better candidate.

In most cases, I tend to stick with what I know. When I need to deliver software, I much prefer avoiding new problems and sticking with what I know is good enough. Granted the option, a few details made me go for Sphinx rather than Lucene (always referring to the PHP project, not the Java one).

  • No built-in stemmer, and could only find for English. If you’ve never tried, not having a stemmer in a search engine is a good way to only occasionally have search results and it makes everything very frustrating.
  • Pagination has to be handled manually. Because it runs in PHP and all memory is gone by the end of the execution, the only way it can handle pagination decently is to let you handle it yourself.

However, it’s a matter of trade-off. Sphinx has a few inconvenients.

  • Runs as a separate server application and requires an additional PHP extension to connect to it (although recent versions support the MySQL protocol and lets you query it from SQL).
  • No incremental update of the index. The indexer runs from command line and can only build indexes efficiently from scratch. Different configurations can be used to lower the importance of this issue. Some delay on the search index update has to be tolerated.

If you can get past those issues, Sphinx really shines.

  • It handles pagination for you. Running it a daemon, it can keep buffers open and keep the data internally and manage it’s memory properly. In fact, you don’t need to know and that’s just perfect.
  • It can store additional attributes and filter on them, including multi-valued fields.
  • It’s distributed, so you can scale the search capacity independently. It requires to modify the configuration file, but entirely transparent to your application.
  • Result sorting options, including one based on time segments, giving higher ranking for recent results depending on which segment they are part of (hour, day, week, month). Ideal when searching for recent changes.

Within just a few hours, it allowed me to solve one of the long lasting issues in all CMS software I’ve came across: respecting access rights in search results efficiently. Typically, whichever search technique you use will provide you with a raw list of results. You then need to filter those results to hide those that cannot be seen by the user. If you can accept that not all pages have the same amount of results (or none at all), this can work pretty efficiently. Otherwise, it adds either a lot of complexity or a lot of inefficiency.

An other option is to just display the results anyway to preserve the aesthetic and let the user be faced with a 403-type error later on. It may be an acceptable solution in some cases. However, you need to hope the excerpt generated or the title does not contain sensitive information, like Should we fire John Doe?. This can also happen if the page contains portions of restricted information.

First, the pagination issue. I could solve this one by adding an attribute to each indexed document and a filter in all queries. The attribute contains the list of all roles who can view the page. The filter contains the list of all roles the user has. Magically, Sphinx paginates the results with only the pages that are visible to the current user.

Of course, this required a bit of up-front design. The permission system allows me to obtain the list of all roles having an impact on the permissions. Visibility can then be verified for each role without having to scan for every (potentially hundreds or thousands) role in the system.

Sphinx can build the index either directly from the database by providing it with the credentials and a query or through an XML pipe. Because a lot happens from the logic in the code, I chose the second approach, providing me with a lot more flexibility. All you have to do is write a PHP script that (ideally using XMLWriter) gathers the data to be indexed and writes it to the output buffer.

The second part of the problem, about exposing sensitive information in the result, was resolved as a side effect. The system allows to grant or deny access to portions of the content. When building the index, absolutely all content is indexed. However, sphinx does not generate the excerpts automatically when generating the search results. One reason is that you may not need them, but the main reason is more likely to be that it does not preserve the original text. It only indexes it. Doing so avoids having to keep yet an other copy of your data. Your database already contains it.

To generate the excerpt, you need to get the content and throw it back to the server with the search words. The trick here is that you don’t really need to send back the same content. While I send the full content during the indexing phase, I only send the filtered content when time comes to generate the excerpt.

Sure, there may be false positives. Someone may see the search result and get nothing meaningful to them. John Doe might find out that the page mentions his name, but the content will not be exposed in any case. Quite a good compromise.

So many possibilities. What is your favorite feature?

Categories: Programming Tags: