New favorite toy

It certainly ain’t cutting edge, but I recently started using Sphinx. I heard presentations about it, read some, but never had the occasion to use it. It’s very impressive as a search engine. The main downside that kept me away from it for so long is that it pretty much requires a dedicated server to run it. As I primarily work on open source software where I can make no assumption about the environment, sphinx never was an option. For those environments, the PHP implementation of Lucene in the Zend Framework is a better candidate.

In most cases, I tend to stick with what I know. When I need to deliver software, I much prefer avoiding new problems and sticking with what I know is good enough. Granted the option, a few details made me go for Sphinx rather than Lucene (always referring to the PHP project, not the Java one).

  • No built-in stemmer, and could only find for English. If you’ve never tried, not having a stemmer in a search engine is a good way to only occasionally have search results and it makes everything very frustrating.
  • Pagination has to be handled manually. Because it runs in PHP and all memory is gone by the end of the execution, the only way it can handle pagination decently is to let you handle it yourself.

However, it’s a matter of trade-off. Sphinx has a few inconvenients.

  • Runs as a separate server application and requires an additional PHP extension to connect to it (although recent versions support the MySQL protocol and lets you query it from SQL).
  • No incremental update of the index. The indexer runs from command line and can only build indexes efficiently from scratch. Different configurations can be used to lower the importance of this issue. Some delay on the search index update has to be tolerated.

If you can get past those issues, Sphinx really shines.

  • It handles pagination for you. Running it a daemon, it can keep buffers open and keep the data internally and manage it’s memory properly. In fact, you don’t need to know and that’s just perfect.
  • It can store additional attributes and filter on them, including multi-valued fields.
  • It’s distributed, so you can scale the search capacity independently. It requires to modify the configuration file, but entirely transparent to your application.
  • Result sorting options, including one based on time segments, giving higher ranking for recent results depending on which segment they are part of (hour, day, week, month). Ideal when searching for recent changes.

Within just a few hours, it allowed me to solve one of the long lasting issues in all CMS software I’ve came across: respecting access rights in search results efficiently. Typically, whichever search technique you use will provide you with a raw list of results. You then need to filter those results to hide those that cannot be seen by the user. If you can accept that not all pages have the same amount of results (or none at all), this can work pretty efficiently. Otherwise, it adds either a lot of complexity or a lot of inefficiency.

An other option is to just display the results anyway to preserve the aesthetic and let the user be faced with a 403-type error later on. It may be an acceptable solution in some cases. However, you need to hope the excerpt generated or the title does not contain sensitive information, like Should we fire John Doe?. This can also happen if the page contains portions of restricted information.

First, the pagination issue. I could solve this one by adding an attribute to each indexed document and a filter in all queries. The attribute contains the list of all roles who can view the page. The filter contains the list of all roles the user has. Magically, Sphinx paginates the results with only the pages that are visible to the current user.

Of course, this required a bit of up-front design. The permission system allows me to obtain the list of all roles having an impact on the permissions. Visibility can then be verified for each role without having to scan for every (potentially hundreds or thousands) role in the system.

Sphinx can build the index either directly from the database by providing it with the credentials and a query or through an XML pipe. Because a lot happens from the logic in the code, I chose the second approach, providing me with a lot more flexibility. All you have to do is write a PHP script that (ideally using XMLWriter) gathers the data to be indexed and writes it to the output buffer.

The second part of the problem, about exposing sensitive information in the result, was resolved as a side effect. The system allows to grant or deny access to portions of the content. When building the index, absolutely all content is indexed. However, sphinx does not generate the excerpts automatically when generating the search results. One reason is that you may not need them, but the main reason is more likely to be that it does not preserve the original text. It only indexes it. Doing so avoids having to keep yet an other copy of your data. Your database already contains it.

To generate the excerpt, you need to get the content and throw it back to the server with the search words. The trick here is that you don’t really need to send back the same content. While I send the full content during the indexing phase, I only send the filtered content when time comes to generate the excerpt.

Sure, there may be false positives. Someone may see the search result and get nothing meaningful to them. John Doe might find out that the page mentions his name, but the content will not be exposed in any case. Quite a good compromise.

So many possibilities. What is your favorite feature?

Always more to do

It’s fascinating to see how little time it takes between the moment code is written and to be mentally flagged as to be improved, how procrastination then kicks in and finally how things get worst because left untouched. Of course, there are higher priorities and hundreds of reasons why it was left behind. The end result is the same. It will have to be refactored at some point, and those hours will be incredibly boring. Recently, I have been working on a fairly large project. After several hundreds of hours, I came to the conclusion I had a decent proof of concept, but it was far from being usable. I made a list of the issues and found several places that needed improvement. Turns out I had known about the root causes for a long time. Simply did not attend them.

So began the refactoring process. Filled with incredibly dull tasks a monkey could do, if only wouldn’t need to spend more time explaining it than it actually takes me to do.

Certainly, those issues would have been much faster to resolve if less was built on em, meaning I had attended them earlier. However, I strongly question that the solution I would have found back then would have lasted any longer. In fact, what I implement now follows patterns that have been deployed in other areas in the mean time. It builds on top of recent ideas. Dull refactoring may just be unavoidable. It will have to be done again in the future.

Constant refactoring and trying to reach for perfection is a trap. I’ve learned about ROI and almost turned going for the highest value objective into an instinct. With limited resources, no time can be wasted on gold-plating an unfinished product. Refactoring, as a means to improve user interaction and speed up development in this case, just happened to become on top of the priority list, and I now have to suffer through brain-dead activities. Luckily, refactoring still beats pixel alignments and fixing browser issues any day.

Trying to avoid falling asleep, I have been keeping Chad Fowler’s new book close by, which turns out to be really good for my condition. Today, I came across this passage.

For most techies, the boring work is boring for two primary reasons. The work we love lets us flex our creative muscles. Software development is a creative act, and many of us are drawn to it for this reason. The work we don’t like is seldom work that we consider to be creative in nature. Think about it for a moment. Think about what you have on your to-do list for the next week at work. The tasks that you’d love to let slip are probably not tasks that leave much to the imagination. They’re just-do-’em tasks that you wish you could just get someone else to do.

It goes on and recommends to divert the mind to a different challenge instead while performing the task with, as an example, keeping 100% code coverage target when writing unit tests. I’ve been doing a lot of that in the project. It influenced the design a lot. Ironically, what makes refactoring so boring is that all the tests now have to be updated. The code update itself is just a few minutes. Updating the dozens of failing tests because the interface changed takes hours however. They are quite a good guarantee nothing broke, but they do increase my daily caffeine intake to unsafe levels.

From junior to senior

It came to me as a realization in the last few days. There is really a huge gap between juniors and seniors as programmers. It’s commonly said that there is a 1:20 variation in our profession, but how much is really associated to experience? That alone can account for an order of magnitude. It’s not a matter of not being smart. Juniors just waste time and don’t realize it. There are multiple reasons for it, but most of them are related to not asking for help.

The specs are unclear, but they don’t ask questions. Specs are always unclear. Just discussing the issue with anyone can help clarifying what has to be done. A junior will either sit idle and wait until he figures it out on his own or pick a random track and waste hours on it. In most cases, there is no right answers. It’s just a judgment call that can only be made by someone who knows the application throughout , understands the domain of application and the use cases. This takes years to learn, so just ask someone who knows.

They don’t have the knowledge necessary to go above a hump. The natural response when they don’t understand a piece of code is to search google for an entire day, even though someone a shout away could have provided the answer within 45 seconds. Sure the interruptions are bad and learning by yourself is a good skill to have, but wasting that amount of time is never a good deal. It takes a while to become comfortable with all the technologies involved in a non-trivial project.

They spend hours on a problem while they could have solved more important ones in less time. Prioritizing work is probably the most important aspect of being productive. Especially when you work in an old application that has countless problems, at the end of the day, you still need to get your objectives done. At some point as a programmer, you need to trust “management” or “technical direction” that the task that were assigned are probably the ones that bring the most value in the project, regardless of what you stumble across along the way.

All of this can be solved fairly easily. Before you begin a task, figure out what the real objectives are, how much time you think it’s going to take and how much time those who assigned it think it’s going to take. Unless they are technical, most managers have no clue how long something is going to take, but aligning the expectations is a key to successful projects. If you think it’s a one week effort and they thought you would only need to spend 2 hours on it, it’s probably better to set the clocks straight and not begin at all.

Even when money is not involved, either you are working for a governmental organization or an open source project as a volunteer, time remains valuable. All bugs are not equal. Not everything is worth spending time on and what should really be judged is the impact of the effort.

Even though most HR departments turned the concept of a time sheet into a joke by forcing all employees to report 40 hours of work per week, a real, detailed time sheet with tasks and how long they took to perform is a great tool for developers to improve their efficiency. Was all that time really worth spending?

At the end of each day, it’s a good thing to look back at what you worked on and ask yourself if it was really what you set out to do.

Once you’re half way through the allocated time, ask yourself if you’re still working on the real objectives. If you’re not, the solution is obvious: get back to it. If you’re still on the objective, but feel you are circling around, how about asking for help?

Once you’re past the allocated time, consider cutting your loss. Ask around. Maybe the request was just a nice to have. It’s really not worth spending too much time on. It may be more economic to assign it to someone else. Just inform about progress and expectations. It allows direction to re-align and support. There is nothing wrong with admitting failure. It just saves time and money in most cases.

State your intent

It’s quite surprising how simply stating the intent allows for improvements. In the last few days, I have been rewriting the permission system for TikiWiki. One of the goals was to add consistency in the way it works with category permissions in regards to object and global permissions, the other was to improve performance.

Typically, listing elements would contain a condition for each element to verify if the user is allowed to view the object. This seems perfectly reasonable and innocent. However, the implementation of that function turned out to be a few hundreds of lines over time, including function calls all seemingly innocent. However, they were not. The amount of database queries performed to filter the list was unreasonable.

Was it lack of design, careless implementation or the desire to keep things simple? Certainly, a lot of re-use was made. The function to verify permission was so convenient that it could be used anywhere in the code. Everywhere a list had to be filtered, two lines were added in the loop to filter the data. Simple, innocent. Why would you bother placing that in a function? It’s just two lines of code. Plus! Placing it in a function would harm performance because it would require an additional loop.

Wrong.

The cost of the look is irrelevant. Stating the intent allows for something much more powerful than just simplifying those loops all around, it allows for creating a better way to filter the list when that is what needs to be done, like bulk loading the information required to resolve the permissions, diminishing the amount of queries required by an order of magnitude or two and offering a sane point to cache results and avoid queries altogether.

Even a naive implementation with terrible overhead of a loop would have been useful. Certainly, writing code to load large amounts of information in a single batch is harder. It requires understanding the relationships and the entire flow. However, if the right abstraction exists, stating the appropriate intent, optimizing the filtering routine can be done at a single place rather than requiring all the code to be updated.

This is one of the pitfalls of object oriented programming, and even procedural programming for that matter. It’s very easy to create elegant abstractions that hides the implementation details. It’s very easy to use, but when used in the wrong context, it creates slow and bloated applications. The abstractions need to reflect the tasks to be performed, not be regrouped around what holds the tasks together.

Focus is too often placed on the implementation details and local optimizations, while the big differences are made at a much higher level by correctly sequencing the operations to be made. Once the right abstractions are in place, when the interfaces are defined, the actual implementation is irrelevant. If it turns out to be slow due to the database design, it can be changed without affecting the rest of the application. The intent of the code remains unchanged.

It also makes the code more readable. It becomes like reading an executive summary. It tells you the outline and the important pieces to remember, but it does not bury you with details. It provides a high level view that, in most cases, is what people need to understand. Sure, understanding abstractions is a different kind of gymnastic for the brain when you need to debug a piece of code, but most of the time, you can just read by and ignore the details.

Finding the appropriate words

Programming is easy. Once you know the constructs and the patterns, it really ends up being a mechanical task. Add a little discipline you catch one’s own mistakes and you’re on the right track to deliver working software. I find that one of the hardest task is to find the appropriate words to use. Writing code is mostly about creating abstractions and metaphors. Using the right words leads to code that is easy to read, understand and predict. Using the wrong ones lead to confusion and a whole stack of problems.

I don’t usually start writing code until I found suitable words to use. There is something magic to nailing the different concepts involved: responsibilities become clear and all the issues untangle themselves.  Code is so much simpler and faster to write when you don’t have to deal with exceptional conditions caused by mixed logic. Good class names set expectations for what the class actually does.

The analogy brought by the word is very important. Words are not context-exclusive. The other concepts they relate to in a different context are just as important. They provide the analogy and the expectations. The vocabulary surrounding the concept will organize the software. Will it make for the most efficient software? It might not, but it will be understandable. Last time I checked, code was meant to be read by humans, or we’d still be writing assembly.

Think about database, knowledge base, data storage and data warehouse. They all pretty much serve the same purpose, except that I don’t expect the same operations to be made on a database or a data storage.

This in fact has been one of my pet peeves for a long time. Business people use the word database for everything. If their grocery list was in a word document, it would probably be a database too. Talking to them is absurdly confusing. However, I figured out that the reason why they keep using it is that it’s really all the same thing to them and they understand each other around this word. Go figure how.

To software engineers, the word database has a much more precise meaning. The terminology surrounding data storage and manipulation is much richer. We simply don’t speak the same language. If may seem worst because we consume a lot of concepts, but I learned that all professions have similar issues. Telling an oncologist you have cancer is probably just as vague and meaningless.

This brings us back to word selection and speaking to the appropriate audience. The basic object oriented courses teach Domain Driven Development. Take the concepts in your domain and model your software around them. It makes a lot of sense at a high level and help the communication between the development team and the stakeholders. However, I have the feeling doing so restricts the software and prevents from making generic solutions. If you focus too much on a single use case, you don’t see the generic aspects and probably add complexity to the software to push in concepts that just don’t fit.

I see nothing wrong with using different words internally and externally. Code is to be understood by programmers, the front-end by the users. If you build a generic solution with words understood by programmers internally, adapting the front-end to a peculiar context is mostly about changing strings. If your application is domain specific all the way through, adapting to a different context means either rewriting, or asking someone to translate between two different domains that don’t match at all. I bet large consulting firms love domain driven because it allows them to rent a lot of staff for extended periods.

Anyway, good words don’t magically come up on a late evening when you need it in time. Good design requires time to draw itself, for the appropriate words to be found. If you aim for elegance in your software, you will very likely need to drop the deadlines, at least until all crucial blocks are in place. The only way I know to find the right words is to switch between contexts. Focusing too much on trying to find the right word is a bit like chasing your own tail. You won’t get it just because the focus is only on what you see. The right word is out there. I always keep books with me when I need to design software. They bring me back to a larger picture, and contain many words too.

Too much focus on static views

When it comes to software design, university programs put a lot of emphasis on UML. UML itself was a great effort to standardise the notation and representation of software concepts at a time that much needed it. If you pick up software design books dating 1995 and older, you will notice instantly how needed unification was. Today, UML is omnipresent and this is a good thing. It’s a great way to document and explain software. Including all diagram types and special syntaxes, it’s a great tool for any software developer to master.

My largest problem with UML is not UML itself, it’s in the people that swear by it and confuse diagrams with design. They build pretty class diagrams and believe they now have an architecture for their system. Belief is that all that needs to be done is generate the stubs, throw it at any code monkey to fill the loosely defined methods, compile the app, and you’re done. This is certainly better than the MDD belief saying that you can compile your diagram and the app will work. However, it’s still wrong.

From everything I have seen, this sort of Diagram Driven Design simply fails. The fact that the diagram is pretty, shows little coupling and high cohesion does not mean that the software does what it has to in the first place. A static view will never be enough to represent a program, because programs are not about the static aspects, their meaning comes from execution. The difference between a good design and a bad one is how it executes. How it initializes, binds with the problem to solve and interacts to the other components. Class diagrams can’t represent this. It has nothing to do with how it looks when printed on paper and how the boxes are arranged in a way that prevents lines from crossing.

Static views focus on conceptual purity. Methods get added because they make sense, not because they are needed. Or maybe they were in the head of the designer, but not in the head of the one implementing. Different people have different preferences in how they implement and get their code to work and if the designer did not communicate his ideas fully through dynamic views, the static view is just a piece of junk. On the other hand, if he was to build sequence diagrams for everything, he might as well write the code himself. I have yet to see a diagram tool that is faster to use than vim.

There are better ways to use a system architect than to ask him to draw diagrams. Get him to write the concepts and general directions in the most concise manner possible and let the implementer handle the low level design the way he prefers it. Results will come in faster, less waste will be generated and your architect will be able to do more, maybe even write code himself to make sure he stays grounded.

From the moment the important concepts and responsibilities involved are identified, a test-first approach will produce much better code. In fact, tests are not that important. Any implementation approach that places the focus on the facade of the components will produce better results. It’s just that doing this, tests are almost free, so why not do it. A well encapsulated component will be easy to integrate in a larger application, so developing it in isolation with tests is not much more trouble and there are additional benefits. Focus on how to use and encapsulate the code, not on how pretty it looks in a diagram. It will very likely be shorter, more efficient and contain less dead code.

At the other hand of the spectrum, some people focus too much on the low level details. A little while back, I saw a presentation on meta-programming. I must say that from a geek point of view, writing programs to rewite your programs is just cool. However, I can’t find a valuable use for it. I have serious doubts about the type of changes you can do from a low level static representation of your application. Examples included removing dead code. Sure, it sounds like a good idea, but dead code has nothing to do with static aspects, especially in dynamic languages. Another use was rewritting code in a different language. Sure, it can work if what you’re working on is a sort algorithm, but if you’re using a sort algorithm, it woul be pure luck if the algorithm you’re using will be called the same way in two different languages. This simply does not scale beyond the academic level. Code is rarely independent from the platform.

Sure, the static views remain important. I for one spend a lot of time formatting my code and reorganizing it to be more readable. This goes well beyond just following coding standards and it’s part of making the code self-documenting by making sure details don’t get tangled with business logic. I occasionally use diagrams to document my work (however I rarely — if ever — use class diagrams). In the end, it’s all about knowing who the artifact you’re building is aimed for. Code always is targetted towards humans. If it was not, we’d still be writing assembly and be happy with it. Then there is a distinction between writing the front-facing interfaces of a component and the internals of it. A poor design can be recognized by the amount of internal details bleeding through the facade the same way poor user documentation spends too much time explaining technical details.

Adding collaboration and durability to code reviews

The idea first came to me over the summer while I was in Strasbourg discussing the future changes to TikiWiki. It all started because we decided to release more often. A lot more often. As it was not something done frequently before, the tools to do it were hard and tedious to use. Some manual steps are long and annoying. Packaging the releases was one of them. All those scripts were rewritten to be easier to use and to regroup multiple others. That part is good now. One of the other issues was the changelog. TikiWiki is a very active project. Having over 1000 commits per month is not uncommon. Many of them are really small, but it’s still 1000 entries in the changelog. Ideally, we’d like the changelog to be meaningful to people updating, but we just had no way to go through all the changelog before the release.

One solution that came up was to use a wiki page to hold the changelog, get new items to be appended, and then people could go though it and translate the commit messages from developer English to user English, filter the irrelevant ones and come up with something useful to someone. Well, we had no direct way to append the commit messages to the end of a page, so this was not done over the 6 month period and we’re now getting dangerously close to the next release. Looks like the next changelog will be in developer English.

Now, what does this have to do with code review? Not so much. I was reading “Best Kept Secrets of Peer Code Review (which they mail you for free because it’s publicity for their software, but is still worth reading because it still contains valid information) and figured TikiWiki could be a good platform to do code reviews on if only we could link it with Subversion in some way. After all, TikiWiki is already a good platform to collaborate over software development as it was developed for it’s own internal purposes for so long. When you dogfood a CMS for a long time, it tends to become good for software development (also tends to have complex UIs intuitive to developers though). It contains all you need to write quality documentation, track issues, and much more by just enabling features and setting it up.

Moreover, code review is done by the community over the Subversion mailing list. The only issue is that we don’t really know what was reviewed and what was not. I personally think a mail client is far from being the best environment to review changes in. Too often, I’d like to see just a couple lines more above to fully check the change or verify something in an other related file. The alternative at this time is to use subversion commands afterwards and open up files in VI. I wish it required less steps.

Wiki pages are great, because they allow you do do what you need and solve unanticipated issues on the spot. Specialized tools on the other hand are great at doing what they were meant to do, but they are often draconian in their ways, and forcing someone to do something they don’t want to never really works. I’ve made those errors in the past designing applications, and I felt Code Collaborator made the same ones. The book mentioned above contains a chapter on a very large case study where no code could be committed unless reviewed first. Result: few pages were spend explaining how they had to filter data for those cases the code was not really reviewed and people only validated it within seconds. I’d rather know it was not reviewed than get false positives.

Anyway, I started thinking of the many ways TikiWiki could allow to perform code reviews. The most simple layout I could think of was this one:

  • One wiki page per commit, grouping all the changes together. Reviewers can then simple edit the page to add their comments (or use the page comments feature, but I think adding to the page is more flexible) and add links to other relevant information to help others in the review.
  • A custom plugin to create links to other files at a given revision, just to make it easier to type. This could actually be a plugin alias to something else. No need for additional code.
  • A custom plugin to display a diff, allow to display full diff instead and link to full file versions for left and right.
  • An on-commit plugin for Suversion to make the link.

With commits linked to other related commits, to related documentation and discussions, other side features like wiki mind map and semantic links will surely prove to be insightful.

Then I went a little further and figured trackers could be used to log issues and perform stats on in the future. Checklists could be maintained in the wiki as well and displayed as modules on the side, always visible during the review. If issues are also maintained in a tracker, they could be closed as part of the commit process by analyzing the commit message. However, this is mostly an extra as I feel there is enough value in just having the review information publicly available. The great part of using a vast system is that all the features are already there. The solution can be adapted and improved as required without requiring complete new developments.

Now, the only real show stopper in this was that there is no direct way of creating all this from a subversion plugin. TikiWiki does not have a webservice-accessible API to create these things and is unlikely to have one any time soon. The script could load the internal libraries and call them if they were on the same server, but that’s unlikely to be the case. A custom script could be written to receive the call, but then it would not be generic, so hard to include in the core. As we don’t like to maintain things off the core (part of the fundamental philosophy making the project what it is), it’s not a good solution. With this and the changelog idea before, I felt there was still a need for something like this. I’m a software engineer, so I think with the tools I use, but I’m certain there are other cases out there that could use a back door to create various things.

To keep the story short, I started finding too many resemblances to the profiles feature. Basically, profiles YAML files containing descriptions of items to create. They are to be used from the administration panel to allow to install configurations on remote repositories, to configure the application faster based on common usage profiles. Most of the time, they are only used to create trackers, set preferences and the such. However, they contain a few handles to create pages, tracker items and some other elements to create sample data and instruction pages. If they could be ran multiple times, be called by non-administrator users, contain a few more options to handle data a little better (like being able to append to pages), they could pretty much do anything that’s required in here, and a lot more.

An other problem was that profile repositories are also TikiWiki instances. They require quite a few special configurations, like opening up some export features, using categories and such. I wouldn’t want all this stuff just to receive a commit notification, and I wouldn’t want to execute a remote configuration without supervision from the administrator. More changes were required to better handle access to local pages.

Still, those were still minor changes. A few hours later, Data Channels were born. What are these? It’s so simple it’s almost stupid. A name, a profile URI and list of user groups who can execute it from an HTTP call.

Next steps toward using TikiWiki as a code review tool:

  • Discuss this with the community
  • Write a profile to create the base setup and the base data channel profiles
  • Write a subversion plugin to call the channels
  • Write the plugins to display code differences

Interested in something like this for your project or company? Just drop me a line.

Software Maintenance

Software maintenance is probably the most unglamorous task to be done in software. It gets no respect at all. In fact, it’s just a bit higher than software testing down the pit. Yet, this is what most of us do. This is what I do most of the time. It’s about working on code you have not written, or wrote so long ago you forgot you even did it. Starting a brand new project does not happen so often. In most cases, starting a brand new project is not even a good idea.

When writing any piece of code, maintainability should be the number one priority. No exceptions made. Code lives for a very long time – often longer than it’s contributors. Unless the project is very small and handled by a single person over it’s lifetime, it’s very likely that, at some point, the original authors will be gone and problems will be handed to someone else. Let’s face it, inner workings of software are complex and a ton of tacit knowledge required to understand it is never written, which is a good thing, because if everything was written in word documents, any significant piece of information would be buried in a big pile of information overload.

Here are some tips in maintaining and writing maintainable code:

1. Never modify code you don’t understand

No, this does not mean you should not make the modification. People patching blindly in existing code is the number one cause of degradation. Degraded code is harder to maintain for you next time and even harder for the ones after you. Unless you understand the implications of what you are about to do, go back and study a bit more. Print out the code and read it. Annotate it. Understand it. It takes time, it kills trees, but it’s efficient.

Try to understand the intent of the original author. If the code is old and you know others touched it before you, identify their modifications. Most of the time, you will find defects along the way. Problems that lived in the application for a long time. Too much code is never reviewed.

I stand with those who believe schools should teach to read code before teaching to write code. Reading code teaches you a lot more about writing maintainable software. You learn what you don’t like to see. If you have the least bit of sympathy, you will avoid doing it yourself afterwards.

2. Review your own code and improve it

Good code is elegant. Good code reads like a book. Does your code read like a book? When reading, do you stumble across a complex condition and ask yourself what it does? Create more functions. Even if they are one liners. A named function reads a lot better than a three line conditional. Intent is clear.

Remove commented code. Remove debug information. Make sure indentation is perfect. Use blank lines to divide unrelated statements. Space helps reading. Books have paragraphs to separate ideas, so should your code. Long sentences are hard to read, so are long statements.

Some give hard rules to these guidelines. Lines should not be longer than 80 characters. Functions should not be longer than 20 lines. Variable names should be between 4 and 15 characters. Files should not be longer than 200 lines. Nesting level should be at most 3. The McCabe cyclomatic complexity should be no higher than 7. All these rules have good intentions, but they really only help reviewers in using static analysis tools rather than reading the code. The problem with these rules is that they always have exceptions.

I have no trouble reading a 100 line switch case. As long as a function has a readable pattern and a coherent view of the world, there is no problem. The real problem with long functions is that they often do too much and work at different levels. Make sure the abstraction level of the concepts manipulated in the functions are consistent.

3. Don’t write useless comments

Organization that enforce rules about writing API documentation on every single class, function or member variable tend to have the worst API documentation. Programmers are smart. They have tools to generate those comments, but the machine generated comments are simply meaningless most of the time. I’d rather have no comment at all than having a comment that does not provide me with the information I need. In the end, I will have to read the code, so it’s better that I don’t waste time reading the comment on top of it. What good is a comment telling you that the function returns the time if it does not tell you in which format and which timezone?

Good function and variable names often compensate the need to have comments in the first place.

Comments are code too. A useless comment sitting there will have to be maintained if the code changes. Comments that do not match the code is misleading.

4. Thinking too far ahead is your enemy

Less code is better. Don’t write code because it could be needed in the future. Write comments that explain how it could be expanded in the future if the urge is too strong. Most of the time, the person who will have to do it in the future will have constraints you had not anticipated. Running into code that is supposed to do something, but does not really do it, is a complete waste of time. The future developer will first try to understand the code, then figure out why it does not work, attempt to fix it, adapt it to fit the needs and spend hours debugging your untested design. Writing based on their own design would have worked much faster, but most people tend to trust the code in place. If you write something you never even executed and leave it there, the next one won’t know it does not work until they have a lot invested.

Your ideas for the future may be great, but keep them to yourself, because others have greater ideas in the future, and those ideas match reality.

5. Don’t be clever, especially when it comes to optimization

In most cases, performance does not matter. Especially in the web, there are so many things surrounding the piece of code you are working on, the relative importance of your piece is meaningless. As programmers, we tend to focus our world around our current task. It’s normal. It’s the only way code can be written. It requires deep attention. But don’t be clever. Clever code, no matter how smart it makes you feel when you write it, is just harder to understand for the next one. The next one would have to be as clever as you are to maintain it (or as mentally twisted).

Performance always seems to be one of those qualities we want to have, but obtaining it requires a lot of clever tricks, and those hide the original intent of the code. Unless you have evidence from a profiler that a piece of code has to be optimized, keep it readable, it might even make it faster. Optimizing without evidence tends to reinforce myths about performance, and like all myths, most of them are false.

6. Refactor as you go

A lot of changes can be made to improve code readability without altering the behavior. If you have to modify code and it’s unreadable, start cutting it apart right away. If you do it well, it won’t hurt and it will likely help you make your modification later. Code rarely kills people. If you’re scared to modify code, it’s probably because there is something terribly wrong with it, so start making innocent changes to clarify intent. Most of the time, the base design is not all bad. It was just degraded by years of blind patching.

Bad code becomes better by making small changes continuously, not by rewriting it all. Rewriting is mostly about wasting time and remaking the same errors that were made in the past. A lot of knowledge is hidden in code. If after understanding the code and understanding it’s value you end up with the conclusion it’s fundamentally wrong, then you are ready to rewrite it. But don’t rewrite because you are too lazy to understand.

Spend less time complaining about code quality and more time improving it. If you’re not happy with the state of things, make sure your own work at least respects your own standards. It should encourage others to raise the bar too.


Maintaining code forces you to read code. Reading code opens you to new paradigms, new ideas, and makes you a better programmer. It’s a much harder task writing code alone.

Read ahead

PHP 5.3 is going to be fun. I personally didn’t really follow the news on this one. I heard of lambda/closures and that pleased me. I just always hated defining functions for callbacks used only once. Not that it takes time, but I always felt forced to create a ridiculously long name for it to avoid any possible conflicts. Those times will be over… at least from the moment it will be deployed enough to justify forcing 5.3 as a requirement.

One fact that I find really interesting about the implementation is that it added functors to PHP. Basically, you can now create function that allow callbacks as parameters, that callback being an object or just a function without too much trouble. Just like STL does.

<php

class Foo {
	function __invoke() {
		return 42;
	}
}

$f = new Foo;

echo $f(); // prints 42

?>

However, this is just a side effect.

One concept that was new to me was traits in Sebastian’s presentation. This one won’t be included in 5.3, so it will be for 5.4 or 6.0. The goal of these is to share code between classes without using inheritance, hence avoiding multiple inheritance. While I can see it’s a cleaner way to do it, I really don’t find the argument compelling.

The implementation details for these are not fixed yet, so don’t worry so much about the syntax.

The main argument against multiple inheritance is the diamond effect. Consider an interface A which is implemented by B and C. If a class D inherits both B and C, problems occur because of an ambiguity. There is no way which parent method call should be used. Typically, the way this is resolved is by redefining that method from D class and choosing either implementation or both. Human decision is required.

I’m personally not really for multiple multiple inheritance, nor against. I just don’t care so much about the issue. I don’t mind if a language allows you to shoot yourself in the foot. You should be smart enough to make the right decisions, or be able to learn from your mistakes. I’m mostly against people being against it for that kind of argument. There are solutions for it and it’s not that complex to resolve. The fact that multiple languages out there have different ways of handling it is no issue.

I think the problems you can have with a less complex structure than that diamond are worst. The diamond problem really occurs because both classes inherit from the same interface most of the time. It means both classes’ implementation really have the same purpose. If you take of that common interface, the conflicts that occur are purely name conflicts, based on a lack of namespace. While it’s likely that they do something similar, it may not be the case. If A it taken off the picture and some function foo() in B returns a collection and foo() in C returns an integer, the problem you have is a whole lot worst. You can no longer just define a wrapper that does either or both implementations.

Traits fully allow that, and they allow it in a much more subtle way. I’m not saying traits are bad. I think it’s a really nice way to share code between different classes, but the argument against multiple inheritance just does not fit.

<?php

trait B {
	function foo() {
		return array( 1, 2, 3 );
	}

	function bar() {
		return 2;
	}
}

trait C {
	function foo() {
		return 5;
	}

	function baz() {
		return 4;
	}
}

class D {
	use B, C;
}

?>

In this kind of situation, you get the exact same conflict. The same function name exists in the traits. Which one do you use? In fact, you can get a conflict with an even simpler model:

<?php
trait B {
	function foo() {
		return 5;
	}

	function bar() {
		return 2;
	}
}

class D {
	use B;

	function foo() {
		return array( 1, 2, 3 );
	}
}
?>

This is an other kind of conflict. Logic would say that the class has precedence, but because the traits are really a compile time hack, it’s not that obvious. The solution to both kinds of conflicts is to have a syntax to alias the functions as they get copied to the new class. It works, but you can do that with multiple inheritance too by just redefining the functions and calling the appropriate parent.

Of course, those kinds of conflicts are not likely to occur at design time. When you make your design, everything is beautiful and works. Things break when they evolve. If you have control over all the code, it’s really easy to manage, because you have unit tests, don’t you? Well, as part of a language, these things will be used by frameworks. If the framework decides an additional method is required in the traits, it may break code all around. Oops. Not quite better than multiple inheritance.

I still think traits are a good thing, but you need the right argument for it. The reason why they make more sense is because they allow you to keep a coherent whole. Multiple inheritance is used to avoid copying code in multiple classes. It’s a matter of laziness. The relationships between the concepts don’t really have anything to do with inheritance. Traits allow to represent that relationship for what it really is. But don’t tell me it’s a perfect solution.

From what I understood, there was also some open questions about how to handle states from traits. Basically, the trait is stateless, but if it can’t access the object’s data, there is no way you can do anything more useful than you would from a static function. One of the current proposals seemed to be to allow the trait to define abstract functions from the trait itself, which would mean that the trait would effectively be an implementation and an interface. An abstract trait just seems confusing.

<?php

trait X {
	function foo() {
		return $this->bar() * 2;
	}

	abstract function bar();
}

?>

I think it could be made a lot more cleaner, and allow more reuse, if the trait instead had requirements:

<?php

interface Barable {
	function bar();
}

trait X requires Barable {
	function foo() {
		return $this->bar() * 2;
	}
}

class D implements Barable {
	use X;

	function bar() {
		return 5;
	}
}

?>

It might seem small, but it could avoid duplicating interfaces in traits, or needing to have traits to implement interfaces.

I think it’s really great to see PHP still evolving, and with the transition to PHP 5 getting faster every month, it means we will be truly able to use these features soon.

Security in the wild

Wikis are open in nature and it’s what brought them to success. Anyone can visit the page, edit it and see the changes live. The concept is really simple and became natural very fast. It’s all around. However, wiki applications evolved over time. The average wiki no longer is just text editable by anyone. They became the heart of complete content management systems with access rights and many other features. Wiki purists cringe when they hear of a wiki that is not editable by anyone. The corporate world does the same when they hear that their intranet could be modified by any employee.

In a standard wiki. The worst thing that can happen is that someone can get offended by false information (or simply offending spam). Undo last change. The world goes on.

As wikis evolved, usage called for higher level functionality. Pages are no longer only textual information, they tend to become full blown applications. They can generate dynamic lists and interact with external systems. This is done mostly though a syntax extension often called a plugin. The concept is very simple. A unified syntax contains a name and some arguments. When the parser runs into it, it calls a custom function and displays the result. In most cases, these will perform harmless operations and cannot cause any damage. All they do is display text, only text a little more complex.

The problem is that they can be used for a whole lot of things, and harmless really is context dependant. Consider a situation where content must be displayed from an other web application, probably a legacy intranet application. One way to do it is to get the server to fetch the HTML page, filter out some of the tags so it fits nicely in the page and display it. This technique is very vulnerable to content format changes and is quite hard to configure for normal people. An easier way would be to use an iframe and just load the page from whereever it is.

In a corporate setting, this probably works great because you can trust the people you work with not to screw up and load something they shouldn’t on the intranet’s home page.

If you want to use it on a public website where all edit rights are restricted, everything is fine. However, if you have a single page that allows public edit, you just opened up a very wide security gap that could allow sub-script-kiddy (talking about the kind of people who “hack” pages on wikipedia) to hijack sessions through XSS.

The main issue is that these extensions are installed or not. You could use it at some point in a completely safe environment, stop using it, and then change the context which made it safe. The extension is still active and you forgot about it. It’s installed site-wide. There is no way to enable it just on specific pages that are controlled. Because the plugin instantiation is part of the page’s content, you can’t prevent anyone with edit rights on a page from using it.

In implementing remote plugins, this was a major issue. Not only it was a plugin that can potentially do harm, it’s about plugins I don’t even know about. I had this vague idea of requiring input validation on the remote plugins before letting them run, so not anything could be called unless an administrator granted permission. All of it was fairly complicated because of implementation issues. During a discussion on IRC with sylvieg and ricks99, I realized that the problem existed beyond the remote service problem. So far, I had really considered if the context wasn’t safe, some extensions should not be installed. Rick was asking if there was a way to let admins add a plugin, but not anyone else. This got me to realize that the only reason it was hard to implement is that I was taking the problem from the wrong level. Applying the validation at the plugin-wide level made it much easier to deal with than if I did it specifically for the remote ones. It also added a whole lot more value too.

The final implementation is very simple in the end. When an extension can be dangerous, it declares it as part of the definition by identifying which parts require validation (body, arguments or both). When the wiki parser encounters a plugin that requires validation, it generates a fingerprint of the plugin and verify if that fingerprint is known. If it is, it goes on, otherwise it displays controls on the page for authorized users to perform the audits (non-authorized ones get an error message). The fingerprint is nothing more than the name of the plugin, a hash of the serialized arguments, a hash of the body and the size of both input to avoid collisions. Some arguments marked as safe can be excluded from the hash to allow some flexibility.

The end result is that any plugin can be enabled on any host in any context and the site’s users are still safe from XSS attacks. More capabilities for the public/open wikis. Of course, because of Tikiwiki policies, validation can be disabled, which is useful if you have one of those safe context.

It does have a downside thought. Validation is required when changes are made to the plugin, which means the page is not fully enabled until an auditor visited it, which may take some time. Notifications, tracking, … There are solutions, but viewing the changes is no longer possible as soon as you click save. The white list verification is a pessimistic approach to the problem, but it’s still better than letting a few identities be stolen until it’s caught.

The implementation is available in tikiwiki svn and will be released as part of 3.0 in April 2009.