Recently I have been reading Programming Collective Intelligence by Toby Seragan. I love the subject. It’s all about handling large data sets and finding useful information out of it. Finally an algorithm book that covers useful algorithms. I don’t read code-centric books very often because I think they are boring, but this one has a great variety of examples that keep it interesting as the chapters advance. There are also real world

examples using web services to fetch realistic data.

My only problem with the book is that there are way too many code samples. It may just be my training, but there are some situations where just writing the formula would have been a lot better. Code is good, but when there is a strong mathematical foundation to it, the formula should be provided. Unlike computer languages, mathematics as a language has been developed for hundreds of years and it provides a concise, unambiguous syntax. I like the author’s effort to write the code as a proof of concept, but I think it belongs in an appendix or on the web rather than between paragraphs.

Which one do you prefer?

def rbf(v1,v2,gamma=20): dv=[v1[i]-v2[i] for i in range(len(v1))] l=veclength(dv) return math.e**(-gamma*l)

or

For that kind of code, I vote #2 any time. I’m not a Python programmer. I can read it without any problem, but that vector substraction did seem a little arcane at first and it took me a few seconds to figure out, and I’m about certain that even a seasoned Python programmer would have stopped on that one. It’s not that it takes really long to figure it out, but it really keeps you away from what is really important about the function. What was important was that you want to score points that are far away from each other a lower value than those that are close by. Anyone who has done math could figure it out from the formula because it’s a common pattern. From the code, would you even bother to read it?

This is a very short code sample. In fact, it’s small enough that every single detail of it can fit into your short term memory. Here is an example that probably does not. In fact, I made your life a lot easier here because this code was scattered across 4 different pages in the book.

def euclidean(p,q): sumSq=0.0 for i in range(len(p)): sumSq+=(p[i]-q[i])**2 return (sumSq**0.5) def getdistances(data,vec1): distancelist=[] for i in range(len(data)): vec2=data[i]['input'] distancelist.append((euclidean(vec1,vec2),i) distancelist.sort() return distancelist def gaussian(dist,sigma=10.0): exp=math.e**(-dist**2/(2*sigma**2)) return (1/(sigma*(2*math.pi)**0.5))*exp def weightedknn(data,vec1,k=5,weightf=gaussian): dlist=getdistances(data,vec1) avg=0.0 totalweight=0.0 for i in range(k): dist=dlist[i][0] idx=dlist[i][1] weight=weightf(dist) avg+=weight*data[idx]['result'] totalweight+=weight avg=avg/totalweight return avg

or

The formula is insanely shorter, and the notation could certainly be improved. What’s the trick? It relies on well documented language features like vector operations and trims out all the python-specific code. I actually wrote more than I had to because gaussian itself is well defined in math. Because all operations used are well defined, whichever language you use will probably support them and you can use the best possible tool for your platform. The odds that I use Python when I get to use those algorithms is low, so why should I have to bother with the language specifics?

The author actually included the formula for some function in the appendix. I just think it should be the other way around.

The book is called “Programming Collective Intelligence” (not “collaborative”).

I totally agree, give me the math so I have the freedom to decide how to realize it in code in whatever language.

Given that PCI is a python-centric book with an ML flavor, perhaps providing both would have been a better idea, with one or the other in an appendix.

But the python version is the “working” version. I dont know of any language where you can input such complex GRAPHICAL formulas (or another representation) and it would just “work”

I think the problem is not code samples in general, but bad code samples.

The code samples displayed reimplement well-known mathematical operations every time they’re used, and that they’re basically Fortran code in disguise. Idiomatic Python versions would be something along the lines of:

def rbf(v1, v2, gamma=20):

return e ** (-gamma * (v1-v2).length())

def gaussian(mu, x, sigma):

r = 1 / (sigma * sqrt(2 * pi))

return r * e ** (-(x-mu).length()**2 / 2 * sigma**2)

def weightedknn(data, v, k=5, sigma=10.0):

weights = [gaussian(v, input, sigma) input in data.input[:k]]

weighted = [w * r for (w, r) in zip(weights, data.result[:k])]

return sum(weighted) / sum(weights)

(I could be a bit off, but the point should be clear. Also, I don’t know how to format comments in WordPress.)

For me the code samples are easier to read, but I’m a programmer, not mathematican.

I agree but i also think that not only samples should be written in mathematical form but also entire programs.

Think of it as the next generation of syntax highlighting, anyone who has tried mathematica knows what i mean.

Lurker: Thank you. Can you believe I made that mistake twice within 30 seconds?

markus: I don’t think you should write the programs in math, that would not make any sense. Not all problems can be represented in math in a useful manner. The problem is that even if the python code is executable, paper does not execute. I can’t quickly play with it and get a feel of the values without running a full trace when I’m sitting in public transits. Math allow me to get that feel much more easily.

shp: I have the feeling you are right, but I don’t know enough of python to tell if that would be possible. Operator overloading is one crucial aspect in a language when dealing with vectors and matrices. It removes a lot of burden on the brain when reading the code.

ki: I also am a programmer first, but I got enough math background.

Crazy Ivan: I really don’t think it applies to all problems. If you are in the scientific field, it certainly makes sense. However, there are many situations where computer languages are a lot better than math to solve some problems. Networking is not something math can express at all. System architecture would be painful. You still need those classes, inheritance and all those other concepts that were developed specially for writing larger programs. From my experience, math can only be used to represent some localized operations.

I find it far far far easier just to read the code. Taking equations and turning them into code is a pain if you don’t know the mathematical notation used.

I am a working developer and working on my bachelors degree. I have had enough math classes to want to see the math first then the code example if I need it. A year ago I may have wanted to just see the code. Viewing the math you can easily see relationships faster than viewing the code. And when you have been exposed to certian patterns and then see them in other algorithms light bulbs go off. I chose the math.

Now, I don’t know much about this myself, but here’s a guy who might take issue with your statement regarding mathematical notation’s unambiguity (amongst other properties):

“Mathematical notation fundamentally sucks. I will demonstrate this in an instant: we all know that cos^2 x = (cos x)(cos x), but cos^-1 x =/= 1/cos x. Can you understand how horrible this is to a little kid in high school who’s logical and smart? The dumb ones will understand this, but the smart ones won’t … it’s not because [mathematicians] are dumb, it’s because mathematics is a natural language to them.”

http://video.google.com/videoplay?docid=-2726904509434151616

(watch it, it’s fun!)

What do you think?

What i meant before wasnt actually that the ENTIRE program should be written as mathematical equations but that certain rows of the program that executes a mathematical operation should be “highlighted” as a mathematical equation.

Say for example that the actual codefile contains something like Math.Pow(5,5), then the IDE should render this as…well, like you do on a paper.

And of course you should also be able to fall back (with one click) to render the plain text instead of the “highlighted” equations.

Anyone who has tried to create a 4×4 matrix with plain code has to agree. You can get it nicely aligned using tabs but if you have a matrix with only digits and then realize that you have to throw in a Math.Cos(someAngle) somewhere in the middle of the matrix, the alignment will be completly messed up and the readability drops below 0….Unless you want to spend time on aligning it again.

I think that the way IDEs work today with fixed row-height etc is outdated, another example is the { } that surround a function in most languages, why does it has to be as high as the rest of the text, it’s just in the way.

I still want to have it there because i wouldn’t trust myself with indent-based logic like python, but it should be small so it doesn’t obscure the actual code.

I think it is highly unfair of you to put the code samples in such a small font that they’ve become unreadable, while providing math equations in nice large text. Not to mention, programming samples should be shown in a programming book. Sure, perhaps some math examples may complement the existing content nicely, but code should _never_ be replaced with math equations (in books for coders).

When writing a book or article geared towards programming, use code samples. Why? Because you can look at code samples and understand what they do. Mathematic notation is laden with implied prerequisite knowledge. If someone has that prereq knowledge, they can easily write out the formulas themselves. If they do not, however, does this mean that this person isn’t smart or educated enough to deal with the material at hand?

I can say from personal experience, this is not the case. My mathematics only went as far as DiffEQ’s, Linear Algebra and Advanced Numerical Analysis. I have had to implement algorithms for things I honestly did not understand before… and that’s ok. I eventually understood it *through the code*. If you gave me a mathematic formula, though… it’s a blackbox of knowledge. It has input, it has output, but you have no idea what’s going on to produce that output.

The point is, when writing a book geared towards any form of programming, it is better (albeit less efficient for those already in-the-know) to include code samples over formulas. I’m sure it frustrates you, but you must realize it’s for the greater good, not you.

umm. maybe you should have read “Formulating Collective Intelligence”… someone reading all the way through a book whose title begins with the verb “Programming” does NOT have a right to complain that there’s too much programming in it..

Also, your second example formula has no meaning until “result” and “input” are defined. Would you rather spread those definitions across 4 pages (which you seem to disparage) or ramble on for 4 pages with no examples and plop your “formula” at the end of it all? I think that would be a lot less readable no matter what your notation preferences are.

Authors have termed this as domain specific examples.

When you write for software developers, write examples in software.

I feel comfortable reading the python code, or C for that matter, although I’d probably feel fine reading javascript, C#, or even Java for that matter.

The constructs used to illustrate the point should be familiar to the reader to lower the barrier to understanding. Iterations, ranges, constrictions, conditionals should be understandable and clear to even novice programmers.

It’d be like teaching math geeks law in the klingon language. Learning the underlying structure to get to another problem is not a productive use of time, especially when you don’t care about klingon.

Bob: Nice video, quite entertaining. The notation certainly has flaws because it was meant to be concise and was developed over very long periods by people who often never really spoke together. However, there are conventions. I totally agree that you can very well pass any math course without understanding anything and just playing with symbols. The main issue is that pure math is abstract and unless you model all the constraints of the problem you are trying to solve, you can end up to a wrong solution.

Sure, there are a lot of symbols to remember, but once you’ve done enough, it’s all natural… just like reading or writing any language.

Even if math is not perfect, I still think it’s better than code to represent numerical transformations.

Crazy Ivan: I use vim to write code, so I don’t think that will happen any time soon, but it could be nice. Probably a very hard problem to solve if you want it to work for all languages and not just Java.

Jacob Sheehy: I fixed that now. I just didn’t think about it.

As for the programming book argument, I don’t think programming is only about code. Sure you can copy code snippets from a book, but understanding it’s fundamentals is more important. In the case of this book, it demonstrates different solutions to understanding data. It’s not restricted to the examples in the book. If I am to apply the solutions to different problems, I would much rather understand the theory in depth.

I’m not saying code samples should be entirely removed. They are important, especially in the early introduction chapters, but as the book moves along, I like to get the noise away and focus on the problem to solve rather than the code details.

I don’t see Programming Collective Intelligence as a book for coders. Sure it explains you how to code these solutions, but the important part of it is to learn how to work around those types of problems. It’s much more about information management and data correlation than about code. Code is just a medium.

Collin Cusce: I get the point that not everyone has the math background to understand the formulas as I wrote them, but the book had an introduction to python notation to bring people up to speed. It could just as well have an introduction to the mathematical notations used through the book. There wouldn’t be so many of them. In fact, the learning would be incremental as not all notations are used at the same time and early ones are fairly easy.

“I’m sure it frustrates you, but you must realize it’s for the greater good, not you.”… This means greater good means keeping people away from notations that could help them understand better?

dut: As I mentioned, I don’t think programming means code. Programming is mostly what I do for a living. I probably spend as much time, if not more, scratching paper to figure out the solution than writing code. Code is the final product, but there are steps to take before reaching code. Among those steps, there is building a mental model of the problem to solve. If you deal with math and numerical transformations, I think using math will help you understand the problem more than code, just because it’s more visual and concise. Plus, there are many tools available to plot graphics and get a quick feel of the function.

Yet, I consider all of this to be programming.

You are right about result and input not being defined. In fact, I did not like these names too well, but I wanted to stick as close as possible to the original code. To me, this would be a valid definition:

input(x): Obtain coordinates for given object x.

result(x): Obtain numerical value for given object x.

Simple and platform independent. The function does not have to worry if the data is stored in an array, a database, on a remote machine, if it’s all loaded at once or fetched by chunks. All these details are provided to me by the code, but I don’t care about them, because I won’t be using Python anyway.

BTW, the formulas were actually in different chapters as some were introduced before.

Wade Mealing: I always had this feeling that computer science and engineering were fairly close to math, and thus people should have a good feel of the basic notation. I’m pretty certain that the book would not be very meaningful to someone without math background anyway.

Sure, I can read about any standard language. They all have pretty much the same structures and can be understood, event by novice programmers, but code is not easy to read. It really takes an effort. Math is also not so easy to read, but it will be much easier to see patterns and similarities. You don’t have to parse the entire code to see that a term as added, which is the fact of the sample I use. The function defined before was knn. The weightedknn really only adds a weight multiplier in the average. Everyone with some statistics background knows what a weighted average is. I could have easily skipped 5 pages just by looking at the formula. Instead, I had to stop to understand the code.

I think it’s just a matter of using the right language for the situation. The same way you would not write a complete web application in C because it would be way too complicated, I don’t think writing python code to explain a mathematical formula to explain it is a good thing.

—

I’m really surprised by the response to this post. Opinions are really divided, which is fun.

Well, again, mathematics is a black box. I do think code is far more expressive than mathematics as code defines pragmatic procedure, whereas mathematics implies procedure.

For instance:

Dx(F(x));

In a computer there are many many ways to implement differentiation. Symbolic, brute force, computationally, greedy algorithms….etc.

Depending on what heuristic you use, you will get different results.

When in the context of writing a book about programming, simply leaving such heuristics up to the reader will probably, nay, definitely confuse someone. If you write a large program which handles symbolic differentiation to get a fairly accurate result whist your reader implements a greedy algorithm to find the nearest fit tangent line, your answers will differ. Sometimes by a little, sometimes by A LOT.. and I mean a lot. That’s how loss of significance and compound error works.

So yes, when dealing within the context of a programming book, mathematical formulas are the very definition of insufficient. If you desire to use formulas, write them yourself.

Seems like you are making a point of misinterpreting my idea. The formulas I mentioned are not anywhere near calculus or any kind of numeric approximation, and even if that were the case, maybe a less computation intensive method sacrificing some precision would suit my use domain, so I should have the right to choose. Certainly it’s meaningful to discuss the error in some cases. In this book, the topic is about interpreting data, not the actual calculation, and the calculations used are quite simple.

I’m surprised nobody has mentioned fortress. They’re trying to dramatically lessen this distinction so you don’t have to choose between formulae and code. Take a look at the NAS CG benchmark in fortress. I hope your keyboard has plenty of modifier keys!

I’m decent at math, but even the first example is much clearer in python code, as non-idiomatic as it may be. The second one doesn’t even compare.

Sorry but what is your complaint again? Is the weight of the paper used in the python samples causing you back pains? The code makes this book -for programmers- more readable -for programmers- and you are telling me this is a bad thing because…?

You start by claiming nobody reads code samples (the article’s title implies so). People come and state they do. You are proven wrong in one single step. Why are you still arguing?

Time would be better spent discussing what language should have been used. If they are aiming at the widest audience Javascript would have been perfect.

rgz: If you’re not happy with my opinion, feel free not to read it. This is not slashdot, no need to flame.

When half the pages consist of code samples, some of which not being relevant because they are merely incremental, I feel I waste time reading.

At this point, no one said they were reading all code samples, they only said they preferred it over maths.

1) Nowhere I asked you to keep your opinions for yourself, let alone -in your own blog- All I did is ask why do you insist on keeping this position, is a valid thing to ask.

As far as I understand this is your position, that it doesn’t matter how many people come and tell you that code is better for them and helps them understand the book, you still insist that traditional notation is best for everybody and so the book still should not include code samples, or at least not beyond the introductory chapters.

2) Thanks, this is the actual answer to my question, you feel you waste your time reading the same thing twice, I would also feel I’m wasting my time reading the same thing twice, when I find a multilingual instruction manuals I don’t read it in as many languages I can, just the one I can read the best, this is only a suggestion, why don’t you do the same? (This one is not actually a question. (This is not a sarcasm disclaimer either.))

3) And let me quote “At this point, no one said they were reading all code samples, they only said they preferred it over maths.”

Oh come on! This is a petty rhetoric, I guess you are having a fun time dissecting the second paragraph in this comment, counting in how many ways it is not exactly and literally what you said.

You are clearly not interested in a constructive dialog…

Well, it’s just that I don’t really have the idea those defending code understand what I mean at all. Hard to have a dialogue when both parties talk about different issues. Plus, my words seem to have been taken as an extreme, which is not the case.

I’ll try to illustrate better. When I see the gaussian formula, this is what I see (or I can plot it really easily):

When I look at code, I actually have to think for a while before I get to that picture. Without that picture in mind, you can read the code and understand how it executes, but understanding the execution and understanding the purpose is very far away. Certainly there are cases where you need code to show the execution, but when you try to correlate data, the purpose is much more important. It does not really matter how the code executes, what it does is important.

There are not so many common formula forms, but when you know them, you can get a much quicker grasp of the visual solution. It’s much easier to see that the point of that weight multiplier is to give significant weight to points near and decrease quite fast, but never reach 0. I could decide to change this for a log and favor large distances. Can you actually get that kind of picture reading code? I think that if you do, you probably have enough background to understand the formula in the first place and spare yourself the code interpretation and transformation back to the formula.

This really does not apply to any kind of code.

Actually I don’t get that picture at all unless I know that x = range(40) (or x = 0..40 for rubists) but that’s besides the point.

But really, the expression doesn’t look at all like the plot regardless of the language used to express it only looks, you give the impression that you think that the formula inherently looks like the plot. This not what you are saying exactly, because you realize the only reason the formula suggest this shape to you is because you have a background in mathematics, where you have seen this plot-shape besides this formula-shape many times with many variations.

What you are saying -if I’m getting you right- is that if you are capable of understanding this graphic is because you have studied it in a math class where you learned the formula and thus can spare the code.

The implication being that if you didn’t study (or can’t remember) math you have no business reading this book.because you won’t understand what it says. Besides being rude, it’s a weird claim to make about a book that tries to explain math to non-mathematicians.

That is the reason you words are taken to an extreme, you are offending many people right know.

(Communicating you this is the purpose of this comment by the way, and of course you are free to offend as many people as you want.)

Can you conceive of the possibility of there being other math notations besides the one you learned in school? (Or are you suggesting that indeed the traditional notation is indeed perfect and absolute instead or arbitrary?)

If you can conceive of the possibility of alternate notation systems the you can conceive of the possibility of people learning math in those notations, so let them be.

Slightly changing the topic, would you be any less anal about it if the author used Haskell instead of Python in the code samples?

My point is that the representation highlights patterns better and allows to see the relationships much faster. You can then mix your own improvements and extend beyond the examples.

Being rusted in math is nothing insulting. When I go back to C, I am rusted too. Doesn’t mean I shouldn’t make the effort to understand it, even if it means getting a refresher. If you never did math, well the best you can do with those problems is copy the code anyway, so you could be saved a step by getting it online. This has nothing to do with notations. The theory is complex and you need some background to fully get the details. Sorry. I first started programming when I was a teenager. The great thing about programming is that anyone can learn it by themselves because there is plenty of documentation available. But face it, if you did not spend 7 years in school learning some of the theory, some things may be out of your reach until you spend some time learning on your own, the same way some things remain out of your reach when you get out of school. Some problems need more learning than others.

I don’t get the point that the book should be trying to explain something to non-mathematicians. I am learning a lot reading it. I knew the functions before. I just never would have thought using them in that way. This is about applying theory to a different context.

It’s not only about the math part of it. By chapter 7, you should have figured out how to load data in an array. No need to explain that code again for the specific problem. Put it online if you wish to save some time when trying it out, but no need to explain the details. No need to spend two pages explaining how to authenticate to ebay or Facebook webservices. The code is written and static. Put it in a library and tell people to use it for the sake of the example. Focus on the important topics, things that matter. This kind of redundancy annoys me. The same applies to math functions in code. If you add a single term, you don’t need to rewrite the entire code again. With the math formula visible, you can quickly see the difference and it does not take half a page to display and a full page to explain.

Oh, and I don’t know anything about Haskell, but I guess with a sufficient primer, it would do just as good. It’s not against python, it’s against the abusive amounts of code on pages that numb my brain and keep me away from focussing on what I want to learn.

Any clearer?

Seems to me when your complaint gets down to specifics; it’s more about poorly-edited code examples than code examples per se. I agree with you that unless you’re writing a book about authenticating to eBay or Facebook, you shouldn’t have to wade through code explaining how to do that. You’re absolutely right; that sort of thing is for (well-commented) libraries.

While I agree a good programmer ought to be able to read mathematical notation if he’s working in a mathematical arena, I consider that just the basic domain knowledge required for any specialized programming. For me, I often find it easier to see patterns and relationships in code than some forumulae. I don’t find your assertions to the contrary very convincing.

Math notation does not, as was claimed above, show the purpose of a function. It shows a sequence of calculations, just like what you might enter into a computer, except with markings that can’t be produced with a keyboard, in a syntax where fonts and formatting are even more significant than Python’s significant whitespace.

The apparent advantage of math notation comes from the fact that mathematicians MEMORIZE the formulae they use on a regular basis. This is what gives you the ability to “see what the formula does”. You don’t really see what it does, you simply remember the formula, just as I remember the word “gaussian”. Programmers let the computer memorize the formulae they use regularly, so their brains are free to think about more important things.

Well, we have comments and documentation strings, you know? ;). Actually, I prefer this:

or even better, use Numpy or similar module/library available to your preferred language that provides linear algebra and, assuming v1 and v2 are already Numpy arrays:

Of course, then we need to understand TeX syntax or have it handy, but who doesn’t? O:)