Recently I have been reading Programming Collective Intelligence by Toby Seragan. I love the subject. It’s all about handling large data sets and finding useful information out of it. Finally an algorithm book that covers useful algorithms. I don’t read code-centric books very often because I think they are boring, but this one has a great variety of examples that keep it interesting as the chapters advance. There are also real world
examples using web services to fetch realistic data.
My only problem with the book is that there are way too many code samples. It may just be my training, but there are some situations where just writing the formula would have been a lot better. Code is good, but when there is a strong mathematical foundation to it, the formula should be provided. Unlike computer languages, mathematics as a language has been developed for hundreds of years and it provides a concise, unambiguous syntax. I like the author’s effort to write the code as a proof of concept, but I think it belongs in an appendix or on the web rather than between paragraphs.
Which one do you prefer?
def rbf(v1,v2,gamma=20):
dv=[v1[i]-v2[i] for i in range(len(v1))]
l=veclength(dv)
return math.e**(-gamma*l)
or
For that kind of code, I vote #2 any time. I’m not a Python programmer. I can read it without any problem, but that vector substraction did seem a little arcane at first and it took me a few seconds to figure out, and I’m about certain that even a seasoned Python programmer would have stopped on that one. It’s not that it takes really long to figure it out, but it really keeps you away from what is really important about the function. What was important was that you want to score points that are far away from each other a lower value than those that are close by. Anyone who has done math could figure it out from the formula because it’s a common pattern. From the code, would you even bother to read it?
This is a very short code sample. In fact, it’s small enough that every single detail of it can fit into your short term memory. Here is an example that probably does not. In fact, I made your life a lot easier here because this code was scattered across 4 different pages in the book.
def euclidean(p,q):
sumSq=0.0
for i in range(len(p)):
sumSq+=(p[i]-q[i])**2
return (sumSq**0.5)
def getdistances(data,vec1):
distancelist=[]
for i in range(len(data)):
vec2=data[i]['input']
distancelist.append((euclidean(vec1,vec2),i)
distancelist.sort()
return distancelist
def gaussian(dist,sigma=10.0):
exp=math.e**(-dist**2/(2*sigma**2))
return (1/(sigma*(2*math.pi)**0.5))*exp
def weightedknn(data,vec1,k=5,weightf=gaussian):
dlist=getdistances(data,vec1)
avg=0.0
totalweight=0.0
for i in range(k):
dist=dlist[i][0]
idx=dlist[i][1]
weight=weightf(dist)
avg+=weight*data[idx]['result']
totalweight+=weight
avg=avg/totalweight
return avg
or
The formula is insanely shorter, and the notation could certainly be improved. What’s the trick? It relies on well documented language features like vector operations and trims out all the python-specific code. I actually wrote more than I had to because gaussian itself is well defined in math. Because all operations used are well defined, whichever language you use will probably support them and you can use the best possible tool for your platform. The odds that I use Python when I get to use those algorithms is low, so why should I have to bother with the language specifics?
The author actually included the formula for some function in the appendix. I just think it should be the other way around.