After Advanced Computer Architecture and Parallel Processing, I’m going to review another book from the same serie. As the title hints it, the goal of this book is to introduce the tools that may be used in parallel, grid and distributed computing. This is the layer above the architecture the last book presented.
This is my first review. I read this book some time ago but I still want to write about it because the topic is very interesting.
In March 2008 issue, IEEE Computers published a case study on large-scale parallel scientific code development. I’d like to comment this article, a very good one in my mind.
Five research centers were analyzed, or more precisely their development tool and process. Each center did a research in a peculiar domain, but they seem share some Computational Fluid Dynamics basis.
Some of the widely used method are based on a similarity graph made with the local structure. For instance LLE uses the relative distances, which is related to similarities. Using similarities allows the use of sparse techniques. Indeed, a lot of points are not similar, and then the similarities matrix is sparse. This also means that a lot of manifold can be reduced with these techniques, but not with Isomap or the other geodesic-based techniques.
It is worth mentioning that I only implemented Laplacian Eigenmaps with a sparse matrix, due to the lack of generalized eigensolver for sparse matrix, but it will be available in a short time, I hope.
Analytical solutions to the dimensionality reduction problem are only possible for quadratic cost functions, like Isomap, LLE, Laplacian Eigenmaps, … All these solutions are sensitive to outliers. The issue with the quadratic hypothesis is that there is no outilers, but on real manifolds, the noise is always there.
Some cost functions have been proposed, also known as stress functions as they measure the difference between the estimated geodesic distance and the computed Euclidien distance in the “feature” space. Every metric MDS can be used as stress functions, here are some of them.
This week, I’ve updated my blog engine and I’m using WordPress now.
Why ? Because it is a real blog engine, there is a huge community that writes a lot of useful plugins (syntax highlighters, sitemaps, …) but mainly, it allows me to specify multiple categories when I’m posting, and this was needed for some blog aggregators (O’ Reilly, planet.scipy.org, …).
The theme is almost the same as the last one, everything is not yet available (like everything that was on my blogroll), but it will be back soon 🙂
In my lab, we frequently process huge amounts of data, each process can take hours or days. The problem is that we don’t have a usable tool to do this.
Our legacy software is in C and we plan on moving to Python in the next weeks. We could use some commercial software, but it is not optimal.
This is where P2P comes into the game. We have a lot of unused computers or dual cores that are not used even at 50% because we are not trained in parallel computing (and we won’t in the near future). By “we”, I mainly mean PhD students. Our background is signal or image processing, not Computer Science and even less parallel computing. Those unused computers could be used for our computations, but this implies that the computer is only used if nobody works on it, that we only use what is available at a precise moment, and that some computers may get used during the computations. That’s why P2P seems an elegant idea, as a grid computing tool.
P2P computation is not new in the lab (we developed P2P-MPI in Java for instance), but for our team, it is. For the time being, I did not find much about the tools that we could use, but the JXTA protocol seems a good start. I hope I will be able to talk more about this subject in the near future.
I got the word today that my paper was accepted, so I can now focus on delivering the code.
I’m in the process of refactoring it so that it depends less on some of our libraries here. In two weeks, there is a nipy sprint in Paris I will attend, and machine learning is one of the topic we will discuss, so this may indicate where and how I’ll contribute the code I will keep going on showing some results next week.
One of the most cited algorithm in nonlinear manifold learning, with Isomap, is LLE. Contrary to Isomap, LLE tries to retain the local data structure of the sampled manifold. Whereas Isomap preserves absolute distances, LLE preserves local relative distances (it preserves barycenter weights).
This means that LLE is not suitable for every dimensionality reductions. For visualization purposes, it can lead to very different solutions if the manifold is noisy.
Before going into more details about nonlinear manifold learning, I’ll present the linear description that is used in most of the applications.
PCA, for Principal Components Analysis, is the other name for the Karhunen-Loeve transform. It aims at describing the data by a single linear model. The reduced space is the space on the linear model, it is possible to project a new point on the manifold and thus testing the belonging of point to the manifold.
The problem with PCA is that it cannot tackle nonlinear manifold, as the SwissRoll that was presented in my last item.