A new blog engine

This week, I’ve updated my blog engine and I’m using WordPress now.

Why ? Because it is a real blog engine, there is a huge community that writes a lot of useful plugins (syntax highlighters, sitemaps, …) but mainly, it allows me to specify multiple categories when I’m posting, and this was needed for some blog aggregators (O’ Reilly, planet.scipy.org, …).

The theme is almost the same as the last one, everything is not yet available (like everything that was on my blogroll), but it will be back soon 🙂

Grid computing for Python

In my lab, we frequently process huge amounts of data, each process can take hours or days. The problem is that we don’t have a usable tool to do this.

Our legacy software is in C and we plan on moving to Python in the next weeks. We could use some commercial software, but it is not optimal.

This is where P2P comes into the game. We have a lot of unused computers or dual cores that are not used even at 50% because we are not trained in parallel computing (and we won’t in the near future). By “we”, I mainly mean PhD students. Our background is signal or image processing, not Computer Science and even less parallel computing. Those unused computers could be used for our computations, but this implies that the computer is only used if nobody works on it, that we only use what is available at a precise moment, and that some computers may get used during the computations. That’s why P2P seems an elegant idea, as a grid computing tool.

P2P computation is not new in the lab (we developed P2P-MPI in Java for instance), but for our team, it is. For the time being, I did not find much about the tools that we could use, but the JXTA protocol seems a good start. I hope I will be able to talk more about this subject in the near future.

Some news about the manifold learning scikit

I got the word today that my paper was accepted, so I can now focus on delivering the code.

I’m in the process of refactoring it so that it depends less on some of our libraries here. In two weeks, there is a nipy sprint in Paris I will attend, and machine learning is one of the topic we will discuss, so this may indicate where and how I’ll contribute the code I will keep going on showing some results next week.

Buy Me a Coffee!
Other Amount:
Your Email Address:

Dimensionality reduction: Locally Linear Embedding

One of the most cited algorithm in nonlinear manifold learning, with Isomap, is LLE. Contrary to Isomap, LLE tries to retain the local data structure of the sampled manifold. Whereas Isomap preserves absolute distances, LLE preserves local relative distances (it preserves barycenter weights).

This means that LLE is not suitable for every dimensionality reductions. For visualization purposes, it can lead to very different solutions if the manifold is noisy.
Continue reading Dimensionality reduction: Locally Linear Embedding

Dimensionality reduction: Principal Components Analysis

Before going into more details about nonlinear manifold learning, I’ll present the linear description that is used in most of the applications.

PCA, for Principal Components Analysis, is the other name for the Karhunen-Loeve transform. It aims at describing the data by a single linear model. The reduced space is the space on the linear model, it is possible to project a new point on the manifold and thus testing the belonging of point to the manifold.

The problem with PCA is that it cannot tackle nonlinear manifold, as the SwissRoll that was presented in my last item.
Continue reading Dimensionality reduction: Principal Components Analysis

Dimensionality reduction: Isomap

Isomap is one of the “oldest” tools for dimensionality reduction. It aims at reproducing geodesic distances (geodesic distances are a property of Riemanian manifolds) on the manifold in an Euclidiean space.

To compute the approximated geodesic distances, a graph is created, an edge linking two close points (K-neighboors or Parzen windows can be used to choose the closest points) with its weight being the Euclidean distance between them. Then, a square matrix is computed with the shortest path between two points with a Dijkstra or Floyd-Warshall algorithm. This follows some distance and Riemanian manifolds properties. The number of points is generally chosen based on the estimated distance on the manifold.

Finally, an classical MDS procedure is performed to get a set of coordinates.
Continue reading Dimensionality reduction: Isomap

More on manifold learning

I hope to present here some result in February, but I’ll expose what I’ve implemented so far :

  • Isomap
  • LLE
  • Laplacian Eigenmaps
  • Hessian Eigenmaps
  • Diffusion Maps (in fact a variation of Laplacian Eigenmaps)
  • Curvilinear Component Analysis (the reduction part)
  • NonLinear Mapping (Sammon)
  • My own technique (reduction, regression and projection)
  • PCA (usual reduction, but robust projection with an a priori term)

The results I will show here are mainly reduction comparison between the techniques, knowing that each technique has a specific field of application : LLE is not made to respect the geodesic distances, Isomap, NLM and my technique are.

Buy Me a Coffee!
Other Amount:
Your Email Address:

A new French book on scientific computing with Python

Today ships my first book on Python for the scientists. Although IT people can learn a lot of Python with it (mainly if they are working in labs are research centers), scientists will be more interested as it presents a viable alternative to Matlab : fast, efficient, a real language with a large standard library.

After an introduction, the Python language is exposed as well as some main modules. The three central chapters are dedicated to Numpy, Scipy and Matplotlib. Each library tackles a specific problem, storing data, using it or display it. Finally, the last chapter exposes ways of speeding up Python with the use of C or C++.

The link to my publisher : here

Transforming a C++ vector into a Numpy array

This question was asked on the Scipy mailing-list last year (well, one week ago). Nathan Bell proposed a skeleton that I used to create an out typemap for SWIG.

  1. %typemap(out) std::vector<double> {
    
  2.     int length = $1.size();
    
  3.     $result = PyArray_FromDims(1, &amp;length, NPY_DOUBLE);
    
  4.     memcpy(PyArray_DATA((PyArrayObject*)$result),&amp;((*(&amp;$1))[0]),sizeof(double)*length);
    
  5. }

This typemap uses obviously Numpy, so don’t forget to initialize the module and to import it. Then there is a strange instruction in memcpy. &((*(&$1))[0]) takes the address of the array of the vector, but as it is wrapped by SWIG, one has to get to the std::vector by dereferencing the SWIG wrapper. Then one can get the first element in the vector and take the address.

Edit on May 2017: This is my most recent trials with this.

  1. %typemap(out) std::vector<float> {
    
  2.     npy_intp length = $1.size();
    
  3.     $result = PyArray_SimpleNew(1, &amp;length, NPY_FLOAT);
    
  4.     memcpy(PyArray_DATA((PyArrayObject*)$result),$1.data(),sizeof(float)*length);
    
  5. }

Creating a Python module with Scons and SWIG

Some times ago, I proposed an optional build for SWIG if the SWIG binary was not found on the system. Here I propose an enhancement, a new library builder that will be registered in the environment env as PythonModule. It takes the same arguments as a classical SharedLibrary, but it does some additional steps :

  • It forces SWIG to create a Python wrapper (flag -python)
  • It checks if SWIG is present at all
  • It suppresses every prefix that the system might need (as lib in Linux)
  • On Windows and for Python >= 2.5, it changes the extension as pyd

Continue reading Creating a Python module with Scons and SWIG