Book review: Data Analysis with Open Source Tools

When faced with a new dataset, the issue is to find how it should be analyzed. A lot of books addresses the theoretical way of doing it, but this book gives practical clues to do it. Besides, it isn’t based on commercial tools like MATLAB, but on open source tools that can be freely downloaded on the Internet.

Content and opinions

The book is split in four parts. Each time the difficulty increases, and one can stop at the point one needs for one’s purpose. In almost each chapter the author spends some pages applying his explanations with the help of open source tools.

The first part is dedicated to graphics. It’s IMHO one of the best ways to understand the data, and it seems that the author shares at least a bit of this opinion himself. The different aspects of graphics are explained, from histograms to complex 2D graphs like parallel coordinates plots. The last chapter is a complete (and efficient) way of using graphics to explain and understand a dataset, instead of using complex analytical tools. I have to say I was surprised to see how close the estimated signal was from the original at the end.

The second part tackles the difficult issue of models. It started with the last chapter of the precedent part, and it goes one step further. It is mainly a matter of stats, with the application of the mean-field theory (or instead of doing the average of the value of a function, get the function’s value for the mean of your data), and then the tricky aspect of probabilities and its counter-part statistics. I was delighted to see the author’s approach on this last subject, as Bayesian theory is nicely introduced, and you can’t disagree with Bayesian statistics after this chapter. The last chapter of this part is a small application of the different theories, perhaps a little bit less practical than the last chapter of the first part, but nonetheless useful.

The third part handles discovering what is behind the data. It starts with a useful technique, Monte Carlo simulations. It may be (very) costly but it is capable of leveraging the knowledge you have to solve a statistic question (is this result possible if I have this experiment setup?). The next step is splitting data in clusters and using dimensionality tools to find an intrinsic emebedded space. As the last one was part of my PhD thesis, I was disappointed to see only simple or almost outdated techniques, but you can’t tackle complex algorithms in such a book. There are too many scientific topics where you could go further (although the author went rather far in the graphics part).

The last part tackles the actual use of data. There are parts that are not actual “data analysis with open source tools” (but there were also such pages in the second or the third part, and they were incredibly simple and useful for someone who wants to jump inside the data analysis wagon). The last chapter handles complex tools as Support Vector Machines or Bayesian Networks. As I’ve said before, the book is not the place to explain everything about those tools, but they have to be mentioned after the first chapters of this part so that one can understand how they can be applied on actual problems.

Conclusion

Data analysis is a difficult topic, and there is a need for explanations to understand the different theoretical and open source tools. This is what the book achieves efficiently.

P.S.: never ever write an article when you are too tired, the amount of error increases dramatically. I may do some stats on this phenomena one day, now that I know how to do it with open source tools!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.