For each algorithm and program, there are architectures that are better than others. Some computation may need a lot of FLOPS, but FLOPS are not the only thing to consider. Communication and memory bandwidth and latency are as important as computational power, specially since memory speed and CPU speed are decoupled.
Raw computational power needs
With the raw computational power we have, it is possible to simulate more and more complicated models. So the newest processors will never be enough, more precise models will be used (http://www.ddj.com/cpp/205900309)
Still, more complicated models means more memory, more communications and more I/Os. For some applications, this is not a problem. For instance, Folding@Home sends a small data set that a computer will process and then an answer is sent to the server. In this model, there is not much communications, as each data set can be processed by itself.
Memory bandwidth and latency needs
For other applications, memory bandwidth and its latency are the real bottleneck: the application spends more time waiting for data than actually computing. And there are different levels in the bottleneck.
Sometimes, all data can fit inside the L2 cache, so the CPU can almost be fully used. Then, perhaps the data fit inside the RAM, and there are a lot of exchanges between RAM and L2 (for instance in finit difference schemes). In this case, the CPU can sometimes wait for data. There are strategies to enhance this, but the COU will always be idle at some point.
I don’t even talk about cases where the model cannot fit inside the memory, and still is needed at every stage of the computation.
When the computation cannot fit in one computer or one cluster node, applications may face another bottleneck. This is where you have to choose between a grid, a cluster of workstations or a real cluster. A massive grid, like the one used for Folding@Home, solves problems that can’t be solved by a cluster of workstations (in an acceptable time). The latter can solve for instance some medical image processing problems (like detecting differences between 3D brain images of different populations), as the amount of work for one step (like the normalization), can be done on one node but the communication amount is too big to be achieved on the Internet (sending hundreds of megabytes for each result). Then, other medical imaging processings, like the evolution of an artery during a cardiac cycle, need a real cluster with low-latency communications (during each iteration, data must be transmitted to different nodes, and this can only be achived through fast and low latency network interfaces).
What is really needed?
In the end, before you choose the architecture you need, you have to write down the actual problem you want to solve. Buying a cluster if your program won’t benefit from it will be a waste of money. The same can be said for the processor you will use. A processor with a high number of FLOPS, but with a low memory bandwidth can be worse than a processor not so fast but with a high memory bandwidth. It all depends on our problem.