Today I visited the most powerful supercomputer in the world: ORNL’s Titan.
My trip to ORNL was for other purposes, so the visit to Titan was geek tourism. Nevertheless, it’s inspiring to see such a large facility dedicated to science. Amazon and Google undoubtedly have more servers, but nothing tops the highly-connected, gpu-accelerated nature of Titan.
The difficultly is in achieving full performance across such a large number of systems simultaneously. Investigate “noisy neighbors” and you’ll learn Amazon and Google do not guarantee consistent performance. Each instance has it’s own performance characteristics. Today’s servers (particularly in “the cloud”) are complex enough that they are subject to the butterfly effect. There are simply too many factors involved to predict anything with certainty.
In fact, Titan was not in production mode during my visit. Cray had to retrofit the entire system after issues were discovered during acceptance testing. Now it must run through tests again to demonstrate proper operation. Thus, there is a serious question within the HPC community whether a cluster 50-times as large could ever be built in less than a decade. Both the hardware and the software of such a system will require more sophistication than is currently available.
Overall, I am inspired to consider what methods might make computation less complicated and more reliable for those in the scientific community. As Ross Walker of UCSD is fond of saying, “Scientists want science first. Technology is the enabler, NOT the science.”