I just finished watching Particle Fever, which describes the ~30-year path that physicists endured before the confirmation of the Higgs boson particle. Thousands of people spent years of excruciatingly painstaking efforts to confirm one aspect of our reality. Yet there were setbacks (some taking years) and the collider won’t even be operating at full power until 2015 (although the original schedule called for full-power operation in 2008)…
I know (from both colleagues and personal experience) that the efforts from the IT and computational folks backing up these experiments are no less painstaking and mundane. Keeping a single computer operating correctly can be a pain. Keeping hundreds or thousands operating correctly (along with the incredible diversity of dodgy scientific software packages) is basically impossible.
It’s been a year since my last post on this blog because I’ve spent nearly every waking moment trying to tackle some of these problems. I accomplished a lot, and I built some fairly easy-to-use HPC systems. For some researchers, we can make it as easy to use a compute cluster as it is to use their local workstation.
But…it’s just the tip of the iceberg. Those systems are still fragile. They run into unexpected problems when something changes (like running MATLAB on a 48-core, 4-socket server instead of a 20-core server; why does MATLAB launch 1000+ threads!?). Or the fact that HOOMD-blue freaks out when your GPUs are set in Default compute mode, but LAMMPS freaks out if your GPUs are set in anything other than Default compute mode!
It’s not all disaster. Computer interfaces and APIs are continuing to improve; we are slowly abstracting away the most troublesome details. However, I’m thinking back to how hard it was for users to get their jobs running properly on HPC clusters 10 years ago in 2004. I’m not so sure that much has changed over 10 years. Some of the software tools have improved, but a lot of things look surprisingly similar.
There have been many improvements to low-level, performance-critical components and libraries (NUMA, InfiniBand, virtualization, RDMA, etc). I don’t think there have been so many changes to the ways users interact with the cluster.
These are complex systems – making improvements requires a lot of initiative; a lot of perseverance; creativity; expertise; insistence on clean solutions (for once!); cross-domain expertise; more perseverance. If the scientists/researchers are using their energy to get all those frustrating scientific challenges sorted out, and all the computer geeks are keeping the computers from crashing, who’s left to work on HPC papercuts?
a paper cut bug is defined as “a trivially fixable usability bug that the average user would encounter on his/her first day of using a brand new installation […]” The analogy is with a paper cut; small, not seriously damaging, but surprisingly painful
It’s been good – I’ll be pressing on throughout 2015. But the end of the road is not in sight…
Also, to return to the Higgs boson: the estimated mass of the Higgs (~126 GeV) is apparently smack in the middle of the two theorized values (one which might confirm SuperSymmetry; the other which might confirm Multiverses). Just another painful detail with universe-altering impact!