Computers are complex systems, which makes them difficult to predict. Often times the hardware layers are fairly sophisticated, with the software adding even more factors – many more than a person can fit in their head. That’s why unit tests, integration tests, compatibility tests, performance tests, etc are so important. It’s also why leadership compute facilities (e.g., ORNL Titan, TACC Stampede) have such onerous acceptance tests. Until you’ve verified that an installed HPC system is fully functioning (compute, communication, I/O, reliability, …), it’s pretty likely something isn’t functioning.
The Stampede cluster at TACC contains over 320 56Gbps FDR InfiniBand switches. Including the node-to-switch and switch-to-switch cables, over 11,520 cables are installed. How much testing would you perform before you said “everything is working”?
“Software Defined HPC”
It has become increasingly clear to me that software is the primary focus going forward. Sure, the Mellanox managed InfiniBand switches can provide fabric health reports, but that’s not really in the hardware. It’s software running on whichever version of Linux they’ve embedded within their switches. Same story for other health reports: any low-level inspection of the hardware is going to be performed in software.
Anything higher-level is certainly also going to be in software. The integration of the various hardware components has to be proven stable, but proper integration of the various software components is arguably more important. If your software tools don’t allow the various pieces of hardware to run smoothly in concert with one another, you’re not going to have an HPC system. If they don’t help you keep it running smoothly, you’re going to have a lot of headaches.
In short, if you’re making HPC plans and not putting a lot of thought into the software components, you’re doing it wrong.
The exascale discussions frequently address potential methods for keeping large systems running despite hardware failures. Hardware errors occur on systems big and small, but I’ve seen plenty of jobs get stuck even when the hardware is running properly. Keeping jobs running across failures is a lofty goal, but it would also be good to have rock-solid results when the hardware is fine. I don’t think that goal is always met – a lot of HPC software is rough around the edges.
Software Needs Love
From what I’ve seen, projects are not usually beautiful unless they’re a labor of love. HPC is niche, so I think perhaps we have lower numbers of beautiful projects. Historically, these are not projects/products the general server market uses.
Thus, we end up in a situation where “it runs” is good enough for a lot of projects. Bugs are fixed as they come up, but it is slow progress.
This discussion doesn’t even delve into the user-level applications that do the real research/science. These packages are often the worst as far as clean design, complete functionality, thorough documentation, etc. For researchers, their field of study is more likely to be their labor of love than their software. Once the software is working, that’s good enough – back to the science!
The most popular scientific packages are certainly better, but even here there is fragmentation and a lack of standards. Everything is written differently. Each has a different method for packaging, compilation and extensibility. These criticisms could be brought against many open-source efforts, but I think the fact that “science comes first” makes it worse.
It’s not reasonable to indict researchers for this state of affairs. They deal with very real, very strict limitations (time, money, manpower, politics, …). I have a huge amount of respect for their accomplishments. Have a talk with them about how much time they have to spend dealing with mundane details and you’ll have even more respect for all they manage to achieve.
These things are iterative. I think we’ll see increasing outreach and efforts to make it easier to build good software. Nothing is immediate, but I hope a larger awareness of the challenges will lead to improvements over time. HPC is defined by the software, and the software is getting better…