Computers are complex systems, which makes them difficult to predict. Often times the hardware layers are fairly sophisticated, with the software adding even more factors – many more than a person can fit in their head. That’s why unit tests, integration tests, compatibility tests, performance tests, etc are so important. It’s also why leadership compute facilities (e.g., ORNL Titan, TACC Stampede) have such onerous acceptance tests. Until you’ve verified that an installed HPC system is fully functioning (compute, communication, I/O, reliability, …), it’s pretty likely something isn’t functioning.
The Stampede cluster at TACC contains over 320 56Gbps FDR InfiniBand switches. Including the node-to-switch and switch-to-switch cables, over 11,520 cables are installed. How much testing would you perform before you said “everything is working”?