Mar 312014
 

I’ve been thinking a lot about the ways we automate tasks and abstract away difficult/complicated aspects of our lives. That’s what all the “progress” of the last 100 years has been – better ways to save labor and still get the same tasks accomplished. Our species is growing increasingly efficient.

I’m sure the HPC industry has also seen improvements in personal efficiency. Certainly we can get the compute portions of our work done much more quickly. But do you still find yourself fighting many of the same systems/software issues you faced years ago? I feel our industry still has a long way to go as far as making the everyday user’s life simpler.

There are certainly many cross-domain experts out there – I speak with them every day. Those that can master their systems and software, as well as their scientific field. I know it keeps them busy, but they are exceptional. The systems they design and the software they write is typically used by many others in their field. They are the leaders.

There are others who are not technical masters, but do manage to make the systems do their bidding. They focus more on the science than on the software. Their efforts are much more substantially directed towards their research. HPC is a means to an end (even if it’s bumpy).

Finally, there are groups who focus almost exclusively on the research, although they often have significant HPC resources. Someone has to manage these resources, and given the state of just about every budget these days, it means the person in charge of the HPC resources has a bit of a struggle on their hands. They are not experts, but they have to keep things working.

Obviously, any one of these types could have a frustrating experience (either with software or hardware) but the experts are more likely to get it sorted out. Those who know their way around will look at Linux dmesg output and see:

...
...
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: failed command: SMART
ata1.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
...
...

A more casual user is probably just going to see white noise. If their system wasn’t properly built with redundancy, their data may be lost some day soon:

ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata1.00: ATA-9: Samsung SSD 840 PRO Series, DXM05B0Q, max UDMA/133
ata1.00: 250069680 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
ata1.00: configured for UDMA/133
scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD 840  DXM0 PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 250069680 512-byte logical blocks: (128 GB/119 GiB)
sd 0:0:0:0: Attached scsi generic sg0 type 0
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sda: sda1 sda2 < sda5 >
sd 0:0:0:0: [sda] Attached SCSI disk
usb 2-1: New USB device found, idVendor=174f, idProduct=112b
usb 2-1: New USB device strings: Mfr=2, Product=3, SerialNumber=0
usb 2-1: Product: HP Webcam-101
usb 2-1: Manufacturer: GenesysLogic Technology Co., Ltd.
usb 2-2: new high-speed USB device number 3 using ehci-pci
usb 2-2: New USB device found, idVendor=0bda, idProduct=0138
usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 2-2: Product: USB2.0-CRW
usb 2-2: Manufacturer: Generic
usb 2-2: SerialNumber: 20090516388200000
usbcore: registered new interface driver usb-storage
ums-realtek 2-2:1.0: USB Mass Storage device detected
usb 2-2: USB disconnect, device number 3
scsi1 : usb-storage 2-2:1.0
hda-intel 0000:00:01.1: Using LPIB position fix
snd_hda_intel 0000:00:01.1: irq 42 for MSI/MSI-X
hda-intel 0000:00:01.1: Enable sync_write for stable communication
hda-intel 0000:00:14.2: Using LPIB position fix
snd_hda_intel 0000:00:14.2: setting latency timer to 64
hda-intel 0000:00:14.2: Enable sync_write for stable communication
input: HDA ATI SB Headphone as /devices/pci0000:00/0000:00:14.2/sound/card1/input9
input: HDA ATI SB Mic as /devices/pci0000:00/0000:00:14.2/sound/card1/input10
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: failed command: SMART
ata1.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
usb 6-1: new full-speed USB device number 2 using ohci-pci
usb 6-1: New USB device found, idVendor=148f, idProduct=2000
usb 6-1: New USB device strings: Mfr=0, Product=2, SerialNumber=0
usb 6-1: Product: CSR BS8510
usbcore: registered new interface driver btusb
r8169 0000:01:00.0 eth0: link down
usb 6-1: USB disconnect, device number 2

I’m working on these issues because I’m at the opposite end of the spectrum described above. I’m someone who understands these systems at a deep level. I have a profound appreciation for the sciences, but you won’t see my name on any papers. I want to make it easier for scientists to do better work in less time.

It’s not too hard to make sure no one loses their data, but I’d like to see how far we can take it. How far we can simplify. Can we make hard choices up-front during design that save the users a lot of time?

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(to be displayed above your comment)

(for moderation; your e-mail address will not be made public)