The ParaStation HealthChecker is a tool designed to check the consistency and health of compute & storage nodes. Each node in a cluster has to be checked of software consistency and has to be hardware error free to make a reliable MPI run possible. Failure to account for node the reduced mean-time-between-failure on large cluster systems leads to dramatically reduced system utilization resulting from wasted CPU cycles caused by high numbers of failing jobs. ParaStation's HealthChecker seeks to address these issue by limiting failures before they occur. Techniques such as Checkpoint restart are important, but ParTec believes its far more important to limit the likelihood of failures, rather to seek to rectify them once they have occurred.
In ParaStationV5, the HealthChecker runs prior to execution of user jobs on all nodes selected by the scheduler for that job. Only a 100% error free check of hardware and software consistency will initiate the job start by the batch system. Nodes found to be faulty will be taken out of the host list and other nodes will be assigned to replace them. The entire process is hidden from the user.