RSS-Feed - Subscribe

Newsletter - SignUp

Subscribe here »

 

Subscribe to ParTec's newsletter & get the latest news sent directly to your mailbox.

 

05.03.2010

ParaStationV5 & Mellanox ConnectX-2


Proven reliability and performance

ParaStationV5 software attained global recognition in 2009 with a flagship installation on the JuRoPA cluster at the Jülich Supercomputer Centre in Germany (see top500.org #10, June 2009). The Juropa machine, with ParaStationMPI, delivered proven scalability in excess of 25,000 MPI tasks over more than 3000 compute nodes (without MPI threading libraries) and attained a parallel efficiency in excess of 91,6% when running HPL benchmarks. ParaStation V5 cluster middleware and Mellanox hardware were key components in JuRoPA’s success. Mellanox ConnectX InfiniBand adapters and InfiniScale switches are state-of-the-art InfiniBand solutions designed specifically for today’s HPC clusters. They provide the most advanced offloading capabilities, the fastest InfiniBand technology and the highest server utilization.

Juropa system overview

  • 3288 compute nodes (26,304 cores)
  • 274.8 Teraflops Linpack performance (June 2009)
  • Mellanox InfiniBand Silicon used throughout.
  • ParaStationV5
  • ParaStation MPI
  • SLES11 Linux
  • Intel Compiler and Development Tools
  • MOAB Scheduler
  • Lustre Filesystem

 

Mellanox ConnectX-2 40Gb/s Adapters

Mellanox ConnectX-2 adapters provide the world most advanced performance and off-loading capabilities enabling systems of any scale to achieve the highest efficiency and performance. ConnectX-2 offers a wire speed of 40Gb/s per port and an application latency of less than 1 μs between server nodes. Moreover, it is the only adapter that provides zero scalable latency, meaning that the lowest latency is kept per application process, regardless of how many processes or CPU cores are actively sending messages to the network. HPC CPU/system efficiency is dependent on the network’s off- loading capabilities - both transport and application off-load. Application complexity and increased cluster size demands advanced application function off-loading capabilities from the underlying interconnect. Additionally, achieving the highest system efficiency and application scalability requires the careful construction of a balanced cluster. The CPU, memory and interconnect must match each other`s capabilities and each of those components must avoid adding artificial limitations to the others. Mellanox adapters are the only adapters that provide full transport off-load, hardware-based RDMA and application (MPI) off-load to ensure maximum CPU efficiency. Other InfiniBand solutions can depend on the CPU for network protocol processing (i.e. creating network packets, dealing with transport checks and error handling). Also, adaptors from other vendors often lack native support for RDMA, resulting in heavy overhead load on the CPU cores leading to system noise and jitter that dramatically reduce efficiency.

The Message Passing Interface (MPI) is the most widely used communication library for scientific applications. It defines fifteen different blocking collective operations. Mellanox ConnectX-2 CORE-Direct needs to be italicized is the only InfiniBand technology that addresses collective communication scalability by offloading all fifteen operations to the Host Channel Adapter (HCA), thus enabling communication to progress asynchronously. Moreover ConnectX-2 includes a floating point operation unit enabling the off-load of the data manipulation part of collective operations. In all, Mellanox has delivered the most comprehensive off-loading capability ever seen in high performance interconnects.

To meet the growing demand for higher compute capability and network throughput, Mellanox has introduced the world’s fastest InfiniBand switches. The current generation of Mellanox switches deliver 120Gb/s of throughput per logical link – 50% higher than any other InfiniBand solution! These switches provide the ability to effectively aggregate networking throughput into a single 120Gb/s logical link, dramatically reducing network congestion and increasing network efficiency.
Mellanox switches are also the fastest networking switches for HPC systems, with an end-to-end switch latency of 100 ns at 100% load. This latency is 30% to 40% lower than any other InfiniBand switch, which ensures the fastest messaging delivery between application process and rapid data delivery between compute servers and storage appliances. Compared to Ethernet switches, Mellanox switches latency is 500% faster, and the gap increases with the compute cluster size.
To deliver Exa-scale ready solutions, Mellanox has architected the most advanced network congestion control and avoidance and network adaptive routing. These unique capabilities combine the adapter cards and the switches into a single “self learning” infrastructure that adapts itself to real system conditions and is able to react and prevent congestion scenarios and maximize the overall network efficiency.

ParaStationV5

ParaStation V5 is cluster middleware specifically designed to meet the demands of today’s large scale distributed memory clusters. ParaStation is a coherent, integrated series of libraries, communication layers, tools and diagnostic utilities that provide the foundations of a cluster operating system. ParaStation MPI is a key component of ParaStation. It has been tuned to deliver ‘Enhanced Application Performance’ and has been shown to have ‘Industry Leading Scalability’ when used with Mellanox ConnectX-2 technology.

Serviceability is another crucial requirement of the cluster environment. The ability to identify and resolve resource and hardware issues before they cause job failures is fundamental to the smooth day-to-day operation of a cluster. ParaStation GridMonitor together with the automated HealthChecker provide administrators with the early warning mechanism required to maintain cluster availability and minimize job failures. Both components actively monitor and maintain the InfiniBand fabric – they are seen as key components in speeding up the commissioning phase of large InfiniBand clusters. Additionally, these tools are invaluable in assisting with routine administration of the cluster in the production environment.

ParaStationV5 Message Passing – MPI(2)

ParaStationV5 provides a standard interface for parallel applications using both MPI & MPI-2. Intra-node message passing communication is handled via a shared memory model (shmem), while inter-node communication is handled using Mellanox InfiniBand technology.

ParaStation Process Communication

Job control, launch and termination is handled by ParaStation’s pscom layer. This communication layer facilitates the automated clean-up of job processes which are not progressing because of MPI job termination resulting from hardware failure. ParaStation’s pscom is a daemon infrastructure that runs on each node. No user authentication is required for compute nodes. Authentication is only required on login nodes thereby greatly simplifying the administration of large clusters. The pscom layer has a sophisticated process-management to start, monitor and close processes. ParaStationV5 supports cluster partitioning or the exclusive allocation of processors or nodes enabling specific jobs or users to access dedicated resources.

ParaStation GridMonitor

The GridMonitor is a concentrator for a multitude of telemetry information available from every component within a cluster. The GridMonitor is a browser based user interface that provides a single monitoring station for entire clusters.

ParaStation HealthChecker

ParaStation’s HealthChecker is designed to check the consistency and health of compute & storage nodes during production. Each node has to be checked for software consistency and confirmed hardware error free to ensure reliable cluster operation. Failure to account for the reduced mean-time-between-failure on large clusters leads to dramatically reduced system utilization. ParaStation’s HealthChecker seeks to address the issue by limiting failures before they occur. Techniques such as Checkpoint restart are important, but ParTec believes it‘s far more important to limit the likelihood of failures rather to seek to rectify them once they have occurred. ParaStation is supported by a team of HPC engineers with the skills needed to provision and maintain some of the largest HPC clusters in the world. ParaStation engineers are experienced in the diagnostics and optimization of InfiniBand fabrics that extend to many thousands of nodes.

ParaStation and ConnectX-2, a proven solution for today‘s high performance throughput clusters.