Please note that JavaScript and style sheet are used in this website,
Due to unadaptability of the style sheet with the browser used in your computer, pages may not look as original.
Even in such a case, however, the contents can be used safely.

LXC³ Cluster Command and Control

Overview

LXC³ is NEC's cluster command and control stack for the LX series high performance Linux clusters. It integrates more than one decade of own cluster administration experience at HPC data centers of all sizes, know-how from using and actively developing open source software with new ideas from research and development activities.

LXC³ contains all components that are necessary to make the administration and operation of a complex HPC cluster as easy as possible:

  • Deployment / provisioning: supports stateless, stateful and hybridly deployed compute nodes.
  • Monitoring: monitors cluster performance and health, alerts the administrators in case of problems.
  • Resource management: job scheduler and batch system.
  • Cluster management tools: various tools that ease the administration and usage of the cluster system.

The basis for the management tools is built on top of well proven and thoroughly tested open source components, carefully selected and configured for high scalability. They are integrated with the help of a framework developed by NEC that uses a single data source for describing the cluster and generates a sensible configuration matching our best practices for cluster administration.

Highly available setups are offered for critical customer systems using redundant management nodes where a stand-by node takes over in case of a failure to minimize overall down-time.

NEC LXC³ clusters are Intel Cluster Ready (ICR) certified and are set-up to maximize customer usability by maintaining commonly used Linux installation schemes. ICR is an architecture and a program that makes it easier to gain the performance advantages of HPC clusters and ease application development and administration.

Provisioning System

The LXC³ provisioning system is image based and supports stateless and stateful deployments of cluster nodes, allowing for completely diskless compute nodes, hybrid nodes with parts of the system on a disk, parts in RAM or on NFS, or full on-disk installations. It builds on top of a forked version of the Perceus open source project which NEC has decided to maintain, improve and extend.

The image based provisioning system is imposing but not enforcing an administration methodology for clusters. The main cluster node synchronization point is its central Virtual Node File System (VNFS),  which allows to maintain a single system image for many cluster nodes. Node specific settings like IP addresses and hostname  are auto-configured during deployment of the system. The central administration paradigm is complemented by a procedural administration method through which cluster nodes regularly check and pull configuration modules that can adjust their setup. Tracking changes within a VNFS image or configuration modules can be easily done with the integrated versioning features.

LXC³ provides scripts to create VNFS images for common enterprise grade Linux operating systems, however other Linux based distributions can be integrated using a regularly installation “golden client” node as source for the VNFS image.

Monitoring and Alerting System

LXC³ managed clusters come configured with performance and health monitoring systems based on the well known and widely used open source solutions Ganglia and Nagios, complemented by self-developed tools. LXC³ includes these industry-standard monitoring systems which are shipped fully configured with a variety of custom and in-house agents, and metrics. The monitoring systems are integrated with each other and report various sensor and system data for analysis and alerting purposes.

In case the cluster reaches a certain critical state that requires operator intervention automatic console-alerts or e-mail notification will be send. Also automatic shut-down of affected cluster nodes can be configured. Preset conditions for these actions are:

  • Reaching a certain CPU temperature or fan speed
  • Reaching a critical number of memory errors
  • System component like HDD or power supply failure
  • Complete node hardware failure
  • Cluster over temperature

Monitoring and resource management systems are integrated.

Resource Management

HPC clusters are shared among multiple users and the coordination of resources is done with the help of a batch or queueing system which includes a resource scheduler. LXC³ clusters come pre-configured and ready to use with a resource management system based either on the free and open source programs Torque and Maui or SLURM.

The resource management system is setup with one default queue. The user interface consists of a set of command line utilities, which enables full control over jobs and their resource definitions.

Key features of the job scheduler:

  • Backfilling
  • Fair Share Scheduling
  • Topology awareness
  • Job reservation
  • Interactive jobs
  • Job dependencies
  • Job accounting
  • Definition of own attributes

Other resource management systems (PBS-Pro, LSF, Grid Engine) can be integrated upon request.

Cluster Management

LXC³ offers centralized administration of cluster nodes comprising:

  • Fully automated deployment of stateful and stateless operating systems on cluster nodes
  • Parallel execution of administrative tasks on many or all cluster nodes
  • Distribution and collection of files to and from cluster nodes
  • Power management (on, off, reset) of cluster nodes
  • Graphical or test-based console access to cluster nodes

In addition LXC³ features tools for cluster users that facilitate using different environments for different compiler and library versions. These tools make it easy to switch between different versions of applications (e.g. ISV applications, MPI) or switch the development tools from one version to another.

Graphical Administration Tool

LXC³ comprises a Graphical User Interface that can be used with a regular Web browser which eases daily administration tasks and get an overview at a glance.

( Screenshot from LXC³ provisioning GUI )

Supported Operating Systems

Following operating systems are supported on the master node with full integration into the LXC³ framework when also deployed on cluster nodes:

  • Red Hat Enterprise Linux 5
  • CentOS 5
  • Scientific Linux 5
  • Red Hat Enterprise Linux 6
  • CentOS 6
  • Scientific Linux 6
  • SUSE Linux Enterprise Server 11

The provisioning system can deploy any Linux operating system on the cluster nodes.

Services

Installation & Integration

LXC³ offers full automated installation of all cluster nodes and easy adaption to the customer's network environment and software infrastructure.

Documentation and Training

NEC provides a detailed system description and documentation that describe LXC³ administration. In-depth trainings are also available.

Support

NEC provides a Web-accessible ticketing system (RT: Request Tracker) that can be used to report problems and track incidents. NEC implements ITIL compliant procedures to canonicalize problem handling and provide solutions in a timely manner.

Partager: