- Objective
- Scope of Work
- Maintain a HPC cluster (hardware, image management, local networking, scheduler, backups).
- Troubleshoot the environment when an incident occurs to ensure a quick return to normal operations.
- Meet with scientists and evaluate their requirements for HPC support.
- Develop a task plan to meet scientists' needs and consult the technical authority for approval.
- Application builds and installs, runtime troubleshooting (GNU, Intel, Fortran, Nvidia).
- Support for open-source and commercial off-the-shelf (COTS) software, including:
- Python and Anaconda installs.
- Bash scripts, build/make tools, EasyBuild, and Spack.
- MPI implementations (MPICH, OpenMPI, IntelMPI, HPMPI).
- Assist with in-house developed applications (compilation and runtime).
- Operating system (patching schedule, reliability for Linux distributions).
- Accounts (creation, deletion).
- Configuration via Git, MS DevOps, Ansible Playbooks.
- RPM/DEB Packages.
- Environment modules.
- ThinLinc troubleshooting.
- Troubleshooting jobs on schedulers (PBS Pro/Torque, SLURM, SGE).
- Ensure reliable CUDA installs, troubleshoot GPU failures and other CUDA software/driver issues.
- Hardware support (memory upgrades, storage arrays, power and network cabling, ILO).
- Document each process for every task to ensure enterprise knowledge continuity.
Mandatory Requirements
- The proposed resource has five (5) years' experience within the last ten (10) years in administrating HPC (High Performance Computing) systems and performing HPC analyst tasks, as per Annex – A Statement of Work.
- The proposed resource has worked for more than twelves (12) months. Each reference provided must have been in a role of supervision of the proposed resource