System Administrator

Renaud Consulting - 2 Jobs

Ottawa, ON

Apply Now

Posted today

Job Details:

Full-time

Management

Objective

The objective for this work is to manage high performance computing (HPC) cluster (HPC administrator) and support users (HPC analyst) with respect to the installation, execution and debugging of research applications and code on high performance computing (HPC) clusters. This requires troubleshooting and ensuring client satisfaction to help clients (scientists) devote their time to NRC research priorities, not resolving IT related issues.

Scope of Work

CategoryTasks for ContractorHPC administrator tasks

Maintain a HPC cluster (hardware, image management, local networking, scheduler, backups).
Troubleshoot the environment when an incident occurs to ensure a quick return to normal operations.

HPC Analyst Tasks

Meet with scientists and evaluate their requirements for HPC support.
Develop a task plan to meet scientists' needs and consult the technical authority for approval.
Application builds and installs, runtime troubleshooting (GNU, Intel, Fortran, Nvidia).
Support for open-source and commercial off-the-shelf (COTS) software, including:
- Python and Anaconda installs.
- Bash scripts, build/make tools, EasyBuild, and Spack.
- MPI implementations (MPICH, OpenMPI, IntelMPI, HPMPI).
Assist with in-house developed applications (compilation and runtime).

Other General TasksManagement of:

Operating system (patching schedule, reliability for Linux distributions).
Accounts (creation, deletion).
Configuration via Git, MS DevOps, Ansible Playbooks.
RPM/DEB Packages.
Environment modules.
ThinLinc troubleshooting.

Troubleshoot & Hardware

Troubleshooting jobs on schedulers (PBS Pro/Torque, SLURM, SGE).
Ensure reliable CUDA installs, troubleshoot GPU failures and other CUDA software/driver issues.
Hardware support (memory upgrades, storage arrays, power and network cabling, ILO).

Documentation

Document each process for every task to ensure enterprise knowledge continuity.

Mandatory Requirements

The proposed resource has five (5) years' experience within the last ten (10) years in administrating HPC (High Performance Computing) systems and performing HPC analyst tasks, as per Annex – A Statement of Work.
The proposed resource has worked for more than twelves (12) months. Each reference provided must have been in a role of supervision of the proposed resource

#Information Technology jobs

Apply Now

Save

System Administrator

Share This Job:

We’ve updated our terms