Senior HPC Cluster Administrator - Deep Learning Frameworks Infrastructure
Tasks
- Automate infrastructure with IaC
- Build CI CD infrastructure pipelines
- Design and scale storage solutions
- Evaluate and introduce new networking and storage technologies
- Maintain observability stacks
- Manage and optimize job scheduling with Slurm
- Mentor engineers and define engineering standards
- Own GPU compute cluster lifecycle
- Tune cluster configuration for distributed training
Perks/Benefits
- N/A
Skills/Tech-stack
Ansible | Apptainer | Bash | Cgroup | DCGM | Docker | EFA | GitLab | Grafana | IPMI | Infiniband | Kubernetes | Linux | Lustre | NFS | NVIDIA MIG | NVIDIA NVSwitch | NVLink | NVSwitch diagnostics | NVSwtich | Prometheus | Python | RDMA | Redfish | Singularity | Slurm | Terraform | WekaFS
Education
Related jobs
-
Automation | Backup and Recovery | High Availability | Linux | MariaDBCareer growth opportunities | Collaborative environment | Fully remote work | Participation in open source projectsSenior-level Full TimePoland R1mo ago