Senior HPC Cluster Administrator - Deep Learning Frameworks Infrastructure
Tasks
- Automate infrastructure with IaC
- Build CI CD infrastructure pipelines
- Design and scale storage solutions
- Evaluate and introduce new networking and storage technologies
- Maintain observability stacks
- Manage and optimize job scheduling with Slurm
- Mentor engineers and define engineering standards
- Own GPU compute cluster lifecycle
- Tune cluster configuration for distributed training
Perks/Benefits
- N/A
Skills/Tech-stack
Ansible | Apptainer | Bash | Cgroup | DCGM | Docker | EFA | GitLab | Grafana | IPMI | Infiniband | Kubernetes | Linux | Lustre | NFS | NVIDIA MIG | NVIDIA NVSwitch | NVLink | NVSwitch diagnostics | NVSwtich | Prometheus | Python | RDMA | Redfish | Singularity | Slurm | Terraform | WekaFS
Education
Related jobs
-
Principal Database Administrator (DBA) PLN 211K-280KAlerting | Ansible | ArgoCD | Backup and Recovery | CI/CDCorporate apartment in Cyprus | Hybrid work | Relocation support | Remote work | Stock optionsSenior-level Full TimePoland R5d ago
-
SQL Server Database Administrator PLN 103K-146KAccess Control | Active Directory | Always On | Ansible | AutomationHybrid work model | International environment | Knowledge sharing | Learning and development | On-call rotationMid-level Full TimePoland R21d ago
-
AWS RDS | Access Control | Ansible | Auto-failover | BashEmployee referral program | Generous time off | Health benefits | Pension benefits | Volunteer daysSenior-level Full TimeWarsaw, Mazowieckie, Poland R29d ago