Position: AI Infra SRE Engineer – DGX
Location: Remote
Duration: Fulltime
Must-have
- NVIDIA (DGX) or equivalent high-performance-compute (HPC) clusters (e.g. Cray, HPE, IBM)
- Cisco UCS C885A
- Docker
Good to have
- DevOps Automation
- CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins)
- Terraform, Ansible, Jenkins
- Python
- GoLang, C/C++
- Enterprise Grade Kubernetes cluster (RedHat OpenShift – preferred) and/or Google Anthos
- Software development lifecycle includes design, development, testing, packaging, and deployment using Golang
Roles & Responsibilities
- Technical knowledge of high-performance compute, NVIDIA DGX/GPUs and/or Cisco Unified Compute System.
- Handle availability, latency, scalability and efficiency of NVIDIA and Cisco UCS infrastructure
- by instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches.
- Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements.
- Automate operational capabilities using Python, Ansible, Terraform, Go etc.
- Deliver automation through CI/CD pipeline and chatbot etc.
- Implement metrics driven processes to ensure service quality targets are met.