Job Summary:
AMD is a company dedicated to building innovative products that drive next-generation computing experiences. They are seeking a Systems Design Engineer to focus on AI infrastructure, responsible for creating reference architectures and technical documentation to support internal teams and customers in their hardware and software decisions.
Responsibilities:
• Apply your expertise to shape AI infrastructure by creating reference architectures, configuration guides, and deployment blueprints that help internal teams and customers make informed hardware and software decisions.
• Perform deep technical evaluations of AI stacks across compute, storage, networking, and observability layers, documenting how they work, where they fit, and the tradeoffs involved.
• Design and execute reproducible experiments and benchmarking harnesses to compare technologies such as schedulers, distributed training libraries, and observability stacks.
• Develop small reference implementations and tools to validate performance hypotheses, analyze system behavior and more.
• Build a library of technical artifacts—including presentations, design documents, and “how it works” guides, to support pre-sales engineers and enable others to skill up from an HPC perspective.
• Present findings through demos, documentation, and internal talks, and create templates and checklists to support repeatable evaluations and cluster designs.
Qualifications:
Required:
• Hands-on experience with rack- and row-scale performant infrastructure.
• Ability to explore how AI workloads like inferencing and training fit into large-scale AI infrastructure.
• Self-directed, proactive, and comfortable navigating ambiguity to solve complex problems.
• Clear communication skills and enjoyment in writing technical artifacts.
• Ability to collaborate naturally with internal teams and customers.
• Curiosity, initiative, and a drive to create.
• Bachelors or Masters degree in electrical or computer engineering.
Preferred:
• Engineering mindset: Evidence of end-to-end systems thinking, debugging, and tradeoff decisions.
• AI/HPC cluster background: hands-on familiarity with at least two schedulers and/or orchestration systems (e.g., Slurm, Kubernetes), MPI/OpenMP, distributed storage patterns, or performance analysis.
• Comparative analysis: experience writing evaluation docs/RFCs with clear criteria, benchmarks, risks, and recommendations.
• Strong Linux fundamentals: Linux operating systems, networking, filesystems, containers, performance tooling (perf, flamegraphs, nvprof/rocprof, basic eBPF).
• Clear communication: ability to turn complex systems into accessible, structured documentation with diagrams and reproducible steps.
• AMD ecosystem experience: ROCm, RCCL, Instinct GPUs, EPYC platforms, compiler/toolchain impacts, and performance tuning.
• Distributed training internals: DDP, collective comms, sharded/stateful optimizers; NCCL/RCCL behavior and transport considerations (PCIe, NVLink, IF).
• Orchestration models: Slurm configuration patterns, Kubernetes for HPC/AI (GPU operators, device plugins), Apptainer/Singularity.
• Storage/data: parallel filesystems (Lustre, BeeGFS), object stores, RDMA, data pipeline throughput and caching strategies.
• IaC literacy: Terraform/Ansible for reproducible blueprints—focused on design and sample configs, not running prod clusters.
• Documentation tooling: reproducible docs/workbooks, literate programming notebooks, CI for benchmarks.
Company:
Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions. Founded in 1969, the company is headquartered in Santa Clara, USA, with a team of 10001+ employees. The company is currently Late Stage.