Job Summary:
KLA is a global leader in diversified electronics for the semiconductor manufacturing ecosystem. They are seeking a highly skilled Senior AI Ops Engineer to architect and deliver automation layers for scalable model development, focusing on end-to-end experiment management and model fine-tuning pipelines.
Responsibilities:
• Implement and operate experiment tracking, lineage, and reproducibility standards (datasets, code, configs, artifacts, metrics) using MLflow/W&B or equivalents.
• Build CI/CD for ML: tests (unit/integration), packaging, reproducibility checks, policy gates, automated deployment and rollback strategies.
• Design workflow orchestration for large-scale ML jobs (scheduled runs, triggered retrains, parameter sweeps, gated releases) using tools such as Airflow/Kubeflow/Argo or equivalents.
• Architect, build, and own automated pipelines for model training, fine-tuning (e.g., PEFT/LoRA), evaluation, and promotion across environments (dev → staging → production).
• Establish standardized training “recipes” (configs, templates, golden paths) to reduce time-to-first-experiment and improve consistency across teams.
• Enable and optimize distributed GPU training (throughput, reliability, and cost), including checkpointing, mixed precision, fault tolerance, and spot/preemptible handling where applicable.
• Develop evaluation harnesses and automated benchmark suites (quality, safety, latency, and cost) with clear, repeatable reporting to compare runs and releases.
Qualifications:
Required:
• Strong proficiency in Python and experience building robust automation frameworks and production-grade services for ML workloads
• Hands-on experience with experiment tracking and model lifecycle tooling (e.g., MLflow, Weights & Biases) and reproducible ML workflows
• Practical experience fine-tuning modern deep learning models (e.g., Transformers) and familiarity with parameter-efficient approaches (LoRA/PEFT)
• Working knowledge of RLHF concepts and pipelines (preference data, reward models, policy optimization) and how to operationalize human-in-the-loop workflows.
• Experience with containerization (Docker), orchestration (Kubernetes), and operating GPU workloads reliably at scale.
• Experience with CI/CD, version control (Git), and Infrastructure-as-Code (Terraform/Bicep or equivalent).
• Excellent problem-solving skills across distributed systems (training jobs, pipelines, compute infrastructure) and strong communication to partner with research and engineering teams.
• Prior experience in a similar industry and/or operating ML platforms with stringent IP/security requirements is a plus.
• Bachelor’s degree in Computer Science, Software Engineering, or related field
• 5+ years of experience in MLOps/Platform Engineering/DevOps/ML Engineering (or demonstrated equivalent impact), including owning production systems and leading cross-team initiatives
• Master's Level Degree and related work experience of 6 years; OR Bachelor's Level Degree and related work experience of 8 years; OR equivalent work experience
Company:
Kla creates tools and services that promote innovation in the electronics industry. Founded in 1975, the company is headquartered in Milpitas, USA, with a team of 10001+ employees. The company is currently Late Stage.