Job Title: Senior ML Infrastructure & MLOps Engineer (Core Platform)
Role Overview
We are seeking an ownership-driven Senior ML Infrastructure Engineer to join our core Platform team. In this role, you will bridge the gap between production engineering and bleeding-edge AI research by building the foundational shared systems that power our machine learning lifecycle. Your primary focus will be designing, scaling, and automating robust ML infrastructure-ranging from standardized training frameworks and feature-generation platforms to high-performance model-serving clusters. The ideal candidate is a pragmatic, fast-moving systems engineer who values reliability, cost-optimization, and seamless developer velocity.
Key Responsibilities
ML Infrastructure & Container Orchestration
• Distributed Clusters: Architect and maintain high-performance training and serving infrastructure utilizing Google Kubernetes Engine (GKE).
• Model Optimization: Design and implement high-efficiency optimization pipelines, including advanced knowledge distillation and foundational training tooling.
• Platform Scaling: Build, monitor, and optimize shared ML systems to ensure maximum infrastructure uptime, pipeline reliability, and cloud cost-efficiency.
Data Engineering & Pipeline Automation
• Workflow Automation: Build robust, automated pipelines for standardized model training, validation, and continuous deployment (CI/CD for ML).
• Feature Platforms: Develop scalable data sampling and feature-generation platforms to accelerate research experimentation cycles.
• Onboarding & Usability: Drive high platform adoption by building intuitive, standardized deployment tools that decrease onboarding speed for research and engineering teams.
Collaboration & Governance
• Cross-Functional Bridge: Collaborate closely with ML researchers and core software engineers to translate theoretical models into highly scalable production systems.
• Methodical Execution: Apply a disciplined, data-backed approach to identify infrastructure bottlenecks, reduce time-to-market, and stabilize complex deployments.
Qualifications & Requirements
• Experience:
o5 to 10+ years of hands-on experience designing and operating large-scale distributed ML platforms.
oProven track record of supporting production-grade ML workflows in cloud environments.
• Technical Mastery:
oDeep expertise in container orchestration, specifically GKE (Google Kubernetes Engine) or equivalent enterprise Kubernetes environments.
oHands-on experience building scalable ML pipelines (e.g., Kubeflow, Airflow, TFX).
oStrong proficiency in distributed training strategies, feature store management, and model serving infrastructure.
• Soft Skills & Attributes:
oPragmatic Mindset: Strong ownership-driven work style focused on consistency, system reliability, and cost-awareness.
oEffective Communicator: Ability to collaborate seamlessly with highly technical researchers and platform engineers alike.
Preferred Qualifications
• Prior experience working within dedicated, tier-1 enterprise ML/AI platform teams.
• Deep knowledge of distributed systems backend optimization and infrastructure-as-code (IaC).
Equal Opportunity Employer / Disabled / Protected Veterans
The Know Your Rights poster is available here:
https://www.eeoc.gov/sites/default/files/2023-06/22-088_EEOC_KnowYourRights6.12.pdf
The pay transparency policy is available here:
https://www.dol.gov/sites/dolgov/files/ofccp/pdf/pay-transp_%20English_formattedESQA508c.pdf
For temporary assignments lasting 13 weeks or longer, AllSTEM Connections is pleased to offer major medical, dental, vision, 401k and any statutory sick pay where required.
We are committed to working with and providing reasonable accommodations to individuals with disabilities. If you need a reasonable accommodation for any part of the employment process, please contact your staffing representative who will reach out to our HR team.
AllSTEM Connections participates in the E-Verify program in certain locations as required by law. Learn more about the E-Verify program.
https://e-verify.uscis.gov/web/media/resourcesContents/E-Verify_Participation_Poster_ES.pdf
We also consider for employment qualified applicants regardless of criminal histories, consistent with legal requirements, including, if applicable, the City of Los Angeles' Fair Chance Initiative for Hiring Ordinance. Pursuant to applicable state and municipal Fair Chance Laws and Ordinances, we will consider for employment-qualified applicants with arrest and conviction records, including, if applicable, the San Francisco Fair Chance Ordinance. For Los Angeles, CA applicants: Qualified applications with arrest or conviction records will be considered for employment in accordance with the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act.
Additional Skills
(none specified)
AllSTEM Representative Contact Info
Account Executive:
Nichols
Branch Phone:
(909) 244-1777
Location:
Ontario, CA