Title: Site Reliability Engineer SRE – ML platform
Location: Austin, TX OR Sunnyvale, CA
Title: Site Reliability Engineer SRE – ML platform
Responsibilities –
- Continuous Deployment using GitHub Actions, Flux, Kustomize
- Design and implement cloud solutions, build MLOps on cloud AWS
- Data science model containerization, deployment using docker, VLLM, Kubernetes
- Communicate with a team of data scientists, data engineers and architects, document the processes
- Develop and deploy scalable tools and services for our clients to handle machine learning training and inference.
- Knowledge of ML models and LLM
Qualifications:
- 6+ years of experience in ML Ops with strong knowledge in Kubernetes, Python, MongoDB and AWS.
- Good understanding of Apache SOLR.
- Proficient with Linux administration.
- Knowledge of ML models and LLM.
- Ability to understand tools used by data scientists and experience with software development and test automation
- Ability to design and implement cloud solutions and ability to build MLOps pipelines on cloud solutions (AWS)
- Experience working with cloud computing and database systems
- Experience building custom integrations between cloud-based systems using APIs
- Experience developing and maintaining ML systems built with open-source tools
- Experience with MLOps Frameworks like Kubeflow, MLFlow, DataRobot, Airflow etc., experience with Docker and Kubernetes
- Experience developing containers and Kubernetes in cloud computing environments
- Familiarity with one or more data-oriented workflow orchestration frameworks (Kubeflow, Airflow, Argo, etc.)
- Ability to translate business needs to technical requirements
- Strong understanding of software testing, benchmarking, and continuous integration
- Exposure to machine learning methodology and best practices
- Good communication skills and ability to work in a team
Note: Focus is to have 60% SRE and 40% ML Ops…