About the Role
We are seeking a Senior Machine Learning Engineer to lead the fine-tuning, optimization, and deployment of AI models for diverse tasks, with a strong emphasis on on-device inference. You will work on cutting-edge applications such as orchestration, planning, multi-agent coordination, and other intelligent decision-making systems.
You will be responsible for adapting foundation models (LLMs, multimodal models) to specialized domains, making them fast, accurate, and efficient for resource-constrained environments-while ensuring robustness and safety.
What You Might Do
- Model Fine-Tuning & Adaptation
- Fine-tune large language models, multimodal models, and task-specific models for orchestration, planning, and any other workflows as defined.
- Design and run experiments to improve task accuracy, robustness, and generalization.
- Explore and apply methods like full fine-tuning, LoRA, QLoRA and other types of parameter-efficient fine-tuning.
- Employee advanced techniques such as QAT, DPO, GRPO to further improve the model quality.
- On-Device Optimization
- Prune, quantize and compress models (e.g., INT8, INT4, mixed-precision) for CPU, GPU, NPU and edge accelerators.
- Optimize models for low-latency inference using frameworks like OpenVINO, ONNX Runtime, QNN etc..
- Data Pipeline & Deployment
- Build robust data pipelines for domain-specific datasets, including synthetic data generation and annotation.
- Define evaluation metrics. Perform evaluations and analyze results.
- Establish best practices for versioning, reproducibility, and continuous improvement of model performance.
- AI Orchestration & Planning
- Develop and refine models to support multi-step reasoning, tool orchestration, and decision planning.
- Work with stakeholders on orchestrator architecture.
- Collaborate with product and research teams to design intelligent, context-aware assistant capabilities.
Essential Qualifications
- 7+ years of experience in applied machine learning, including at least 3 years in LLM fine-tuning.
- Proficiency in Python and ML frameworks ecosystem (HuggingFace, PyTorch).
- Strong understanding of transformer architectures, attention mechanisms, and PEFT techniques.
- Experience with on-device inference optimization (OpenVINO, ONNX, QNN).
- Familiarity with orchestration/planning architectures and techniques for AI assistants.
- Track record of delivering production-ready ML solutions in latency-sensitive environments.
Preferred Qualifications
- Experience with multi-agent systems or AI assistant orchestration.
- Familiarity with advanced inference optimization techniques such as KV cache paging , flash attention.
- Knowledge about common inference engines, including but not limited to llama.cpp, vLLM.
Salary Range: $120,000 - $215,000