Job Summary:
CIM Group is a community-focused real estate and infrastructure company seeking a Senior ML Ops Engineer to lead the design and maintenance of scalable infrastructure for ML model deployment and lifecycle management. The role involves collaborating with various teams to enhance ML-driven insights while ensuring compliance and governance of ML and generative AI initiatives.
Responsibilities:
• Lead the design, implementation, and ongoing maintenance of scalable ML infrastructure on Databricks, including ML flow for experiment tracking, model registry, and model serving endpoints
• Oversee the development of the ML Ops platform and automated pipelines for deploying, monitoring, and maintaining models within production environments
• Implement robust solutions for model versioning, systematic retraining, and comprehensive artifact management using Databricks Unity Catalog for ML governance
• Design and manage Databricks Feature Store for consistent feature engineering across training and inference pipelines
• Architect and implement Retrieval-Augmented Generation (RAG) systems for document Q&A, enabling business teams to query fund documents, investor letters, and market research
• Design, deploy, and manage vector database solutions (Databricks Vector Search, Pinecone, or similar) for semantic search and retrieval across enterprise documents
• Lead LLM fine-tuning and customization initiatives, training models like Claude or open-source alternatives with CIM proprietary data while ensuring data privacy and compliance
• Develop and optimize document processing pipelines including PDF parsing, chunking strategies, and embedding generation for RAG applications
• Implement prompt engineering best practices and LLM evaluation frameworks to ensure output quality, relevance, and factual accuracy
• Build guardrails and safety measures for GenAI applications, including hallucination detection, output validation, and source attribution
• Design and implement extensive automation across the ML workflow, covering model training, testing, validation, and deployment using Databricks Workflows and Asset Bundles
• Set up robust CI/CD pipelines for both traditional ML models and GenAI applications, leveraging GitHub Actions, Azure DevOps, or similar tools
• Automate complex data and model workflows utilizing orchestration tools such as Airflow, Prefect, or Databricks Workflows
• Implement comprehensive monitoring and alerting systems for real-time tracking of model performance, data quality, and GenAI output quality
• Utilize specialized tools (Evidently AI, WhyLabs, Prometheus/Grafana) to proactively detect model drift, data quality anomalies, and RAG retrieval degradation
• Develop evaluation frameworks for GenAI applications including relevance scoring, faithfulness metrics, and human feedback loops
• Troubleshoot issues within production environments, including debugging model deployment failures, RAG retrieval issues, and LLM response quality problems
• Build and maintain sophisticated feature stores on Databricks, ensuring precise alignment between training and inference data pipelines
• Collaborate with data engineers and information architects to build robust ETL pipelines that feed into the Databricks Lakehouse
• Design embedding pipelines and vector index management strategies for RAG applications, including incremental updates and versioning
• Integrate robust security measures directly into ML Ops and GenAI pipelines, including access controls via Unity Catalog and data encryption
• Implement Trustworthy AI guardrails addressing bias detection, explainability, prompt injection prevention, and responsible AI practices
• Ensure GenAI applications handling sensitive fund and investor data comply with regulatory requirements and internal policies
• Collaborate with Legal and Compliance to establish AI governance policies and audit trails for model decisions
• Engage in extensive collaboration with data scientists, platform engineers, information architects, and DevOps teams to ensure seamless ML/AI integration
• Partner with business teams (Fund Accounting, FP&A, Investor Relations, Sales, Investments) to identify high-value AI use cases and translate business needs into technical solutions
• Communicate complex AI concepts in business terms, managing expectations and demonstrating ROI of ML/GenAI initiatives
• Provide technical mentorship to team members, including refactoring data scientist code for production readiness
Qualifications:
Required:
• Bachelor's or Master's degree in Computer Science, Engineering, Information Systems, or a related field
• 7+ years of experience as an ML Ops Engineer, ML Engineer, or similar role with production deployment responsibility
• Expert-level proficiency in Python, complemented by strong skills in Bash scripting
• Extensive experience designing and implementing cloud solutions on Azure (required) or GCP
• Deep expertise with Docker and Kubernetes for containerizing and orchestrating ML workloads
• Hands-on experience with CI/CD tools such as GitHub Actions, Jenkins, GitLab CI, or Azure DevOps
• Strong SQL proficiency and practical experience with Databricks platform
• Experience with workflow orchestration tools (Airflow, Prefect, or Databricks Workflows) and monitoring tools (Prometheus, Grafana, Evidently AI)
• Demonstrated experience building and deploying RAG (Retrieval-Augmented Generation) systems in production environments
• Hands-on experience with vector databases (Databricks Vector Search, Pinecone, Weaviate, Chroma, or Milvus)
• Experience with LLM APIs and frameworks (OpenAI, Anthropic Claude, LangChain, LlamaIndex)
• Understanding of embedding models, chunking strategies, and retrieval optimization techniques
• Knowledge of prompt engineering best practices and LLM evaluation methodologies
• Experience with ML flow for experiment tracking, model registry, and model serving
• Familiarity with Databricks Feature Store and Unity Catalog for ML governance
• Understanding of Delta Lake and Lakehouse architecture for ML data pipelines
• Experience with Databricks Model Serving endpoints and inference optimization
Preferred:
• Experience with LLM fine-tuning techniques (LoRA, QLoRA, full fine-tuning) on proprietary data
• Familiarity with ML frameworks including TensorFlow, PyTorch, Scikit-learn, XGBoost
• Experience with model serialization (ONNX) and inference optimization
• Prior experience within financial services, fintech, or private equity sectors
• Experience building ML/AI infrastructure from scratch in entrepreneurial environments
• Relevant certifications: Azure AI Engineer Associate, Databricks ML Professional, Google Cloud ML Engineer
Company:
CIM is a community-focused real estate and infrastructure owner, operator, lender and developer. Founded in 1994, the company is headquartered in Los Angeles, USA, with a team of 501-1000 employees. The company is currently Late Stage.