Job Summary:
CIM Group is a community-focused real estate and infrastructure owner, operator, lender, and developer. They are seeking a Senior Machine Learning Ops Engineer to lead the design and maintenance of scalable infrastructure for ML model deployment and lifecycle management, with a focus on productionizing Generative AI solutions.
Responsibilities:
• Lead the design, implementation, and ongoing maintenance of scalable ML infrastructure on Databricks, including ML flow for experiment tracking, model registry, and model serving endpoints.
• Oversee the development of the ML Ops platform and automated pipelines for deploying, monitoring, and maintaining models within production environments.
• Implement robust solutions for model versioning, systematic retraining, and comprehensive artifact management using Databricks Unity Catalog for ML governance.
• Design and manage Databricks Feature Store for consistent feature engineering across training and inference pipelines.
• Architect and implement Retrieval-Augmented Generation (RAG) systems for document Q&A, enabling business teams to query fund documents, investor letters, and market research.
• Design, deploy, and manage vector database solutions (Databricks Vector Search, Pinecone, or similar) for semantic search and retrieval across enterprise documents.
• Lead LLM fine-tuning and customization initiatives, training models like Claude or open-source alternatives with CIM proprietary data while ensuring data privacy and compliance.
• Develop and optimize document processing pipelines including PDF parsing, chunking strategies, and embedding generation for RAG applications.
• Implement prompt engineering best practices and LLM evaluation frameworks to ensure output quality, relevance, and factual accuracy.
• Build guardrails and safety measures for GenAI applications, including hallucination detection, output validation, and source attribution.
• Design and implement extensive automation across the ML workflow, covering model training, testing, validation, and deployment using Databricks Workflows and Asset Bundles.
• Set up robust CI/CD pipelines for both traditional ML models and GenAI applications, leveraging GitHub Actions, Azure DevOps, or similar tools.
• Automate complex data and model workflows utilizing orchestration tools such as Airflow, Prefect, or Databricks Workflows.
• Implement comprehensive monitoring and alerting systems for real-time tracking of model performance, data quality, and GenAI output quality.
• Utilize specialized tools (Evidently AI, WhyLabs, Prometheus/Grafana) to proactively detect model drift, data quality anomalies, and RAG retrieval degradation.
• Develop evaluation frameworks for GenAI applications including relevance scoring, faithfulness metrics, and human feedback loops.
• Troubleshoot issues within production environments, including debugging model deployment failures, RAG retrieval issues, and LLM response quality problems.
• Build and maintain sophisticated feature stores on Databricks, ensuring precise alignment between training and inference data pipelines.
• Collaborate with data engineers and information architects to build robust ETL pipelines that feed into the Databricks Lakehouse.
• Design embedding pipelines and vector index management strategies for RAG applications, including incremental updates and versioning.
• Integrate robust security measures directly into ML Ops and GenAI pipelines, including access controls via Unity Catalog and data encryption.
• Implement Trustworthy AI guardrails addressing bias detection, explainability, prompt injection prevention, and responsible AI practices.
• Ensure GenAI applications handling sensitive fund and investor data comply with regulatory requirements and internal policies.
• Collaborate with Legal and Compliance to establish AI governance policies and audit trails for model decisions.
• Engage in extensive collaboration with data scientists, platform engineers, information architects, and DevOps teams to ensure seamless ML/AI integration.
• Partner with business teams (Fund Accounting, FP&A, Investor Relations, Sales, Investments) to identify high-value AI use cases and translate business needs into technical solutions.
• Communicate complex AI concepts in business terms, managing expectations and demonstrating ROI of ML/GenAI initiatives.
• Provide technical mentorship to team members, including refactoring data scientist code for production readiness.
Qualifications:
Required:
• Bachelor's or Master's degree in Computer Science, Engineering, Information Systems, or a related field.
• 7+ years of experience as an ML Ops Engineer, ML Engineer, or similar role with production deployment responsibility.
• Expert-level proficiency in Python, complemented by strong skills in Bash scripting.
• Extensive experience designing and implementing cloud solutions on Azure (required) or GCP.
• Deep expertise with Docker and Kubernetes for containerizing and orchestrating ML workloads.
• Hands-on experience with CI/CD tools such as GitHub Actions, Jenkins, GitLab CI, or Azure DevOps.
• Strong SQL proficiency and practical experience with Databricks platform.
• Experience with workflow orchestration tools (Airflow, Prefect, or Databricks Workflows) and monitoring tools (Prometheus, Grafana, Evidently AI).
• Demonstrated experience building and deploying RAG (Retrieval-Augmented Generation) systems in production environments.
• Hands-on experience with vector databases (Databricks Vector Search, Pinecone, Weaviate, Chroma, or Milvus).
• Experience with LLM APIs and frameworks (OpenAI, Anthropic Claude, LangChain, LlamaIndex).
• Understanding of embedding models, chunking strategies, and retrieval optimization techniques.
• Knowledge of prompt engineering best practices and LLM evaluation methodologies.
• Experience with ML flow for experiment tracking, model registry, and model serving.
• Familiarity with Databricks Feature Store and Unity Catalog for ML governance.
• Understanding of Delta Lake and Lakehouse architecture for ML data pipelines.
• Experience with Databricks Model Serving endpoints and inference optimization.
Preferred:
• Experience with LLM fine-tuning techniques (LoRA, QLoRA, full fine-tuning) on proprietary data.
• Familiarity with ML frameworks including TensorFlow, PyTorch, Scikit-learn, XGBoost.
• Experience with model serialization (ONNX) and inference optimization.
• Prior experience within financial services, fintech, or private equity sectors.
• Experience building ML/AI infrastructure from scratch in entrepreneurial environments.
• Relevant certifications: Azure AI Engineer Associate, Databricks ML Professional, Google Cloud ML Engineer.
Company:
CIM is a community-focused real estate and infrastructure owner, operator, lender and developer. Founded in 1994, the company is headquartered in Los Angeles, USA, with a team of 501-1000 employees. The company is currently Late Stage.