Job Summary:
Mindlance is a company focused on advanced data engineering solutions. The Senior AI Data Engineer will own end-to-end data pipelines that integrate traditional transformations with ML models, managing complex systems for data enrichment and ensuring high-quality data for generative AI models.
Responsibilities:
• AI-Augmented Data Pipelines: Design and maintain AI-augmented, large-scale data pipelines (billions of images) integrating traditional transformations with ML models (classifiers, embeddings, LLMs) for cleaning and annotation.
• Remote Inference Orchestration: Own the systems for remote ML model inference orchestration within pipelines, managing batching, retries, async jobs, and ensuring graceful degradation.
• Feature Pipelines: Build and maintain scalable pipelines for generating, storing, and serving vector embeddings, including nearest-neighbor index management and quality validation.
• Data Curation at Scale: Source, filter, and curate training datasets using a combination of SQL and model-derived signals (e.g., aesthetic scores, NSFW classifiers), owning the end-to-end data flow and maintaining governance, quality, and compliance.
• LLM-Assisted Annotation: Design and operate pipelines that use LLMs and vision models for automated annotation of training data, including auditing workflows to measure and improve annotation model performance.
• Tooling & Frameworks: Contribute to shared tooling and frameworks that make it easier for the broader team to build AI-augmented data pipelines — e.g., reusable operators for model invocation, standard patterns for async job management.
Qualifications:
Required:
• Advanced SQL & data pipeline expertise. Complex queries, query optimization, pipeline orchestration frameworks (Airflow, Dataswarm, or equivalent).
• Experience integrating ML models into data pipelines. Calling inference endpoints, managing model versions, batching requests, handling inference failures at scale.
• Proficiency with AI-assisted coding agents (e.g., Copilot, Cursor, Codex). Expected to leverage AI tools as a force multiplier for writing, debugging, and reviewing code, building pipelines faster, and accelerating day-to-day engineering workflows.
• Strong verbal and written communication skills, problem-solving ability, and cross-functional collaboration.
• Bachelor's degree or higher in Computer Science, Data Engineering, Machine Learning, or a related STEM field.
• 5+ years of industry experience in data engineering, ML engineering, or a hybrid role involving both data pipelines and model serving/inference.
• Demonstrated track record of building and operating production data pipelines that invoke ML models at scale.
• Be onsite in MPK, working closely with engineers and researchers.
Preferred:
• Working knowledge of embeddings and vector representations like generating, storing, indexing, and querying embeddings (FAISS, Milvus, or equivalent).
• Familiarity with content-understanding models like image classifiers, object detection, OCR, NSFW detection, aesthetic scoring.
• Experience with LLMs for data tasks like prompt engineering for annotation, data cleaning, or evaluation using LLM APIs.
• Knowledge of generative AI like diffusion models, image generation, evaluation metrics (FID, CLIP score, etc.).
• Previous experience at Meta is preferred but not required.
Company:
Mindlance is a Staffing and Recruiting company which provides multi-vertical staffing services Founded in 1999, the company is headquartered in Union, USA, with a team of 1001-5000 employees. The company is currently Late Stage.