Pay Rate Low: 35 | Pay Rate High: 40
Our client is a leading global biotechnology and pharmaceutical organization driven by a mission to innovate, continuously advance science, and ensure everyone has access to the healthcare they need.
Title: AI Data Engineer - Scientific Data Platforms
Location: Remote, Must work PST
Pay rate: $35-38/hr (Depends on experience level)
Schedule: Full-time (40 hours/week)
Duration: 1-year contract, (Plus benefits)
Position Overview This role addresses a critical need in scaling our AI models for drug discovery by building largely automated, scalable, agent-driven data ingestion and curation pipelines for genomics data. This includes metadata inference, constructing performant query architectures, and transforming high-dimensional datasets (e.g., single-cell omics, clinical trials) into AI-ready training formats.
Key Responsibilities - Build an agentic data ingestion pipeline and move beyond bespoke steps toward agents that teams can reliably use as a shared, deployed service.
- Triage and prioritize incoming requests to ingest specific datasets. Clean and organize data, building the first-pass cleaning and organization steps into the agentic flow.
- Validate cross-modal linkage. Add automated checks that catch when ingested data does not connect correctly and flag low-quality or mismatched records.
- Version every dataset, retaining and making prior versions addressable. Preserve raw data and provenance, ensuring agent workflows log validation and transformation steps so lineage is fully traceable.
- Partner with AI, software engineering, and computational biology groups to co-define data standards and conventions.
Qualifications & Requirements - Demonstrated experience building multi-agent workflows or LLM workflows using tools/frameworks such as LangGraph or LlamaIndex, including tool/function calling and asynchronous task execution.
- Strong Python skills for data manipulation, working with APIs and databases, and handling heterogeneous data formats.
- Familiarity with dataset versioning approaches (e.g., DVC, lakeFS, or equivalent).
- Comfortable with or showing a strong willingness to learn common omics data formats like AnnData, H5AD, and TileDB.
- No deep bioinformatics expertise required; just a basic conceptual understanding of different modalities (e.g., RNA-seq vs. scRNA-seq vs. WES; genomics vs. transcriptomics vs. proteomics vs. metabolomics).
- Comfortable writing unit and functional tests to ensure data processing workflows are reliable and reproducible.
- Degree in a technical field or equivalent practical experience.
- Must be Authorized to work in the United States without Sponsorship.
Nice to Have - Experience deploying agent workflows as a shared service (e.g., FastAPI or MCP endpoints).
- Exposure to cloud platforms (AWS, GCP) and containerization (Docker).
- Familiarity with scientific workflow managers such as Nextflow or Snakemake.
INDBH
#LI-MG1