2

Remote Data Curation Jobs in California (NOW HIRING)

Data Engineer III

Menlo Park, CA ยท On-site +1

$134K - $162K/yr

Our team develops comprehensive data curation and evaluation solutions for image generation models ... Remote Inference Orchestration: Own the systems for remote ML model inference orchestration within ...

Senior Data Modeler I

Redwood City, CA ยท On-site +1

$90K - $130K/yr

Experience working with real world data from various sources (e.g., curation workflows, EHRs, lab ... Additionally, for remote roles open to individuals in unincorporated Los Angeles - including remote ...

next page

Showing results 1-20

Remote Data Curation information

What is remote data curation?

Remote data curation involves organizing, managing, and maintaining data sets from a remote location, ensuring their accuracy, quality, and usability for various purposes. Data curators work with digital information, often cleaning, annotating, and structuring data to make it more accessible and valuable for organizations or research. This role can include tasks like verifying data integrity, standardizing formats, and documenting metadata, all performed using online tools and collaboration platforms. Remote data curation allows professionals to work from anywhere, supporting projects in fields such as science, business, or technology.

What are some common challenges faced in a remote data curation role, and how can they be addressed?

One common challenge in remote data curation is ensuring consistent data quality and integrity across distributed teams. Without in-person collaboration, communication gaps can arise, making it important to use clear documentation practices and regular virtual check-ins. Additionally, remote curators often need to be proactive in seeking clarification and feedback to avoid misinterpretation of data standards. Leveraging collaborative tools and maintaining open communication channels can significantly help in overcoming these hurdles and maintaining workflow efficiency.

What is the difference between Remote Data Curation vs Remote Data Entry?

AspectRemote Data CurationRemote Data Entry
Primary FocusAnalyzing, organizing, and maintaining data quality and relevanceInputting and updating data into systems
Required SkillsData analysis, critical thinking, attention to detailTyping accuracy, basic computer skills
Work EnvironmentCollaborative, often involves research and validationIndividual, repetitive tasks
Common CertificationsData management, database certificationsNone typically required

Remote Data Curation involves managing and improving data quality through analysis and organization, while Remote Data Entry focuses on inputting data accurately into systems. Both roles require attention to detail, but data curation demands analytical skills and a deeper understanding of data relevance, making it more complex and strategic compared to the straightforward nature of data entry tasks.

What are the key skills and qualifications needed to thrive as a Remote Data Curator, and why are they important?

To thrive as a Remote Data Curator, you need strong analytical abilities, attention to detail, and a background in information science, data management, or a related field. Familiarity with database systems, data cleaning tools (such as OpenRefine), and metadata standards is typically required, along with experience using collaborative platforms. Excellent communication, organizational skills, and the ability to work independently are crucial soft skills in this remote role. These skills ensure the accuracy, accessibility, and integrity of large datasets, supporting effective data-driven decision-making for organizations.
What are the most commonly searched types of Data Curation jobs in California? The most popular types of Data Curation jobs in California are:
What cities in California are hiring for Remote Data Curation jobs? Cities in California with the most Remote Data Curation job openings:
Data Engineer III

Data Engineer III

Akidev Corporation

Menlo Park, CA โ€ข On-site, Remote

$134K - $162K/yr

Other

Posted 7 days ago


Job description

Start/End Dates: 7/13/2026 - 12/31/2026
Tax Work Location: US - CA - Menlo Park (105201)
Job Title: Data Analytics & Engineering - Data Engineer III
Job Description: Summary
Generative AI models are only as good as the data they consume. Unlike traditional data engineering, building data pipelines for generative AI requires orchestrating ML model invocations (content understanding classifiers, embedding models, LLM-based cleaners) alongside standard SQL-based transformations, all at billion-row scale.
This role sits at the intersection of Data Engineering and ML Systems. The Senior AI Data Engineer will own end-to-end data pipelines that don''t just move and transform data, but enrich it through remote model inference, managing the systems complexity of async execution, capacity allocation, retry/fallback logic, and throughput optimization that comes with it. This is not a pure ETL-with-SQL role; it demands hands-on systems experience with distributed inference infrastructure.
Our team develops comprehensive data curation and evaluation solutions for image generation models across quality dimensions including visual quality, prompt adherence, identity preservation, naturalness, and visual text generation.
Job Responsibilities
AI-Augmented Data Pipelines: Design and maintain AI-augmented, large-scale data pipelines (billions of images) integrating traditional transformations with ML models (classifiers, embeddings, LLMs) for cleaning and annotation.
Remote Inference Orchestration: Own the systems for remote ML model inference orchestration within pipelines, managing batching, retries, async jobs, and ensuring graceful degradation.
Feature Pipelines: Build and maintain scalable pipelines for generating, storing, and serving vector embeddings, including nearest-neighbor index management and quality validation.
Data Curation at Scale: Source, filter, and curate training datasets using a combination of SQL and model-derived signals (e.g., aesthetic scores, NSFW classifiers), owning the end-to-end data flow and maintaining governance, quality, and compliance.
Additional Responsibilities
LLM-Assisted Annotation: Design and operate pipelines that use LLMs and vision models for automated annotation of training data, including auditing workflows to measure and improve annotation model performance.
Tooling & Frameworks: Contribute to shared tooling and frameworks that make it easier for the broader team to build AI-augmented data pipelines โ€” e.g., reusable operators for model invocation, standard patterns for async job management.
Skills Required
Advanced SQL & data pipeline expertise. Complex queries, query optimization, pipeline orchestration frameworks (Airflow, Dataswarm, or equivalent).
Experience integrating ML models into data pipelines. Calling inference endpoints, managing model versions, batching requests, handling inference failures at scale.
Proficiency with AI-assisted coding agents (e.g., Copilot, Cursor, Codex). Expected to leverage AI tools as a force multiplier for writing, debugging, and reviewing code, building pipelines faster, and accelerating day-to-day engineering workflows Strong verbal and written communication skills, problem-solving ability, and cross-functional collaboration.
Preferred
Working knowledge of embeddings and vector representations like generating, storing, indexing, and querying embeddings (FAISS, Milvus, or equivalent).
Familiarity with content-understanding models like image classifiers, object detection, OCR, NSFW detection, aesthetic scoring.
Experience with LLMs for data tasks like prompt engineering for annotation, data cleaning, or evaluation using LLM APIs.
Knowledge of generative AI like diffusion models, image generation, evaluation metrics (FID, CLIP score, etc.).
Education / Experience
Bachelor''s degree or higher in Computer Science, Data Engineering, Machine Learning, or a related STEM field.
5+ years of industry experience in data engineering, ML engineering, or a hybrid role involving both data pipelines and model serving/inference.
Demonstrated track record of building and operating production data pipelines that invoke ML models at scale.