Job SummaryWe are seeking a highly skilled
AI Engineer to design, build, and deploy agentic systems using
Google Gemini ADK on
GCP. The role involves implementing AI agent workflows for production use cases, integrating with developer ecosystems, monitoring runtime performance, and conducting offline/online evaluations to improve model quality. The ideal candidate will have hands-on experience with LLM/agentic applications, strong Python skills, GCP cloud services, observability tooling, and CI/CD processes.
Key Responsibilities- Design and build agentic systems using Google Gemini ADK (tool use, planning, memory, orchestration) and deploy them on GCP (Cloud Run, GKE, Vertex AI endpoints).
- Monitor runtime performance of AI agents and related metrics.
- Productionize agent workflows for SRE use cases including incident triage, log/trace summarization, anomaly detection, remediation suggestions, runbook generation, and post-incident reviews.
- Integrate with developer ecosystem tools: GitHub, CI/CD pipelines, policy checks, secrets management, feature flags, and progressive delivery (canary/blue-green) for agent versions and prompts.
- Evaluate and improve agent workflows through offline/online evaluations, driving model/agent quality and closing the loop with feedback from SREs.
- Ensure MTTAs/MTTRs for critical incidents attributable to agent assistance are met.
- Maintain observability coverage of agent tool calls with correlated trace spans and structured logs.
Required Skills & Experience- 1+ years building LLM / agentic applications with strong Python expertise.
- Hands-on with Google Gemini (prompting, tool use, function calling, structured outputs) and ADK (planning, orchestration, memory).
- GCP fluency: IAM, Cloud Run/GKE, Vertex AI, Cloud Storage, Cloud Logging/Monitoring, Secret Manager, Cloud Build, Cloud Deploy.
- CI/CD & Infrastructure as Code: Terraform, Jenkins, AWS CodePipeline.
- Experience with storage systems: AlloyDB, Spanner DB, Neo4j.
- Observability expertise: OpenTelemetry (traces/metrics/logs), Dynatrace, Splunk.
- Data & evaluation skills: dataset curation, prompt/version management, offline evaluations (accuracy, hallucination, safety), online metrics (latency, cost, win rate).
Competencies- Strong analytical and problem-solving skills.
- Ability to work independently and manage priorities in complex environments.
- Excellent communication and collaboration with cross-functional teams.
- Ability to measure and optimize SLO compliance for agent endpoints and orchestrations.
Preferred / Desirable Skills- Experience with productionizing LLM agents for SRE workflows or similar use cases.
- Knowledge of progressive delivery strategies (canary, blue-green).
- Familiarity with advanced runtime observability and monitoring practices for AI systems.