Job Summary:
PepsiCo is a leading global food and beverage company, and they are seeking an AI Engineer specializing in Agentic AI enablement. The role involves designing and delivering production-grade agent capabilities, managing the end-to-end delivery of agent modules, and driving adoption through collaboration with various teams.
Responsibilities:
• Lead design and productionization of high-leverage agent modules and reusable patterns (tool-use orchestration, policies/guardrails, memory, RAG where it adds measurable value), built as composable components and reference implementations.
• Translate ambiguous product/problem statements into concrete agent behaviors and system designs: state models, failure modes, tool contracts, latency budgets, and acceptance criteria that engineering + product can execute against.
• Deliver quickly without sacrificing quality: create thin vertical slices, iterate with evidence, and converge on robust behavior under real-world constraints.
• Drive meaningful performance gains via systematic optimization: latency, token efficiency, tool-call success, retrieval quality, and cost per successful task, including remediation of long-tail failure modes.
• Proactively identify platformizable opportunities: refactor one-off implementations into shared frameworks/SDKs that reduce build time for others.
• Define and implement evaluation strategies for assigned workflows: golden sets, scenario coverage maps, regression suites, online/offline metrics, and release gating thresholds aligned to real business outcomes.
• Build repeatable evaluation systems (templates, labeling guidance, dataset/versioning conventions, dashboards/reports) so evaluation becomes a productized capability, not ad hoc testing.
• Implement robust automated testing across layers: unit tests for prompt/tool wrappers, contract tests for tool schemas, integration tests for toolchains, and agent simulation tests for multi-step flows.
• Lead root-cause analysis of quality failures (hallucinations, tool misuse, retrieval misses, routing errors): isolate causes (prompt/tool/data/model), implement corrective actions, and prevent regressions.
• Champion evidence-first iteration: decisions and releases are backed by eval results, not gut feel.
• Contribute to router design and task-to-model mapping through routing rules/classifiers, prompt strategies, and model selection policies; validate decisions using evaluation data and runtime telemetry.
• Propose and implement routing improvements when constraints change (pricing, latency, throughput, new model capabilities), with governance-aware rollouts and rollback plans.
• Identify and mitigate routing failure modes (over-escalation to expensive models, under-routing causing quality loss, brittle heuristics) and improve robustness using lightweight ML or rules where appropriate.
• Lead implementation of MCP connectors/clients for enterprise apps and internal data products with strong engineering hygiene: schema/versioning discipline, typed contracts, scopes/permissions, auditability, and integration test strategy.
• Build reusable integration patterns: standardized tool metadata, error normalization, retries/timeouts, idempotency, pagination handling, and consistent auth patterns to accelerate onboarding of new tools.
• Collaborate with security/data owners to ensure secure-by-design tool access (least privilege, logging, PII handling, policy enforcement).
• Ensure production readiness for owned components: telemetry coverage, structured logging, traceability for tool calls, SLIs/SLO alignment (latency, success rate, cost), and participation in incident response and postmortems.
• Proactively identify delivery risks (dependencies, rate limits, data quality, security scopes, vendor constraints) and drive resolution with clear tradeoffs and recommendations.
• Mentor peers through technical leadership: raise code quality, share patterns, review PRs for correctness/performance/security, and contribute to internal playbooks.
Qualifications:
Required:
• Bachelor’s in CS/AI/ML or equivalent experience required
• 6-8 year experience in Software life cycle
• Expertise in ML (structured and unstructured data) development and engineering
• Proven experience shipping LLM/agent solutions to production with measurable quality and operational practices.
• Advanced Software Engineering: Python (and Java) mastery with distributed systems expertise; performance optimization (profiling, parallelization); architecture patterns (e.g., FastAPI, asyncio, Pydantic)
• LLM & Agent Systems: Multi-agent orchestration (LangChain, LangGraph, CrewAI); advanced prompt engineering; custom agent memory architectures; model optimization techniques
• Evaluation Framework Development: Statistical evaluation design (confidence intervals, power analysis); benchmark creation; instrumentation frameworks (e.g., MLflow, Arise); regression testing systems
• ML Operations: Production deployment pipelines (Docker, Kubernetes, Ray); model registry management; scaled inference optimization; GPU utilization optimization
• Enterprise Integration: Enterprise connector development; scalable API architectures; data pipeline engineering (Kafka, gRPC, Redis); authorization protocol implementation
• Observability Engineering: Telemetry system design (Prometheus, OpenTelemetry); automated anomaly detection; distributed tracing; performance dashboarding (Grafana)
• System Architecture: Microservice design patterns; high-throughput event processing; fault-tolerance implementation; horizontal scaling architectures
• Technical Leadership: Architecture governance systems; engineering standards development; build-vs-buy evaluation frameworks; technical roadmap creation
Preferred:
• Master’s preferred
• Full-stack dev experience on modern stack
• Modelling User Interactions with AI Systems; Modeling multi-agent behaviour loops with tools like Temporal
• Agentic memory Patterns and usage with tools like MEM0 and Temporal
• Experience with Agentic RAG; Domain level Semantic Layer Designs with Graph and Vector DBs
Company:
PepsiCo is a food and beverage company. Founded in 1898, the company is headquartered in Purchase, USA, with a team of 10001+ employees. The company is currently Late Stage.