Overview:Job Title: AI Reliability Engineer (AI SRE)
Company: R2 Technologies
Location: Alpharetta, GA (Hybrid / Remote Options Available)
Employment Type: Full-Time / Contractual
About R2 Technologies: R2 Technologies is a Certified Minority Business Enterprise (MBE) headquartered in Alpharetta, GA. With over two decades of experience across global markets, we have built a reputation as a trusted partner for IT staffing excellence and cutting-edge digital product innovation. We are driven by innovation and operate on a simple philosophy: "We deliver what we promise, and we promise only what we can deliver." Beyond providing top-tier IT talent, R2 builds cutting-edge proprietary solutions like SmartEnt-an Enterprise AI & IoT Intelligence Platform utilizing advanced NLP and AI technologies. By partnering closely with our clients, we deliver technology-driven outcomes that are realistic, measurable, and impactful.
Job Summary: As enterprise AI shifts from prototypes to mission-critical production systems, we need engineers who can guarantee stability. R2 Technologies is seeking an AI Reliability Engineer to merge traditional Site Reliability Engineering (SRE) with LLM operations. You will be the guardian of our production AI, responsible for monitoring foundation models for performance drift, optimizing token usage and GPU costs, and ensuring high-availability inference for our SmartEnt platform.
Key Responsibilities: * Deploy, scale, and manage LLM inference servers (e.g., vLLM, Ray Serve, NVIDIA Triton) on Kubernetes across multi-cloud environments.
- Implement comprehensive observability, logging, and tracing for complex agentic workflows using platforms like LangSmith, MLflow, or Weights & Biases (Weave).
- Monitor production models for data drift, hallucination rates, and latency spikes, implementing automated rollback or model-routing strategies when necessary.
- Optimize cloud infrastructure to balance GPU utilization, inference speed, and token cost (FinOps for AI).
- Automate infrastructure provisioning (IaC) and CI/CD pipelines specifically tailored for machine learning models and fine-tuned adapters.
- Actively utilize AI-assisted coding tools (GitHub Copilot, Cursor) to automate infrastructure management and incident response scripting.
Qualifications: *
Up to 3 years of hands-on experience in SRE, DevOps, MLOps, or Cloud Infrastructure.
- Strong proficiency in containerization and orchestration (Docker, Kubernetes, Helm).
- Experience configuring and scaling GPU-backed workloads in cloud environments (AWS, Azure, or GCP).
- Familiarity with LLM observability tools and trace-level debugging of AI applications.
- Proven experience or strong familiarity working alongside AI coding assistants to enhance productivity.
- Scripting skills in Python and Bash, with a strong focus on system reliability, automation, and cost-optimization.
Skills:Reliability Engineering,Kubernetes