Overview:Job Title: AI/ML Ops & Infrastructure Engineer
Company: R2 Technologies
Location: Alpharetta, GA (Hybrid / Remote Options Available)
Employment Type: Full-Time / Contractual
About R2 Technologies: R2 Technologies is a Certified Minority Business Enterprise (MBE) headquartered in Alpharetta, GA. With over two decades of experience across global markets, we have built a reputation as a trusted partner for IT staffing excellence and cutting-edge digital product innovation. We are driven by innovation and operate on a simple philosophy: "We deliver what we promise, and we promise only what we can deliver." Beyond providing top-tier IT talent, R2 builds cutting-edge proprietary solutions like SmartEnt-an Enterprise AI & IoT Intelligence Platform utilizing advanced NLP and AI technologies. By partnering closely with our clients, we deliver technology-driven outcomes that are realistic, measurable, and impactful.
Job Summary: The shift from classical Machine Learning to Generative AI requires a new breed of infrastructure engineering. R2 Technologies is looking for an AI/ML Ops & Infrastructure Engineer to build and manage the operational backbone for our advanced LLM and agentic systems. You will transition beyond basic CI/CD to implement full-lifecycle LLMOps-managing foundation models, fine-tuned adapters, routing logic, and guardrails. Your work will ensure that our AI solutions, including SmartEnt, run with high performance, optimal GPU utilization, and rigorous compliance.
Key Responsibilities: * Design and maintain highly scalable LLMOps pipelines for continuous integration, evaluation, and deployment of machine learning models and AI agents.
- Deploy and manage containerized AI applications and model inference servers (e.g., vLLM, Ray Serve, NVIDIA Triton) on Kubernetes across multi-cloud environments (AWS, GCP, Azure).
- Implement comprehensive observability and trace-level logging for multi-step agentic workflows using platforms like LangSmith, W&B Weave, or MLflow.
- Automate infrastructure provisioning and monitoring using tools like Terraform and agent-driven workflows (e.g., n8n, GitHub Actions).
- Optimize GPU computing costs, latency, and token usage for high-traffic AI inference endpoints.
- Enforce security guardrails, toxic output filtering, and robust access policies within the AI deployment infrastructure.
- Actively utilize AI-assisted coding tools (Copilot, Cursor) to automate infrastructure-as-code (IaC) and streamline Kubernetes management.
Qualifications: *
Up to 3 years of hands-on experience in MLOps, DevOps, Site Reliability Engineering (SRE), or Cloud Infrastructure.
- Strong proficiency in containerization and orchestration (Docker, Kubernetes).
- Experience with ML/LLM operational platforms (MLflow, Weights & Biases, Databricks Mosaic AI, or SageMaker).
- Familiarity with serving open-source or fine-tuned LLMs and optimizing inference performance.
- Proven experience or strong familiarity working alongside AI coding assistants to enhance productivity.
- Scripting/programming skills in Python and bash, along with experience in CI/CD automation.
- Passion for the evolving landscape of AI infrastructure, cost-optimization (FinOps), and system reliability.
Skills:Nvidia,Infrastructure