We are looking for a Senior Python / AI API Engineer to build and deploy production-grade services powering Large Language Model (LLM) applications. This role focuses on developing high-performance APIs for model inference, optimizing GPU workloads, and deploying AI services in cloud environments.
This is an engineering-focused role, not research. We are looking for someone who has built and shipped AI systems into production and understands the challenges of scalable inference and model serving.
Key Responsibilities - Develop high-performance APIs using Python (3.10+) and FastAPI
- Build and deploy LLM inference services using HuggingFace Transformers and PyTorch
- Optimize GPU workloads and CUDA memory usage
- Implement streaming inference APIs for real-time model responses
- Containerize and deploy services using Docker and GPU-enabled infrastructure
- Deploy AI workloads in Azure environments (AKS, ACI, or Container Apps)
Required Skills - Strong Python development experience (3.10+)
- Hands-on experience building production APIs with FastAPI
- Experience with HuggingFace Transformers and PyTorch
- Solid understanding of REST API design
- Experience deploying containerized applications with Docker
Nice to Have - Experience with OpenAI-compatible APIs, vLLM, or Text Generation Inference (TGI)
- Experience deploying AI workloads on Azure GPU infrastructure
- Familiarity with LoRA / PEFT fine-tuning
- Exposure to legal or financial NLP use cases
Ideal Candidate: A hands-on engineer who understands how LLM systems run in production-from model loading and tokenization to GPU deployment and scalable APIs.