Job Summary:
Oak Ridge National Laboratory (ORNL) is seeking a Senior Research Software Engineer to join the Incident Modeling and Computational Sciences Group. The role involves serving as a technical leader responsible for AI and data infrastructure development, mentoring junior staff, and collaborating with diverse teams to enhance AI capabilities.
Responsibilities:
• Serve as a senior technical leader responsible for the architecture, development, and sustained operation of enterprise AI and data infrastructure.
• Drive technical decisions, mentor junior staff, and partner with multidisciplinary teams of scientists, data engineers, and system administrators to deliver reliable, secure, and high-performance AI services to ORNL researchers.
Qualifications:
Required:
• A PhD in computer science, software engineering, or a related technical field and a minimum of 8 years of relevant experience, or an MS in these areas with a minimum of 12 years of relevant experience.
• Demonstrated experience designing, deploying, and operating complex software systems or AI/ML infrastructure in a research, national security, or comparable production environment.
• Experience leading or making significant technical contributions to multi-component software projects, including ownership of architecture decisions and delivery of results to stakeholders.
• Experience deploying and managing containerized applications using Docker and Docker Compose or equivalent technologies in multi-service environments.
• Demonstrated proficiency in Python and at least one additional language (e.g., JavaScript, Bash, C++).
• Experience with Linux shell scripting and working in HPC or GPU cluster environments.
• Experience presenting technical work to diverse audiences, including both technical peers and non-specialist stakeholders.
Preferred:
• Deep expertise deploying and operating LLM inference infrastructure, including serving frameworks such as vLLM, Ollama, or comparable tools, and model routing or proxy solutions such as LiteLLM.
• Experience architecting or administering vector database and RAG pipelines (e.g., Milvus, ChromaDB, or similar) at scale.
• Expertise in reverse proxy and web infrastructure, including Nginx configuration, TLS/mTLS certificate management, WebSocket proxying, and authentication subrequest patterns.
• Experience designing and operating observability stacks using OpenTelemetry, Prometheus, Grafana, Loki, Tempo, or equivalent tooling.
• Experience maintaining security-sensitive forks of open-source projects, including upstream merge management, CVE triage, patch backporting, and coordinated disclosure workflows.
• Familiarity with JavaScript or TypeScript and component-based frontend frameworks such as Svelte or React.
• Demonstrated experience mentoring junior engineers or leading multidisciplinary technical teams.
• Experience contributing to research proposals, white papers, or program development activities with federal sponsors or comparable R&D organizations.
• Experience working with DOE National Laboratories or other federal research institutions.
• Excellent written and oral communication skills.
• Ability to function well in a fast-paced research environment, set priorities to accomplish multiple tasks within deadlines, and adapt to ever-changing needs.
Company:
Oak Ridge National Laboratory holds a range of R&D assignments, from fundamental nuclear physics to applied R&D on advanced energy systems. Founded in 1943, the company is headquartered in Oak Ridge, USA, with a team of 5001-10000 employees. The company is currently Late Stage.