Job Summary:
Bot Auto is revolutionizing the transportation of goods with autonomous trucks, aiming to enhance community quality of life. They are seeking a highly skilled Senior Software Engineer to architect and operate workflow orchestration platforms that support various engineering and autonomy workloads.
Responsibilities:
• Architect, deploy, and operate workflow orchestration platforms (e.g., Argo Workflows, Airflow, or a hybrid) supporting simulation, machine learning and model training, data pipelines, CI/CD, and other general-purpose workloads.
• Build internal platforms, abstractions, SDKs, and self-service tooling on top of orchestration engines to make authoring, running, and monitoring workflows simple and reliable for engineers.
• Operate workflow platforms at scale on Kubernetes across cloud (AWS) and on-prem data center environments, handling scheduling, autoscaling, GPU and heterogeneous resources, and cross-cluster orchestration.
• Ensure reliability, performance, and cost efficiency of workloads through observability, queuing and prioritization, retries, and resource optimization.
• Partner with ML, simulation, data, and infrastructure teams to understand workload requirements and deliver fit-for-purpose pipelines.
• Integrate workflow platforms with storage, data streaming and event systems, artifact and model registries, and CI/CD tooling.
• Establish best practices, templates, and documentation for workflow authoring and operations; mentor engineers across the company.
• Handle user-impacting issues promptly with clear communication — mitigate in the short term and follow up with durable long-term solutions.
Qualifications:
Required:
• Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience
• 5+ years of hands-on experience in platform engineering, infrastructure, DevOps, or SRE roles
• Significant experience with workflow orchestration platforms such as Argo Workflows, Airflow, or comparable systems
• Strong software development skills in one or more languages: Python, Go, Java, or JavaScript/TypeScript
• Solid understanding of Kubernetes and distributed systems
Preferred:
• Expert-level experience operating Argo Workflows, Airflow, and/or other engines (e.g., Prefect, Dagster, Temporal, Kubeflow Pipelines, Flyte)
• Experience orchestrating ML training, simulation, or large-scale data and batch workloads, including GPU scheduling
• In-depth Kubernetes experience (EKS, GKE, AKS, RKE2/Rancher) and cross-cluster orchestration
• IaC tools proficiency, including Terraform, Pulumi, OpenTofu, or Ansible
• Experience with data streaming and event platforms, including NATS JetStream, Kafka, Pulsar, or RabbitMQ
• Familiarity with observability stacks: Prometheus, Grafana, Loki, OpenTelemetry, or comparable
• Demonstrated ability to optimize workload cost and performance without compromising reliability
Company:
Transforming American Transportation with Autonomous Trucks Founded in 2023, the company is headquartered in Houston, USA, with a team of 51-200 employees. The company is currently Growth Stage.