SRE Key Responsibilities
• Design and manage multi-account AWS infrastructure (VPC, Route Tables, EC2, ECS, EKS 1.33, RDS, DynamoDB, Elastic ache Valley, S3, Transit Gateway, Resource Access Manager, Lambda, CloudFormation, AWS Backup)
• Configure load balancing and traffic management (ELB, NLB, Target Groups with gRPC, Route53, Global Accelerator, CloudFront)
• Implement security and compliance controls (IAM, IAM Identity Center, SCP, = Guard Duty, WAF, CloudTrail, ACM, Secrets Manager, OKTA integration)
• Manage Cloudflare infrastructure (Zero Trust, Argo Smart Routing, DNS, Workers, Load Balancer, Bot Management, WAF, Rules & Policies, Cache)
• Manage S3 with Access Policies, Lifecycle Policies, S3 Storage Lens optimization, and cross-region replication
• Operate messaging and notification services (SNS, SES, SQS)
• Architect and manage multi-cluster EKS environments with HA and cross-region DR
scenarios using Istio service mesh, Network Policies, Karpenter, HPA, KEDA, Argo CD, Argo Rollouts
• Implement and maintain Argo CD for multi-cluster application management with HA and cross-region DR configurations
• Configure Argo CD Application Sets for managing applications across multiple EKS clusters
• Implement ECR with global cross-region replication for container image distribution and disaster recovery
• Implement Aurora Global Database for cross-region DR, manage Aurora RDS (MySQL and PostgreSQL) and standalone MySQL/PostgreSQL instances for development
• Design and maintain RDS cross-region replication, automated backups, failover strategies, and upgrade procedures
• Establish and maintain DevOps practices including change management, release management and deployment strategies
• Build resilient CI/CD pipelines with cross-region artifact replication, automated testing, and failover capabilities
• Develop and maintain GitHub Actions shared internal workflows and reusable actions for standardized deployments
• Implement change approval workflows, deployment gates, and release coordination
processes
• Implement Cross plane for automated feature environment creation, upgrades, and AWS resource provisioning
• Deploy applications using Helm, Customize with Overlay Patches, Json net, and Cross plane for infrastructure orchestration
• Maintain platform operators (External DNS, External Secrets, Reloader) and custom CRDs
• Build comprehensive observability stack & Dashboards (Grafana, Thanos/Prometheus, Loki, Alert manager, Open Telemetry Alloy/Tempo/Beyla/Pyro scope)
• Configure exporters (Blackbox, MySQL, Redis, YACE CloudWatch, Cloudflare, Node Exporter, Prometheus Push Gateway)
• Support data platforms (Kafka/Kafka UI, Minion, Airflow, JupyterHub, DASK, Superset, Imply, AWS Glue, Athena, Quick Sight, Bedrock)
• Optimize CI/CD with GitHub Actions, Actions Runner Controller (ARC), runs-on.com, GitHub Rulesets
• Manage mobile app delivery pipelines (Unity Build Management, Fastlane, Google Play Developer, Apple Developer/Enterprise, Applivery)
• Implement and maintain all infrastructure using Terraform/Open Tofu with Scalr, backporting existing resources into code
• Automate operational tasks wherever possible; create comprehensive runbooks for no automatable procedures
• Conduct thorough post-mortem analysis after incidents, documenting learnings and implementing preventive measures
• Drive cost optimization initiatives using S3 Storage Lens, CloudWatch metrics, rightsizing recommendations, and resource lifecycle management
• Develop automation in Bash, Python, Go, C#/.NET (Unity Game Engine)
• Maintain developer experience (Backstage, Click Up, Miro, Shared GitHub Action/Workflows)
• Integrate monitoring and alerting (PagerDuty, Cronitor, Wiz, CloudWatch)
Core Expertise:
• Multi-account AWS architecture with Transit Gateway, Resource Access Manager, VPC design, and Route Tables
Kubernetes/EKS high availability with cross-region disaster recovery scenarios
• Multi-cluster EKS management with service mesh (Istio), autoscaling (Karpenter, KEDA), GitOps (Argo CD)
• Argo CD enterprise deployment for multi-cluster application management with HA and cross-region DR
• Argo CD Application Sets, app-of-apps patterns with Helm, and cluster management strategies
• ECR global cross-region replication strategies for container image distribution and DR
• Cloudflare enterprise features (Zero Trust, Argo Smart Routing, DNS management, Workers, Load Balancer, Bot Management, Cache optimization, WAF Rules & Other Security Policies)
• Aurora Global Database implementation and management for cross-region DR
• Aurora RDS (MySQL and PostgreSQL engines) and standalone MySQL/PostgreSQL instance management
• RDS cross-region replication, automated failover, disaster recovery, and version upgrade strategies
• DevOps best practices including change management, release management, and deployment coordination
• Resilient CI/CD pipelines with automated testing, cross-region artifact distribution, and failover
• GitHub Actions shared workflows and reusable actions development for internal use
• Cross plane for Kubernetes-native infrastructure provisioning, feature environment automation, and upgrade orchestration
• Expert-level Terraform/Open Tofu with enterprise policy management (Scalr)
• Infrastructure backporting and migration from ClickOps to IaC
• Complete observability stack (Prometheus, Grafana, Loki, Open Telemetry, distributed tracing)
• Data pipeline orchestration (Kafka, Airflow) and analytics platforms (Superset, Imply)
• GitHub Actions with self-hosted runners (ARC, runs-on.com)
• Proficiency in Python, Bash, Go, and C#/.NET for automation development
• Security implementations (IAM, SCP, OKTA, WAF, Guard Duty, Wiz)
• Mobile CI/CD (Unity, Fastlane, Apple/Google distribution & Applivery during Development)
• Disaster recovery planning, testing, and automation (AWS Backup, cross-region strategies)
• AI/ML infrastructure experience (AWS Bedrock)
• Cost optimization strategies and Quick Sight for AWS Cost Review
• Post-mortem facilitation and blameless incident analysis
• Runbook creation and maintenance for operational procedures
Technical Skills:
• Container orchestration with advanced networking and progressive delivery
• Infrastructure as Code and GitOps methodologies with automation-first mindset
• Change management workflows, approval gates, and release orchestration
• CI/CD pipeline design with automated testing, security scanning, and deployment strategies
• Incident response, on-call management, post-mortem analysis, DR execution
• Cross plane composition design and custom resource definitions
• Custom CRD and operator development in Kubernetes
• Event-driven architecture (Lambda, SQS, SNS, SES)
• Real-time analytics and BI platforms
• Developer portal management (Backstage)
• Multi-region failover automation and orchestration
• Cost analysis and optimization using native AWS tools
• Automation of repetitive operational tasks
• Technical documentation and runbook authoring
• Database performance tuning and optimization (Aurora, MySQL, PostgreSQL)
• Argo CD backup, restore, and disaster recovery procedures
• Cloudflare Workers development & deployment using Wrangler
Soft Skills: Strong troubleshooting, cross-functional communication, self-directed, documentation-focused, cost-conscious, continuous improvement mindset