Job Title: Site Reliability Engineer (SRE) Lead Messaging Services
Location: Irving, TX
Duration: Longterm
Job Description
We are seeking an experienced Site Reliability Engineer (SRE) Lead Messaging Services to drive platform reliability, observability, and operational excellence across enterprise IBM MQ and Kafka environments. This role will focus on ensuring high availability, scalability, security, and resilience of critical messaging platforms supporting large-scale, real-time business applications.
Key Responsibilities
- Lead reliability engineering initiatives for large-scale IBM MQ and Kafka messaging platforms.
- Manage and support high-volume messaging environments with thousands of runtimes and mission-critical workloads.
- Drive platform stability through patching, upgrades, vulnerability remediation, and end-of-life (EOL) modernization efforts.
- Establish and maintain SRE best practices including SLIs, SLOs, error budgets, incident response, and postmortem reviews.
- Enhance observability and monitoring capabilities for message flow health, queue depth, consumer lag, throughput, and latency.
- Design and implement proactive fault detection, automated recovery, and self-healing solutions.
- Support high-availability architectures, disaster recovery strategies, failover mechanisms, and capacity planning.
- Partner with application, infrastructure, and security teams to ensure secure, reliable, and scalable messaging services.
- Participate in production support, on-call rotations, and critical incident management.
Required Skills & Experience
- Strong experience in Site Reliability Engineering (SRE), Production Engineering, or Platform Engineering.
- Hands-on expertise with IBM MQ including:
- Queue Managers
- MQ Clustering
- Channels
- Dead Letter Queue (DLQ) Management
- Performance Tuning and Troubleshooting
- Strong experience with Kafka / Confluent Platform including:
- Topics
- Brokers
- Partitions
- Consumer Groups
- Cluster Administration
- Experience implementing monitoring, observability, and alerting solutions.
- Knowledge of incident management, root cause analysis, and operational excellence practices.
- Experience with Linux/Unix environments.
- Scripting and automation experience using Shell, Python, or similar technologies.
- Understanding of messaging security, SSL/TLS, authentication, and vulnerability remediation.