2

Remote Reliability Engineer Jobs in Texas (NOW HIRING)

Senior SRE (Storage Platforms) DETAILS Location: 100% Remote Position Type: 3M C2H Hourly / Salary: to $120W2+ (based on experience) JOB SUMMARY Vaco is currently seeking a Senior SRE (Storage ...

At Blackhawk Network, you'll enjoy the best of both worlds-focused remote work plus in-person ... You will own the reliability, scalability, and observability of a critical Data pipeline platform.

Site Reliability Engineer

Windcrest, TX · Remote

$51.50 - $68.25/hr

US REMOTE** **WORK FROM HOME** Rackspace Professional Services is helping customers solve business challenges using Kubernetes. RPS is seeking Kubernetes DevOps engineers who can help guide clients ...

Cloud Engineer (Managed Services)

Addison, TX · Remote

$54 - $72/hr

Remote Position Type: 6M C2H Hourly / Salary: BOE! JOB SUMMARY Vaco is currently seeking a Cloud ... While automation, DevOps practices, and reliability engineering are part of the role, it is applied ...

DevOps Engineer

Dallas, TX · On-site +1

$52.25 - $71.50/hr

Leap Event Technology is a remote-friendly company. This position is open to any candidate in North ... You'll help drive modern DevOps practices, ensure system reliability, and scale enterprise-level ...

next page

Showing results 1-20

Remote Reliability Engineer information

What are the key skills and qualifications needed to thrive as a Remote Reliability Engineer, and why are they important?

To thrive as a Remote Reliability Engineer, you need a strong background in systems engineering, software development, and infrastructure management, often supported by a degree in computer science or a related field. Proficiency with cloud platforms (such as AWS, Azure, or GCP), monitoring tools (like Prometheus, Grafana), and relevant certifications (e.g., AWS Certified DevOps Engineer) is highly valuable. Excellent problem-solving, communication, and collaboration skills are crucial for working effectively across distributed teams and responding to incidents. These abilities ensure system reliability, quick incident resolution, and seamless remote teamwork, which are vital for maintaining high service uptime and user satisfaction.

How do Remote Reliability Engineers typically collaborate with on-site teams to address urgent technical issues?

Remote Reliability Engineers often utilize a combination of video conferencing, instant messaging, and collaborative monitoring tools to stay closely connected with on-site teams. When urgent technical issues arise, they participate in real-time troubleshooting sessions, analyze system logs remotely, and may guide on-site staff through step-by-step resolution procedures. Building strong communication channels and regular check-ins are essential to ensure swift and effective collaboration, even across different time zones. This structure allows Remote Reliability Engineers to contribute significantly to system uptime while working from a distance.

What is a Remote Reliability Engineer?

A Remote Reliability Engineer is a professional who works from a remote location to ensure that systems, applications, or infrastructure are reliable, available, and performing well. Their responsibilities typically include monitoring system health, diagnosing issues, implementing preventative measures, and collaborating with teams to improve system reliability. They often use tools for automation, incident response, and performance monitoring, all while working offsite. This role is critical in minimizing downtime and ensuring a smooth user experience, especially for companies with complex technical environments. Remote Reliability Engineers must have strong problem-solving skills and be proficient in cloud technologies, automation, and incident management.

What is the difference between Remote Reliability Engineer vs Remote Site Reliability Engineer?

AspectRemote Reliability EngineerRemote Site Reliability Engineer
CredentialsTypically requires certifications like AWS Certified Solutions Architect, Linux Foundation certificationsSimilar credentials, often with additional focus on site-specific tools and monitoring
Work EnvironmentPrimarily remote, focusing on cloud infrastructure and system reliabilityRemote with some on-site responsibilities, focusing on infrastructure and operational stability
Industry UsageUsed across tech, cloud providers, SaaS companiesCommon in data centers, cloud providers, and large enterprise IT
Search & Comparison IntentOften compared due to overlapping roles in system reliability and cloud infrastructureCompared for on-site vs remote operational responsibilities

The main difference is that Remote Reliability Engineers focus on cloud and system reliability remotely, while Remote Site Reliability Engineers may have some on-site duties related to infrastructure. Both roles require similar skills and certifications but differ in their work environment and specific responsibilities.

What are the most commonly searched types of Reliability Engineer jobs in Texas? The most popular types of Reliability Engineer jobs in Texas are:
What job categories do people searching Remote Reliability Engineer jobs in Texas look for? The top searched job categories for Remote Reliability Engineer jobs in Texas are:
What cities in Texas are hiring for Remote Reliability Engineer jobs? Cities in Texas with the most Remote Reliability Engineer job openings:
Infographic showing various Remote Reliability Engineer job openings in Texas as of May 2026, with employment types broken down into 88% Full Time, 8% Part Time, 3% Contract, and 1% Nights. Highlights an 81% Physical, 7% Hybrid, and 12% Remote job distribution.
Future Openings - SRE Support Engineer - Observability

Future Openings - SRE Support Engineer - Observability

Virtasant

Austin, TX • On-site, Remote

$56.50 - $75/hr

Full-time

Posted 16 days ago


Job description

SRE Support Engineer - Observability
While this position is not currently open, we are interviewing strong candidates for upcoming opportunities on this team.
Location: Remote | Time Zone: (US, Canada, Brazil, Chile, Colombia, Mexico) (8AM-5PM Pacific)
Freedom to grow. Power to deliver.
Virtasant is a global technology services company delivering large-scale cloud, data, and engineering solutions across 130+ countries. We partner with some of the world's largest organizations to help them build, operate, and scale internal platforms used by tens of thousands of engineers.
For this role, you will be supporting one of the most advanced internal developer platforms in the world, powering products used by hundreds of millions of people. The problems you will solve are deep, complex, and essential to keeping a global-scale organization moving.
Role Overview
The Observability & Tools Support Engineer provides high-impact technical support for customers of a large technology company's internal IaaS platform, with a focus on monitoring, alerting, telemetry, and operational tooling.
This role spans a wide range of support-from white-glove onboarding and end-to-end customer enablement, to deep technical troubleshooting across Linux, networking, and observability systems (especially Prometheus and AlertManager). You will also contribute to improving the support function itself: strengthening tooling, documentation, workflows, and feedback loops so the service scales.
Success depends on excellent troubleshooting, strong written communication, comfort working with highly technical customers, and the maturity to identify patterns and drive operational improvements beyond individual ticket resolution.
Business Outcome
Become a trusted frontline expert for the customer's observability ecosystem and operational tooling - delivering fast, accurate support across Slack and tickets, improving monitoring reliability, and reducing incident impact through better triage, troubleshooting, onboarding, and knowledge capture.
Success Measures
  • Healthy volume of threads and tickets handled with high-quality outcomes
  • Consistent achievement of time-based SLAs
  • High customer satisfaction through surveys
  • Accurate classification of issue type, severity, and recurring patterns
  • Reduced repeat issues through better docs, tooling, and scalable onboarding
What Will Be True When You Succeed
  • Customers can onboard smoothly to monitoring/alerting with minimal friction
  • Monitoring and alerting issues are resolved quickly, with fewer escalations
  • Linux and networking-related incidents reach resolution faster due to strong troubleshooting and clean handoffs
  • Engineering and SRE teams receive clear, actionable feedback based on real customer trends
  • Knowledge base content prevents tickets and accelerates self-service
Core Work Units
1) Frontline Support for Observability & Tooling
  • Manage Slack threads and tickets (roughly 50/50)
  • Handle a broad range of customer support: simple issue resolution through end-to-end onboarding
  • Provide clear, structured guidance to highly technical customers
  • Maintain strong attention to detail while managing multiple interactions in parallel

2) Deep-Dive Troubleshooting & Incident Support
  • Troubleshoot, isolate, and resolve monitoring and alerting issues (especially Prometheus + AlertManager)
  • Troubleshoot complex Linux and networking issues (TCP/IP fundamentals required)
  • Support OpenTelemetry, tracing, and telemetry pipelines, including investigation of gaps in signals and instrumentation
  • Drive incidents to resolution in partnership with Engineering/SRE teams

3) Documentation & Knowledge Development
  • Build and maintain customer-facing and internal knowledge base articles
  • Create informational posts for the community support platform
  • Turn repeated issues into reusable guides, checklists, and onboarding playbooks

4) Trend Analysis & Feedback to Engineering
  • Analyze and categorize customer interaction trends
  • Provide accurate, meaningful feedback to Engineering and SRE orgs to improve product/tooling
  • Identify "top offenders" and propose practical fixes (tooling, docs, process, product)

5) Operational Excellence & Continuous Improvement
  • Participate in post-mortem reviews and drive follow-through on improvements
  • Contribute meaningfully to team objectives and goals (process, tooling, and service scaling)
  • Bring creativity and discretion to resolve highly complex issues "outside the box"

High-Quality Work - what top performance looks like
Frontline Support
  • Moves smoothly from triage to deeper analysis without losing the customer
  • Communicates clearly and confidently with technical users
  • Maintains clean follow-ups and thread hygiene even with high context switching

Troubleshooting
  • Rapidly isolates issues across monitoring/alerting configs, Linux runtime behavior, and network connectivity
  • Uses structured approaches to incident handling: hypothesis → test → evidence → resolution
  • Produces high-signal writeups that accelerate downstream resolution

Documentation & Enablement
  • Documentation is clear enough that customers avoid opening tickets
  • Onboarding flows reduce time-to-value and prevent common misconfigurations
  • Captures "tribal knowledge" quickly and makes it reusable

Operational Excellence
  • Obsessing over details: correct severity, accurate tagging, clean timelines, strong handoffs
  • Spots patterns early and proactively proposes improvements that scale support

Typical Day / Work Patterns
  • ~50% Slack support, ~50% ticket handling
  • Deep-dive investigations during lower ticket volume periods
  • Documentation writing and lightweight tooling/process improvements when patterns emerge
  • Weekly team review of escalations, themes, and operational improvements
  • High rate of context switching and parallel issue management
Required Skills & Experience (Non-Negotiable)
  • Several years supporting highly scalable applications and web services
  • Hands-on experience with open-source observability and cloud-native tooling, including:
    • Kubernetes (and container fundamentals)
    • Prometheus and AlertManager troubleshooting
    • OpenTelemetry and distributed tracing concepts
  • Strong understanding of the Linux operating system (command line, process/network debugging, logs)
  • Good understanding of infrastructure observability principles (signals, alerting strategy, SLO thinking, noise reduction)
  • Good understanding of the TCP/IP suite and practical networking troubleshooting
  • Strong experience troubleshooting ambiguous, multi-layer issues
  • Excellent analytical capability and strong attention to detail
  • Strong written and verbal communication (clear, structured, customer-friendly)
  • Comfortable working with a very technical customer base
  • Passion for Technical Support and a service mindset
Nice-to-Haves
  • Experience improving or supporting internal support tooling or workflows (automation, templates, runbooks)
  • Experience operating at scale in a services environment (pattern detection, KPI/SLA awareness, operational process maturity)
  • Familiarity with Grafana, log aggregation, incident tooling, and production support practices
  • Prior SRE or platform support experience

Minimum Qualifications
  • 3-7+ years in Technical Support Engineering, SRE support, DevOps, Platform Support, or similar
  • Demonstrated experience supporting distributed systems, IaaS, or cloud platforms
  • Strong Linux, troubleshooting, and customer-facing communication background
  • Evidence of documentation, knowledge-base contributions, and process improvement mindset

Disqualifiers: weak Linux fundamentals, inability to troubleshoot systematically, poor written communication, or discomfort supporting highly technical users.
What You'll Love
  • Real technical problem solving with tangible customer impact
  • A role that blends deep troubleshooting with scaling support via docs, tooling, and process
  • High autonomy in a remote-first environment

What May Be Challenging
  • High context switching and managing multiple threads in parallel
  • Repeated patterns that require discipline to convert pain into scalable improvements
  • Supporting high-visibility systems where speed and accuracy matter

Differentiation
Industry: Remote-first, trust-based culture; global team; autonomy; modern systems; meaningful technical challenges
Internal: High-impact, customer-facing observability support; direct influence on tooling and process maturity; opportunity to shape scalable support practices
Our team Technology Operations Locations Brazil, Chile, Colombia, Mexico, Canada, USA Remote status Fully Remote