Site Reliability Engineer
- Expired: July 22, 2021. Applications are no longer accepted.
StackPath is a platform of secure Internet services built at the cloud's edge. StackPath services enable developers to build protection and performance into any cloud-based solution—from apps, to games, web sites, and beyond—without needing cloud security and delivery expertise of their own. More than 800,000 customers already use StackPath services, ranging from early-stage enterprises to Fortune 100 organizations. Headquartered in Dallas, Texas, StackPath has offices across the U.S. and around the world.
For more information follow StackPath at
About The Role
The StackPath Site Reliability Engineering (SRE) team combines software, systems and network engineering to deploy and run a portfolio of high-performance edge services including CDN, WAF and Compute. SRE’s daily focus is on the availability, change velocity, performance and capacity of customer-facing services and supporting internal systems.
On the SRE team you will have the opportunity to apply your experience against systems at scale – where a single week can involve shifting terabits of traffic between sites, deploying configuration changes to shave milliseconds off billions of requests, or enabling a new software feature on thousands of systems using automated tooling you designed and built.
This role will report to our: VP Site Reliability Engineering
Essential Duties And Responsibilities
- Respond to incidents during on-call duty
- Respond to complex customer escalations, which often cross system, network and software boundaries
- Design, develop and maintain internal service metrics (SLA, SLO, SLI) in cross-team collaborations
- Design, develop and maintain dashboards, tooling, alarms and playbooks in collaboration with operations teams to support service-level objectives
- Design, develop and maintain reusable monitoring and canary infrastructure
- Design, execute and evaluate performance experiments
- Collaborate with development teams to complete production readiness checklists prior to major feature launches
- Collaborate with operations and engineering teams in determining root cause of major incidents, performance anomalies, or other customer-impacting issues
- Experience with monitoring and alerting platforms (Prometheus and Alertmanager, Grafana, Zabbix, Nagios)
- Experience with a Linux server environment
- Experience with scripting languages (Python, Ruby, Perl)
- Experience with systems programming languages (Go, C)
- Experience with configuration management systems (Puppet, Ansible, Chef)
- Expert-level proficiency in systems, network or software engineering
- Excited about working on a remote-first engineering team
- Proficient at troubleshooting complex systems
- Production experience in a service provider environment
- Comfortable with a software engineering workflow for collaboration and configuration management — branches, pull requests, merges, conflicts
- Product launches
- Software and platform feature releases
- Live streaming event planning and execution
- Network reach and capability expansion
- Network and system automation tooling development
- Telemetry and monitoring system development
- Defining service metrics (SLA, SLO, SLI) during new product development
StackPath is an Equal Opportunity Employer. EOE/AA M/F/D/V
If your experience and qualifications match our current needs, a member of our human resources team will contact you. We look forward to hearing from you.
StackPath collects and processes personal data submitted by job applicants in accordance with our
Powered by JazzHR
TechnologyView all jobs at StackPath