Job Summary:
Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. They are hiring a Senior Platform & Reliability Engineer to own service reliability end-to-end, prevent incidents, and lead recovery efforts when production degrades.
Responsibilities:
• Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows.
• Production architecture resilience: Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access.
• Kubernetes runtime reliability: Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety.
• Queue + job safety (BullMQ/Redis): Own poison pill containment and workload isolation.
• Incident command quality: Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution).
• Reliability operating system: Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline.
• Release safety authority: Gate risky deploys and enforce reliability guardrails when production health is at risk.
Qualifications:
Required:
• Experience with setting and enforcing SLIs/SLOs/error budgets for critical user flows.
• Proven ability to drive failure isolation across API, workers, queues, and dependencies.
• Expertise in defining probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety in Kubernetes.
• Experience with BullMQ/Redis for queue and job safety, including poison pill containment and workload isolation.
• Demonstrated ability to lead Sev1/Sev2 incident response end-to-end.
• Strong skills in observability quality, on-call effectiveness, runbooks, and postmortem discipline.
• Ability to gate risky deploys and enforce reliability guardrails.
Preferred:
• Calm, structured incident commander under pressure.
• Ability to think in failure modes and blast radius by default.
• Pragmatic approach to stabilizing quickly and implementing durable fixes.
• High ownership and strong written communication skills.
Company:
Building tools that shorten the distance between having ideas and bringing them to life. Founded in 2021, the company is headquartered in San Francisco, California, US, , with a team of 51-200 employees. The company is currently Early Stage.