Job Description
The role focuses on making Glydways' centralized planning system highly reliable, available, and restart-safe in real-world operations. Responsibilities include:
- Owning the reliability, availability, and failover behavior of the centralized planning system in production. Design and implement leader election, health checks, and heartbeat protocols.
- Defining and building state continuity mechanisms so backup instances can take over from recent state instead of cold-starting. Extend and refine recovery behaviors, ensuring the system gets to a safe state first.
- Expanding and maintaining observability: logs, metrics, traces, dashboards, and alerts for key service indicators. Harden configuration, pipelines, and deployments for the system and related services.
- Designing and maintaining automated test and robustness suites, including scenario-based, stress, fault-injection/chaos, and long-running burn-in tests. Applying safety-critical, requirements-driven reasoning to functional changes.
- Collaborating with algorithm developers, Autonomy, Test Ops, and Product to align robustness and failover behavior with algorithmic guarantees, operational procedures, and milestones.
About Glydways
Glydways is reimagining what public transit can be, believing that mobility connects people to housing, education, employment, commerce, and care.