Be a key contributor on an Agile development team, collaboratively realizing business value through iterative software development lifecycle.
Build and execute the monitoring strategy for ScienceLogic SaaS infrastructure.
Define, deploy, and maintain system and service monitors.
ScienceLogic is a leader in IT Operations Management, giving modern IT operations actionable insights for faster problem resolution and prediction. They see everything across cloud and distributed architectures, contextualizing data through relationship mapping, and acting on this insight through integration and automation.
Lead incident response as Incident Commander, coordinating teams, communications, and service restoration
Produce executive-level incident reports, run RCAs, and drive continuous improvement
Enforce change management and risk assessment for production changes
Truelogic is a leading provider of nearshore staff augmentation services headquartered in New York, delivering top-tier technology solutions to companies of all sizes. Their team of 600+ highly skilled tech professionals, based in Latin America, drives digital disruption by partnering with U.S. companies on their most impactful projects.
Ensure near-zero downtime with monitoring and alerting, self-healing automation, and continuous improvement
Create highly automated, available and scalable systems by applying software and infrastructure principles
Employ and advise clients on DevOps and SRE principles and practices, covering deployment pipelines, HA, service reliability, technical debt, and operational toil for live services running at scale
66degrees is an AI transformation partner. They guide enterprises from business challenges to quantifiable outcomes, helping businesses reach their inflection point where chaotic data becomes a strategic asset, complexity becomes clarity, and AI becomes an engine for growth. They believe in thriving through challenges and winning together.
Lead reliability-focused design and readiness reviews.
Build, operate, and continuously improve our observability stack.
Own and evolve incident management practices.
Transcend is building the privacy platform that easily embeds privacy into your entire tech stack. They are growing quickly, backed by top-tier investors and are proud to serve some of the world's most iconic brands.
Operate and maintain large-scale data systems, ensuring stability and performance.
Design, implement, and optimize deployment processes using virtualization.
Monitor system health, analyze failures, and identify instability sources.
Jobgether is a platform that uses AI-powered matching to connect candidates with companies. They ensure applications are reviewed quickly, objectively, and fairly, then share a shortlist of top candidates directly with the hiring company.
Responsible for providing support of MEMX exchange platforms including on-call, respond to incidents and support triaging the issue
Help isolate and resolve unplanned system outages
Enhance monitoring and alerting based on symptoms
MEMX is building a next-generation exchange that will bring greater competition, transparency, and efficiency to equity trading. We offer competitive employee benefits and perks and will continue to make this a priority to attract the best.
Respond to production incidents and contribute to post-incident analysis.
Identify and automate manual processes to improve efficiency and reduce risk.
Enhance monitoring tools and platforms to improve observability.
Restaurant365 is a SaaS company that provides a unique, centralized solution for accounting and back-office operations for restaurants. They focus on empowering team members to produce top-notch results while elevating their skills.
Automate the provisioning of all of Juniper Square’s infrastructure in code.
Partner with our Platform Engineering team on building developer tooling / improving developer experiences via joint initiatives and enhancements.
Partner with our Data Engineering team on improving our data posture and driving operational excellence.
Juniper Square's mission is to unlock the full potential of private markets by digitizing them to bring efficiency, transparency, and access. They are a values-driven organization with a hybrid workplace strategy, allowing employees to collaborate effectively across multiple countries and offering physical offices in several major cities.
Work directly with customers to ensure successful Teleport deployments.
Meet regularly with customers, understand pain points blocking deployments and remove roadblocks.
Work with customers to articulate the problem they are trying to solve, gather requirements, and make the business case to the product and engineering teams to invest in resolving the issue.
Teleport is the Infrastructure Identity Company, modernizing identity, access, and policy for infrastructure, improving engineering velocity and resiliency of critical infrastructure against human factors and/or compromise. They are a fast-growing, well-funded Y-Combinator company that values craft, strongly supports work/life balance, and embraces a culture of humility, honesty, and transparency.
Design, build, and maintain highly available, scalable infrastructure.
Manage and optimize infrastructure across GCP, AWS, Azure, and other cloud providers.
Develop comprehensive monitoring, logging, and alerting systems.
Bobsled is seeking a Site Reliability Engineer to enhance its data-sharing platform's reliability and scalability. We're a company that values growth, offering flexible work hours in a fully remote environment and fully sponsored individual coaching for all employees.
Design, implement, and manage cloud infrastructure using Infrastructure as Code (IaC) tools.
Design, build, and maintain scalable CI/CD pipelines using tools like CircleCI or GitHub Actions.
Implement and maintain observability tooling (Prometheus, Grafana, Datadog), and lead incident response to ensure system reliability.
Engine is transforming business travel into something personalized, rewarding, and simple. More than 20,000 companies already rely on Engine to support over 1 million travelers and billions in annual bookings each year.
Contribute to high impact AWS cloud infrastructure initiatives.
Participate in operability and production readiness reviews.
Advocate and implement Site Reliability Engineering practices.
Patreon is a media and community platform where creators give fans access to exclusive work. They have generated over $10 billion for creators and have 25 million+ paid memberships, with a hybrid work model and offices in New York and San Francisco.
Act as first responder and incident commander during production incidents
Improve reliability and uptime across all Wormhole services
Harden infrastructure for security and operational resiliency
Wormhole Foundation empowers passionate people in the research and development of blockchain interoperability technologies. They support teams building secure, open-source, and decentralized products within the Wormhole ecosystem.
Own and operate core platform systems across AWS, GCP, Vercel, Github, and Cloudflare.
Improve reliability, scalability, and security of production and non-production environments.
Improve local development environments and onboarding experience for engineers.
Moxie empowers ambitious aesthetic entrepreneurs to build profitable, independent practices. A global, remote-first team of more than 140 people supports hundreds of practices nationwide as they unlock sustainable success for aesthetic entrepreneurs.
Manage and troubleshoot Linux-based systems in production and non-production environments.
Improve infrastructure automation, monitoring, and operational processes.
Assist with incident response, root cause analysis, and continuous improvement.
Higher Logic provides online communities and communication tools to help organizations build, retain, and grow their member or customer base. They are a global company with offices throughout the US, Canada, and Australia, serving more than 3,000 customers.
Leverage infrastructure as code (Terraform) to build and maintain complex production and analytics workflows including networking and containerized services.
Rapidly diagnose and resolve faults in system services as part of a 24/7 on-call rotation focused on actionable alerting and eliminating toil.
Improve speed of delivery by developing and maintaining CI/CD pipelines.
Linus Health is a Boston-based digital health company transforming brain health worldwide. They combine cutting-edge neuroscience, clinical expertise, and AI to advance early detection and intervention for cognitive and brain disorders, empowering people to live longer, healthier lives. With 100+ team members and growing, they’re entering a phase of accelerated growth and looking for top talent to help shape their future.
Work with your team to build and roll out new features, then use the results to iterate and improve.
Drive projects from initial ideation all the way to operations once it is in the hands of customers.
Maintain critical systems, and own their reliability, performance, and availability.
Grafana Labs is a remote-first, open-source powerhouse with over 20M users. They provide observability strategies for over 3,000 companies, featuring scalable metrics, logs, and traces, and thrive in an innovation-driven environment with transparency, autonomy, and trust.
Own and maintain the incident response process, including defining procedures, tools, and best practices
Guide teams in establishing and monitoring Service Level Objectives (SLOs), including setting up alerts and reporting systems
Lead capacity planning initiatives, focusing on both short and long-term scalability while optimizing costs
Underdog makes sports more fun by building the best products for sports fans. They are a fast-growing sports company valued at $1.3B with a focus on a seamless, simple, easy to use, intuitive and fun app.
Design, implement, and maintain scalable and reliable infrastructure solutions.
Automate deployments and maintain a resilient, secure SaaS application platform.
Develop comprehensive monitoring and alerting solutions, and respond to incidents.
Veeam is the #1 global market leader in data resilience, believing businesses should control all their data whenever and wherever they need it, providing data resilience through data backup, data recovery, data portability, data security, and data intelligence. Based in Seattle, Veeam protects over 550,000 customers worldwide who trust Veeam to keep their businesses running.
Ensure the smooth operation and high availability of Clarifai's core services
Monitor system performance, identify bottlenecks, and implement optimizations to enhance reliability and efficiency
Design and implement scalable, secure, and cost-effective infrastructure solutions
Clarifai is a leading AI platform specializing in computer vision and generative AI, empowering organizations to transform unstructured data into actionable insights. Founded in 2013, they have a diverse, globally distributed team with $100M in funding and are committed to building a diverse and inclusive team.