Architect and deploy secure, scalable infrastructure using Terraform, CloudFormation, or similar tools.
Ensure the platform meets strict SLA requirements for enterprise clients, minimizing downtime.
Implement comprehensive monitoring, logging, and alerting to provide deep visibility into system health.
Filevine provides cloud-based workflow tools for legal professionals, helping them manage organizations and serve clients. They are recognized as a fast-growing and innovative technology company with a team of passionate professionals.
Design, build, and maintain highly available, scalable infrastructure.
Manage and optimize infrastructure across GCP, AWS, Azure, and other cloud providers.
Develop comprehensive monitoring, logging, and alerting systems.
Bobsled is seeking a Site Reliability Engineer to enhance its data-sharing platform's reliability and scalability. We're a company that values growth, offering flexible work hours in a fully remote environment and fully sponsored individual coaching for all employees.
Architect and maintain self-healing systems with 99.9%+ availability targets.
Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
Implement adaptive SLIs/SLOs that evolve automatically from real-time data.
Groupon is a marketplace where customers discover new experiences and services everyday and local businesses thrive. Even with thousands of employees spread across multiple continents, they still maintain a culture that inspires innovation, rewards risk-taking and celebrates success.
Design, implement, and maintain scalable and reliable infrastructure solutions.
Automate deployments and maintain a resilient, secure SaaS application platform.
Develop comprehensive monitoring and alerting solutions, and respond to incidents.
Veeam is the #1 global market leader in data resilience, believing businesses should control all their data whenever and wherever they need it, providing data resilience through data backup, data recovery, data portability, data security, and data intelligence. Based in Seattle, Veeam protects over 550,000 customers worldwide who trust Veeam to keep their businesses running.
Respond to production incidents and contribute to post-incident analysis.
Identify and automate manual processes to improve efficiency and reduce risk.
Enhance monitoring tools and platforms to improve observability.
Restaurant365 is a SaaS company that provides a unique, centralized solution for accounting and back-office operations for restaurants. They focus on empowering team members to produce top-notch results while elevating their skills.
Build Enterprise-Scale Infrastructure leveraging infrastructure-as-code to manage complex cloud environments.
Sustain Platform Health and Performance owning critical systems in production, including reliability and security.
Enable Teams and Customers to Move Faster creating abstractions and tooling that deploy, run, and scale AI/ML workloads.
Cake is on a mission to make cutting-edge AI accessible to enterprise teams. Backed by top investors, Cake is seeing strong adoption and is positioned for rapid growth in the next 12 months, emphasizing ownership, clear communication, and collaboration.
Design, implement, and manage cloud infrastructure using Infrastructure as Code (IaC) tools.
Design, build, and maintain scalable CI/CD pipelines using tools like CircleCI or GitHub Actions.
Implement and maintain observability tooling (Prometheus, Grafana, Datadog), and lead incident response to ensure system reliability.
Engine is transforming business travel into something personalized, rewarding, and simple. More than 20,000 companies already rely on Engine to support over 1 million travelers and billions in annual bookings each year.
Ensure the smooth operation and high availability of Clarifai's core services
Monitor system performance, identify bottlenecks, and implement optimizations to enhance reliability and efficiency
Design and implement scalable, secure, and cost-effective infrastructure solutions
Clarifai is a leading AI platform specializing in computer vision and generative AI, empowering organizations to transform unstructured data into actionable insights. Founded in 2013, they have a diverse, globally distributed team with $100M in funding and are committed to building a diverse and inclusive team.
Make deployments boring (in the best way possible)
Own CI/CD pipelines: optimize build times, improve caching, reduce flakiness
Evolve our Kubernetes (EKS) deployment strategy for reliability and speed
Obvious is building an AI-native workspace, an operating system for work that puts co-intelligence at the center. They are a small and talent-dense team with world-class builders, former founders, and leaders from companies like Netflix, Google, and Meta.
Ensure near-zero downtime with monitoring and alerting, self-healing automation, and continuous improvement
Create highly automated, available and scalable systems by applying software and infrastructure principles
Employ and advise clients on DevOps and SRE principles and practices, covering deployment pipelines, HA, service reliability, technical debt, and operational toil for live services running at scale
66degrees is an AI transformation partner. They guide enterprises from business challenges to quantifiable outcomes, helping businesses reach their inflection point where chaotic data becomes a strategic asset, complexity becomes clarity, and AI becomes an engine for growth. They believe in thriving through challenges and winning together.
Configure and maintain cloud infrastructure automation using Terraform, focusing on CDN optimization and content delivery performance
Develop capacity planning strategies and performance optimization initiatives for high-volume spatial content delivery.
Instrument services to understand system health.
Miris is a cutting-edge technology company building the future of 3D content delivery at global scale. Our mission is to empower creators and developers to deliver high-fidelity, photorealistic 3D experiences to billions of users instantly, seamlessly, and across all major platforms and devices.
Building world-class AI infrastructure to support a 100+ person research team.
Designing and scaling multi-cloud systems that support high-performance model training and inference.
Improving monitoring, alerting and system observability for AI workloads
Canva is redefining how the world experiences design. It has campuses in Sydney and Melbourne, and co-working spaces in other major cities, trusting employees to choose the balance that empowers them and their team to achieve their goals.
Leverage infrastructure as code (Terraform) to build and maintain complex production and analytics workflows including networking and containerized services.
Rapidly diagnose and resolve faults in system services as part of a 24/7 on-call rotation focused on actionable alerting and eliminating toil.
Improve speed of delivery by developing and maintaining CI/CD pipelines.
Linus Health is a Boston-based digital health company transforming brain health worldwide. They combine cutting-edge neuroscience, clinical expertise, and AI to advance early detection and intervention for cognitive and brain disorders, empowering people to live longer, healthier lives. With 100+ team members and growing, they’re entering a phase of accelerated growth and looking for top talent to help shape their future.
Work directly with customers to ensure successful Teleport deployments.
Meet regularly with customers, understand pain points blocking deployments and remove roadblocks.
Work with customers to articulate the problem they are trying to solve, gather requirements, and make the business case to the product and engineering teams to invest in resolving the issue.
Teleport is the Infrastructure Identity Company, modernizing identity, access, and policy for infrastructure, improving engineering velocity and resiliency of critical infrastructure against human factors and/or compromise. They are a fast-growing, well-funded Y-Combinator company that values craft, strongly supports work/life balance, and embraces a culture of humility, honesty, and transparency.
Support and evolve the reliability of platforms used by the AI Research team.
Ensure production services meet expectations for availability, latency, and operational readiness.
Build and maintain Kubernetes-based services on GCP using infrastructure-as-code and GitOps.
Algolia is a pioneer and market leader in AI Search, empowering 17,000+ businesses to deliver blazing-fast, predictive search and browse experiences. They have raised $150 million in Series D funding, quadrupling their valuation to $2.25 billion, investing in their market-leading platform.
Contribute to high impact AWS cloud infrastructure initiatives.
Participate in operability and production readiness reviews.
Advocate and implement Site Reliability Engineering practices.
Patreon is a media and community platform where creators give fans access to exclusive work. They have generated over $10 billion for creators and have 25 million+ paid memberships, with a hybrid work model and offices in New York and San Francisco.
Build and operate scalable backend services and internal APIs for the AI platform.
Integrate LLMs and AI tool execution into reliable, production-ready workflows.
Own production reliability for AI platform infrastructure through observability, alerting, and incident response.
MaintainX is the world's leading Asset and Work Intelligence platform for industrial and frontline environments. They are a modern IoT-enabled cloud-based tool for reliability, safety, and operations on physical equipment and facilities, powering operational excellence for 13,000+ businesses. MaintainX recently completed a $150 million Series D round, at a valuation of $2.5 billion.
Automate the provisioning of all of Juniper Square’s infrastructure in code.
Partner with our Platform Engineering team on building developer tooling / improving developer experiences via joint initiatives and enhancements.
Partner with our Data Engineering team on improving our data posture and driving operational excellence.
Juniper Square's mission is to unlock the full potential of private markets by digitizing them to bring efficiency, transparency, and access. They are a values-driven organization with a hybrid workplace strategy, allowing employees to collaborate effectively across multiple countries and offering physical offices in several major cities.
Building monitoring, alerting, logging, and observability from the ground up.
Improving our security posture across auth, IAM, policies, and data access.
Software Mind develops solutions that make an impact for companies around the globe. They build cross-functional engineering teams that take ownership and crave more, embracing openness, acting with respect, showing grit & guts and combining employment with enjoyment.
Design, implement, and maintain high-performance ML training and inference platforms.
Ship tools that allow any ML engineer to deploy a model in minutes, not days.
Improve scalability, reliability, and cost efficiency of model training and serving systems.
Speechify's mission is to make sure that reading is never a barrier to learning. With nearly 200 people around the globe working in a 100% distributed setting, Speechify's team includes frontend and backend engineers, AI research scientists, and others.