Build and maintain Python fleet tracking system that manages the full lifecycle of servers.
Build server management tooling that automates provisioning, health checks, GPU diagnostics, recovery and alerting.
Create and maintain metrics, dashboards, and alerting for hardware health across the fleet.
FAL is committed to keeping a large fleet of GPU servers healthy and productive. They offer a collaborative and supportive culture with learning and growth opportunities.
Own performance optimization and reliability of large-scale GPU clusters and InfiniBand networking for HPC workloads.
Diagnose and resolve complex system-level issues across GPU, network, and compute layers, integrating new hardware components.
Develop automation for monitoring, fault detection, and proactive remediation in distributed compute environments.
Our partner is building a next-generation AI cloud infrastructure environment, focusing on large-scale high-performance computing systems. They foster a highly technical engineering culture with experts across systems, networking, and virtualization, offering career development and continuous learning opportunities.
Develop and maintain automated provisioning pipelines for bare-metal servers across global data centers.
Perform security monitoring, repair and recover from hardware or software failures.
Act as technical lead, mentor engineers, and report directly to the CTO.
Kayzen is a mobile demand-side platform (DSP) that democratizes programmatic advertising. With 160B+ daily ad requests and 1B+ ads served per day globally, it powers top mobile marketing teams with a focus on performance, transparency, and control.
Lead the development of Ansible playbooks and workflows, integrating AI to supercharge automation processes.
Manage and optimize Red Hat Enterprise Linux environments, ensuring peak performance and security.
Implement best practices for security, patching, and compliance to maintain robust and secure systems.
General Dynamics Mission Systems engineers a diverse portfolio of high technology solutions, products and services that enable customers to successfully execute missions across all domains of operation. With a global team of 12,000+ top professionals, they partner with the best in industry to expand the bounds of innovation in the defense and scientific arenas.
Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents.
Provide technical leadership for Kubernetes platform operations and supporting infrastructure services.
Mentor and support AI Infrastructure & Platform Operations Engineers, sharing technical knowledge through documentation and training.
Mirantis helps organizations ship code faster on public and private clouds, providing a public cloud experience on any infrastructure from the data center to the edge. The company serves many of the world's leading enterprises, including Adobe, DocuSign, Liberty Mutual, and PayPal, and is a leader in container management.
Design, deploy, and maintain HPC clusters and cloud-based compute environments.
Support scientific workflows and compute-intensive applications in life sciences.
Administer HPC schedulers like SLURM and implement Infrastructure-as-Code with tools such as Terraform.
Jobgether is an AI-powered job matching platform that connects candidates with hiring companies. The company operates as a remote team focused on streamlining recruitment through technology.
Monitor, operate, and support production AI infrastructure platforms.
Investigate and resolve infrastructure, networking, hardware, and platform-related incidents.
Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve technical issues.
Mirantis is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure infrastructure for AI and data-intensive applications. The company is growing and invests heavily in AI infrastructure and platform services.
Lead end-to-end technical delivery for client-facing scientific and AI-driven projects.
Design, build, and deploy scalable software systems that extend or wrap scientific models.
Act as a technical liaison between researchers, product teams, and engineering stakeholders.
This role sits at the intersection of advanced AI systems, scientific computing, and real-world drug discovery applications, where engineering directly enables breakthroughs in life sciences. The position requires strong autonomy and the ability to operate in highly ambiguous, research-driven environments where experimentation and execution go hand in hand.
Build and maintain infrastructure platforms for over 200 backend services running on Kubernetes clusters with 40,000+ cores.
Lead and mentor other engineers, own complex infrastructure failures, and participate in a shared on-call rotation.
Drive cloud cost efficiency, estimate schedules, and use AI tools as a first-class collaborator in daily workflows.
Life360's mission is to keep people close to the ones they love through location sharing, safe driver reports, and crash detection. The company serves approximately 97.8 million monthly active users across more than 180 countries and has more than 500 remote-first employees.
Design, build, and implement robust infrastructure solutions aligned with business needs and security best practices.
Automate resource deployment, compute and storage allocation, and optimize delivery of key infrastructure services.
Troubleshoot escalated issues, perform root-cause analyses, and drive process improvements using AI and automation.
Hyland is the pioneer of the Content Innovation Cloud™, delivering ubiquitous enterprise intelligence to organizations through solutions that unlock actionable insights and drive automation. Trusted by thousands of organizations worldwide, including many of the Fortune 100, Hyland has grown to nearly 4,000 employees with a culture focused on employee initiatives, wellbeing, and innovation.
Maintain and support core infrastructure with deep Linux expertise.
Design scalable networks using VLANs, routing, VPNs, and UniFi equipment.
Automate provisioning with Ansible, Bash/Python, and MAAS-based workflows.
A European deep-tech company is developing a decentralized, energy-efficient cloud platform using distributed bare-metal infrastructure. It is a startup or hyper-growth environment that values autonomy, speed, and problem-solving.
Act as senior technical authority in application support, ensuring stability, performance, and reliability of enterprise systems.
Partner with technology and business teams to define enhancements, production support strategies, and drive incident management.
Mentor junior analysts, influence operational practices, and improve system resilience in a global financial technology environment.
Jobgether is an AI-powered job matching platform that helps candidates get reviewed quickly and objectively against role requirements. They focus on using technology to connect top-fitting candidates with hiring companies.