Source Job

You will design, build, and maintain observability platform tools and frameworks. This role involves designing and implementing systems that monitor and analyze the performance/health of software applications and infrastructure. You will collaborate closely with development, site reliability engineering, DevOps, and infrastructure teams.

Python Java Splunk Grafana Docker

12 jobs similar to Lead Software Engineer - Observability

Jobs ranked by similarity.

  • Automate manual processes to provide efficiencies in managing services.
  • Create visual representations of systems and services using Grafana dashboards.
  • Collaborate with engineering teams to align development efforts with reliability, scalability, and business objectives.

Vultr is on a mission to make high-performance cloud infrastructure easy to use, affordable, and locally accessible for enterprises and AI innovators around the world.

Europe

Act as a trusted technical advisor to enterprise customers, bridging the gap between product and customer outcomes. Design, demonstrate, and validate Dash0’s technical capabilities in real-world environments through Proofs of Concept (POCs). Partner with sales and product teams to guide observability architecture discussions and ensure customers realize the full technical value of Dash0.

Dash0 is building a delightful, simple, and AI-centric platform that eliminates vendor lock-in and meaningless toil for observability.

US

  • Designs, implements, and continuously improves observability strategies across services.
  • Focuses on understanding system behavior in production, identifying failure modes, performance bottlenecks, and reliability risks.
  • Evolves and maintains shared AWS CDK and CDK8s constructs, with emphasis on observability, autoscaling, and operational safeguards.

Truelogic is a leading provider of nearshore staff augmentation services. They have a team of 600+ highly skilled tech professionals based in Latin America, partnering with U.S. companies on impactful projects and valuing expertise and aspirations.

  • Design and implement foundational patterns and libraries for Python applications.
  • Develop and maintain robust CI/CD pipelines using tools such as Jenkins, ArgoCD.
  • Instrument observability through tools such as CloudWatch and DataDog to monitor and optimize application performance across multiple environments.

As a leader in aging care innovation, Honor provides the technology, tools, and services that empower older adults to live life on their own terms.

Design, implement, monitor and maintain Sysdig's Infrastructure at scale on different clouds and on-prem. Collaborate with development teams to improve system reliability, performance, and scalability. Participate in on-call rotation, respond to incidents, conduct root cause analyses, and implement preventive measures.

Sysdig helps organizations secure innovation in the cloud with runtime insights, open innovation, and agentic AI, trusted by over 60% of the Fortune 500.

India Unlimited PTO

Seeking an experienced Site Reliability Engineer to help build highly resilient and scalable systems by automating, measuring, and monitoring everything. Implement highly-available and scalable architectures for core and third-party components of Acquia Source. Implement metrics, monitoring, and incident response processes.

Acquia is an open source digital experience company providing technology to brands that allows them to embrace innovation and create customer moments that matter.

UK

Run the production environment by monitoring availability and taking a holistic view of system health. Build software and systems to manage platform infrastructure and applications. Improve reliability, quality, and time-to-market of our suite of software solutions.

NICE software products are used by 25,000+ global businesses to deliver extraordinary customer experiences, fight financial crime and ensure public safety.

$74,900–$99,000/yr
US

  • Lead and mentor a team of Specialists, fostering a culture of ownership and continuous learning.
  • Enable Change and Problem Management teams to leverage Datadog observability tools for evaluating release quality.
  • Oversee implementation and optimization of CI/CD Observability pipelines to ensure Operational Readiness standards are met.

BWH Hotels is a global leader in hospitality for nearly 80 years, inspiring travel through unique experiences. Headquartered in Phoenix, Arizona, BWH Hotels boasts a powerful portfolio of 18 brands and they foster a workplace culture where contributions truly matter.

US Unlimited PTO

  • Lead infrastructure resiliency efforts including recovery mechanisms, tenant isolation, and load spike handling
  • Improve observability and operability of systems
  • Build performance-critical, user-facing infrastructure like real-time event processing

Jobgether is a platform that uses AI-powered matching process to ensure applications are reviewed quickly, objectively, and fairly against a role's core requirements. They identify top-fitting candidates and share this shortlist directly with the hiring company.

$145,000–$185,000/yr
US Unlimited PTO

  • Be a keen learner, working with cloud-native, highly scalable infrastructure and gaining expertise in container orchestration, networking, and observability.
  • Be a passionate problem solver, tackling scalability, reliability, and troubleshooting challenges in distributed systems.
  • Be a great communicator, engaging directly with developers, engineering teams, and product teams to understand infrastructure challenges and provide solutions.

Temporal provides an open-source programming model that simplifies code, improves application reliability, and helps developers focus on delivering features faster. They aim to be the reliable foundation of every developer’s toolbox and value curiosity, drive, collaboration, genuineness, and humility.

Europe

Lead the Reliability & Operations function within the Developer & Production Enablement (DPE) division of RWS’s Product & Technology organization. Take ownership of global production operations and lead the transition from manual, ticket-based workflows to platform-integrated automation. Ensure stability today, while designing for scalability and autonomy in the future.

RWS's purpose is to unlock global understanding, valuing every language and culture, and celebrating diversity and inclusion to make the company strong.

ANZ

  • Own challenging infrastructure problems end-to-end by understanding how engineers use the platform.
  • Design scalable, maintainable services and contribute to technical proposals.
  • Contribute to the roadmap, highlighting opportunities, validating approaches and helping keep our platform solutions current with cloud best practices.

Canva's intuitive suite of design products is powered by our large distributed infrastructure group, setting large and ambitious goals.