Avatar for Datadog

Modern monitoring & analytics. See inside any stack, any app, at any scale, anywhere

Software Engineer - Resilience

$105k – $140k AngelList Est.
Apply now

About Datadog:

We're on a mission to build the best platform in the world for engineers to understand and scale their systems, applications, and teams. We operate at high scale—trillions of data points per day—providing always-on alerting, metrics visualization, logs, and application tracing for tens of thousands of companies. Our engineering culture values pragmatism, honesty, and simplicity to solve hard problems the right way.

The team:

The Resilience Engineering group at Datadog focuses on improving resilience in our software and staff. We work on defining our on-call tooling and incident response process for the entire company, constantly iterating on it through the lessons we learn from production. We help out during the most complex production incidents - our Resilience Engineers excel in troubleshooting and have a passion for problem solving and efficiency. We also build the chaos platform and tooling so that engineers can use a measured approach to break and test for system resilience and reproduce past bugs/incidents to verify their remediation.

The opportunity:

When we design systems, our Software Engineers and Site Reliability Engineers invest heavily on making them reliable and robust. However, it wouldn’t be pragmatic to expect our systems to be perfect and never fail. Being prepared to deal with unknown failures both from a technical and organizational standpoint is the core work of Resilience Engineers.

You will:

  • You will help with your expertise analyze complex issues in production and write postmortems in partnership with other engineering teams, but also help reproduce some of our past incidents in our staging and production environment at large scales to verify in practice our fixes (several teams, high complexity).
  • You will get to contribute in the development of our self-service chaos platform implemented on top of Kubernetes.
  • You will get to help define for the whole company how we respond to incidents and build tooling along the way to streamline that process. You will also help train our on-call staff, preparing newcomers to their on-call responsibilities but also refreshing the rest of the staff with what we’ve learnt from past incidents.


  • You have significant programming experience and have a willingness to dive into unfamiliar codebases and find obscure bugs.
  • You have architected, built, and operated distributed systems to solve problems at high scale in cloud-based environments.
  • You have been on-call for critical systems and you have experience handling incidents using a formal organization process.
  • You want to work in a fast-paced, high-growth environment that respects its engineers and customers.

Bonus points:

  • You've worked on chaos engineering projects before.
  • You’ve been an Incident Commander or have contributed to defining an incident response process.
  • You have Linux/Kubernetes experience.
  • You have experience in cross-team, cross-functional projects.
Paris • EugeneRemote
Hires remotely
Job type
Visa sponsorship
Not Available

Medical insurance

Retirement savings plan

Open paid time off

Catered lunches

Snacks & drinks

Fitness fund

Commuter benefits

Outings & events

Referral bonus

Datadog at a glance

Modern monitoring & analytics. See inside any stack, any app, at any scale, anywhere

Datadog focuses on SaaS, Enterprise Software, Information Technology, Analytics, and Software. Their company has offices in New York City, San Francisco, New York, Boston, and Chicago. They have a very large team that's between 1001-5000 employees. To date, Datadog has raised $147.9M of funding; their latest round was closed on September 2019 at a valuation of $11B.

You can view their website at https://www.datadoghq.com or find them on Twitter and LinkedIn.

More jobs at Datadog

View all jobs

Open-Source Software Engineer - .NET / C#

Open-Source Software Engineer - .NET / C#

Software Engineer - Alerting

Software Engineer - Compute

Software Engineer - Cloud Metrics

Similar jobs to Software Engineer - Resilience at Datadog

Avatar for Azalead
The Hyper Aware Platform For Account-Based Marketing
Avatar for Boutwik
Disrupting influencers monetization using Marketplace technology
Avatar for Prose
Prose is the 1st personalized haircare, freshly made in NYC
Avatar for EXPLOY
Difficult roads often lead to beautiful destinations. "Bon Voyage"
Avatar for FairMoney
FairMoney is building the leading mobile bank for Emerging Markets
Avatar for Voodoo
Ultracasual games and mobile Apps