Senior Devops Engineer, Data Platform Reliability

Netflix, Los Gatos, California

Leading subscription service for watching TV episodes and movies

At Netflix, the big data platform is at the core of driving our product decisions that directly impact our customer experience when they watch Netflix.
To support these business needs, we operate one of the largest big data infrastructures in the AWS cloud. In addition to relying on big data compute engines like Spark, Presto, and Flink, we also build an ecosystem of tools and services that allow all Netflix teams to leverage the platform as a cohesive service. To learn more, here is our recent talk (slides or video) that describes our big data infrastructure.
Our platform runs tens of thousands of jobs and processes over a trillion events every day. We support over a thousand data analysts, data scientists, and engineers across the company.
As a member of the team, you will help drive operational excellence for this ecosystem of complex large-scale systems by re-imagining how we would automate and build tools to lower operational barriers, improve clarity on problematic areas, and improve reliability of the platform.

Specifically, you will:

  • Develop effective tooling, alerts, and response to both identify and address reliability risks.
  • Build tools and automation to reduce operational tasks, improve automatic issue identification and routing, and predict platform performance in accordance to SLAs based on overall platform health and progress.
  • Participate in on-call rotation to manage incident and to handle unknown/new issues.
  • Drive issue resolution and root cause identification with the various data infrastructure teams.
  • Evangelize best practices around collaboration and reliability to all partner teams.

Job Qualifications:

  • Effective root cause identification, triage and mitigation
  • Experience with configuration and troubleshooting of Linux, Java, Tomcat, and other middleware technologies
  • Understands large-scale complex systems from a reliability perspective
  • Strong communication skills and the ability to engage partner teams effectively
  • Strong automation mindset and passion to identify strategies to mitigate going forward
  • Experience with Cloud Computing platforms (particularly AWS) a plus
  • Strong Linux system-level analysis and network analysis experience a plus

About Netflix

Netflix is the world’s leading Internet television network with over 100 million members in over 190 countries enjoying more than 125 million hours of TV shows and movies per day, including original series, documentaries and feature films. Members can watch as much as they want, anytime, anywhere, on nearly any Internet-connected screen. Members can play, pause and resume watching, all without commercials or commitments.

Want to learn more about Netflix? Visit Netflix's website.