Netflix has more than 130 million subscribers worldwide. To support such a large subscriber base, we run a large, distributed, and ever-changing system. The Resilience Engineering team’s goal is to make this complex system as resilient as possible, so that our customers enjoy a great experience. The opportunity to impact Netflix and its 125 million customers is huge! If you like scale and global impact, this is an amazing place to be.
How do we make our system more resilient? We find vulnerabilities and risks in our system before they lead to customer-facing outages. To find vulnerabilities, we build Chaos tools that allow us to inject events that we expect the system to handle, and check that the service stays healthy. We are currently leveraging these tools to build a platform for load testing services with production traffic. This platform allows us to better understand the limits of our production systems. Finally, we track patterns of risks and vulnerabilities, which inform us of our biggest availability challenges and help us come up with risk mitigation strategies. You can read more about the practice of Chaos engineering here.
Who you are
You are intensely curious about how complex distributed systems operate and fail at scale
When you code, you reflect and seek feedback on design choices and trade-offs you make
You value engineering excellence and write testable, clear, and re-usable code.
You think freely and independently, and are ready to share your view
You are humble and eager to learn from mistakes and you socialize the lessons learned
You can argue both sides of most disagreements
You collaborate well with partner teams
What you’ll do
Study the problems in the software resilience space
Create new solutions and see them through, from conception to production
Write code to support our existing solutions
Work with partner teams to find and fix vulnerabilities in their services
You have built or contributed to a variety of systems, ideally in different technologies
You have experience with microservice architectures and understand scaling and concurrency concerns
You have strong software design and development skills in modern programming languages
Nice to have
Experience with multi-site high availability
Experience with Chaos engineering or testing in production
Experience creating products for engineers
Experience developing tools to improve reliability
Experience with internet-scale infrastructure
Netflix is the world’s leading Internet television network with over 100 million members in over 190 countries enjoying more than 125 million hours of TV shows and movies per day, including original series, documentaries and feature films. Members can watch as much as they want, anytime, anywhere, on nearly any Internet-connected screen. Members can play, pause and resume watching, all without commercials or commitments.