Sr. Software Engineer - Spark Compute Infrastructure, ML Platform

Netflix, Remote, United States

Leading subscription service for watching TV episodes and movies

Duration: Full-Time

Would you like to manage our Spark compute infrastructure and optimize the ML Spark pipelines that power Netflix recommendations? We think of the Netflix service as hundreds of millions of different products serving uniquely personalized experiences to each of our 200+ Million members.  One of the teams powering this effort is the ML Platform Data & Feature Infra team that is responsible for building a scalable and efficient compute infrastructure that is leveraged to train our personalization ML models. 

The Opportunity
In this role, you will have the opportunity to manage the Spark compute infrastructure that is used to train ML algorithms that power Netflix personalization. You will drive operational excellence through tooling and automation and will be working closely with ML researchers and engineers to scale their adhoc explorations and manage Production ML pipelines. This role will allow you to gain intimate knowledge of Netflix Personalization, while working for a unique and pioneering company that is redefining how video content is consumed globally.

Here are some examples of the types of things you would work on:

  • Optimize the ML Spark pipelines for both resource and latency efficiency and help do capacity planning for our compute infrastructure
  • Increase research productivity by quickly troubleshooting Spark performance issues and any roadblocks in adoption of our compute infrastructure
  • Build tools and automation to make infrastructure more robust and for reporting cluster cost utilization and efficiency
  • Manage a large scale Spark cluster (several thousands of EC2 instances) that powers the ML production pipelines fueling innovation for Recommendations research
  • Collaborate with our Big Data Platform teams to build, deploy and upgrade our compute infrastructure using the the latest and greatest open source libraries

To learn more, here are some talks/blog posts from the team:

Minimum Qualifications

  • 4+ years of relevant experience managing large scale distributed data systems
  • Strong automation mindset and a passion for root cause analysis and strategies to mitigate issues
  • Experience in big data technologies like Spark, Mesos/YARN/Kubernetes, HDFS or ElasticSearch
  • Experience with performance tuning and debugging scalability issues of Spark applications
  • Excellent communication and people engagement skills
  • Expertise in scripting languages
  • Experience with Cloud Computing platforms like Amazon AWS

Preferred Qualifications

  • Exposure to functional languages like Scala
  • Experience working on Notebooks such as Jupyter or Polynote
  • Experience working on container (Docker) platforms

Netflix is an equal opportunity employer and strives to build diverse teams from all walks of life. We offer a unique culture of freedom and responsibility with a clear long-term view. We recommend reading through these to understand what working at Netflix is like.

About Netflix

Netflix is the world’s leading Internet television network with over 100 million members in over 190 countries enjoying more than 125 million hours of TV shows and movies per day, including original series, documentaries and feature films. Members can watch as much as they want, anytime, anywhere, on nearly any Internet-connected screen. Members can play, pause and resume watching, all without commercials or commitments.

Want to learn more about Netflix? Visit Netflix's website.