Director, Site Reliability Engineering

Okta, San Francisco, CA or San Jose, CA

Okta is an integrated identity and mobility management service

Okta authenticates, authorizes and provisions millions of users a day. The service is hosted on Amazon Web Services (AWS) across multiple availability zones and geographically separated regions. The service is designed for high throughput, and 100% availability.   We're looking for a technical leader to help us to continue to scale the service with great people and reliable, cost-effective and efficient infrastructure, processes and tooling. 

Job Duties and Responsibilities:  

  • "Always On" service delivery.
  • Work with our Program Management, Finance and Product Development organizations for capacity forecasting and production budget management.
  • People Management - you will lead and develop distributed team of Site Reliability Engineers, Database Reliability Engineers, and First Line Managers.
  • Ensure that we build and maintain the automation tools and processes required to reliably and efficiently manage and secure our fleet.
  • Work with other parties within Engineering to accelerate SRE adoption of Agile and DevOps methodologies and help the SRE teams fully leverage the source control, CI, quality engineering and release management resources available to them.
  • Partner with Development to provide the infrastructure and services required to enable innovation and to ensure the products we build have the tools and telemetry required to operate it effectively and efficiently.
  • Continue to evolve our service architecture of microservices, containers, and a monolith to take advantage of new cloud infrastructure services and modern scalability concepts (i.e. pets vs cattle).
  • Participate in 24x7 site reliability rotations and escalation workflows.

Minimum REQUIRED Knowledge, Skills, and Abilities:  

  • 8+ years of experience in technical leadership.
  • 5+ years of experience people management.
  • Extensive experience using Agile and DevOps methodologies to build product infrastructure along with the monitoring, alerting and tooling required to operate it.
  • 3+ years of experience running large-scale infrastructure supporting a cloud service, preferably in AWS.
  • Solid background in Linux system administration and understanding of automation scripting languages (eg Python), configuration management systems (eg Chef), and logging and monitoring frameworks (eg Splunk, Zabbix).
  • Deep expertise in securing cloud infrastructure (eg security monitoring, PAM, key-based authentication, role-based authorization, audit logging and patching).
  • Experience navigating security certification audits a plus (eg FedRAMP).
  • Effective verbal, written communication and interpersonal skills.

 Education and Training:  

  • Computer Science Degree or related degree or equivalent experience  

Okta is an Equal Opportunity Employer.  


About Okta

Okta is the leading independent provider of identity for the enterprise. The Okta Identity Cloud connects and protects employees of many of the world's largest enterprises. It also securely connects enterprises to their partners, suppliers and customers. With deep integrations to over 5,000 applications, the Okta Identity Cloud enables simple and secure access for any user from any device. Thousands of customers, including 20th Century Fox, Adobe, Dish Networks, Experian, Flex, LinkedIn, and News Corp, trust Okta to help them work faster, boost revenue and stay secure. Okta helps customers fulfill their missions faster by making it safe and easy to use the technologies they need to do their most significant work.

Want to learn more about Okta? Visit Okta's website.