About the department
As part of the Cloudflare Engineering organization, SREs are primarily responsible for production reliability. SREs are based in San Francisco, London, Singapore, Austin and Lisbon and use the global distribution to enable follow the sun coverage which allows work to be focused in business hours in each location.
SREs are supported by all engineering teams at Cloudflare who participate in on call schedules for their services. The SRE teams facilitate remediation and follow up of production issues and mature the tooling to enable all engineering teams to self-service on production. Incident follow up work across all engineering teams is prioritized above product innovation and the impact of production incidents influences the priority.
Currently SREs support two main environments: Edge SRE are focused on edge distribution where most client traffic is served. Core SRE are focused on the core services like control plane, data pipeline and other supporting supporting services
Edge SRE project work is organized in four development areas: Platform Engineering, Production Tooling, Hardware Lifecycle and Observability.
Who you are
- You have 5+ years of software engineering, reliability, or operations experience in a customer-focused environment.
- You have 2+ years experience managing a team of 5 or more engineers on projects in the areas of: distributed systems, tooling, Linux, Internetworking, infrastructure security or infrastructure management
- You are comfortable collaborating and co-ordinating on cross-team projects and workflows
- You can provide a strong technical vision for systems and infrastructure teams
- You have experience building services and systems, have successfully taken projects from inception to production, and are comfortable diving in to provide leadership for major projects when needed
- You are capable of leading a discussion with upper management, and are able to tailor the level of technical detail to suit your audience
What you'll do
We are looking for an Engineering Manager to join the Edge SRE team in London. You will lead and develop a team of SREs that are responsible for Cloudflare edge production and building the tools for all teams to understand and interact with it. You will play a lead role in driving our Observability initiatives for edge services and will be tasked with leading engineers who build tools and best practices for engineering teams to debug in production, measure availability and performance indicators, track and report on thresholds.
- Lead a team of engineers who are working to keep the Cloudflare edge reliable and scalable
- Mentor, grow, and empower your team by giving them the skills, confidence and motivation to make decisions
- Help the individuals on your team to build and execute personal development plans that align with Cloudflare’s goals and objectives
- Take an active role in prioritizing the roadmap for the SRE Org
- Drive cross-team and cross-org alignment in engineering, infrastructure and product teams
- Partner with other Engineering Managers across Cloudflare to achieve reliability outcomes for their services
- Participate in deep technical design discussions within your team, and across partner teams, and ensure that we're building the right systems and keeping the quality high
Examples of desirable skills, knowledge and experience
- Hands-on experience with software or reliability engineering
- Experience leading and hiring a team that builds and runs tools and platforms
- Excel at planning and overseeing execution to meet commitments and deliver with predictability
- Observability: Tracking and refining key customer
- Incident root cause analysis and follow-ups
- Incident management
- Comfortable managing teams/projections with deadlines and short release cycles
- Experience using observability tools such as Jaeger, OpenTracing, ELK, Prometheus, Thanos, Grafana, Clickhouse
- Experience running and maturing distributed systems
- Familiarity working with Proxies, DNS, Databases, Internet and Security
- Experience developing tools and APIs
Cloudflare is the simplest way to make websites faster, safer and smarter. Millions of websites have signed up for our service, including large enterprises, major consumer destinations, and government agencies. With offices in San Francisco and London, Cloudflare operates a highly-available global network that has security measures built into every layer and regularly clocks in lightning-fast speeds.
We're on a mission to build a better web - and we need smart, talented people to join our team. Our team works on the forefront of leading technologies including nginx, Go and Lua programming languages. We're a strong supporter of the open source community and regularly share our technology learnings at https://blog.cloudflare.com.
Want to learn more about Cloudflare? Visit Cloudflare's website.
File hosting service