GoDaddy: OpenStack Stability SWAT
Role: Sr Dir of SRE (Observability & ITSM)
Overview: Tasked with stabilizing our internal cloud offering and improving visibility into its operation.
Situation: GoDaddy’s internal OpenStack cloud platform was experiencing stability issues, impacting internal services and development workflows. There was a lack of comprehensive visibility into the various OpenStack components, making troubleshooting difficult.
Task: To form and lead a cross-functional working group to audit the existing OpenStack environment, identify areas for improvement in telemetry and stability, and implement solutions to create a more reliable cloud platform.
Action:
- Assembled and led a cross-functional SWAT team comprising members from various engineering disciplines.
- Conducted a thorough audit of the OpenStack environment to identify root causes of instability and telemetry gaps.
- Prioritized and implemented improvements to telemetry for various OpenStack components (e.g., Nova, Neutron, Cinder, Keystone).
- Drove initiatives to enhance the overall stability and resilience of the cloud platform.
Tech Stack Used: OpenStack, RabbitMQ, Syslog, Graphite, Elastic (Elasticsearch), Moogsoft.
Result: Created a cross-functional working group to ideate, audit, and improve telemetry for various OpenStack components, resulting in a more stable cloud environment. The efforts of the SWAT team led to a more stable internal OpenStack cloud environment. Improved telemetry provided better insights into component health and performance, facilitating proactive maintenance and faster incident resolution.
Context: A stable and reliable internal cloud platform was crucial for GoDaddy’s engineering productivity and service delivery. This project directly addressed critical infrastructure stability issues, ensuring that internal teams had a dependable platform for development, testing, and hosting internal applications.