OpenStack Stability SWAT Leadership

GoDaddy: OpenStack Stability SWAT

Role: Sr Dir of SRE (Observability & ITSM)

Overview: Tasked with stabilizing our internal cloud offering and improving visibility into its operation.

Situation: GoDaddy’s internal OpenStack cloud platform was experiencing stability issues, impacting internal services and development workflows. There was a lack of comprehensive visibility into the various OpenStack components, making troubleshooting difficult.

Task: To form and lead a cross-functional working group to audit the existing OpenStack environment, identify areas for improvement in telemetry and stability, and implement solutions to create a more reliable cloud platform.

Action:

  • Assembled and led a cross-functional SWAT team comprising members from various engineering disciplines.
  • Conducted a thorough audit of the OpenStack environment to identify root causes of instability and telemetry gaps.
  • Prioritized and implemented improvements to telemetry for various OpenStack components (e.g., Nova, Neutron, Cinder, Keystone).
  • Drove initiatives to enhance the overall stability and resilience of the cloud platform.

Tech Stack Used: OpenStack, RabbitMQ, Syslog, Graphite, Elastic (Elasticsearch), Moogsoft.

Result: Created a cross-functional working group to ideate, audit, and improve telemetry for various OpenStack components, resulting in a more stable cloud environment. The efforts of the SWAT team led to a more stable internal OpenStack cloud environment. Improved telemetry provided better insights into component health and performance, facilitating proactive maintenance and faster incident resolution.

Context: A stable and reliable internal cloud platform was crucial for GoDaddy’s engineering productivity and service delivery. This project directly addressed critical infrastructure stability issues, ensuring that internal teams had a dependable platform for development, testing, and hosting internal applications.