Network Telemetry & Tooling Consolidation

GoDaddy: Network Telemetry & Tooling Consolidation

Role: Sr Dir of SRE (Observability & ITSM)

Overview: In collaboration with the network engineering, storage engineering, and architecture teams, GoDaddy underwent a strategic overhaul of network tools and telemetry to prepare for SDN and modernize the network stack. My team and I were responsible for the observability/telemetry portion of this initiative.

Situation: GoDaddy’s network monitoring relied on a disparate set of tools (Zabbix, Nagios, CA Spectrum), leading to inefficiencies, data silos, and difficulties in gaining a unified view of network health. A modernized and consolidated approach was needed to support future network architectures like SDN and improve incident response.

Task: To design and implement a consolidated network telemetry platform. This involved selecting and integrating new tools for SNMP data collection, flow data analysis, syslog analysis, event correlation, and integration with ServiceNow for CMDB and device provisioning data.

Action:

  • Sevone Integration: Deployed SevOne to collect SNMP data, replacing Zabbix, Nagios, and CA Spectrum. Used this data to build a network topology fed into Moogsoft for enhanced event correlation.
  • Kentik Integration: Implemented Kentik to collect sFlow and NetFlow data from network and security devices across the company, providing deep visibility into network traffic.
  • Elastalert for Syslog: Designed, built, and deployed an Elasticsearch cluster with Elastalert to analyze syslog data from network devices, enabling more intelligent and context-aware alerting.
  • Moogsoft for Correlation: Leveraged Moogsoft to correlate events using a vertex entropy method, relying on the underlying topology data from SevOne to understand node fragility and potential impact of outages.
  • ServiceNow Integration: Developed custom integrations with ServiceNow to synchronize device records, metadata, and provisioning information with other monitoring applications.
  • Tech Stack Used: Sevone, Kentik, Kafka, Fluentbit, Elastic (Elasticsearch), Moogsoft, ServiceNow.

Result:

  • Reduced network team overhead and decreased MTTR by 40%.
  • Provided near real-time topology for operators, aiding in accurate diagnosis of incidents.
  • Improved automated mitigation times from ~15 minutes to sub-minute once a bad actor was detected.

Context: This project was a critical component of GoDaddy’s broader network modernization strategy. By consolidating tools and improving telemetry, the initiative enhanced network visibility, operational efficiency, and the ability to proactively manage one of the world’s largest and most complex network infrastructures, directly supporting the reliability of GoDaddy’s extensive product offerings.