Cloudflare, a leader in internet security, reported a significant glitch in its logging system that resulted in the loss of 55% of customer logs for a 3.5-hour duration. The issue stemmed from a bug in the log collection service, disrupting the delivery of event logs to users.
Cloudflare's Logging Service and Its Importance
Cloudflare provides a robust logging solution that allows customers to monitor site traffic, investigate security incidents, troubleshoot issues, and optimize performance. The "logpush" service enables the transfer of these logs to external storage platforms, such as AWS S3, Elasticsearch, Microsoft Azure, Splunk, and Google Cloud Storage. Given that Cloudflare handles over 50 trillion event logs daily, of which approximately 4.5 trillion are distributed to customers, the reliability of this service is critical.
The Incident
The root of the problem was a malfunction in Logfwdr, a pivotal component in Cloudflare's logging infrastructure. A flawed configuration update created a 'blank configuration', falsely indicating that no logs needed to be forwarded, causing their inadvertent loss. Although Logfwdr is equipped with a failsafe designed to forward all logs in the event of such misconfigurations, this back-up system inadvertently triggered an overload by forwarding all customers' logs, overwhelming the system. The excess load also affected Buftee, the distributed buffering system tasked with holding logs when downstream systems are overloaded. Buftee encountered an unusually high volume—40 times its capacity—leading to a shutdown and the necessity of a complete system restart, which exacerbated the situation and increased data loss.
Measures for Future Prevention
In response, Cloudflare has taken several corrective actions:
Enhanced Detection & Alerts: A new misconfiguration detection and alerting mechanism has been installed to promptly notify teams of any anomalies in log-forwarding configurations.
Configuration Optimization: Buftee's setup has been refined to prevent similar overloads from resulting in system-wide outages.
Regular Stress Testing: To further strengthen their systems, Cloudflare plans to conduct regular stress testing to replicate sudden surges in log volumes, ensuring all failsafe systems are capable of managing these scenarios. These adaptations aim to bolster Cloudflare's logging infrastructure and prevent such incidents from reoccurring, maintaining the company's commitment to reliable and secure log management for its users.