October 6, 2022

Robotic Notes

All technology News

Meta details its approach to detecting data errors in IT infrastructure

3 min read



Meta Platforms Inc. today its detailed approach to detecting so-called silent data corruptions, or SDCs, subtle errors that often emerge in information technology infrastructure and are highly difficult to troubleshoot.

Outages and other technical issues are a frequent occurrence in data centers. As a result, companies use a variety of methods to ensure that important business information is not lost in the event of a malfunction. One of the most common approaches is to create multiple copies of a record, which ensures that a backup is available in the event the original record is lost.

But despite the steps that companies take to protect business information, data errors still frequently emerge in IT infrastructure. Among the most complex errors are malfunctions that Meta refers to as SDCs. Such errors emerge because of computing errors made by a server’s central processing unit.

Servers and other data center systems automatically generate logs about notable events such as a malfunction. Those logs can then be used by administrators to carry out troubleshooting. SDC errors are challenging to fix because they do not appear in server logs, which makes them highly difficult to detect.

Meta’s engineers have developed multiple methods of detecting SDCs, the company detailed today. The company shared technical information about two of the most important methods in a blog post.

The first technique that Meta uses to detect SDCs is known as ripple testing.

To carry out ripple testing, Meta connects an error detection system to the applications running on a given server. The error detection system, with the help of the applications to which it’s connected, carries out a series of specialized computing operations. If the operations return an incorrect result, Meta can conclude that there was an SDC error caused by the server’s CPU.

“Ripple tests are typically in the order of hundreds of milliseconds within the fleet,” explained Meta engineer Harish Dattatraya Dixit. “They are scheduled based on workload behavior and can be switched on and off per workload.”

Because they can be completed in under a second, ripple tests require a relatively limited amount of infrastructure resources to carry out. A related benefit is that it’s possible to perform ripple tests fairly often. But while effective, this method cannot spot all types of SDCs, which is why Meta also uses a second error detection technique dubbed opportunistic testing.

Whereas a ripple test can be completed in under a second, opportunistic tests take several minutes to carry out, which reflects the fact that they are much more thorough. Meta built a custom software tool called Fleetscanner to manage the process. The company runs opportunistic tests on servers when they’re not actively used, for example while they’re undergoing maintenance.

Meta carries out opportunistic tests when a machine reboots, as well as when it installs updates to the onboarding operating system or firmware. The company also searches for SDC errors when certain changes are made to the server cluster to which a machine is attached.

Meta carries out 2.5 billion ripple tests every month across its data centers and has run a total of 68 million opportunistic tests to date. Ripple tests spot about 70% of SDC errors, the company says, while the rest are detected through opportunistic testing.

Image: Meta

Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, ​​Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.



Source link