How often do SSDs Fail? Prevent Pre-Mature SSD failures

Affiliate Disclosure: This post may include affiliate links. If you click and make a purchase, I may earn a small commission at no extra cost to you.

We are all aware that SSDs degrade over time. But how much do they fail, and what are the chances your drive is going to fail? This article presents a detailed study of multiple research papers that conclude various aspects of SSD failure rates.

SSDs come with their TBW (Terabytes Written), which is an estimated number of how much data an SSD can write on it during its lifetime. If you divide this number by the total warranty period, you can easily calculate the total estimated write capacity of your drive. However, solid-state drives are prone to failure even before the warranty period, and this TBW is limited due to several reasons. The primary reasons are high-density flash (mainly QLC), power failures, excessive heat, and overload.

However, it isn’t possible to say which SSD is going to fail and which isn’t. The best we can do is choose the right SSD based on its features, specifications, benchmarks, and reviews. However, if you want to know how often SSDs fail, we can take a look at some research on this matter.

I have read most of this research, and the findings are pretty interesting. Let’s get started.

How often do ssds fail?

Why do SSDs fail?

The general initial signs of SSD failure are easy to recognize. Your SSD may start to show unusual load even when the system is idle. Sometimes, it may disconnect from the system, and you will end up in the BIOS/UEFI. Other signs include poor read/write speed, excessive heat, and a blue screen of death.

There are two types of SSD failures, i.e., Wear-Out Failures and Non-Wear-Out Failures.

Wear-out failures occur when the insulation layer around the floating gate transistor or charge-trap flash is damaged, rendering it unable to hold the data. Each cell comes with a limited number of program/erase cycles, which correlate with the endurance of this floating gate transistor. Once these cycles are crossed, the drive can fail at any time. This is common in high-density drives such as TLC and QLC.

All other issues usually are Non-Wear-Failures.

The primary type of non-wear-out failure is controller failure. The controller is the busiest and hottest component on your drive. This typically occurs due to manufacturing defects and excessive heat.

Additionally, voltage fluctuations and electrical malfunctions within the drive can cause failures. There may also be firmware issues. In some poorly designed SSDs, there may be flawed algorithms (mainly wear-leveling) that can lead to an early failure of your drive.

General SSD failure rates

The Annualized Failure Rate (AFR) is generally between 1% to 2% in consumer-grade SSDs. Additionally, most SSDs are equipped with an MTBF (Mean Time Between Failures) of 1 to 1.5 million hours.

Most importantly, the less dense the NAND Flash memory, the less likely it is to experience premature failures. Sadly, in the consumer market, we have only two options to choose from: TLC and QLC. MLC has now gone almost extinct in consumer-grade drives. However, in the data centers, SLC and MLC are preferred. SLC has the best endurance.

Almost all the studies, including this one, have found no SSDs with AFR higher than 2% in both consumer and enterprise environments.

Some studies related to SSD failures

1. MLC SSDs fail 2 to 10% more than the SLC drives

In our article comparing all types of flash memories, we discussed that MLC is less reliable than SLC because a single cell is used to store two bits of information rather than one.

In consumer markets, finding MLC SSDs is rare, and they are extremely expensive. I am referring to the study “Flash Reliability in Production: The Expected and the Unexpected,” presented by Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso at the 13th USENIX Conference on File and Storage Technologies (FAST 2015).

For this study, millions of SSDs were analyzed from Google’s data centers for multiple years. There were only two types of SSDs, i.e., MLC and SLC. The primary objective was to collect data on these drives, including age, workload, and environmental conditions. Both wear-out failures and non-wear-out failures were analyzed in this research.

The first finding was that wear-out failures were more common than non-wear-out failures. These failures primarily included firmware issues and controller failures.

The main finding was that the MLC SSDs were failing 2 to 10% more than the SLC drives. They concluded that it was happening due to the complexity of the MLC NAND Flash.

2. HDDs fail 2 to 4% more than HDDs

The study I am referring to is ‘A Large-Scale Study of Flash Memory Failures in the Field.’ It was done by Raghav Lagisetty, Bianca Schroeder, and Arif Merchant. They presented it at the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems.

The study researched SLC, MLC, and TLC NAND flash in real-time production environments.

It found that the annual failure rate for SSDs was between 1.5% to 2%. HDDs, on the other hand, in the same environment, had an annual failure rate between 2% to 4%.

3. 3D TLC offers 2x better endurance than the planar TLC

This study is the Reliability Characterization of 3D NAND Flash Memory. It was done by R. S. Tiwari, J. Kim, Y. Cai, and R. D. H. Chakraborty and presented at the 2017 IEEE International Memory Workshop.

The study found that the 3D TLC NAND offers 2x more endurance compared to the planar TLC NAND flash. Additionally, the 3D TLC offers better data retention capabilities, along with lower read-disturb sensitivity.

How to prevent premature SSD failures?

1. Firmware Updates

SSD manufacturers continue to study their SSDs and identify common issues. If the problems could be fixed through the software, they would release new firmware updates. Therefore, if your SSD also exhibits some signs of premature failure during the manufacturing process, it may be resolved by firmware updates. Make sure to keep the firmware updated.

2. Keep your SSD cool

Excessive heat can result in premature SSD failure because heat is an enemy to electronics. Use an SSD heatsink and ensure a good airflow to your system. Especially if you are performing data-intensive work on your computer, it is crucial to maintain an ideal external temperature for your drive. Although SSDs come with a thermal throttling feature, it is undesirable that your SSD is reaching those high temperatures repeatedly.

3. Take care of the write operations

Data writing is the most resource-intensive task on the SSDs. The SSDs also come with limited P/E cycles. Therefore, ensure that your programs are up to date, and if you frequently need to process large amounts of data, consider using two drives to share the workload. Use TRIM to ensure that unnecessary blocks of data are properly managed and cleaned by the algorithms.

4. Maintain SSD capacity

Using your SSD at close to its full capacity can harm your drive, increase its temperature, and reduce its life. This is because your drive has to do excessive erase and write operations. Additionally, wear leveling and garbage collection algorithms require extra space to function correctly.

5. Monitor SSD health

You can use your SSD’s software to check its health. There are other programs, such as CrystalDiskInfo and Disk Genius, that can help you check your drive’s health. If the drive has degraded, it is time to back up first and then replace your drive.

SSD health check using crystaldiskinfo

Conclusion

SSDs can fail prematurely for various reasons. There are some manufacturing faults that could be present in any drive. However, the factors within our control include selecting the right SSD specifications, managing temperature, allocating sufficient free space for algorithms, and maintaining the drive.

I hope this helps!

Similar Posts

Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments