How often do SSDs Fail? Prevent Pre-Mature SSD failures

Table of Contents

We all know SSDs degrade over time. But, how much do they fail, and what are the chances your drive is going to fail? This article is going to be a detailed study of multiple research papers concluding various things on the subject of SSD failure rates.

SSDs come with their TBW (Terabytes Written) which is an estimated number of how much data an SSD can write on it during its lifetime. If you divide this number by the total warranty period, you can easily calculate the total estimated write capacity of your drive. However, solid-state drives are prone to failure even before the warranty period and this TBW is limited due to many reasons. Some main reasons are high-density flash (mainly QLC), power failures, heat, and overload.

However, it isn’t possible to say which SSD is going to fail and which isn’t. The best we can do is choose the right SSD based on its features, specifications, benchmarks, and reviews. However, if you want to know how often the SSDs fail, we can just have a look at some research on this matter.

I have read most of this research and the findings are pretty interesting. Let’s get started.

Why do SSDs fail?

Generally, your SSD will start to show unusual load even when the system is idle. Sometimes, it may disconnect from the system and you will end up in the BIOS/UEFI. Some other signs are poor read/write speed, excessive heat, and a blue screen of death.

There are two types of SSD failures i.e. Wear-Out Failures and Non-Wear-Out failures.

The wear-out failures happen when the insulation layer around the floating gate transistor is damaged and it can no longer hold the data. Each cell comes with a limited number of program/erase cycles that correlate with this floating gate transistor’s endurance. Once these cycles are crossed, the drive can fail at any time. This is common in high-density drives such as TLC and QLC.

All other issues are normally Non-Wear-Failures.

The primary type of non-wear-out failure is controller failure. The controller is the busiest and hottest component on your drive. This generally happens due to manufacturing faults and excessive heat.

Also, there could be voltage fluctuations and electrical malfunctions inside the drive causing the failures. There could be firmware issues as well. In some poorly-designed SSDs, there could be bad algorithms (mainly wear-leveling) which could result in an early failure of your drive.

General SSD failure rates

The Annualized Failure Rate (AFR) is generally between 1% to 2% in consumer-grade SSDs. Also, most SSDs come with an MTBF (Mean Time Between Failure) of 1 to 1.5 Million hours.

Most importantly, the less dense the NAND Flash memory, the less likely its chances of pre-mature failures. Sadly, in the consumer market, we just have two options to choose from i.e. TLC and QLC. MLC has now gone almost extinct in consumer-grade drives. However, in the data centers, SLC and MLC are preferred. SLC has the best endurance.

Almost all the studies including this one have found no SSDs with AFR higher than 2% in both consumer and enterprise environments.

Some studies related to SSD failures

1. MLC SSDs fail 2 to 10% more than the SLC drives

In our article about the comparison of all types of flash memories, we discussed that MLC is less reliable than SLC because a single cell is being used to store two bits of information rather than one.

In consumer markets, finding MLC SSDs is rare and they are really expensive. I am talking about the Flash Reliability in Production: The Expected and the Unexpected study which was presented by Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso at the 13th USENIX Conference on File and Storage Technologies (FAST 2015).

For this study, millions of SSDs were analyzed from Google’s data centers for multiple years. There were only two types of SSDs i.e. MLC and SLC. The main motto was to collect the data about these drives including age, workload, and environment. Both wear-out failures and non-wear-out failures were analyzed in this research.

The first finding was that wear-out failures were more common than non-wear-out failures. These failures mainly included firmware issues, controller failures, etc.

The main finding was that the MLC SSDs were failing 2 to 10% more than the SLC drives. They concluded that it was happening due to the complexity of the MLC NAND Flash.

2. HDDs fail 2 to 4% more than the HDDs

The study I am talking about is the A Large-Scale Study of Flash Memory Failures in the Field. It was done by Raghav Lagisetty, Bianca Schroeder, and Arif Merchant. They presented it at the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems.

The study researched SLC, MLC, and TLC NAND flash in real-time production environments.

It found that the annual failure rate for SSDs was between 1.5% to 2%. HDDs, on the other hand, in the same environment, had an annual failure rate between 2% to 4%.

3. 3D TLC offers 2x better endurance than the planar TLC

This study is Reliability Characterization of 3D NAND Flash Memory. It was done by R. S. Tiwari, J. Kim, Y. Cai, and R. D. H. Chakraborty and presented at the 2017 IEEE International Memory Workshop.

The study found that the 3D TLC NAND offers 2x more endurance compared to the planar TLC NAND flash. Also, the 3D TLC has better data retention capabilities along with lower read-disturb sensitivity.

How to prevent pre-mature SSD failures?

1. Firmware Updates

SSD manufacturers keep studying their SSDs and find common issues. If the issues could be fixed through the software, they would release new firmware updates. So, if your SSD also has some roots of pre-mature failure during the manufacturing process, it may get fixed by the firmware updates. Make sure to keep the firmware updated.

2. Keep your SSD cool

Excessive heat can result in pre-mature SSD failure because heat is an enemy to electronics. Use an SSD heatsink and ensure a good airflow to your system. Especially, if you are doing any data-intensive work on your computer, it is important to ensure an ideal external temperature for your drive. Although SSDs come with a thermal-throttling feature, it is bad that your SSD is reaching those high temperatures over and over again.

3. Take care of the write operations

Data writing is the most resource-intensive task on the SSDs. The SSDs come with limited P/E cycles as well. So, make sure to keep your programs up to date, and if you have to write a lot of data, consider using two drives to share the workload. Use TRIM so that the unnecessary blocks of data are properly managed and cleaned by the algorithms.

4. Maintain SSD capacity

Using your SSD at close to its full capacity can harm your drive, increase its temperature, and reduce its life. This is because your drive has to do excessive erase and write operations. Also, wear leveling and garbage collection algorithms need extra space to work properly.

5. Monitor SSD health

You can use your SSD’s software to check its health. There are other programs like CrystalDiskInfo and Disk G enius which will help you to check your drive’s health. If the drive has degraded, it is time to first back up and then change your drive.

Conclusion

SSD can easily fail prematurely because of many reasons. There are some manufacturing faults which could be there with any drive. However, the things that are in our control are choosing the right SSD specifications, taking care of the temperature, keeping free space for algorithms, and maintaining the drive.

I hope this helps!