What is CPU Cache? Understanding L1, L2, and L3 Cache

Table of Contents

CPU Cache has a critical role in reducing CPU bottlenecks. Because CPUs can be so fast, it is very important to have a very fast memory with almost no latency and very high bandwidth. The cache memory inside the CPU does that job. The cache in the CPU is employed as a very high-speed memory that stores the frequently accessed data and instructions. So, instead of fetching the required data from the main memory (RAM), the CPU has its own faster memory which helps in improving CPU performance.

Without a cache, a CPU won’t be able to show its true performance because it would always have to be in the waiting state looking for the data either from the RAM or the secondary storage.

The CPU Cache is the first interaction of the ALU with the memory after the registers. The L1 Cache being the nearest to the cores has the fastest speed followed by the L2 and then L3 cache. Let’s discuss everything in detail in this article.

Let’s understand the broad CPU working basics first

The three main components of the CPU are ALU (Arithmetic Logic Unit), Control Unit (CU), and Registers. ALU is the main element doing most of the serious stuff that makes the CPU what it is. However, the Control Unit directs the operations of the processor. It means the control unit tells the ALU and the memory what to do based on the instructions. Registers are high-speed memory within the CPU that holds the data and instructions in real time when the CPU is actually processing them.

The Fetch-Decode-Execute cycle, even for very small operations, will have some outputs that must be stored in memory. These results are always stored in the registers temporarily. If these results are required again during ongoing operations, they will stay in the registers. The registers store operands for arithmetic operations, intermediate results, and control information. But, if the results are not required immediately or are very big, they will be stored in the main memory. The CPU will have the memory addresses to read from and write the RAM.

The Cache comes as an intermediate memory location between the CPU (Registers) and the main memory (RAM). Now, instead of moving the results to the RAM, the CPU can keep them in the cache if they are being used frequently. Cache memory is faster than the RAM so using it for the frequently accessed data leads to a very low latency.

The memory management is handled primarily by the operating system. It keeps track of where the data is stored and makes sure there are no conflicts. The registers and Cache have limited memory space often in Megabytes (Cache) and Kilobytes (Registers). But, we can run big programs which may require GBs of memory (RAM). But, there is always the use of cache memory to store the most important and frequent data. How is this important decided? Well, the CPUs have their own algorithms like LRU (Least Recently Used) to determine which data to keep in the cache based on the usage patterns. Also, the software developers can optimize their applications to use the data access patterns effectively.

How is the CPU Cache made?

CPU cache is made up of SRAM (Static Random Access Memory). It is a volatile memory just like the DRAM but is static in behaviors. It means the stored data will stay there without the need for any refreshing as long as the input power is there. Static RAM is made up of transistors (6 transistors make up a single SRAM cell to store one bit of data) and the basic storage unit is called a Flip-Flop. Static RAM is pretty hard to scale and is very expensive. This is the reason it has its application at critical places like the cache, networking devices, digital cameras, etc.

Because Flip-Flops uses the transistors as the storage medium, it becomes capable of achieving very high data read/write performance. This is because the transistors can switch pretty fast (more than 800 Gigahertz). So, with the help of transistors, the flip-flops and hence the cache become capable of impressive performance compared to any other storage medium. But, because in order to store just a bit of data, 6 transistors are required, scaling the SRAM is a pretty tough task. This is the reason the Cache has limited size in CPUs and they can’t be used at other places like the DRAM as the primary memory.

Types of CPU Cache

1. L1 (Level 1) Cache

L1 cache is the fastest, closest, and smallest cache embedded within each CPU core. The most common size for L1 cache is 64KB but it could vary between 16KB to 128KB. Older CPUs may have below 16 KB as well. The L1 Cache can work for two main types of tasks i.e. as an L1 Data Caceh or L2 Instruction Cache. The main role of the L1 cache is to store the most frequently accessed data and instructions by the CPU core. The cache reduces the latency by not fetching the data from the L2, L3 Cache, or the main memory.

Size and Speed

The L1 Cache generally comes in sizes between 32 KB to 64 KB. Most modern and faster CPUs will have an L1 Cache size of 64 KB. The theoretical speed can vary between 50 GB/s to 100 GB/s. Because the L1 cache generally works at almost the same speed as the CPU, the effective speed becomes pretty high as compared to other memory mediums. In case of cache hits, the data can be provided to the CPU within 1 to 3 cycles.

How does L1 cache work?

If the CPU demands any data or an instruction, the first step is to check for the L1 Cache for that data. If the data is available in the L1 Cache, it is called a cache hit. With a cache hit, the CPU gets the required data in just a few clock cycles (generally 1 – 3 clock cycles). A cache miss is when the required data isn’t available in the L1 and the CPU has to reach the L2, L3, or the main memory (RAM).

The CPU decides which data to place in the L1 cache by the methods called temporal locality and spacial locality. With temporal locality, the CPU identifies the recently used data and predicts that it may be used again in the future and hence it uses the L1 cache to store that data. The spacial locality is much more predictive in nature. In this, the data near the recently accessed data is moved to the L1 cache (mostly from the same block in the memory).

However, because of the smaller size of the L1 cache, the selection of data is really important. For the replacement of the data, CPU uses the algorithms like LRU in which the least utilized data is replaced with the new one. However, some CPUs also use random replacement methods and make things simpler because algorithms tend to increase the workload and slow things down.

CPU cache is generally designed to be associative in nature. The cache is divided into multiple sets with a specific number of lines to store the data. To understand it better, associativity answers this question: “Where in the cache should a specific piece of data from the memory go?” More specifically, associativity defines how memory locations are mapped to these cache lines.

The process of Cache Memory Mapping

The main memory (RAM) of your computer gets the data from the secondary storage in the form of processes. Each process is subdivided into pages. The main memory, on the other hand, is divided into equal size of frames. The size of each frame is the same as each page. The process of this subdivision and bringing the processes to the main memory is the job of the operating system.

However, in order to move the elements from the main memory to the cache memory, the main memory is divided into blocks and the cache is divided into lines. The line size is the same as the block size. The process of how data from the memory blocks to the cache lines is called the process of mapping. This was an abstract way of saying because there are a lot of steps to convert the address and data values when mapping them with each other.

There are three main types of cache memory mapping methods used in CPU cache. These are as follows.

Cache Type	Flexibility	Performance	Complexity
Direct-Mapped	Low (1 specific line per block)	Fast lookup, but high collision rate	Simple hardware
Fully Associative	High (any line for any block)	Best performance (no collisions), but slower due to search	Complex hardware
Set-Associative	Medium (n lines per set)	Balanced performance, fewer collisions than direct-mapped	Moderate complexity

Functions and Roles of the L1 cache

The main function, as we discussed earlier, is to store the data and instructions used most frequently by the CPU. The L1 cache works at the same clock frequency as the CPU. The Instruction cache (L1l) helps the CPU to fetch the instructions separately from the data. The data cache stores the data while the CPU is actively working on any task.

The best thing about the L1 cache is that each CPU core has its own L1 Cache. It helps the cores with parallel processing and also minimizes delays caused by the shared cache. When the CPU has to multitask, it may require a faster memory that switches from one data to another. In this way, L1 cache offers major benefits in multitasking as well.

2. L2 (Level 2) Cache

The L2 Cache can be core-specific or shared. It means the L2 cache can be assigned just to the CPU or sometimes shared by multiple cores. Some CPUs with core-specific L2 cache are Intel Core i7 9700K, Ryzen 7 5800X, and Intel Xeon Gold 6248. Some popular CPUs with shared L2 cache are AMD Opteron 6174 and Q6600. Most consumer-oriented CPUs will have a core-specific L2 cache. The shared L2 cache is generally used in server CPUs.

The L2 Cache comes after the L1 cache and before the L3 cache. We talked about the cache miss above in the L1 cache. If the CPU demands some data and it is not present in the L1 cache, it will then check the L2 cache. Again, if it is a cache hit (meaning the required data is in L2 cache), the CPU will save some time searching it further in the next memory levels. L2 cache also helps in multithreading by providing faster data access to shared data among threads.

Size and Speed of L2 Cache

The smallest size of the L2 cache is 256KB which is more common in low-end and older processors. The most common size is 512KB to 2MB. However, some high-end CPUs and mostly the server processors can have up to 16 MB of L2 cache.

L2 cache generally accesses the data within 3 to 10 clock cycles. The speed is lower than the L1 cache but pretty faster than the main memory and the L3 Cache. The total speed of the L2 cache is generally within 50 to 100 GB/s. However, it varies heavily depending on the CPU architecture and clock speed. We can estimate the speed using this formula:

Data Transfer Rate (GB/s)=Clock Speed (GHz)×Bus Width (bytes)×2

Let’s take an example of the AMD Ryzen 7 5800X. It has a base clock of 3.8 GHz and can be overclocked up to 4.7 GHz. The estimated speed of its L2 Cache will be around 58 GB/s. The same formula goes for the L1 cache as well. But, the effective speed here in the L2 cache will be slower because of the higher latency. Because the L2 cache will take 3 to 10 clock cycles to provide the data, the same speed will have different results in the real world. However, this speed is really fast compared to the RAM in our computers.

How does L2 Cache work?

The working of the L2 cache is almost similar to the L1 Cache. The real difference is that it is a little far from the CPU core where the actual work is going on. But, even though it is placed after the L1 cache, its capacity is higher. More frequently used data can be stored inside it. Again, just like the L1 cache, algorithms like the LRU (Least Recently Used) are there to decide what data should be placed in the L2 cache.

L2 cache can also be direct-mapped, set-associative, or fully-associative. The real difference is in the latency at which the data is accessed by the CPU from the L2 cache.

Functions and Roles

If the CPU modifies it will first modify the L1 cache. But, these changes can also be seen in the L2 cache. You can call the L2 cache a buffer for the L1 cache. It serves as an additional storage if the data is not found in the L1 cache. In multi-core CPUs, the coordination between the L1 and L2 cache is important. If the changes in the L1 cache aren’t reflected in the L2 cache, the L2 cache has stale data which might not be useful for the CPU.

3. L3 (Level 3) Cache

The L3 Cache sits just between the L2 cache and the main memory of your computer. It is a shared cache among multiple cores and their L1 and L2 caches. L3 cache comes with the largest capacity of all the cache levels but has the highest latency. You can call it a backup plan for the L1 and then the L2 cache. If the required data isn’t available in both L1 and L2 Cache, the L3 Cache will be searched for it.

If the CPU modifies some data, the changes will appear in the L3 cache as well. L3 cache uses coherence protocols so that any data on the cache is available to all the cores consistently. Also, it can track which core is changing or accessing the data. Again, multiple algorithms might be working to keep the most useful data inside the L3 cache.

L3 Cache is generally set-associative by which they can use multiple ways to store any given cache line.

Size and Speed

L3 Cache generally has a size between 2 MB to 64 MB. Most consumer CPUs will have an L3 cache between 3 MB to 20 MB. Server CPUs can even have an L3 Cache bigger than 64 MB. The access time is generally within 10 to 20 nanoseconds. The latency is higher than the L1 and L2, i.e. between 10 to 20 clock cycles.

4. L4 (Level 4) Cache

Some rare CPUs like Intel’s Haswell or IBM’s Power9 can also have the L4 Cache. Most of the time, the L4 cache is made up of eDRAM rather than SRAM. The L4 Cache has a much bigger capacity than the L3 cache generally 128 MB and more. L4 Cache is slower with the latency ranging between 20-30 nanoseconds. L4 Cache has an interesting name victim cache. The data which is evicted from the L3 cache is stored in the L4 cache. However, because it is pretty rare in both consumer and server CPUs, we aren’t going to talk much about it.

Useful Resources:

https://users.sussex.ac.uk/mfb21/compilers/slides/15-handout.pdf

https://en.algorithmica.org/hpc/cpu-cache

https://cseweb.ucsd.edu/classes/fa10/cse240a/pdf/08/CSE240A-MBT-L15-Cache.ppt.pdf