November 28, 2020

2bbm.com

Your Best Review Website !

Deep Analysis: Uncover the Secret of the AMD Zen3 39% Performance Improvement

8 min read
Spread the love

 

AMD Zen3 based on the new architecture of the Ryzen 5000 series was finally released. Do the performance of the Ryzen 9 5950X and Ryzen 9 5900X are satisfied enough to make you pay?

In the future, we will also provide the review of the 7 5800X and 5 5600X, please stay waiting.

This time, the sharp dragon has been arguably the only disadvantage of single core/performance finally is no longer a short board game, every achieved ahead of Intel, and in the manufacturing process under the premise of maintaining 7 nm process completely, the new design of Zen3 architecture can say, this is also the biggest change since the birth of Zen.

Today, let’s talk about some of the innovations of the Zen3 architecture.

Of course, processor architecture design is a very high level of knowledge, we cannot go into depth professionally. Let’s talk about some more superficial and easy to understand, to see how such a leap in performance came about.

First of all, you should have a goal in everything you do, especially designing a processor architecture. Zen3 has three goals:

One is to improve the performance of single thread, technically known as IPC (number of instructions per clock cycle). After all, previous generations have been pursuing multi-core, so it is time to improve the performance of single core to a sufficient height, otherwise it will always be lame, lack of long-term competitiveness.

Second, on the premise of maintaining 8 core CCD modules, unify core and cache to improve communication efficiency and reduce delay.

The third is to continue to improve the energy efficiency ratio, performance improvement at the same time power consumption cannot be out of control.

To this end, the Zen3 architecture has been refurbished for all modules, front-end, prefetching, decoding, execution, integer, floating point, loading, storage, caching, and more.

First, Zen3 designed an art-like branch predictor that then has two channels to queue and dispatch instructions: an 8-way associated 32KB first-level instruction cache and x86 decoder, and a 4K operation-cache.

The x86 decoder is limited to a maximum of four instructions per clock cycle, but if it is a familiar instruction, it can be put into the operation cache, which can handle eight instructions per cycle. The combination of the two greatly improves the instruction distribution efficiency, which is a direct step up from Zen2.

Instruction dispatching is followed by the execution engine phase, divided into integer and floating-point parts, to which six instructions can be dispatched per clock cycle.

Among them, there are still four integer units, but they are more dispersed, and a separate branch and data storage unit are added to improve throughput, and three addresses can be generated per clock cycle.

Floating point is divided into six pipelines to further improve throughput and efficiency.

On the memory side, you can perform three loads per clock cycle, or one load plus two stores, increasing throughput again, and allowing more flexibility in handling different workloads.

Simply say Zen3 may not feel anything, compare Zen2, change too much or let’s talk about the core.

On the front end, there are L1 BTB capacity doubled, greater branch predictor bandwidth, faster prediction error recovery, faster operational cache pickup, more sophisticated operational cache pipeline switching, and so on.

In terms of execution engines, there are mainly separate branches and data storage units, larger integer Windows, lower latency for specific integer/floating-point instructions, 6 width pickup and distribution, wider floating-point dispatch, faster floating-point FMAC (multiplexer accumulator), and so on.

In terms of load/storage, there are mainly higher load bandwidth (2 variable 3), higher storage bandwidth (1 variable 2), more flexible load/storage instructions, better memory dependency detection, and so on.

These are the core and cache key metrics of the Zen, Zen2, Zen3 architectures. At first glance, Zen3 doesn’t look as strong as Zen2, but these Numbers don’t fully reflect deeper changes, and Zen3 has made more breakthroughs in key metrics, such as a jump in distribution width from 10/11 to 16, with more than a slight improvement in performance.

As a result of these improvements, the Zen3 architecture’s IPC has improved by up to 19%, thanks to a combination of front-end, load/storage, execution engine, cache prefetch, microoperation cache, branch prediction, and more.

So, you might be wondering, where does 19% come from?

The Zen3 and Zen2 architectures are both fixed at 8 cores and 4GHz, and then compared to the performance changes of different applications.

Of course, the improvement range of different workloads is not the same. The biggest change is that the weak points of Ryzen’s online games before, such as PUBG, LOL and CSGO, which have increased by 35-39%, plus the increase in frequency, etc. Finally, we have seen the world-shaking changes of Ryzen 5000 in online games.

In fact, most of the improvements over the 19% average were in games, which is why the Ryzen 5000 took one of Intel’s last remaining spots in game performance and is arguably the best game processor in the world.

The improvements were relatively small for benchmarked projects and games that were difficult to deeply optimize, especially single-threaded performance, such as POV-Ray 9%, CP-Z 12%, CineBench R20 13%, and CineBench R15 18%, but even so you saw a very significant real performance improvement, which is far more than the 5% or so that some generations of cores can change at most.

If you find the previous architecture unsatisfactory and want to take a closer look at it, let’s break it down into different modules and see how they change individually.

On the front end, Zen3 builds a faster branch predictor that can handle more instructions per clock cycle, while switching between operation caches and instruction caches is faster and more flexible and efficient for different workloads.

Of course, branch prediction is not 100% accurate, it is always possible, sometimes the prediction is wrong, at this time the key is to be able to recover quickly, Zen3 greatly reduced the delay at this time, can be quickly back on track, the accuracy of branch prediction has been improved.

In the pick and decode section, you can see more details about the branch predictor, especially how the accuracy improvement came about, such as branch target buffer redesign, L1 B2B doubling, L2 B2B reorganization, indirect target array (ITA) increase, pipeline shortening, error prediction delay reduction, and so on.

At the same time, the 32KB 8-way associated first-level instruction cache was optimized to improve prefetch capability and utilization.

Operational caching is also more refined, queue picking is more efficient, and switching between operational cache and instruction cache pipelining is more convenient.

On the execution engine side, increased floating-point and integer distribution widths, reduced FMAC latency, and increased execution Windows.

In terms of integer execution, the number of integer scheduler nodes increased from 92 to 96 (distributed by 4×24), and the number of physical register files used to rename logical registers to improve the efficiency of out-of-order execution increased from 180 to 192.

Distribution per clock cycle also increased from 7 to 10, consisting of 4 ALU (arithmetic logic unit), 3 AGU (address generation unit), 1 branch unit and 2 storage data units.

In addition, the number of x86 instructions saved by the logger buffer (ROB) increased from 224 to 256.

There are still four integer units in Zen3, but ALU and AGU schedulers are Shared to make it more balanced between loads.

In terms of floating-point execution, increasing the floating-point unit to six pipelints means that six microinstructions can be sent at once, while the MUL multiplication and ADD integer units, which used to be responsible for storing and adding floating-point register files, are now pipelined-independent, allowing for better handling of real MUL and ADD instructions when needed.

There is also a faster 4-cycle FMAC, a separate F2I and storage unit, and a larger scheduler.

In terms of load/storage, the number of storage queue nodes increased from 48 to 64, the bandwidth between the 32KB first-level instruction cache was increased, 3 loads per clock cycle could be performed, or 2 floating-point and 1 storage, and the prefetch algorithm was improved to take better advantage of the doubling capacity of the third-level cache.

Let’s go back to the “advanced” level and look at Zen3’s design in terms of core and cache.

This is a familiar layout of the CDD core and cache. Each CCD of Zen2 and Zen3 has 8 physical cores and 32MB level 3 cache, but the former is two separate parts, with every 4 cores sharing half of 16MB level 3 cache, while the latter is a complete part, with all 8 cores sharing all 32MB level 3 cache, which is equal to a direct doubling of the level 3 cache capacity available to each core.

Zen2 above, if a core need instructions and data in indirect share the other half of the three-level cache, then around a circle, delay the natural increase greatly, can direct one pace reaches the designated position now, when the first core need data in the eighth core, may also directly within the CCX quick access to.

Let’s look at the cache details. The first, second and third levels of capacity have not changed, but the efficiency is much higher. For example, 32KB first-level instruction cache supports 32bit pickup,32KB first-level data cache supports up to 3 loads and 2 stores, and 512KB second-level cache is faster.

With the increased size and uniform access, the discarded victim cache from the tier 2 cache can be kept completely, as a backup, since they are highly likely to be accessed again, so that whichever core needs to be accessed again can be retrieved directly from the cache from the swap.

In addition, each core from level 2 cache to level 3 cache allows 64 failed hits and from level 3 cache to memory allows 192 failed hits.

The Ryzen 5000 series continues the design of the Chiplet microchip by combining one or two CCD DIES with an IOD (responsible for memory controller and input/output), but since there is only one CCX in each CCD instead of two independent CCX, the connection communication between CCD, IOD and memory is also more consistent and efficient.

When two CCDS are combined with an IOD, the bandwidth is the same and the system is the same.

This is where the benefits of the Chiplet design again come into play, with the ease of doing 16 cores, the ability to upgrade to the Zen3 architecture without changing the layout and platform, all within the package itself.

In terms of security, Zen3 focuses on adding control flow enforcement technology (CET), which Intel has previously supported. It introduces the Shadow stack, which only contains the return address and is stored in the system memory. Meanwhile, it is protected by the processor memory management module. If malicious code exploits the vulnerability to modify the stack, it can be detected and stopped before it causes harm.

In terms of instruction set, Zen3 added MPK, or memory protection key, which can be changed more efficiently by software, and VAES, VPCLMULQD instruction added AVX2 support.

Finally, about energy efficiency, according to the official statement, compared with i9-10900K, The Energy efficiency of Zen3’s 9 5950X and 9 5900X are 2.8 times and 2.4 times respectively, while compared with Zen2’s 9 3950X and 9 3900XT, the performance of Zen3’s 9 3950X and 9 3900XT are improved by 12% and 26% respectively, so as to achieve better performance but no increase in power consumption.

Anyhow, Zen3 smoothly to achieve the expected goals, including the IPC improved at a much lower average (19%), delayed (8 core and 32 MB l3 cache), memory access drastically accelerated (triple double direct access to the cache), significantly improve the speed (4.9 GHz), energy efficiency significantly improvement (2.8 times), the frame rate increased significantly (1080 p on average about 26%).

AMD Zen’s next stop will be the Zen 4, which will be paired with the more advanced 5nm process. It is currently being designed and is progressing according to schedule and should launch in the first half of 2022 depending on the schedule.

Leave a Reply

Your email address will not be published.

Copyright © 2020 2bbm.com - All Rights Reserved.