![]() |
| |||||||
| Notices |
| Computer Geek, Gadgets and Electronics- Περι υπολογιστών Software, hardware, help, coding, recomend a computer, electronics and gadgets... - Λογισμικό, ηλεκτρονικές συσκευές, βοήθεια, προτείνετε έναν υπολογιστή.. |
![]() |
| | Thread Tools | Display Modes |
| | #1 |
| GR Elite | What you need to know about Intel's Nehalem CPU LINK: To view links or images in this forum your post count must be 1 or greater. You currently have 0 posts. Building blocks "Atom" is the brand name for Intel's newly-launched ultramobile processor line, but it could just as well be the name for Intel's next-generation 45nm microarchitecture. This new core microarchitecture, codenamed Nehalem, forms the basic building block from which Intel will assemble the brains for everything from high-end servers to svelte notebooks. Insofar as Nehalem represents a lot more than just a new processor, it's a significant shift for Intel at almost every level. In this article, I'll give a general overview of Nehalem, focusing on the major changes and big new features that the architecture will eventually bring to Intel's entire x86 processor line. A more in-depth examination of Nehalem from me will show up later in the spring; for now, read on for the highlights. Here's what you need to know about Nehalem. It's the bandwidth, stupid Moore's Law has given processor designers an embarrassment of transistor riches, and nowhere is that more apparent than in Intel's 45nm Nehalem processor. Debuting in 4- and 8-core variants later this year, Nehalem packs a ton of hardware into a single processor socket. (Early numbers put the transistor count of a quad-core Nehalem at 781 million; no numbers for the 8-core model have appeared yet.) But trying to feed all of that hardware with the Intel platform's existing frontside bus architecture would be folly. So, just as importantly, Nehalem also sounds the long-overdue death knell for Intel's positively geriatric frontside bus architecture. The radical change in Intel's system bandwidth situation that Intel's new QuickPath Interconnect (QPI) represents is perhaps the largest single factor that shaped Nehalem's design. Between QuickPath and Nehalem's integrated memory controller, a Nehalem processor will have access to an unprecedented amount of aggregate bandwidth, especially in two- and four-socket implementations. What this means is that Intel no longer has to equip its processors with freakishly large unified caches designed to mitigate the effects of the bandwidth starvation with which Intel platforms currently struggle. The chipmaker is now free to use all of the transistors that Moore's Law affords more flexibly and intelligently, and this freedom has profound effects on every aspect of Nehalem. Let's take a look at what this bandwidth improvement means for Nehalem-based products across all segments, from servers to notebooks. Remixing the microprocessor In some ways, Nehalem is Intel's most significant processor since the Pentium 4, insofar as it signifies a major shift for the company's x86 strategy. The ill-fated Pentium 4 was a relatively radical design conceived with clockspeed in mind. Nehalem, in contrast, is a more progressive evolution of Intel's existing, mobile-oriented Core 2 products; all of its changes are made with a view to exploiting the large amounts of parallelism that Moore's Law affords at the 45nm process node and to taking advantage of QPI's bandwidth. Because of this emphasis on parallelism and bandwidth, "Nehalem," broadly conceived, is less of a "processor" in the classical sense than it is a set of building blocks that can be assembled in different configurations for different market segments. A four-core Nehalem processor, with three DDR3 channels and four QPI links Nehalem-derived processors—if it's still appropriate to call them "processors" and not "systems-on-a-chip" (SoCs)—will mix the following elements in different proportions, depending on the platform and product:
That's quite bit of customization available, and this approach is what will let Intel slip Nehalem into all kinds of market segments. Indeed, Nehalem's processor core actually reminds me of the Linux kernel in that it's a small unit that can be augmented in different ways with add-ons so that it fits everything from a set-top box to a supercomputer cluster. So far, Intel has said that Nehalem will scale from two to eight cores, but the company has talked about only the four-core, server-oriented part. All Nehalem configurations have a number of Nehalem cores—each with a 32KB, four-way set associative instruction cache, a 32KB, eight-way set associative data cache, and a private, low-latency 256KB L2 cache—all attached to an inclusive L3 cache that will be sized to fit the number of cores and target market. The four-core part that Intel has detailed weighs in at 781 million transistors, much of which is no doubt the very generous 8MB L3 cache. This part also includes an on-die, three-channel DDR3 memory controller and a QuickPath interface that supports four QuickPath links. As I noted above, the number of memory channels and QuickPath links in other Nehalem-based products can be expected to vary with the part and target market. Nehalem's core The basic building block of Intel's Nehalem family is a new version of the Core microarchitecture, which sports a number of major changes from its Core 2 Duo incarnation. In fact, Nehalem's core represents the biggest overhaul that this microarchitecture has undergone since the transition from Core to Core 2. Most areas of the processor have undergone major revisions in order to take advantage of the amount of bandwidth made available by QuickPath. The important exception here is the execution hardware, which, except for the addition of some floating-point and integer shuffle blocks on port 5, is substantially unchanged from Core 2. The execution engine and instruction window At a high level, you can think of Nehalem as a design that takes the very wide, extremely robust execution engine from its predecessor, the Core 2 Duo, and focuses on keeping it as busy as possible by feeding it code and data at an unprecedented rate. The Core 2 Duo's execution engine, which is substantially the same as Nehalem's. Another, related way to conceive of Nehalem's overall goal is to imagine that the Core 2 Duo's thirsty execution engine has been separated from the pools of code and data that lay in main memory by relatively thin pipes (the frontside bus and cache hierarchy) and a strong pump (the front end and memory unit) that does the best it can to keep instructions and data flowing given the circumstances. Nehalem, then, is all about replacing the plumbing with very wide pipes and beefing up the pump in order to take full advantage of all this new capacity; this way, the execution engine can get much closer to reaching its full potential. In terms of keeping the execution engine fed, the return of simultaneous multithreading (SMT) to Intel's mainstream product line has an important impact on Nehalem's design. By letting each core on the die run two instruction streams at the same time, SMT increases overall system bandwidth usage and keeps the core's execution units busier with code and data, so that they waste less time (and power) sitting idle each cycle. Because of the increased flow of instructions and data through the core that SMT enables, the (re)introduction of SMT meant that buffers on both the instruction and data sides of Nehalem had to be enlarged. On the instruction side, Intel had to enlarge Nehalem's instruction window to accommodate more instructions in-flight. Specifically, Nehalem's reorder buffer (ROB) has been enlarged to 128 entries from Core 2's 96 entries, a 33 percent increase. (Architecture trivia buffs will recall that the Pentium 4 could track 126 instructions in-flight.) To go along with this much larger number of in-flight instructions, Nehalem's reservation station has been expanded to 36 entries form Core 2's 32. The reorder buffer is statically partitioned so that the number of instructions from any one thread can never dominate the structure, thereby ensuring that this critical resource is shared fairly by the two running threads. The reservation station is competitively shared, so that instructions from one thread can dominate it from time to time as needed. On the data side, the number of load buffers has gone from 32 in Core 2 to 48 in Nehalem, and the number of store buffers has gone from 20 to 32. Both of these resources are statically partitioned to ensure fairness. It bears pointing out that this static partitioning strategy for shared resources effectively reduces the number of entries available to each thread below the number available in Core 2; that is, the two threads sharing Nehalem's ROB will have only 64 entries each, instead of the full 96 entries afforded a thread in the Core 2's ROB. I won't speculate on the degree to which this will impact Nehalem's single-threaded performance, because I don't know a) the threshold beyond which a decrease in ROB entries significantly impacts the core's ability to extract maximum instruction-level parallelism, or b) if Intel has some way of mitigating this decrease, like, say, letting one thread use all the available entries if it's the only one executing. Intel has been a bit coy about the details of how this partitioning is managed, but I expect more details to emerge as we get closer to Nehalem's launch. Update: Intel says that they do option b, i.e., if only one thread is executing that thread gets full use of the shared resources. The front end In order to push more instructions into the enlarged instruction window, Nehalem's front end has undergone some significant changes. In fact, Nehalem's front end is probably the most altered part of the processor. The first major innovation that Nehalem brings to Intel's line is the addition of a dedicated loop stream detector (LSD) to the instruction pipeline after the decode stage. In the Core 2 Duo, Intel introduced an 18-entry instruction queue between the fetch and decode stages. This queue was big enough to cache a small loop so that a cached loop could execute repeatedly from the queue without having to constantly re-fetch the necessary instructions from the L1 instruction cache. By keeping the fetch hardware idle during such loops, Core 2 was able to save power. Nehalem takes this loop caching concept a bit further by moving this loop cache down below the decode units so that it caches up to 28 decoded uops instead of raw x86 instructions. Because the instructions cached in the LSD are already decoded, a loop can now execute without activating either the fetch or decode hardware, a feature that saves power and boosts performance. As David Kanter points out in his excellent article on Nehalem, the LSD provides much of the benefit of the Pentium 4's trace cache without the added complexity and the negative impact on the decode and L1 cache hardware. The improved LSD is a great feature that will serve the core in good stead in every context, from servers to mobiles. Intel made other major improvements to Nehalem's front end in the area of macrofusion. You may recall from my article on the Core 2 Duo that macrofusion is a technique that the Core microarchitecture uses to fuse some pairs of x86 instructions (compare + jump) together prior to decoding. On a cycle when two instructions can be macrofused, this gives the processor's front end a "virtual" fifth decoder, because it is decoding an extra x86 instruction on that cycle. Nehalem widens the number of x86 instructions that can be macrofused in two ways. First, it expands four new compare + jump branch conditions to the list of macrofusable instruction pairs. Second—and this is major—it can now macrofuse 64-bit instructions. Core 2's macrofusion hardware is limited to 32-bit instructions only. By expanding the list of instructions that can be macrofused to include new branch conditions and 64-bit instructions, Nehalem gains that virtual fifth decoder for a greater number of instructions. This improves overall decode bandwidth and plays a key role in keeping the core's execution engine fed. There are a few other improvements to Nehalem's front end that I'll mention only in passing, here. First is the processor's new multi-level branch predictor, which can store more branch history and give better performance on code that has too many branches to fit into a regular predictor. As always, branch prediction is one area where improvements translate directly into increased performance and power efficiency. Nehalem also improves Core 2's return stack buffer by renaming it. Duplicating the return stack buffer helps performance in SMT situations. SSE 4.2, virtualization, and multithreading Intel has added a number of new instructions to Nehalem and it has sped up others. The 4.2 version of Intel's SSE vector extensions takes the x86 ISA back to the future just a bit by adding new string manipulation instructions. I say "back to the future" because ISA-level support for string processing is a hallmark of CISC architectures that was actively deprecated in the post-RISC years; typically, when a writer wants to give an example of crufty old corners of the x86 ISA that have caused pain for chip architects, string manipulation instructions are what he or she reaches for. But the new SSE 4.2 string instructions are aimed at accelerating XML processing, which makes them Web-friendly and therefore modern (i.e., not crufty). SSE 4.2 also includes a CRC instruction that accelerate storage and networking applications, as well as a POPCNT instruction that's useful for a variety of pattern matching tasks. Also, to provide better support for multithreaded applications, Intel decreased the latency of its thread synchronization primitives. On the virtualization front, Nehalem speeds up VM transitions and has some substantial improvements, which I won't detail here, to its virtual memory system that will greatly reduce the number of such transitions required by the hypervisor. Conclusions: it's about the platform With the advent of Nehalem, Intel makes the giant leap from what is fundamentally still its decades-old monolithic-processor-plus-FSB platform to a fully modern SoC and NUMA (see diagram below) platform. Intel's current four-socket Xeon platform vs. Nehalem's fully connected NUMA topology. Note that main memory is never more than one hop away from a socket. This leap is long overdue (AMD was there years ago), but when Intel makes it in the fourth quarter of this year, it will change everything about its broader platform picture. The increase in bandwidth alone will improve performance on multisocket servers, and even desktop and mobile platforms will benefit from the higher levels of integration and performance that the integrated memory controller brings with it. In sum, Intel's entire processor product line will benefit from the large structural changes that I've outlined here, as well as from the smaller, core-specific improvements that Nehalem embodies. And from here on out, Nehalem's mix-and-match approach to products and platforms will be par for the course for Intel, as well as for rival AMD. With the launch of the first four-core, eight-thread Nehalem, the future of hardware will have arrived; and with all of that parallelism available, the performance ball will be squarely in the software industry's court. But that's a topic for another day. |
| | |