CPU Pipeline


History of CPU Pipeline

The original 8086 processor has 14 CPU registers which are still in use today. Four are general purpose registers -- AX, BX, CX, and DX. Four are segment registers that are used to help with pointers -- Code Segment (CS), Data Segment (DS), Extra Segment (ES), and Stack Segment (SS). Four are index registers that point to various memory locations -- Source Index (SI), Destination Index (DI), Base Pointer (BP), and Stack Pointer (SP).

The instruction pointer’s job is to point to the next instruction to be run. First, they follow the instruction pointer and decode the next CPU instruction at that location. After decoding, there is an execute stage where the instruction is run. Some instructions read from memory or write to it, others perform calculations or comparisons or do other work. When the work is done, the instruction goes through a retire stage and the instruction pointer is modified to point to the next instruction. 

In 1982 an instruction cache was added to the processor. Instead of jumping out to memory at every instruction, the CPU would read several bytes beyond the current instruction pointer. It dramatically improved performance by removing round trips to memory every few cycles.  In 1985, the 386 added cache memory for data as well as expanding the instruction cache.


In 1989, the i486 moved to a five-stage pipeline. Instead of having a single instruction inside the CPU, each stage of the pipeline could have an instruction in it. This addition more than doubled the performance of a 386 processor of the same clock rate. The fetch stage extracted an instruction from the cache. (The instruction cache at this time was generally 8 kilobytes.) The second stage would decode the instruction. The third stage would translate memory addresses and displacements needed for the instruction. The fourth stage would execute the instruction. The fifth stage would retire the instruction, writing the results back to registers and memory.

In 1993,  the Pentium architecture added a second separate superscalar pipeline. The main pipeline worked like the i486 pipeline, but the second pipeline ran some simpler instructions, such as direct integer math, in parallel and much faster. 

In 1995, Intel released the Pentium Pro processor. This chip had several features including out-of-order execution processing core (OOO core) and speculative execution. The pipeline was expanded to 12 stages, and it included something termed a ‘superpipeline’ where many instructions could be processed simultaneously.

In 2002, the Pentium 4 processor introduced a technology called Hyper-Threading. For most users the CPU’s OOO core was effectively idle much of the time, even under load. To help give a steady flow of instructions to the OOO core they attached a second front-end. The operating system would see two processors rather than one. There were two sets of x86 registers. There were two instruction decoders that looked at two sets of instruction pointers and processed both sets of results. The results were processed by a single, shared OOO core.

CPU Instruction Pipelines

The i486 has a 5-stage pipeline. The stages are – Fetch, D1 (main decode), D2 (secondary decode, also called translate), EX (execute), WB (write back to registers and memory). One instruction can be in each stage of the pipeline.


  The chips starting with the 8086 up through the 386 did not have an internal pipeline. They processed only a single instruction at a time, independently and completely.

XOR a, b
XOR b, a
XOR a, b
The first instruction enters the Fetch stage. On the next step the first instruction moves to D1 stage (main decode) and the second instruction is brought into fetch stage. On the third step the first instruction moves to D2 and the second instruction gets moved to D1 and another is fetched. On the next stage something goes wrong. The first instruction moves to EX ... but other instructions do not advance. The decoder stops because the second XOR instruction requires the results of the first instruction. The variable (a) is supposed to be used by the second instruction, but it won’t be written to until the first instruction is done. So the instructions in the pipeline wait until the first instruction works its way through the EX and WB stages. Only after the first instruction is complete can the second instruction make its way through the pipeline. The third instruction will similarly get stuck, waiting for the second instruction to complete.  

This is called a pipeline stall or a pipeline bubble.  

The Out Of Order Core Pipeline


Having fast instructions waiting for slow instructions was still a problem with parallel pipelines. The OOO core is a huge departure from the previous chip designs with their linear paths. It added some complexity and introduced nonlinear paths.



The first thing that happens is that instructions are fetched from memory into the processor’s instruction cache.

The decoding stage was modified slightly from earlier chips. Instead of just processing a single instruction at the instruction pointer, the Pentium Pro processor could decode up to three instructions per cycle. Decoding produces small fragments of operations called micro-ops or μ-ops.

Next is a stage (or set of stages) called micro-op translation, followed by register aliasing. Many operations are going on at once and we will potentially be doing work out of order, so an instruction could read to a register at the same time another instruction is writing to it. Writing to a register could potentially stomp on a value that another instruction needs. Inside the processor the original registers (such as AX, BX, CX, DX, and so on) are translated (or aliased) into internal registers that are hidden from the programmer. The registers and memory addresses need to have their values mapped to a temporary value for processing. Currently 4 micro-ops can go through translation every cycle.

After micro-op translation is complete, all of the instruction’s micro-ops enter a reorder buffer, or ROB. The ROB currently holds up to 128 micro-ops.

These micro-ops are now ready for processing. They are placed in the Reservation Station (RS). The RS currently can hold 36 micro- ops at any one time.

Now the magic of the OOO core happens. The micro-ops are processed simultaneously on multiple execution units, and each execution unit runs as fast as it can. Micro-ops can be processed out of order as long as their data is ready, sometimes skipping over unready micro-ops for a long time while working on other micro-ops that are ready.

The original Pentium Pro OOO core had six execution units: two integer processors, one floating-point processor, a load unit, a store address unit, and a store data unit.

what happens when the code hits an 'if' statement or a 'switch" statement?
Speculative execution was the solution. Speculative execution means that when a conditional statement (such as an 'if' block) is encountered the OOO core will simply decode and run all the branches of the code. As soon as the core figures out which branch was the correct one, the results from the unused branches would be discarded.

Finally, CPUs with Hyper-Threading enabled will expose two virtual processors for a single shared OOO core. They share a Reorder Buffer and OOO core, appearing as two separate processors to the operating system. 

 

Comments

Popular posts from this blog

Thread & Locks

Opengl-es Buffer

Opengl Stages of Vertex Transformation