CPU Architecture Overview

Modern microprocessors are among the most complex systems ever created by humans. A single silicon chip, roughly the size of a fingernail, can contain a complete high-performance processor, large cache memories, and the logic required to interface it to external devices. We know that a processor must execute a sequence of instructions, where each instruction performs some primitive operation, such as adding two numbers.  An instruction is encoded in binary form as a sequence of 1 or more bytes. Microprocessing unit is synonymous to central processing unit, CPU used in traditional computer, tablets, smartphones etc.




CPU Architecture Overview MC4yMA6



Arithmetic Logic Unit (ALU)

The ALU perform the computing function of microprocessor. An arithmetic logic unit (ALU) is a digital circuit used to perform arithmetic and logic operations. The ALU is part of the CPU that actually processes data.The ALU takes data from CPU registers, processes the the data and copies it back to into registers before moving on to next batch of data.  





CPU Registers

Registers are the ALU's workbenches. Registers are memory circuits located inside the CPU that holds data before and after processing. The control unit tells the ALU what operation to perform on a data stored in CPU registers and the ALU stores the result in an output register. Most of the operations of CPU are performed by one or more ALU's which load data from input registers. Control Unit (CU) moves data between these registers.




Floating Point Unit (FPU)

Floating Point Unit (FPU) is the CPU component that handles calculations based on the IEEE Floating Point standard which defines a set of codes for representing real numbers (numbers that can contain fractional parts) in CPU calculations.  There might be multiple ALUs, floating point units, etc. This means that certain instruction sequences can execute very quickly while others won’t.




Instruction Pipelining

The term Pipeline refers to the discrete series of steps that the CPU flows to process commands. Instruction pipelining is a technique that implements a form of parallelism called instruction-level parallelism within a single processor. It therefore allows faster CPU throughput (the number of instructions that can be executed in a unit of time) than would otherwise be possible at a given clock rate. Early pre-Pentium CPU's had only a single pipeline and thus could only process a single command at a time. The Pentium introduced Dual Independent Bus (DIB) architecture -- Dual Pipelines  that enabled the CPU to process two commands simultaneously. Later CPU's have even more pipelines and can thus process more commands at once.




Superscalar architecture

If one pipeline is good, more are better. Using multiple pipelines allows multiple instructions to be processed in parallel, an architecture called superscalar. A superscalar processor processes multiple instructions per clock cycle.



CPU Bus Interfaces

Bus interfaces are the pathways that connect the processor to memory and other components. A bus is a subsystem that transfers data between computer components. For example, processors connect to the chipset North bridge via a dedicated bus called the Front Side Bus (FSB). Modern bus types include Front Side Bus (FSB), which carries data between the CPU and memory controller hub; Direct Media Interface (DMI), which is a point-to-point interconnection between an Intel integrated memory controller and an Intel I/O controller hub on the computer’s motherboard; and Quick Path Interconnect (QPI), which is a point-to-point interconnect between the CPU and the integrated memory controller.





CPU Cache

To aid communication between the CPU and slow RAM a special type of memory called Static RAM (SRAM) is used.  The cache is a smaller, faster memory which stores copies of the data from frequently used main memory locations. Normal system memory made of Dynamic RAM (DRAM) this type of memory can hold data for only a short duration before it needs to be refreshed (reenergized) or it will lose its contents. The SRAM cache memory in contrast never needs to be refresh and is therefore much faster than DRAM. All modern (fast) CPUs (with few specialized exceptions) have multiple levels of CPU caches. The first CPUs that used a cache had only one level of cache, not split into L1d (for data) and L1i (for instructions). Almost all current CPUs with caches have a split L1 cache. They also have L2 caches and, for larger processors, L3 caches as well. The L2 cache is usually not split and is usually shared between cores. The L3 cache, and higher-level caches, are shared and not split. When the processor needs to read from or write to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory. The time taken to fetch one cache line from memory (read latency due to a cache miss) matters because the CPU will run out of things to do while waiting for the cache line. When a CPU reaches this state, it is called a stall. As CPUs become faster compared to main memory, stalls due to cache misses displace more potential computation; modern CPUs can execute hundreds of instructions in the time taken to fetch a single cache line from main memory. A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency. Various techniques have been employed to keep the CPU busy during this time, including out-of-order execution in which the CPU (Pentium Pro and later Intel designs, for example) attempts to execute independent instructions after the instruction that is waiting for the cache miss data. Another technology, used by many processors, is simultaneous multithreading (SMT), or—​​in Intel's terminology—​​hyper-threading (HT), which allows an alternate thread to use the CPU core while the first thread waits for required CPU resources to become available. As for main system RAM size important to speedup access to your data you are working on so does CPU caches storage size important bigger they are faster CPU can access data it needs.







Intel Smart Cache

CPU Cache is an area of fast memory located on the processor. Intel Smart Cache refers to the architecture that allows all cores to dynamically share access to the last level cache.





CPU Clock Speed

The CPU clock speed is a measurement of how many calculation cycle a CPU executes per second. One calculation cycle per second is equal to one hertz (Hz). Modern CPU's clock speed frequency is measured in gigahertz (GHz), or billion cycles per second. Clock speed determines how fast instructions execute. Some instructions require one cycle, others multiple cycles, and some processors execute multiple instructions during one cycle. Complex Instruction Set Computer (CISC) processors use complex instructions. Each requires many clock cycles to execute, but accomplishes a lot of work. Reduced Instruction Set Computer (RISC) processors use fewer, simpler instructions. Each takes few clock cycles but accomplishes relatively little work. These differences in efficiency mean that one CPU cannot be directly compared to another purely on the basis of clock speed.  The comparison is complicated because different CPUs have different strengths and weaknesses. For example, the Athlon is generally faster than the Pentium 4 clock for clock on both integer and floating-point operations (that is, it does more work per CPU cycle), but the Pentium 4 has an extended instruction set that may allow it to run optimized software literally twice as fast as the Athlon.  The only safe use of direct clock speed comparisons is within a single family of CPU. Clock speed of CPU is determined by by the maximum speed of the CPU itself and the maximum speed that the motherboard capable of handling. Motherboard clock speed is governed by on board component called the quartz crystal circuit that oscillates at a fixed frequency when fed current. The quartz crystal circuit sets the tempo for the CPU by firing a small electrical charge with each oscillation onto a spacial wire on the CPU the CLK (or Clock) wire. Each beat from the quartz crystal circuit onto CLK wire equals one clock cycle on the CPU.  






CPU Clock Multipliers

The clock multipliers is the mechanism the CPU uses to run its clock speed at faster frequencies then the externally supplied clock speed. For example, a system with an external clock of 133 MHz and a 10x clock multiplier will have an internal CPU clock of 1.33 GHz. The external address and data buses of the CPU (often collectively termed front side bus or FSB in PC contexts) also use the external clock as a fundamental timing base; however, they could also employ a (small) multiple of this base frequency (typically two or four) in order to transfer data faster. The internal frequency of microprocessors is usually based on front side bus (FSB) frequency. To calculate internal frequency the CPU multiplies bus frequency by a number called the clock multiplier. For calculation, the CPU uses actual bus frequency, and not effective bus frequency. To determine the actual bus frequency for processors that use dual-data rate (DDR) buses (AMD Athlon and Duron) and quad-data rate buses (all Intel microprocessors starting from Pentium 4) the effective bus speed should be divided by 2 for AMD or 4 for Intel. Clock multipliers on many modern processors are fixed; it is usually not possible to change them. Some versions of processors have clock multipliers unlocked; that is, they can be "overclocked" by increasing the clock multiplier setting in the motherboard's BIOS setup program. Many Intel CPU's have maximum clock multiplier locked: these CPUs may be underclocked (run at lower frequency), but they cannot be overclocked by increasing clock multiplier higher than intended by CPU design. While these qualification samples and majority of production microprocessors cannot be overclocked by increasing their clock multiplier, they still can be overclocked by using a different technique: by increasing FSB frequency. The internal CPU clock speed is not the external clock speed that CPU uses to talk to the rest of the PC.






Cores

Cores is a CPU term that describes the number of independent central processing units in a single computing component (die or chip).







Instruction Set

An instruction set refers to the basic set of commands and instructions that a microprocessor understands and can carry out.






Instruction Set Extensions

Instruction Set Extensions are additional instructions which can increase performance when the same operations are performed on multiple data objects. These can include SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions).