From: mark@hubcap.clemson.edu (Mark Smotherman) Newsgroups: comp.sys.m88k Subject: Re: Infos on 88110 Date: 29 Apr 94 02:25:27 GMT Organization: Clemson University Lines: 199 Message-ID: References: <1994Apr28.111853.29366@roche.com> NNTP-Posting-Host: hubcap.clemson.edu Motorola MC88110 Overview Mark Smotherman April 1994 The 88110 is a superscalar implementation of Motorola's M88K RISC architecture. The 88110 extends the architecture by introducing a separate floating-point register file and new graphics instructions. The design provides dispatch of up to two instructions per cycle to the ten functional units, out-of-order issue for stores and one branch from store buffers and a branch reservation station, speculative execution beyond conditional branches, and exception and branch misprediction recovery using a history buffer. One of the noteworthy features of the 88110 is its large set of functional units. Each of these units, except the divide unit, either completes in a single cycle or is fully pipelined and thus able to receive a new instruction on each clock cycle. Hardware design -- Single-chip design, 1.5M transistors I-cache -- 8KB, 32-byte line size, 2-way set associative, physically addressed, pseudo-random replacement, software i-cache coherency Fetch width -- 2 instructions, unless the instruction pair starts with the last instruction in a cache line Decoder width -- 2 instructions Max issue/cycle -- Up to two instructions can be issued; no address alignment or instruction type restrictions on issue (``symmetric'' issue) Inst. window -- Reservation stations for branches and stores Execution order -- Program-order issue except for stores and branches, speculative execution, out-of-order completion Branch predict -- Static branch prediction based on opcode choice; 32-entry Branch Target Instruction Cache with two target instructions per entry, fully associative, virtually addressed, FIFO replacement, software invalidate on context switch Recovery method -- Instructions issued past branch are tagged as conditional and flushed if branch is mispredicted; registers already written are restored using a history buffer of 12 entries, repair rate of two registers restored per cycle Function units -- 1 instruction / branch unit 1 data cache unit 2 integer units (32-bit operands) 1 bit-field unit (32-bit operands) 1 floating-point add unit (80-bit fp operands) 1 multiply unit (64-bit int., 80-bit fp) 1 divide unit (64/80-bit operands) 2 graphics units (64-bit operands) Integer regs. -- 32 32-bit registers (88100 code uses these for FP) FP regs. -- 32 80-bit registers (both register files are 8-ported: 4 read, 2 write, 2 history) Dep. checking -- Register scoreboard Latencies -- Integer add/sub Issue = 1 Result = 1 Integer mul Issue = 1 Result = 3 Integer div Issue = 18 Result = 18 FP cmp Issue = 1 Result = 1 FP add/sub Issue = 1 Result = 3 FP mul Issue = 1 Result = 3 FP div Issue = 13-26 Result = 13-26 Data cache -- 8KB, 32-byte line size, 2-way set associative, physically addressed, either write-through or write-back with write-allocate software selectable on page or block basis, non-blocking, prefetch and zero-allocate instructions available as well as non-allocating store-through instructions, pseudo- random replacement, dual tags for snooping, MESI cache coherency based on write invalidates Load use penalty-- One cycle Load bypass -- Yes Hardware support-- 4-entry load queue and 3-entry store instruction reservation station; tagged (conditional) load/stores cannot change the cache until the branch is resolved Page size -- 4K bytes Data TLB -- 32 page entries, fully associative, FIFO or software- managed replacement; 8 block entries, blocks are variable in size to 64MB, fully associative, software- managed replacement Inst. TLB -- same as Data TLB Exceptions -- Precise exceptions occur in program order by allowing all prior instructions to complete; the register files are restored to state just prior to the excepting instruction using the history buffer Interrupts -- Precise interrupts occur by aborting all incomplete instructions and restoring the register files for out-of-order completions using the history buffer Instruction Sequencer The instruction unit of the 88110 performs instruction fetch, decode, and issue along with data flow control, branch execution, and exception handling. On each cycle a pair of instructions is decoded and then dispatched in program order to the proper execution units along with the associated operands, assuming there are no resource or data conflicts. Two instructions are fetched each cycle from the instruction cache by the instruction sequencer, unless a fetch is made to the last word in a cache line in which case only one instruction is obtained. Additionally two instructions are fetched per cycle from a branch target instruction cache (TIC) if possible. A pair of instructions cannot be placed in the TIC if they cross an instruction cache line. Data conflicts are recognized by a register ``scoreboard''. An instruction that writes to a register will set a lock bit for that register in the scoreboard. Subsequent instructions with RAW and WAW dependencies on this register are then stalled until the register is updated and unlocked. Because the sequencer reads source registers for both instructions at the same time, instruction pairs with a WAR dependency between the first instruction and the second can be issued in the same cycle. Stores and conditional branches can be dispatched even when the source register is locked. The sequencer also keeps track of the availability of execution units. Due to the rich set of execution units, two instructions can be dispatched under many circumstances. However, if the instruction unit is not able to dispatch the first instruction of the pair, due to an unavailable resource, neither instruction will be issued, even if the resources of the second are available. Unlike some other dual-issue processors (e.g., DEC Alpha 21064), the 88110 instruction sequencer is aggressive and will attempt to keep two instructions in the decode stage during each cycle. That is, if the first instruction of a pair can be issued and the second cannot, the first instruction is sent, the second instruction is moved into the first decode slot, and another instruction is fetched into the second decode slot. The 88110 has two write-back buses for destination registers (each 80-bits wide). Due to the variance of latencies between functional units, it is possible that three or more instructions may attempt to use the write-back buses for destination registers on a given cycle, causing a pipeline stall for some instructions. Arbitration between the instructions favors results from lower-cycle-count instructions over results from larger-cycle-count instructions. This is an important aspect of scheduling, since it may be the case that the larger-cycle-count instructions are on the critical path of the program and thereby limit best performance. Branch Handling The 88110 retains the delayed branches of the 88100 but also adds normal branches. Branch execution on the 88110 uses static branch prediction to choose the path of the branch when the condition is not yet known. Once a path has been chosen, speculative execution proceeds, allowing instructions from the predicted, but possibly incorrect path, to be issued and executed with results written to the register file. A history buffer is used to restore the correct register state prior to a mispredicted branch. Load/Store Handling The load/store unit, which is critical to overall performance, is the most sophisticated unit of the 88110 and is one of the most sophisticated data units of all commercial microprocessors to date. One load or one store instruction can be issued on each clock cycle. Loads have a latency of two cycles on a data cache hit, and they are allowed to bypass stores if there is no address conflict. Load operations are buffered in a four-deep FIFO queue while store instructions are held in a three-deep reservation station. Stores can be issued even if the data to store is not yet available; thus, dependent stores can be scheduled in the same cycle as the value-producing instruction. Bus The bus has 32 address lines and 64 data lines. It is a split-transaction, pipelined design. Data transfers occur either in single beat (byte, half- word, word,or doubleword) or burst mode (4 double words). Burst transfers use critical-word first with wrap-around in order to quickly refill cache lines and allow early restart of CPU operations. L2 Cache Controller The 88410 is the second-level cache controller and can support up to a 1MB direct-mapped cache. Line size is selectable at 32 bytes or 64 bytes. Write policy and cache coherency follow the on-chip data cache. Multi- level inclusion is used, so that anything in the L1 cache must be in the L2 cache. A separate set of tags is used to track inclusion and can filter non-hitting snoop transactions away from the processor. An extensive article on the 88110 can be found in the April 1992 issue of IEEE Micro. -- Mark Smotherman, CS Dept., Clemson University, Clemson, SC 29634-1906 (803) 656-5878, mark@cs.clemson.edu or mark@hubcap.clemson.edu