From: mark@hubcap.clemson.edu (Mark Smotherman)
Newsgroups: comp.sys.m88k
Subject: Re: Infos on 88110
Date: 29 Apr 94 02:25:27 GMT
Organization: Clemson University
Lines: 199
Message-ID: <mark.767586327@hubcap>
References: <1994Apr28.111853.29366@roche.com>
NNTP-Posting-Host: hubcap.clemson.edu


Motorola MC88110 Overview

Mark Smotherman
April 1994


The 88110 is a superscalar implementation of Motorola's M88K RISC
architecture.  The 88110 extends the architecture by introducing a
separate floating-point register file and new graphics instructions.
The design provides dispatch of up to two instructions per cycle to
the ten functional units, out-of-order issue for stores and one branch
from store buffers and a branch reservation station, speculative
execution beyond conditional branches, and exception and branch
misprediction recovery using a history buffer.

One of the noteworthy features of the 88110 is its large set of 
functional units.  Each of these units, except the divide unit, either
completes in a single cycle or is fully pipelined and thus able to
receive a new instruction on each clock cycle.


Hardware design -- Single-chip design, 1.5M transistors

I-cache		-- 8KB, 32-byte line size, 2-way set associative,
		   physically addressed, pseudo-random replacement,
		   software i-cache coherency
Fetch width	-- 2 instructions, unless the instruction pair starts
		   with the last instruction in a cache line
Decoder width	-- 2 instructions
Max issue/cycle	-- Up to two instructions can be issued; no address
		   alignment or instruction type restrictions on
		   issue (``symmetric'' issue)

Inst. window	-- Reservation stations for branches and stores
Execution order -- Program-order issue except for stores and branches,
		   speculative execution, out-of-order completion

Branch predict	-- Static branch prediction based on opcode choice;
		   32-entry Branch Target Instruction Cache with two
		   target instructions per entry, fully associative,
		   virtually addressed, FIFO replacement, software
		   invalidate on context switch
Recovery method	-- Instructions issued past branch are tagged as
		   conditional and flushed if branch is mispredicted;
		   registers already written are restored using a
		   history buffer of 12 entries, repair rate of two
		   registers restored per cycle

Function units	--	1 instruction / branch unit
			1 data cache unit
			2 integer units (32-bit operands)
			1 bit-field unit (32-bit operands)
			1 floating-point add unit (80-bit fp operands)
			1 multiply unit (64-bit int., 80-bit fp)
			1 divide unit (64/80-bit operands)
			2 graphics units (64-bit operands)

Integer regs.	-- 32 32-bit registers (88100 code uses these for FP)
FP regs.	-- 32 80-bit registers
		   (both register files are 8-ported: 4 read, 2 write,
		   2 history)
Dep. checking	-- Register scoreboard

Latencies	--	Integer add/sub	 Issue =  1	Result =  1
			Integer mul	 Issue =  1	Result =  3
			Integer div	 Issue = 18	Result = 18
			FP cmp		 Issue =  1	Result =  1
			FP add/sub	 Issue =  1	Result =  3
			FP mul		 Issue =  1	Result =  3
			FP div		 Issue = 13-26	Result = 13-26

Data cache	-- 8KB, 32-byte line size, 2-way set associative,
		   physically addressed, either write-through or
		   write-back with write-allocate software selectable
		   on page or block basis, non-blocking, prefetch and
		   zero-allocate instructions available as well as
		   non-allocating store-through instructions, pseudo-
		   random replacement, dual tags for snooping, MESI
		   cache coherency based on write invalidates
Load use penalty-- One cycle
Load bypass	-- Yes
Hardware support-- 4-entry load queue and 3-entry store instruction
		   reservation station; tagged (conditional) load/stores
		   cannot change the cache until the branch is resolved

Page size	-- 4K bytes
Data TLB	-- 32 page entries, fully associative, FIFO or software-
		   managed replacement; 8 block entries, blocks are
		   variable in size to 64MB, fully associative, software-
		   managed replacement
Inst. TLB	-- same as Data TLB

Exceptions	-- Precise exceptions occur in program order by allowing
		   all prior instructions to complete; the register files
		   are restored to state just prior to the excepting
		   instruction using the history buffer
Interrupts	-- Precise interrupts occur by aborting all incomplete
		   instructions and restoring the register files for
		   out-of-order completions using the history buffer


Instruction Sequencer

The instruction unit of the 88110 performs instruction fetch, decode,
and issue along with data flow control, branch execution, and exception
handling.  On each cycle a pair of instructions is decoded and then
dispatched in program order to the proper execution units along with
the associated operands, assuming there are no resource or data conflicts.

Two instructions are fetched each cycle from the instruction cache by
the instruction sequencer, unless a fetch is made to the last word in a
cache line in which case only one instruction is obtained.  Additionally
two instructions are fetched per cycle from a branch target instruction
cache (TIC) if possible.  A pair of instructions cannot be placed in the
TIC if they cross an instruction cache line.

Data conflicts are recognized by a register ``scoreboard''.  An instruction
that writes to a register will set a lock bit for that register in the
scoreboard.  Subsequent instructions with RAW and WAW dependencies on
this register are then stalled until the register is updated and unlocked.
Because the sequencer reads source registers for both instructions at the
same time, instruction pairs with a WAR dependency between the first
instruction and the second can be issued in the same cycle.  Stores and
conditional branches can be dispatched even when the source register is
locked.

The sequencer also keeps track of the availability of execution units.
Due to the rich set of execution units, two instructions can be dispatched
under many circumstances.  However, if the instruction unit is not able
to dispatch the first instruction of the pair, due to an unavailable
resource, neither instruction will be issued, even if the resources of
the second are available.  Unlike some other dual-issue processors (e.g.,
DEC Alpha 21064), the 88110 instruction sequencer is aggressive and will
attempt to keep two instructions in the decode stage during each cycle.
That is, if the first instruction of a pair can be issued and the second
cannot, the first instruction is sent, the second instruction is moved
into the first decode slot, and another instruction is fetched into the
second decode slot.

The 88110 has two write-back buses for destination registers (each 80-bits
wide).  Due to the variance of latencies between functional units, it is
possible that three or more instructions may attempt to use the write-back
buses for destination registers on a given cycle, causing a pipeline stall
for some instructions.  Arbitration between the instructions favors results
from lower-cycle-count instructions over results from larger-cycle-count
instructions.  This is an important aspect of scheduling, since it may be
the case that the larger-cycle-count instructions are on the critical path
of the program and thereby limit best performance.

Branch Handling

The 88110 retains the delayed branches of the 88100 but also adds normal
branches.  Branch execution on the 88110 uses static branch prediction to
choose the path of the branch when the condition is not yet known.  Once a
path has been chosen, speculative execution proceeds, allowing instructions
from the predicted, but possibly incorrect path, to be issued and executed
with results written to the register file.  A history buffer is used to
restore the correct register state prior to a mispredicted branch.

Load/Store Handling

The load/store unit, which is critical to overall performance, is the most
sophisticated unit of the 88110 and is one of the most sophisticated data 
units of all commercial microprocessors to date.  One load or one store
instruction can be issued on each clock cycle.  Loads have a latency of
two cycles on a data cache hit, and they are allowed to bypass stores if
there is no address conflict.  Load operations are buffered in a four-deep
FIFO queue while store instructions are held in a three-deep reservation
station.  Stores can be issued even if the data to store is not yet
available; thus, dependent stores can be scheduled in the same cycle as
the value-producing instruction.

Bus

The bus has 32 address lines and 64 data lines.  It is a split-transaction,
pipelined design.  Data transfers occur either in single beat (byte, half-
word, word,or doubleword) or burst mode (4 double words).  Burst transfers
use critical-word first with wrap-around in order to quickly refill cache
lines and allow early restart of CPU operations.

L2 Cache Controller

The 88410 is the second-level cache controller and can support up to a
1MB direct-mapped cache.  Line size is selectable at 32 bytes or 64 bytes.
Write policy and cache coherency follow the on-chip data cache.  Multi-
level inclusion is used, so that anything in the L1 cache must be in the
L2 cache.  A separate set of tags is used to track inclusion and can
filter non-hitting snoop transactions away from the processor.


An extensive article on the 88110 can be found in the April 1992 issue of
IEEE Micro.
-- 
Mark Smotherman, CS Dept., Clemson University, Clemson, SC 29634-1906
  (803) 656-5878,  mark@cs.clemson.edu  or  mark@hubcap.clemson.edu