CS 2733/2731, Spring 2004
Computer Organization II
Review for Final Exam (Wednesday, 5 May 2004)
|
Previous Reviews:
Refer to the reviews for Exam 1 and Exam 2 for material covered
up to the second exam:
See
Review for Exam 1 and
Review for Exam 2
Previous Exams:
Previous exams are available through the individual
web pages. (See here,
with links at the end of each main page.)
This final may be significantly different from previous ones.
Review topics:
Here are the new topics since the second exam.
Note that the final will emphasize Chapter 6, since that is
the most important topic since the second exam.
The final also covers the new topics of
caches, virtual memory, buses, and the recitation on exceptions.
- Exceptions and the trap handler in MIPS (
Recitation 10).
- The specifics of how the trap handler works.
- Take-home quiz on exceptions:
Quiz 7
- Pipelined Implementation (Chapter 6).
- Overview (Section 6.1, includes especially discussion of
hazards).
- Pipelined datapath (Section 6.2, ignores hazards).
Note the several ways of graphically representing pipelines
on pages 461-465, especially the series of diagrams
in Figures 6.22-6.24.
- Pipelined control (Section 6.3, ignores hazards).
The control signals from the Control unit are the same as
for the single-cycle implementation, but the pipelined implementation
makes use of extra latched
register storage between pipeline stages to pass along control
information. (See Figures 6.29 and 6.30. See also the
good series of diagrams in Figures 6.31-6.35.)
- Data hazards and forwarding (Section 6.4).
You should understand how forwarding works, with the Forwarding Unit
and extra control and data lines.
This applies to dependencies between the register result of one
instruction and the use of that new register value in subsequent
instructions. (See especially Figure 6.38 and the
series of diagrams in Figures 6.41-6.42. We did not cover
Figure 6.43 or page 488.) There are 6 inputs to the forwarding
unit and 2 outputs.
- Data hazards and stalls (Section 6.5).
In case of a dependency involving a lw instruction,
the machine must stall for one instruction. This uses the
Hazard Detection Unit.
Note how the stall is inserted by
de-asserting write lines to the PC and the IF/ID registers,
and by inserting all zeros on control lines. (These zeros
propagate along from cycle to cycle, as the "bubble" of
a stalled instruction moves along the pipeline.)
See diagrams in Figures 6.47-6.49. Note that the
final forwarding is done by the Forwarding Unit as before.
- Branch hazard (Section 6.6). In case of a branch,
if the branch is taken (in one simple implementation),
must stall and wipe out the start of the next instruction.
(Also need to move branch handling into step 2 of pipeline.)
See Figures 6.51 and 6.52. Skip Section 6.6 from pape 501 on.
The stall is inserted by zeroing the instruction in the
IF/ID pipeline register (making it a nop instruction),
so that as it moves through the pipeline, nothing happens.
- Exceptions (Section 6.7). Example of arithmetic overflow.
IF.Flush, ID.Flush, EX.Flush lines to put nop
in first stage and to set control lines to zeros in second
and third stages.
- Caches (Chapter 7).
- The general idea of caching, used not just for memory,
but with disk storage and elsewhere.
- SRAM versus DRAM (B.5, pages B-26 to B-33).
- The idea of hashing, with a hash function and
with some method for resolving collisions:
- Open addressing, that is, using the next available
sequential location after the hash address.
- Bucketting, that is, using a linked list attached to
to each hash address.
- Overflow area. Using an additional area for data that
collided.
- A very simple example, with 3-bit cache addresses, and 5-bit
memory addresses. Use the low-order 3 bits of the memory address
for the cache address. See Figures 7.5 and 7.6.
- A simple approach to a cache, involving a cache table such
as the one shown in Figure 7.7, with 1024 cache entries
(using a 10-bit cache address), indexes
in the range from 0 to 1023, a valid bit, a 20-bit tag field,
and a 32-bit data field.
For lookup:
- The CPU generates an address.
- Extract a 10-bit index from bits 11-2 of the address.
- For the address of a word, bits 1-0 will be 0.
- Compare the 20-bit quantity from bits 32-12 of the
address with the tag field. If equal, you have a hit,
and return the data field. If not equal, you have
a miss; go out to main memory.
- In case of a miss, stall the CPU, fetch the word from
memory, load it into the cache, and restart the instruction
so that this time there will be a cache hit.
- A very similar but slightly more complicated example as
shown in Figure 7.8, with 14-bit cache addresses (16K entries), and
a 16-bit tag field. Once again, bits 0 and 1 are not used,
since the cache is fetching words.
- Using spatial locality
(which means that if an item is referenced, items whose addresses
are close by will tend to be referenced soon):
As illustrated in Figure 7.10,
each cache entry could be a block of 4 words, so that a cache
miss will fetch 4 adjacent words, leading to likely hits afterwards.
- Associative caches, as illustrated in Figure 7.19,
where 4 completely distinct words are held for each cache index
(4-way associative).
- Virtual Memory (Section 7.4, just a small part).
- Computer memory is divided into continguous pages
(blocks) of some convenient size, sometimes 4K = 4096 bytes.
- Virtual memory is arranged as a sequence of such pages,
but virtual memory doesn't actually exist.
- Physical memory is arranged as a non-sequential
collection of such pages, but it is what really exists.
- The hardware allows a mapping (transformation) between
virtual memory and physical memory, using a page table.
The page table gives the physical page number for each virtual
page, if that virtual page is present in physical memory.
(But there are too many virtual pages for them all to be present.)
Figure 7.21 shows an overview of the translation, and
Figure 7.22 shows what the table and its use might look like.
- When a page is needed for use in a program in virtual memory,
and it is not in physical memory (a page fault),
it must be read in from disc,
and it must replace a page that is already in physical memory.
A page fault is handled by software after an interrupt.
- The Page Replacement Problem refers to the problem
of deciding which page in physical memory to replace.
We would like to replace (overwrite) a page that has not been
used for a long time. Specifically, we would like to use the
LRU (least recently used) algorithm, to replace the page
that has gone the longest time without being used. It turns
out that LRU is too complex for reasonable implementation.
- LRU approximation: This is a clever algorithm that
is efficient to implement, yet is almost as good as the real LRU.
Each page has an extra reference bit (RB) (in addition to a
valid bit that says whether the page is in physical
memory or not). Every time we reference a page in physical memory,
we set the RB to 1. In order to find a page for replacement, we cycle
through all pages in a circular way, looking for a page with
a 0 for its RB (0 meaning that it hasn't been
recently referenced). When we find a 1 for a RB,
we change the RB to 0, but go on, so we won't get to that
page until cycling though all pages. This guarantees that we
have to go once through all pages before using a page that has
been recently used.
- Buses (Section 8.4)
- Structure: data lines, address lines, control lines.
(A bus may use the same lines for data and addresses.)
- Types: processor-memory (short and fast to connect these all-important
components), I/O buses (general purpose for I/O devices), backplane
(designed to allow several kinds of buses to coexist).
(See figures on page 659.)
- Protocols: Synchonous (using a clock), asynchronous (no clock,
uses handshaking).
- Example protocol: Figures 8.10 and 8.11. (See also the
handout illustrating these figures.)
- Performance: Synchonous is faster, often used between processor
and memory. Everything must be clocked,
and all actions are precisely synchonized to clock cycles.
- Asynchonous is slower, has longer lines, and is more versatile.
Usually used for I/O devices.
- Bus arbitration: bus master and slave. The master decides which
slaves get access to the bus With several masters, one
needs arbitration: Four common types of arbitration:
- Daisy-chain: "simple and cheap, but not fair or fast".
Any devise can signal on a request line that it wants to use
the bus. The granting of the right to use the bus goes
from a "Bus arbiter" to the highest priority devise.
If it doesn't want the bus, is passed the grant on to the
next devise in priority order, and so on to the lowest priority.
Thus the granting chains through devices from highest to lowest priority.
- Centralized, parallel arbitration: Uses a request line
from each device to the arbiter. (common PCI bus)
- Distributed arbitration by self-selection: the devices
requesting bus access determine among themselves who gets access.
(Each device wanting bus access places a code indicating
its identity on the bus.) (NuBus and Apple II)
- Distributed arbitration by collision detection:
Each device wanting to use the bus checks if it is in use,
and if not, starts using it. If two or more devices start
at too close a time, there will be a collision.
In case of a collision, the devices stop using the bus,
wait a random time interval, and try again. (Ethernet)
Likely final exam questions:
- No questions about the use of CMOS transistors to create
gates.
- Probably several questions about MIPS assembly and machine
language. I might ask for the code for a simple loop
(use of beq or bne), or
for a simple call to a function (jal)
and the code for the function (jr $ra to return), or
access to memory (use of lw or sw),
or saving and restoring a register on the stack.
- Possibly one question about the correspondence between
machine code and assembler code (as in Lab 8:
Hand Assembly of MIPS Code.
- At least one, perhaps 2, questions on either the single-cycle or
multi-cycle implementation.
One of these questions might even ask you to do
something creative, such a implementing a new instruction.
The text has exercises asking how to implement the
addi instruction in both the single-cycle
and the multi-cycle models, without additional control lines.
(Instruction decode does have to recognize the new instruction.)
You should practice writing a data path diagram for the
implementation of addi on both single- and
multi-cycle implementations.
- Extra emphasis on the pipelining chapter, the first seven
sections.
- I might ask about the pipelined datapath
(lots of figures in section 6.2).
- I might ask about the pipelined
control (more figures in section 6.3).
- I may give you a diagram with a forwarding unit,
for you to explain how the forwarding unit works in simple
cases not involving a stall (section 6.4, figures 6.41, 6.42).
There are 6 inputs to the forwarding unit, and you should
understand how each of them is needed and is used.
Similarly you should understand the two outputs.
- I might ask about the hazard detection unit and the stall used
to handle the lw instruction (section 6.5,
figures 6.47, 6.48, 6.49). How does the computer know it is
a lw instruction? How does the computer
insert the stall?
- I might ask about handling the beq instruction
by moving everything into step 2, and by putting in a 1-cycle
stall bubble (for successful branch) (section 6.6,
figures 6.51, 6.52).
- I might ask about handling arithmetic overflow exception
by flushing the pipeline and restarting the instruction at address
40000040 (section 6.7, figures 6.55, 6.56).
- Probably one question about exception handlers.
- Possibly one question about caches.
- Possibly one question about virtual memory.
- Possibly one question about buses.
Good Luck!!!
Revision date: 2004-04-26.
(Please use ISO
8601, the International Standard.)