CS 2733/2731 -- Final Exam Review

CS 2733/2731, Spring 2004
Computer Organization II
Review for Final Exam
(Wednesday, 5 May 2004)

Previous Reviews: Refer to the reviews for Exam 1 and Exam 2 for material covered up to the second exam:
See Review for Exam 1 and Review for Exam 2

Previous Exams: Previous exams are available through the individual web pages. (See here, with links at the end of each main page.) This final may be significantly different from previous ones.

Review topics: Here are the new topics since the second exam. Note that the final will emphasize Chapter 6, since that is the most important topic since the second exam. The final also covers the new topics of caches, virtual memory, buses, and the recitation on exceptions.

Exceptions and the trap handler in MIPS ( Recitation 10).
- The specifics of how the trap handler works.
- Take-home quiz on exceptions: Quiz 7
Pipelined Implementation (Chapter 6).
- Overview (Section 6.1, includes especially discussion of hazards).
- Pipelined datapath (Section 6.2, ignores hazards). Note the several ways of graphically representing pipelines on pages 461-465, especially the series of diagrams in Figures 6.22-6.24.
- Pipelined control (Section 6.3, ignores hazards). The control signals from the Control unit are the same as for the single-cycle implementation, but the pipelined implementation makes use of extra latched register storage between pipeline stages to pass along control information. (See Figures 6.29 and 6.30. See also the good series of diagrams in Figures 6.31-6.35.)
- Data hazards and forwarding (Section 6.4). You should understand how forwarding works, with the Forwarding Unit and extra control and data lines. This applies to dependencies between the register result of one instruction and the use of that new register value in subsequent instructions. (See especially Figure 6.38 and the series of diagrams in Figures 6.41-6.42. We did not cover Figure 6.43 or page 488.) There are 6 inputs to the forwarding unit and 2 outputs.
- Data hazards and stalls (Section 6.5). In case of a dependency involving a lw instruction, the machine must stall for one instruction. This uses the Hazard Detection Unit. Note how the stall is inserted by de-asserting write lines to the PC and the IF/ID registers, and by inserting all zeros on control lines. (These zeros propagate along from cycle to cycle, as the "bubble" of a stalled instruction moves along the pipeline.) See diagrams in Figures 6.47-6.49. Note that the final forwarding is done by the Forwarding Unit as before.
- Branch hazard (Section 6.6). In case of a branch, if the branch is taken (in one simple implementation), must stall and wipe out the start of the next instruction. (Also need to move branch handling into step 2 of pipeline.) See Figures 6.51 and 6.52. Skip Section 6.6 from pape 501 on. The stall is inserted by zeroing the instruction in the IF/ID pipeline register (making it a nop instruction), so that as it moves through the pipeline, nothing happens.
- Exceptions (Section 6.7). Example of arithmetic overflow. IF.Flush, ID.Flush, EX.Flush lines to put nop in first stage and to set control lines to zeros in second and third stages.
Caches (Chapter 7).
- The general idea of caching, used not just for memory, but with disk storage and elsewhere.
- SRAM versus DRAM (B.5, pages B-26 to B-33).
- The idea of hashing, with a hash function and with some method for resolving collisions:
  - Open addressing, that is, using the next available sequential location after the hash address.
  - Bucketting, that is, using a linked list attached to to each hash address.
  - Overflow area. Using an additional area for data that collided.
- A very simple example, with 3-bit cache addresses, and 5-bit memory addresses. Use the low-order 3 bits of the memory address for the cache address. See Figures 7.5 and 7.6.
- A simple approach to a cache, involving a cache table such as the one shown in Figure 7.7, with 1024 cache entries (using a 10-bit cache address), indexes in the range from 0 to 1023, a valid bit, a 20-bit tag field, and a 32-bit data field.
  For lookup:
  - The CPU generates an address.
  - Extract a 10-bit index from bits 11-2 of the address.
  - For the address of a word, bits 1-0 will be 0.
  - Compare the 20-bit quantity from bits 32-12 of the address with the tag field. If equal, you have a hit, and return the data field. If not equal, you have a miss; go out to main memory.
  - In case of a miss, stall the CPU, fetch the word from memory, load it into the cache, and restart the instruction so that this time there will be a cache hit.
- A very similar but slightly more complicated example as shown in Figure 7.8, with 14-bit cache addresses (16K entries), and a 16-bit tag field. Once again, bits 0 and 1 are not used, since the cache is fetching words.
- Using spatial locality (which means that if an item is referenced, items whose addresses are close by will tend to be referenced soon): As illustrated in Figure 7.10, each cache entry could be a block of 4 words, so that a cache miss will fetch 4 adjacent words, leading to likely hits afterwards.
- Associative caches, as illustrated in Figure 7.19, where 4 completely distinct words are held for each cache index (4-way associative).
Virtual Memory (Section 7.4, just a small part).
- Computer memory is divided into continguous pages (blocks) of some convenient size, sometimes 4K = 4096 bytes.
- Virtual memory is arranged as a sequence of such pages, but virtual memory doesn't actually exist.
- Physical memory is arranged as a non-sequential collection of such pages, but it is what really exists.
- The hardware allows a mapping (transformation) between virtual memory and physical memory, using a page table. The page table gives the physical page number for each virtual page, if that virtual page is present in physical memory. (But there are too many virtual pages for them all to be present.) Figure 7.21 shows an overview of the translation, and Figure 7.22 shows what the table and its use might look like.
- When a page is needed for use in a program in virtual memory, and it is not in physical memory (a page fault), it must be read in from disc, and it must replace a page that is already in physical memory. A page fault is handled by software after an interrupt.
- The Page Replacement Problem refers to the problem of deciding which page in physical memory to replace. We would like to replace (overwrite) a page that has not been used for a long time. Specifically, we would like to use the LRU (least recently used) algorithm, to replace the page that has gone the longest time without being used. It turns out that LRU is too complex for reasonable implementation.
- LRU approximation: This is a clever algorithm that is efficient to implement, yet is almost as good as the real LRU. Each page has an extra reference bit (RB) (in addition to a valid bit that says whether the page is in physical memory or not). Every time we reference a page in physical memory, we set the RB to 1. In order to find a page for replacement, we cycle through all pages in a circular way, looking for a page with a 0 for its RB (0 meaning that it hasn't been recently referenced). When we find a 1 for a RB, we change the RB to 0, but go on, so we won't get to that page until cycling though all pages. This guarantees that we have to go once through all pages before using a page that has been recently used.
Buses (Section 8.4)
- Structure: data lines, address lines, control lines. (A bus may use the same lines for data and addresses.)
- Types: processor-memory (short and fast to connect these all-important components), I/O buses (general purpose for I/O devices), backplane (designed to allow several kinds of buses to coexist). (See figures on page 659.)
- Protocols: Synchonous (using a clock), asynchronous (no clock, uses handshaking).
- Example protocol: Figures 8.10 and 8.11. (See also the handout illustrating these figures.)
- Performance: Synchonous is faster, often used between processor and memory. Everything must be clocked, and all actions are precisely synchonized to clock cycles.
- Asynchonous is slower, has longer lines, and is more versatile. Usually used for I/O devices.
- Bus arbitration: bus master and slave. The master decides which slaves get access to the bus With several masters, one needs arbitration: Four common types of arbitration:
  - Daisy-chain: "simple and cheap, but not fair or fast". Any devise can signal on a request line that it wants to use the bus. The granting of the right to use the bus goes from a "Bus arbiter" to the highest priority devise. If it doesn't want the bus, is passed the grant on to the next devise in priority order, and so on to the lowest priority. Thus the granting chains through devices from highest to lowest priority.
  - Centralized, parallel arbitration: Uses a request line from each device to the arbiter. (common PCI bus)
  - Distributed arbitration by self-selection: the devices requesting bus access determine among themselves who gets access. (Each device wanting bus access places a code indicating its identity on the bus.) (NuBus and Apple II)
  - Distributed arbitration by collision detection: Each device wanting to use the bus checks if it is in use, and if not, starts using it. If two or more devices start at too close a time, there will be a collision. In case of a collision, the devices stop using the bus, wait a random time interval, and try again. (Ethernet)

Likely final exam questions:

No questions about the use of CMOS transistors to create gates.
Probably several questions about MIPS assembly and machine language. I might ask for the code for a simple loop (use of beq or bne), or for a simple call to a function (jal) and the code for the function (jr $ra to return), or access to memory (use of lw or sw), or saving and restoring a register on the stack.
Possibly one question about the correspondence between machine code and assembler code (as in Lab 8: Hand Assembly of MIPS Code.
At least one, perhaps 2, questions on either the single-cycle or multi-cycle implementation.
One of these questions might even ask you to do something creative, such a implementing a new instruction. The text has exercises asking how to implement the addi instruction in both the single-cycle and the multi-cycle models, without additional control lines. (Instruction decode does have to recognize the new instruction.) You should practice writing a data path diagram for the implementation of addi on both single- and multi-cycle implementations.
Extra emphasis on the pipelining chapter, the first seven sections.
- I might ask about the pipelined datapath (lots of figures in section 6.2).
- I might ask about the pipelined control (more figures in section 6.3).
- I may give you a diagram with a forwarding unit, for you to explain how the forwarding unit works in simple cases not involving a stall (section 6.4, figures 6.41, 6.42). There are 6 inputs to the forwarding unit, and you should understand how each of them is needed and is used. Similarly you should understand the two outputs.
- I might ask about the hazard detection unit and the stall used to handle the lw instruction (section 6.5, figures 6.47, 6.48, 6.49). How does the computer know it is a lw instruction? How does the computer insert the stall?
- I might ask about handling the beq instruction by moving everything into step 2, and by putting in a 1-cycle stall bubble (for successful branch) (section 6.6, figures 6.51, 6.52).
- I might ask about handling arithmetic overflow exception by flushing the pipeline and restarting the instruction at address 40000040 (section 6.7, figures 6.55, 6.56).
Probably one question about exception handlers.
Possibly one question about caches.
Possibly one question about virtual memory.
Possibly one question about buses.

Good Luck!!!

Revision date: 2004-04-26. (Please use ISO 8601, the International Standard.)

CS 2733/2731, Spring 2004 Computer Organization II Review for Final Exam(Wednesday, 5 May 2004)

CS 2733/2731, Spring 2004
Computer Organization II
Review for Final Exam
(Wednesday, 5 May 2004)