# University of California, Berkeley - College of Engineering 

## Department of Electrical Engineering and Computer Sciences

Spring 2015
Instructors: Krste Asanović, Vladimir Stojanovic


After the exam, indicate on the line above where you fall in the emotion spectrum between "sad" \& "smiley"...

| Last Name | ANSWER KEY |
| ---: | ---: |
| First Name |  |
| Student ID Number |  |
| CS61C Login | cs61c- |
| The name of your SECTION TA (please circle) | David \| Donggyu | Fred | Jeffrey | Martin <br> Nolan \| Sagar | Shreyas | William |
| Name of the person to your Left |  |
| Name of the person to your Right |  |
| All the work is my own. I had no prior knowledge of the exam <br> contents nor will I share the contents with others in CS61C <br> who have not taken it yet. (please sign) |  |

## Instructions (Read Me!)

- This booklet contains 14 numbered pages including the cover page. The back of each page is blank and can be used for scratch-work, but will not be looked at for grading.
- Please turn off all cell phones, smartwatches, and other mobile devices. Remove all hats \& headphones. Place your backpacks, laptops and jackets under your seat.
- You have 180 minutes to complete this exam. The exam is closed book; no computers, phones, or calculators are allowed. You may use three handwritten 8.5 "x11" pages (front and back) of notes in addition to the provided green sheet.
- There may be partial credit for incomplete answers; write as much of the solution as you can. We will deduct points if your solution is far more complicated than necessary. When we provide a blank, please fit your answer within the space provided. "IEC format" refers to the mebi, tebi, etc prefixes.
- You must complete ALL THE QUESTIONS, regardless of your score on the midterm. Clobbering only works from the Final to the Midterm, not vice versa. You have 3 hours... relax.

|  | M1-1 | M1-2 | M1-3 | M2-1 | M2-2 | M2-3 | M2-4 | F1 | F2 | F3 | Total |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Points <br> Possible | 9 | 9 | 9 | 10 | 4 | 8 | 12 | 9 | 10 | 10 | 90 |
| Points <br> Earned |  |  |  |  |  |  |  |  |  |  |  |

## M1-1: I smell a potpourri section covering midterm one... (9 points)

a) Which of the following number representations give 0xFFFFFFFE the most positive value when converted to decimal?
A) Bias (with standard bias)
B) Unsigned
C) Two's complement
D) Sign and Magnitude
B) Unsigned
b) Consider a plot that shows the mapping between 8 -bit two's complement binary numbers and their decimal equivalents (i.e. binary is on the x-axis and decimal is on the $y$-axis). Fill in the plot to the left and answer the following questions.

i) Fill in the plot to the left.
ii) Describe (in binary) where discontinuities occur in the plot, if any:
Discontinuity from Ob01111111 to Ob10000000. Jumps from most positive number to most negative number.
iii) What are the most positive and most negative decimal values that this representation can store?
Largest: 127
Smallest: -128
c) Consider the C code below. Indicate where the values on the right live in memory (using (S)tack,
(H)eap, s(T)atic, or (C)ode). Assume no registers are used:

```
```

\#define a 10

```
```

\#define a 10
int b = 0;
int b = 0;
int main(int argc, char** argv) {
int main(int argc, char** argv) {
int c = a;
int c = a;
char d[10];
char d[10];
int* e = malloc(sizeof(int));
int* e = malloc(sizeof(int));
}

```
```

}

```
```

a: $\qquad$
b: $\qquad$
*d: $\qquad$
*e: _ ${ }^{H}$

d) Convert the following instructions from TAL to hex or vice versa. Use register names when possible.
i) 1 w \$s0, $0(\$ \mathrm{aO})$
$0 \times 8 \mathrm{c} 900000$
ii) $0 \times 02021021$
addu \$v0 \$s0 \$v0
$\qquad$

## M1-2: I'll believe it when I C it (9 points)

Your friend wants to take 61C next semester, and is learning C early to get ahead. They try to implement the ROT13 function as practice, but the code they wrote has some bugs, so you've been called in, as the C expert, to help debug their program, reproduced below:

```
/* Applies the ROT13 cipher. rot13("happytimes") == "uncclgvzrf" */
void rot13(char *str) {
    while (*str) {
        if (str >= 'a' && str <= 'z') {
            *str = (*str + 13) % 26;
        }
        str++;
    }
}
int main(int argc, char *argv[]) {
    char a[] = "happy";
    char b[] = "times";
    char *s = "XXXXXXXXXXXX"; // 12 X's
    // Apply cipher to a and b.
    rot13(a);
    rot13(b) ;
    printf("%s%s\n", a, b);
    // Concatenate and place in s.
    int i = 0;
    for (int j = 0; a[j]; ) s[i++] = a[j++];
    for (int j = 0; b[j]; ) s[i++] = b[j++];
    printf("%s\n", s);
}
```

a) You want to impress your friend, so you predict the result of executing the program as it is written, just by looking at it. If the program is guaranteed to execute without crashing, describe what it prints, otherwise explain the bug that may cause a crash.

Modifying string literals is undefined, causes a bus error in modern systems.
b) Now, fix all the errors in the program so that it executes correctly. Fill in the corrections you made in the table below. You may not need all the rows.

| Line \# | Insert Before / Replace / Delete | Change (Explanation or Code) |
| :---: | :---: | :---: |
| 4 | Replace | if (*str >= 'a' \&\& *str <= 'z') \{ |
| 5 | Replace | *str = (*str - 'a' + 13) \% 26 + 'a'; |
| 14 | Replace | Change char *s to char s[] |
| 25 | I / R | s[i] = '\0'; |

$\qquad$

## M1-3: I don't want to MIPS a thing (9 points)

The following C code recursively sums the elements in an array of length $n$.

```
int32_t sum_arr(int32_t *arr, size_t n) {
        if (n) {
            return sum_arr(arr + 1, n - 1) + arr[0];
            }
        return 0;
}
```

Translate sum_arr into MIPS below. Your code must follow all function calling conventions, and you may not use any pseudoinstructions. You may not need every blank.

```
sum_arr: bne $a1, $zero_ non_zero
    addu $v0, $0, $0
    jr $ra
non_zero: _addiu $sp, $sp, -8
    sw $s0, 4($sp)
    sw $ra, 0($sp)
    addiu $s0, $a0, 0 # Store arr into $s0
    addiu $a0, $a0, 4
    _addiu $a1, $a1, -1
        jal sum_arr
    Iw $t0, 0($s0)
    addu $v0, $v0, $t0
    lw $s0, 4($sp)
    lw $ra, 0($sp)
    addiu $sp, $sp, }
    jr $ra
```


## M2-1: I couldn't come up with a clever title for SDS. (10 points)

a) Give the simplest Boolean expression for the following circuit in terms of $A$ and $B$, using the minimum number of AND, OR, and NOT gates:


C = $\qquad$ $A+\sim B$ $\qquad$
(You must show your work above to earn points.)
b) Using as few states as possible, complete the transition table for an FSM that takes an input with 3 values: 0 , 1 , or 2 . The machine will output a 1 when the sum of the inputs seen so far is divisible by 3 . Otherwise it should output a 0.

Assume you have seen no digits at the start state. You might not need all of the states, and you should not draw additional states. You must represent your FSM using the table to the left, the table is the only part that will be graded. The first transition has been filled in for you.

| CURRENT <br> STATE | INPUT/OUTPUT | NEXT <br> STATE |
| :---: | :---: | :---: |
| A | $1 / 0$ | B |
| A | $0 / 1$ | A |
| A | $2 / 0$ | C |
| B | $2 / 1$ | A |
| B | $0 / 0$ | B |
| B | $1 / 0$ | C |
| C | $1 / 1$ | A |
| C | $2 / 0$ | B |
| C | $0 / 0$ | C |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |


$\qquad$

## M2-1: (continued)

c) Suppose we add registers to the unoptimized circuit in part A to increase the clock rate (this modification is shown below). What is the longest clock-to-Q that the registers on inputs A and B can have that will result in correct behavior when the circuit is clocked at 10 MHz ?


Assumptions:

- Assume that clock-to-Q > hold time
- All registers have a setup time of 2 ns
- All logic gates have a delay of 25 ns
- Bubbles on gates do not introduce additional delay

Answer: $\qquad$ 48ns $\qquad$

## M2-2: Float like a butterfly and sting like an IEEE (4 points)

Let's take another look at the IEEE754 standard for single-precision floating-point numbers. [x, y) represents a range where $x$ is included and $y$ is not.
a) How many floats are representable in the interval [0.5, 1)? Answer: $1 \cdot 2^{\wedge} 23$
b) How many floats are representable in the interval [0, 0.5)? Answer: $\quad 126 \cdot 2^{\wedge} 23+1$ Note: We will accept both $126 \times 2^{23}$ and $126 \times 2^{23}+1$. This change will be made .

## M2-3: If this exam were a CPU, you'd be halfway through the pipeline (8 points)

We found that the instruction fetch and memory stages are the critical path of our 5-stage pipelined MIPS CPU. Therefore, we changed the IF and MEM stages to take two cycles while increasing the clock rate. You can assume that the register file is written at the falling edge of the clock.


Assume that no pipelining optimizations have been made, and that branch comparisons are made by the ALU. Here's how our pipeline looks when executing two add instructions:

| Clock Cycle \# | $\mathbf{1}$ | $\mathbf{2}$ | $\mathbf{3}$ | $\mathbf{4}$ | $\mathbf{5}$ | $\mathbf{6}$ | $\mathbf{7}$ | $\mathbf{8}$ |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| add \$t0, \$t1, \$t2 | IF1 | IF2 | ID | EX | MEM1 | MEM2 | WB |  |
| add \$t3, \$t4, \$t5 |  | IF1 | IF2 | ID | EX | MEM1 | MEM2 | WB |

Make sure you take a careful look at the above diagram before answering the following questions:
a. How many stalls would a data hazard between back-to-back instructions require?

3 stalls
b. How many stalls would be needed after a branch instruction?

4 stalls
c. Suppose the old clock period was 150 ns and the new clock period is now 100 ns . Would our processor have a significant speedup executing a large chunk of code...
i. Without any pipelining hazards? Explain your answer in 1-2 sentences.

## Yes, due to 1.5x throughput

ii. With $50 \%$ of the code containing back-to-back data hazards? Explain your answer in 12 sentences.
Yes, penalty is 300 ns per hazard in both cases, so our new processor will still have higher throughtput.
$\qquad$

## M2-4: Some say there's nothing better than cold, hard cache (12

 points)a) What shape do the following trade-off curves have? Select a shape and enter its number into the box for each of the graphs. Unless they are the parameters being varied, assume that associativity, capacity and block size are constant. You should assume that the axes are linear.

$\underset{\text { 1: Constant }}{\text { 个 }}$
b) Consider a system with inclusive L1 and L2 caches with 4B cache block size. Assume we have 1 MiB of on-chip memory available and want to determine how much of this memory we should give to the L1 cache and how much to the L2 cache. We will try to minimize the AMAT to do so.

Assume both caches are fully associative with LRU replacement. Their combined capacity is 1 MiB (excluding tags and meta-data). You can consider all miss rates approximate.

Say you are running the following program starting from cold L1 and L2 caches:

```
#define ARRAY_SIZE 256*1024
int a[ARRAY_SIZE];
int sum = 0; // assume sum, i, and j are stored in registers
for (int i = 0; i < 100000; i++) {
    for (int j = 0; j < ARRAY_SIZE; j++) sum += a[j];
    for (int j = ARRAY_SIZE-1; j >= 0; j--) sum += a[j];
}
```

1) How would we compute AMAT if we had the local L1 miss rate ("L1Miss"), the local L2 miss rate ("L2Miss") and the memory access time ("Memory")? Use "H1" and "H2" to represent the L1 and L2 hit times respectively. (We will compute these quantities later in the question)
```
AMAT = H1 + L1Miss * (H2 + L2Miss * Memory)
```

$\qquad$

## M2-4: (continued)

2) For the program above, express the local miss rate for the L1 cache in general terms as a function of the L1 cache size (write L1 for the size of L1 in bytes). Hint: The miss rate is 0 for a 1 MiB cache, 0.5 for a 0.5 MiB cache and 1 for a 0 MiB (i.e., no) L1 cache.

LocalMiss1 = $1-(\mathrm{L} 1 / 1 \mathrm{MiB})$
3) What is the global miss rate for the L2 cache as a function of the L1 cache size? Hint: Start by expressing the global miss rate as a function of the L 2 cache size.

GlobalMiss2 $=1-(\mathrm{L} 2 / 1 \mathrm{MiB})=(1 \mathrm{MiB}-\mathrm{L} 2) / 1 \mathrm{MiB}=\mathrm{L} 1 / 1 \mathrm{MiB}$
4) What is the local miss rate for the L2 cache as function of the L1 and L2 sizes? Hint: Use your results from questions 2 and 3.

LocalMiss2 $=[\mathrm{L} 1 / 1 \mathrm{MiB}] /[1-(\mathrm{L} 1 / 1 \mathrm{MiB})]=\mathrm{L} 1 /(1 \mathrm{MiB}-\mathrm{L} 1)=\mathrm{L} 1 / \mathrm{L} 2$
5) Assume the hit time of the L1 cache is 10 cycles, the hit time of the L2 cache is 20 cycles and the memory access time is 100 cycles. Using the formula from question 1, what is the AMAT for this system as a function of only the L1 size?

```
AMAT = H1 + LocalMiss1 * (H2 + LocalMiss2 * Memory)
= 10 + [L2/1MiB]*[20 + (L1/L2)*100]
= 10 + (1MiB-L1)*20/1MiB + L1*100/1MiB
= 10 + 20-20*L1/1MIB + 100*L1/1MiB
= 30 + 80*L1/1MiB
```

6) What sizes of L1 and L2 caches should we pick to minimize the AMAT? (assume the caches have non-zero size, i.e., both of them exist)
$L 1=4 B, L 2=1 M i B-4 B$
$\qquad$

## F1: Paging all CS61C students (9 points)

Consider a byte-addressed machine with a 13-bit physical address space that can hold two pages in memory. Every process is given 16 MiB of virtual memory and pages are evicted with an LRU replacement scheme.
a) What are the sizes of the following fields in bits?

Virtual Page Number: $\qquad$ 12 $\qquad$ Virtual Address Offset: $\qquad$ 12 $\qquad$

Physical Page Number: $\qquad$ 1 $\qquad$ Physical Address Offset: $\qquad$ 12 $\qquad$
b) Consider the following code snippet:

```
// a and b are both valid pointers to
// different arrays of length ARRAY_SIZE
void enumerate(int* a, int* b) {
    for (int i = 0; i < ARRAY_SIZE; i++) {
        a[i] = i;
        b[i] = ARRAY_SIZE - i;
    }
}
```

The compiled binary for the program containing this code snippet weighs in at 4096B. If this code was executed on the machine, what is the maximum value of ARRAY_SIZE that would allow this code to execute with 0 page faults in the best-case scenario? (Answer in IEC prefix: 8Gi, 32Ti, etc)

Page Size: 4KiB
Each array must be $2 \mathrm{KiB}=2^{\wedge 11} / 4=512$ ints long

ARRAY_SIZE = $\qquad$ 512 $\qquad$
c) How could we modify the above code snippet to allow a larger ARRAY_SIZE and execute with the fewest page faults in the best-case scenario? Write the new code below:

```
for (int i = 0; i < ARRAY_SIZE; i++)
    a[i] = i;
for (int i = 0; i < ARRAY_SIZE; i++)
    b[i] = ARRAY_SIZE - i;
```

Access each array sequentially by splitting the for loop into two separate loops. Now each array can be as large as a page.

## F2: Why can't you use parallelism at a gas station? It might cause a

## spark. (10 points)

1. Optimize factorial() using SIMD intrinsic(AVX).
```
double factorial(int k) {
    int i;
    double f = 1.0;
    for (i = 1 ; i <= k ; i++) {
        f *= (double) i;
    }
    return f;
}
```

You might find the following intrinsics useful:


```
double factorial(int k) \{
    int i, j;
    double f_init[] = \{1.0, 1.0, 1.0, 1.0\};
    double f_res[4];
    double \(f=1.0\);
    // initialize f_vec
    __m256d f_vec = _ mm256 loadu pd(f init);
    // vectorize factorial
    for (i = 1 ; i <= \(k / 4 * 4\); __i += 4) \{
        double l[] = \{
            (double) __i, (double) _i+1,
            (double) _i+2, (double) _i+3\};
            __m256d data = \(\quad\) mm256 loadu pd(l);
            f vec = mm256 mul pd(f vec, data)_;
    \}
    // reduce vector
    mm256 store pd(f res, f vec);
    for ( \(\mathrm{j}=0\); \(\mathrm{j}<4\); \(\mathrm{j}++\) ) \{
        \(f=f * f r e s[j] ;\)
    \}
    // handle tails
    for ( ; i <= k ; i++) \{
        f \({ }^{*}=\) (double) \(i\);
    \}
    return f;
\}
```

$\qquad$

## F2: (continued)

## 2. Cache Coherence:

We are given the task of counting the number of even and odd numbers in an array, A, which only holds integers greater than 0 . Using a single thread is too slow, so we have decided to parallelize it with the following code:

```
#include <stdio.h>
#include "omp.h"
void count_eo (int *A, int size, int threads) {
    int result[2] = {0, 0};
    int i,j;
    omp_set_num_threads(threads);
    #pragma omp parallel for
    for (j=0; j<size; j++)
        result[(A[j] % 2 == 0) ? 0 : 1] += 1;
    printf("Even: %d\n", result[0]);
    printf("Odd: %d\n", result[1]);
}
```

As we increase the number of threads running this code:
a) Will it print the correct values for Even and Odd? If not, explain the error.

No, there may be a data race
b) Can there be false sharing if the cache block size is 8 bytes?

Yes
c) What about 4 bytes?

No
$\qquad$

## F3: This isn't a bathroom. Why is there potpourri? (10 points)

## 1. T/F Questions (Circle one. If the circling is unclear, you will receive no credit.)

1) CPUs need separate instructions to access $I / O$ devices. (True / False) False
2) Segmentation (base + bound) has fragmentation problems. (True / False) True
3) Exceptions in early pipeline stages override exceptions in later stages for a given instruction. (True / False) True
4) Exceptions are handled in the pipeline stage where they occur. (True / False) False

## 2. Polling, Interrupts, and DMA

1) Choose polling or interrupt for the following devices.

| Device | Data Rate | Transfer Block Size | Polling/Interrupt? |
| :---: | :---: | :---: | :---: |
| A | $80 \mathrm{~B} / \mathrm{s}$ | 4 B | Polling |
| B | $400 \mathrm{MB} / \mathrm{s}$ | 4 B | Interrupt |
| C | $400 \mathrm{MB} / \mathrm{s}$ | 2 KB | Interrupt |

2) To support interrupts, the CPU should be able to save and restore the current state. Which of the following should be saved before handling interrupts to ensure correct execution?
a, b
a. Program Counter
b. User Registers
c. TLB
d. Caches
3) To which device in 1) is direct memory access (DMA) most beneficial? Explain briefly.

Device C. CPU can do useful work while transferring large data blocks.

## 3. Warehouse Scale Computing and Amdahl's Law.

1) We are going to train convolutional neural networks on Amazon EC2. It turns out that $90 \%$ of training can be parallelized but the rest takes twice as long due to the communication overhead among the instances. What is the maximum speedup we can achieve?
$1 /(0.9 / s+2$ * 0.1$)<=5 \rightarrow$ Maximum speedup $=5$
2) Which of the following can increase the maximum speedup in 1)? $c, d$
a. Use more instances
b. Deploy the application across multiple arrays.
c. Reduce the number of reduce operations.
d. Increase network bandwidth.
e. Increase the capacity of disks.
$\qquad$

## F3: (continued)

4. Hamming Error-Correction Code (ECC)
1) Suppose we have a one-byte data value, $0110_{\text {two. }}$. Fill in the encoded data in the following table.

| Bit | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Encoded Data | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
| P1 | X |  | X |  | X |  | X |
| P 2 |  | X | X |  |  | X | X |
| P 4 |  |  |  | X | X | X | X |

2) Assume that we have an encoded value, 1001110two with a single-bit error. Indicate below each parity bit if it has an error:

| Parity Bit | P1 | P2 | P4 |
| :---: | :---: | :---: | :---: |
| OK/ERROR | OK | Error | Error |

Incorrect bit position: $\qquad$ 6

Correct data: $\qquad$

## 5. Dependability and RAID

1) Which of the following can increase the availability?
a, d, e
a. Increasing MTTF
b. Decreasing MTTF
c. Increasing MTTR
d. Decreasing MTTR
e. Redundant data copies
2) Explain very briefly why RAID1 is the most expensive form of RAID. Mirrored disk $\rightarrow 100 \%$ overhead
3) How many check disks are needed for RAID3?

1
4) Explain why RAID5 has a higher write throughput than RAID4.

Check information is distributed across all disks in a group

