# A Brief Overview of Computer Architecture
- Big ideas: Memory, CPUs, and Networking
- Let's draw some pictures!

## Link to Lecture
https://tinyurl.com/rsmas-python2


## Memory Hierarchy
![](http://faculty.etsu.edu/tarnoff/othr2150/mem_hier.gif)

## CPU Layout and Networking
1. Somewhere between 1 and 16 cores
2. Modern processors will support hyperthreading / virtual cores
	- Oftentimes the benefit of two processors for the price of 1!
![](https://www.gamingscan.com/wp-content/uploads/2017/12/cpu-core-for-gaming.jpg)

## Numbers Every Programmer Should Know
[Source](https://gist.github.com/hellerbarde/2843375) and Jeff Dean, Peter Norvig, etc.

Latency Comparison Numbers (~2012)
-----------

| Event                              | Nanoseconds   | Microseconds | Milliseconds | Comparison    |
|------------------------------------|--------------:|--------:|----:|-----------------------------|
| L1 cache reference                 |           0.5 |       - |   - | -                           |
| L2 cache reference                 |           7.0 |       - |   - | 14x L1 cache                |
| Main memory reference              |         100.0 |       - |   - | 20x L2 cache, 200x L1 cache |
| Compress 1K bytes with Zippy       |       3,000.0 |       3 |   - | -                           |
| Send 1K bytes over 1 Gbps network  |      10,000.0 |      10 |   - | -                           |
| Read 4K randomly from SSD          |     150,000.0 |     150 |   - | ~1GB/sec SSD                |
| Read 1 MB sequentially from memory |     250,000.0 |     250 |   - | -                           |
| Round trip within same datacenter  |     500,000.0 |     500 |   - | -                           |
| Read 1 MB sequentially from SSD    |   1,000,000.0 |   1,000 |   1 | ~1GB/sec SSD, 4X memory     |
| Disk seek                          |  10,000,000.0 |  10,000 |  10 | 20x datacenter roundtrip    |
| Read 1 MB sequentially from disk   |  20,000,000.0 |  20,000 |  20 | 80x memory, 20X SSD         |
| Send packet CA → Netherlands → CA  | 150,000,000.0 | 150,000 | 150 | -                           |


Notes
-----
1 ns = 10^-9 seconds
1 us = 10^-6 seconds = 1,000 ns
1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns

## Humanized Latency Numbers
Lets multiply all these durations by a billion:

Magnitudes:
----------------------------------------------

Minute:
-----
    L1 cache reference                  0.5 s         One heart beat (0.5 s)
    L2 cache reference                  7 s           Long yawn

Hour:
-----
    Main memory reference               100 s         Brushing your teeth
    Compress 1K bytes with Zippy        50 min        One episode of a TV show (including ad breaks)

Day:
----
    Send 2K bytes over 1 Gbps network   5.5 hr        From lunch to end of work day

Week:
-----
    SSD random read                     1.7 days      A normal weekend
    Read 1 MB sequentially from memory  2.9 days      A long weekend
    Round trip within same datacenter   5.8 days      A medium vacation
    Read 1 MB sequentially from SSD    11.6 days      Waiting for almost 2 weeks for a delivery

Year
----
    Disk seek                           16.5 weeks    A semester in university
    Read 1 MB sequentially from disk    7.8 months    Almost producing a new human being
    The above 2 together                1 year

Decade
-----
    Send packet CA->Netherlands->CA     4.8 years     Average time it takes to complete a bachelor's degree


## Conclusions
1. CPUs are really fast
    - The goal: Have the CPU always be working
2. Memory is fast-ish, hard drives are slow
3. You want to make your CPU happy by always giving it data that is close at hand
4. Networks are reeeeaaally slow

# Processes and Threads
1. Since there's so much downtime when we need to get data from disk or over a network, we can use the CPU for another task
2. Switching between tasks can keep our CPU working
    - 3,000-30,000 ns for a context switch, or about 50 minutes to 8 hours in human time
3. To allow this, we want multiple different programs that are able to run at the same time
    - These are called processes
4. A process has 4 parts
    - The program code
    - The Call Stack (what is actually happening in the program)
    - Variables, files, memory, etc
5. Within a program, we have also have multiple mini-programs running
    - These are called threads
    - A child thread and its parent share the same memory (usually)

## Rolling Your Own Processes in Python: `subprocess`
1. Running a script from the command line
```python
import subprocess
subprocess.call(['python3 run_me.py', shell=True])
```
2. Getting the stdout and stderr from a process
```python
output, errors = subprocess.Popen(['ls', '-la', '*.py'], 
			          stdout=subprocess.PIPE).communicate()
```
3. Piping processes
```python
p1 = subprocess.Popen(["cat", "file.log"], stdout=subprocess.PIPE)
p2 = subprocess.Popen(["tail", "-1"], stdin=p1.stdout, stdout=subprocess.PIPE)
p1.stdout.close()  # Allow p1 to receive a SIGPIPE if p2 exits.
output,err = p2.communicate()
```

# Parallel Programming
1. Lots of models
2. Hard to do super efficiently
3. However, there are some easy ways to get gains

## Multi-Core Parallel Computing vs. Distributed Parallel Computing
1. Using many cores from your machines is different than using many cores on multiple machines
2. Pegasus, as far as I can tell, makes it as if you are using a lot of cores on one machine

## Big Ideas that Matter

### `numpy` is your friend
1. Vectorizing functions is the easiest step to wriitng fast and parallel code
2. Many Numpy Operations already are parallelized across machines

### Split-Apply-Combine
1. Split
    - Think about how to break up the data so that you can compute the same thing on smaller parts of the larger dataset
2. Apply
    - Solve the problem a bunch of times on smaller versions of your large data
3. Combine
    - Then combine those mini solutions into one big solution

### Data Needs to be Independent
1. In image analysis, you can apply convolutions because they only matter on a small subset of the data 
2. If you need all of the data to do each computation, it's less efficient



## MapReduce or Map and then Reduce
```python
# Parallel version is equivalent to the following
def map_parallel(func, input_data):
    intermediate_data = []
    for i in input_data:
        intermediate_data.append(func(i)
    return intermediate_data

def reduce(func, intermediate_data):
    starting_value = None
    for value in intermediate_data:
        if starting_value == None:
          starting_value = value
        else:
          starting_value = func(starting_value, value)
    return value
```

### MapReduce in Python: `multiprocessing`
1. Prepare data to be split between processors
2. Use `multiprocessing` to apply `map` step
3. Combine the results of the `map` result as needed
4. Easy when your data is small
5. Good for simple functions
6. [Some good examples](https://stackoverflow.com/questions/2846653/how-to-use-threading-in-python)

```python
from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    with Pool(5) as p:
        print(p.map(f, [1, 2, 3]))

###################################

# Sequential
results = []
for item in my_array:
    results.append(my_function(item))

# Parallel
from multiprocessing.dummy import Pool as ThreadPool 
pool = ThreadPool(4) 
results = pool.map(my_function, my_array)

```

### `Numba`
1. For more complicated parallel tasks
2. Converts functions to C code
3. Can parallelize code easily (if you read the documentation :D )

```python
@numba.jit(nopython=True, parallel=True)
def logistic_regression(Y, X, w, iterations):
    for i in range(iterations):
        w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y), X)
    return w
```


# Post Session Discussion
1. Virtual environments: https://conda.io/docs/user-guide/tasks/manage-environments.html
2. Figuring out your dependencies: `pip freeze`
3. `pylint <filename>` for linting
4. `autopep8 <filename>` for correcting easy linting errors

## Testing
1. Two types
	1. Unit Test
		- Just tests individual functions
		- Define some input and some desired output
		- Tests whether it works
	2. Integration Test
		- Combining functions
		- COmbining modules
		- Very simple example cases
2. Tests are written in a separate folder usually