About

Implementation of the BTHOWeN Weightless Neural Network architecture on an UpMem Processing-in-Memory system.

What is going on during BTHOWeN inference?

For now, let's focus on inference, assuming we have a pre-trained model. In BTHOWeN, inference can be discretized in multiple steps:

The input is binarized.
Input remapping to a pre-defined random order.
The whole input is being forwarded to each discriminator.
Each discriminator decomposes the input into a number of chunks equal to the number of filters they contain.
Each chunk is being hashed k times, producing k hashes (k is predefined, it is 2 in MNIST-Small for example).
Each filter is probed k times (once for each has produced), and we take the minimum of these probes to be the result.
A popcount is performed on all hash output for each discriminator.
The discriminator with the highest popcount is taken to be the correct class.

Naive PiM implementation

The problem we need to solve is how to divide the work between Host and DPU, and how to divide the work within a DPU. Before we do any smart division of work, we need to establish a naive baseline implementation:

Host: input binarization and remapping.

Host -> DPU transfer of the input, or multiple inputs.

DPU: Have a number of tasklets that correspond to the number of inputs received. Have each tasklet work on the whole inference.

Considerations for a PiM implementation

MRAM is 64 MB
WRAM is 64 KB

MNIST-Small

If we take the MNIST-Small model of 70 KiB as a reference:

Each input is 1500 bits, or 1500 B if we don't pack it
Each input is decomposed into around 50 chunks of 28b (or B, easy division if we don't pack)
Each of these 50 chunks is hashed into 2 different numbers in the [0; 1023] range
500 filters process each of these 2 numbers one by one, applying a min-reduction to the output
Each discriminator take these 50 results and apply a popcount
The model take these 10 popcounts and outputs the argmax

MNIST-Large

If we take the MNIST-Large model of 960 KiB as a reference:

Each input is 5000 bits, or 5000 B if we don't pack it
Each input is decomposed into around 100 chunks of 49b (or B, easy division if we don't pack)
Each of these 100 chunks is hashed into 4 different numbers in the [0; 8191] range
1000 filters process each of these 4 numbers one by one, applying a min-reduction to the output
Each discriminator take these 100 results and apply a popcount
The model take these 10 popcounts and outputs the argmax

PiM implementation: targeted for MNIST-Small

The whole model almost fits into the WRAM. I am not sure what is the load of hashing vs. filters, but a pool of tasklets will be reserved for hashing and another will be reserved for filters.

Host: input binarization and remapping.
(Host: hashing, or do it in the DPUs)

Host -> DPU transfer of the input, or multiple inputs.

(DPU: hashing. We can divide the hashing work in the following way: each tasklet takes charge of hashing (#Chunks / #Hashing_Tasklets) chunks. It is best if the work is done in a consistently predictible pattern (like from chunk 0 to the last chunk), so that tasklets working on filters can start working in parallel).
DPU: we need to do around 2*500 filter probes for each input. We can divide it by first considering that a tasklet can do work across all discriminators and across only a subset of filters. The exact amount of chunks each tasklet needs to be preocupied with is: (#Discriminators * (#Chunks / #Filter_Tasklets)).
DPU: we can distribute popcounts among filter tasklets as well.

PiM implementation: targeted for MNIST-Large

Each discriminator almost fits into the WRAM. It's more advantageous to do hashing in the host to reuse the results

Host: input binarization and remapping.
Host: input hashing.

Host -> 10 parallel DPU transfer of hashed input, or multiple hashed inputs.

DPU: we can simply and elegantly divide the work to be done: each tasklets takes charge of (#Filters / #Tasklets) filters.
DPU: we can distribute the popcount between tasklets as well.

DPU -> Host transfer of the result, or a batch of results.

Host: compute the argmax

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cbthowen		cbthowen
dpu		dpu
host		host
support		support
.DS_Store		.DS_Store
.conf		.conf
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
_CHECK_RES_0		_CHECK_RES_0
_CHECK_RES_1_FORMAT_NO		_CHECK_RES_1_FORMAT_NO
_FORMAT_1.conf		_FORMAT_1.conf
_FORMAT_NO		_FORMAT_NO
_NR_TASKLETS_1		_NR_TASKLETS_1
_PERF_NO		_PERF_NO
_PRINT_0		_PRINT_0
analysis.ipynb		analysis.ipynb
assembler.s		assembler.s
correct.sh		correct.sh
data		data
dpu_strong_scaling_time.png		dpu_strong_scaling_time.png
dpu_weak_scaling_time.png		dpu_weak_scaling_time.png
model.dat		model.dat
perf.sh		perf.sh
perf2.sh		perf2.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

What is going on during BTHOWeN inference?

Naive PiM implementation

Considerations for a PiM implementation

MNIST-Small

MNIST-Large

PiM implementation: targeted for MNIST-Small

PiM implementation: targeted for MNIST-Large

About

Releases

Packages

Languages

Xavier0301/BiMT

Folders and files

Latest commit

History

Repository files navigation

About

What is going on during BTHOWeN inference?

Naive PiM implementation

Considerations for a PiM implementation

MNIST-Small

MNIST-Large

PiM implementation: targeted for MNIST-Small

PiM implementation: targeted for MNIST-Large

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages