Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
chengchingwen committed Dec 21, 2023
1 parent fbcf4d6 commit 40922f8
Show file tree
Hide file tree
Showing 7 changed files with 55 additions and 16 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The core idea of this package is to make the attention operation composable, so
be used directly for high dimensional attentions, such as image or video.


This package contain 3 submodules: `MatMul`, `Masks`, and `Functional`.
This package contain 3 submodules: `Matmul`, `Masks`, and `Functional`.

1. `Matmul` defines an Array wrapper `CollapsedDimsArray{T}(array, ni::Integer, nj::Integer)` which treat n-dimensional
array as 3-dimensional array while preserving the original shape. By explicitly specifying which dimensions should be
Expand Down
2 changes: 1 addition & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ using Documenter
DocMeta.setdocmeta!(NeuralAttentionlib, :DocTestSetup, :(using NeuralAttentionlib); recursive=true)

makedocs(;
modules=[NeuralAttentionlib],
modules=[NeuralAttentionlib.Matmul, NeuralAttentionlib.Masks, NeuralAttentionlib.Functional],
authors="chengchingwen <adgjl5645@hotmail.com> and contributors",
repo="https://github.com/chengchingwen/NeuralAttentionlib.jl/blob/{commit}{path}#{line}",
sitename="NeuralAttentionlib.jl",
Expand Down
6 changes: 6 additions & 0 deletions docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@ Modules = [NeuralAttentionlib, NeuralAttentionlib.Functional, NeuralAttentionlib
Pages = ["functional.jl"]
```

```@autodocs
Modules = [NeuralAttentionlib]
Pages = ["types.jl", "utils.jl"]
Filter = t -> !(t isa typeof(NeuralAttentionlib.var"@imexport"))
```

## Mask

```@autodocs
Expand Down
27 changes: 23 additions & 4 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,29 @@ CurrentModule = NeuralAttentionlib
`NeuralAttentionlib.jl` aim to be highly extendable and reusable function for implementing attention variants.
Will be powering [`Transformers.jl`](https://github.com/chengchingwen/Transformers.jl).

## Outline

# Design

![overview](assets/overview.png)

The core idea of this package is to make the attention operation composable, so that most of the attention variants can
be easily defined without rewriting other parts. For example, normal attention use `softmax` on the attention score to
normalize weight of each entries. If you want to replace `softmax` with other normalization function, such as L2-norm,
there is a problem that they require different ways to mask specific entries such as paddings. With this package, we
can easily do this by providing a different `AbstractMaskOp` to `masked_score`, so no copy-paste is needed. For another
example, some position embeddings are adding values to the attention scores, with this package, you can directly chain
the position embedding function (or use `biased_score`) with other score functions. Moreover, the same definition can
be used directly for high dimensional attentions, such as image or video.


This package contain 3 submodules: `Matmul`, `Masks`, and `Functional`.

1. `Matmul` defines an Array wrapper `CollapsedDimsArray{T}(array, ni::Integer, nj::Integer)` which treat n-dimensional array as 3-dimensional array while preserving the original shape. By explicitly specifying which dimensions should be the "batch" and "length" dimensions, the implementations of attention do not need to worry about the input dimensions.
2. `Masks` provides an interface to define non-allocating masks with support for both CPU and GPU (using Julia's broadcast interface) and many pre-defined masks. For example, `CausalMask()` is just a Julia object and it would NOT allocate a `n^2` attention score mask either on CPU or GPU. These masks are also composable, you can use `&`/`|` to combine, for example, causal mask and padding mask without extra allocation or the need to write extra code.
3. `Functional` contains the implementation for the "attention score"s, "mixing"s, and "attention operation"s. The interface of "attention score"s allow you to chain different score function together, such as `normalized_score`, `masked_score`, and `biased_score`. And the interface of "attention operation"s allow you to provide different score functions and mixing functions. The other part, such as reshaping for multi-head, are automatically handled.


# Outline

```@contents
Pages = [
Expand All @@ -18,6 +40,3 @@ Pages = [
"api.md",
]
```



12 changes: 6 additions & 6 deletions docs/src/term.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# Terminology

Term and Naming explanation.

## Prerequisite

Some term for better understanding this docs.

### 1. [PartialFunctions](https://github.com/archermarx/PartialFunctions.jl)
### 1. Partial Functions

This actually live outside the scope of this package, but is extremely useful for illustrate the overall design.
We'll use the `$` operation to denote partial function application
We'll use the `$` operation to denote partial function application
(i.e. `f $ x` is equivanlent to `(arg...)->f(x, arg...)`).


### 2. Feature / Length / Batch Dimension

Expand All @@ -23,7 +23,7 @@ Under the context of attention operation in deep learning, the input data can be

For example, given 3 sentence as a batch, each sentence have 10 word, and we choose to represent a word with
a vector of 32 element. This data will be store in an 3-dim array with size `(32, 10, 3)`.

General speaking, *batch* stands for how many independent data you are going to run in one function call,
usually just for performance/optimization need. *length* means how many entry you have for each data sample,
like the #-words in a sentence or #-pixels in an image. *feature* is the number of value you used to
Expand Down Expand Up @@ -67,7 +67,7 @@ The overall attention operation can be viewed as three mutually inclusive block:
The attention operation is actually a special way to "mix" (or "pick" in common lecture) the input information.
In (probably) the first [attention paper](https://arxiv.org/abs/1409.0473), the attention is defined as weighted
sum of the input sequence given a word embedding. The idea is furthur generalize to *QKV attention* in the first
[transformer paper](https://arxiv.org/abs/1706.03762).
[transformer paper](https://arxiv.org/abs/1706.03762).

### 1. Attention Score

Expand Down
8 changes: 4 additions & 4 deletions src/types.jl
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ end
p::F # dropout probability
end
Structure for holding parameter of `multihead_qkv_attention`.
Structure for holding parameter of [`multihead_qkv_attention`](@ref).
(op::MultiheadQKVAttenOp)(q, k, v, mask = nothing)
Expand All @@ -115,7 +115,7 @@ const MultiheadQKVAttenOpWithScore{F} = WithScore{MultiheadQKVAttenOp{F}}
p::F # dropout probability
end
Structure for holding parameter of `multihead_qkv_attention`.
Structure for holding parameter of [`multihead_qkv_attention`](@ref).
(op::CausalMultiheadQKVAttenOp)(q, k, v, mask = nothing)
Expand All @@ -140,7 +140,7 @@ const CausalMultiheadQKVAttenOpWithScore{F} = WithScore{CausalMultiheadQKVAttenO
p::F
end
Structure for holding parameter of `grouped_query_attention`.
Structure for holding parameter of [`grouped_query_attention`](@ref).
(op::GroupedQueryAttenOp)(q, k, v, mask = nothing)
Expand All @@ -167,7 +167,7 @@ const GroupedQueryAttenOpWithScore{F} = WithScore{GroupedQueryAttenOp{F}}
p::F
end
Structure for holding parameter of `grouped_query_attention`.
Structure for holding parameter of [`grouped_query_attention`](@ref).
(op::CausalGroupedQueryAttenOp)(q, k, v, mask = nothing)
Expand Down
14 changes: 14 additions & 0 deletions src/utils.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@ as_bool(b::StaticBool) = Bool(b)
as_char(c::Char) = c
as_char(c::StaticInt) = Char(c)

"""
PrefixedFunction(f, args::NTuple{N}) <: Function
A type representating a partially-applied version of the function `f`, with the first `N` arguments fixed to the
values `args`. In other words, `PrefixedFunction(f, args)` behaves similarly to `(xs...)->f(args..., xs...)`.
See also [`NeuralAttentionlib.:\$`](@ref).
"""
struct PrefixedFunction{F, A<:Tuple} <: Function
f::F
arg::A
Expand All @@ -17,6 +25,12 @@ Base.show(io::IO, ::MIME"text/plain", f::PrefixedFunction) = show(io, f)

@inline (f::PrefixedFunction)(args...) = f.f(f.arg..., args...)

"""
f \$ x
f \$ x \$ y \$ ...
Partially-applied function. Return a [`PrefixedFunction`](@ref).
"""
($)(f::Function, x) = PrefixedFunction(f, (x,))
($)(f::PrefixedFunction, x) = PrefixedFunction(f.f, (f.arg..., x))
($)(f::PrefixedFunction, g::PrefixedFunction) = PrefixedFunction(f.f, (f.arg..., g.f, g.arg...))
Expand Down

0 comments on commit 40922f8

Please sign in to comment.