update docs

chengchingwen · Dec 21, 2023 · 40922f8 · 40922f8
1 parent fbcf4d6
commit 40922f8
Show file tree

Hide file tree

Showing 7 changed files with 55 additions and 16 deletions.
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@ The core idea of this package is to make the attention operation composable, so
  be used directly for high dimensional attentions, such as image or video.
 
 
-This package contain 3 submodules: `MatMul`, `Masks`, and `Functional`.
+This package contain 3 submodules: `Matmul`, `Masks`, and `Functional`.
 
 1. `Matmul` defines an Array wrapper `CollapsedDimsArray{T}(array, ni::Integer, nj::Integer)` which treat n-dimensional
  array as 3-dimensional array while preserving the original shape. By explicitly specifying which dimensions should be

diff --git a/docs/make.jl b/docs/make.jl
@@ -4,7 +4,7 @@ using Documenter
 DocMeta.setdocmeta!(NeuralAttentionlib, :DocTestSetup, :(using NeuralAttentionlib); recursive=true)
 
 makedocs(;
-    modules=[NeuralAttentionlib],
+    modules=[NeuralAttentionlib.Matmul, NeuralAttentionlib.Masks, NeuralAttentionlib.Functional],
     authors="chengchingwen <adgjl5645@hotmail.com> and contributors",
     repo="https://github.com/chengchingwen/NeuralAttentionlib.jl/blob/{commit}{path}#{line}",
     sitename="NeuralAttentionlib.jl",

diff --git a/docs/src/api.md b/docs/src/api.md
@@ -11,6 +11,12 @@ Modules = [NeuralAttentionlib, NeuralAttentionlib.Functional, NeuralAttentionlib
 Pages   = ["functional.jl"]
 ```
 
+```@autodocs
+Modules = [NeuralAttentionlib]
+Pages   = ["types.jl", "utils.jl"]
+Filter  = t -> !(t isa typeof(NeuralAttentionlib.var"@imexport"))
+```
+
 ## Mask
 
 ```@autodocs

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -9,7 +9,29 @@ CurrentModule = NeuralAttentionlib
 `NeuralAttentionlib.jl` aim to be highly extendable and reusable function for implementing attention variants.
  Will be powering [`Transformers.jl`](https://github.com/chengchingwen/Transformers.jl).
 
-## Outline
+
+# Design
+
+![overview](assets/overview.png)
+
+The core idea of this package is to make the attention operation composable, so that most of the attention variants can
+ be easily defined without rewriting other parts. For example, normal attention use `softmax` on the attention score to
+ normalize weight of each entries. If you want to replace `softmax` with other normalization function, such as L2-norm,
+ there is a problem that they require different ways to mask specific entries such as paddings. With this package, we
+ can easily do this by providing a different `AbstractMaskOp` to `masked_score`, so no copy-paste is needed. For another
+ example, some position embeddings are adding values to the attention scores, with this package, you can directly chain
+ the position embedding function (or use `biased_score`) with other score functions. Moreover, the same definition can
+ be used directly for high dimensional attentions, such as image or video.
+
+
+This package contain 3 submodules: `Matmul`, `Masks`, and `Functional`.
+
+1. `Matmul` defines an Array wrapper `CollapsedDimsArray{T}(array, ni::Integer, nj::Integer)` which treat n-dimensional array as 3-dimensional array while preserving the original shape. By explicitly specifying which dimensions should be the "batch" and "length" dimensions, the implementations of attention do not need to worry about the input dimensions.
+2. `Masks` provides an interface to define non-allocating masks with support for both CPU and GPU (using Julia's broadcast interface) and many pre-defined masks. For example, `CausalMask()` is just a Julia object and it would NOT allocate a `n^2` attention score mask either on CPU or GPU. These masks are also composable, you can use `&`/`|` to combine, for example, causal mask and padding mask without extra allocation or the need to write extra code.
+3. `Functional` contains the implementation for the "attention score"s, "mixing"s, and "attention operation"s. The interface of "attention score"s allow you to chain different score function together, such as `normalized_score`, `masked_score`, and `biased_score`. And the interface of "attention operation"s allow you to provide different score functions and mixing functions. The other part, such as reshaping for multi-head, are automatically handled.
+
+
+# Outline
 
 ```@contents
 Pages = [
@@ -18,6 +40,3 @@ Pages = [
 	"api.md",
 ]
 ```
-
-
-
diff --git a/docs/src/term.md b/docs/src/term.md
@@ -1,17 +1,17 @@
 # Terminology
 
 Term and Naming explanation.
-    
+
 ## Prerequisite
 
 Some term for better understanding this docs.
 
-### 1. [PartialFunctions](https://github.com/archermarx/PartialFunctions.jl)
+### 1. Partial Functions
 
 This actually live outside the scope of this package, but is extremely useful for illustrate the overall design.
- We'll use the `$` operation to denote partial function application 
+ We'll use the `$` operation to denote partial function application
  (i.e. `f $ x` is equivanlent to `(arg...)->f(x, arg...)`).
- 
+
 
 ### 2. Feature / Length / Batch Dimension
 
@@ -23,7 +23,7 @@ Under the context of attention operation in deep learning, the input data can be
 
 For example, given 3 sentence as a batch, each sentence have 10 word, and we choose to represent a word with
  a vector of 32 element. This data will be store in an 3-dim array with size `(32, 10, 3)`.
- 
+
 General speaking, *batch* stands for how many independent data you are going to run in one function call,
  usually just for performance/optimization need. *length* means how many entry you have for each data sample,
  like the #-words in a sentence or #-pixels in an image. *feature* is the number of value you used to
@@ -67,7 +67,7 @@ The overall attention operation can be viewed as three mutually inclusive block:
 The attention operation is actually a special way to "mix" (or "pick" in common lecture) the input information.
  In (probably) the first [attention paper](https://arxiv.org/abs/1409.0473), the attention is defined as weighted
  sum of the input sequence given a word embedding. The idea is furthur generalize to *QKV attention* in the first
- [transformer paper](https://arxiv.org/abs/1706.03762). 
+ [transformer paper](https://arxiv.org/abs/1706.03762).
 
 ### 1. Attention Score
 

diff --git a/src/types.jl b/src/types.jl
@@ -90,7 +90,7 @@ end
         p::F       # dropout probability
     end
 
-Structure for holding parameter of `multihead_qkv_attention`.
+Structure for holding parameter of [`multihead_qkv_attention`](@ref).
 
     (op::MultiheadQKVAttenOp)(q, k, v, mask = nothing)
 
@@ -115,7 +115,7 @@ const MultiheadQKVAttenOpWithScore{F} = WithScore{MultiheadQKVAttenOp{F}}
         p::F       # dropout probability
     end
 
-Structure for holding parameter of `multihead_qkv_attention`.
+Structure for holding parameter of [`multihead_qkv_attention`](@ref).
 
     (op::CausalMultiheadQKVAttenOp)(q, k, v, mask = nothing)
 
@@ -140,7 +140,7 @@ const CausalMultiheadQKVAttenOpWithScore{F} = WithScore{CausalMultiheadQKVAttenO
         p::F
     end
 
-Structure for holding parameter of `grouped_query_attention`.
+Structure for holding parameter of [`grouped_query_attention`](@ref).
 
     (op::GroupedQueryAttenOp)(q, k, v, mask = nothing)
 
@@ -167,7 +167,7 @@ const GroupedQueryAttenOpWithScore{F} = WithScore{GroupedQueryAttenOp{F}}
         p::F
     end
 
-Structure for holding parameter of `grouped_query_attention`.
+Structure for holding parameter of [`grouped_query_attention`](@ref).
 
     (op::CausalGroupedQueryAttenOp)(q, k, v, mask = nothing)
 

diff --git a/src/utils.jl b/src/utils.jl
@@ -4,6 +4,14 @@ as_bool(b::StaticBool) = Bool(b)
 as_char(c::Char) = c
 as_char(c::StaticInt) = Char(c)
 
+"""
+    PrefixedFunction(f, args::NTuple{N}) <: Function
+
+A type representating a partially-applied version of the function `f`, with the first `N` arguments fixed to the
+ values `args`. In other words, `PrefixedFunction(f, args)` behaves similarly to `(xs...)->f(args..., xs...)`.
+
+See also [`NeuralAttentionlib.:\$`](@ref).
+"""
 struct PrefixedFunction{F, A<:Tuple} <: Function
     f::F
     arg::A
@@ -17,6 +25,12 @@ Base.show(io::IO, ::MIME"text/plain", f::PrefixedFunction) = show(io, f)
 
 @inline (f::PrefixedFunction)(args...) = f.f(f.arg..., args...)
 
+"""
+    f \$ x
+    f \$ x \$ y \$ ...
+
+Partially-applied function. Return a [`PrefixedFunction`](@ref).
+"""
 ($)(f::Function, x) = PrefixedFunction(f, (x,))
 ($)(f::PrefixedFunction, x) = PrefixedFunction(f.f, (f.arg..., x))
 ($)(f::PrefixedFunction, g::PrefixedFunction) = PrefixedFunction(f.f, (f.arg..., g.f, g.arg...))