post: reasons to reject

p208p2002 · Jul 5, 2024 · 64cd43f · 64cd43f
1 parent 4a5b397
commit 64cd43f
Show file tree

Hide file tree

Showing 4 changed files with 122 additions and 0 deletions.
diff --git a/public/docs/reason-to-reject/cut.png b/public/docs/reason-to-reject/cut.png
diff --git a/public/docs/reason-to-reject/document.md b/public/docs/reason-to-reject/document.md
@@ -0,0 +1,113 @@
+# Reasons to Reject? Aligning Language Models with Judgments
+
+<document-info>
+- tags: #論文筆記
+- date: 2024/07/05
+</document-info>
+
+研究首次系統性地探討了使用語言反饋（**判斷**）來對齊 LLM 的可能性，提出了 Contrastive Unlikelihood Training (CUT) 框架。
+
+實驗結果表明，CUT 僅需 1317 筆訓練資料便能超越 175B 的 DaVinci003。並且進一步分析表明，**判斷**在LLM對齊中具有比 RL 獎勵更大的潛力。
+
+## 問題設定
+
+假設有一組**指令**-**回應**-**判斷**三元組 $(x, y, j)$，其中指令 $x = [x_1, \ldots, x_M]$，**回應** $y = [y_1, \ldots, y_N]$，**判斷** $j = [j_1, \ldots, j_Q]$ 為長度分別為 $M$、$N$ 和 $Q$ 的符號序列。**回應**可能存在缺陷或被認為完全滿意。**判斷**提供了對**回應**的優缺點的分析，這些分析可以由人類或 AI 模型起草。將 LLMs 與**判斷**對齊的目標是使 LLMs 保留在優點中提到的適當行為，更重要的是，解決缺點以防止未來的不當行為。
+
+### 可能的解決方案
+#### Forward Prediction
+依序預測**回應**及其**判斷**。
+
+$$
+\mathcal{L} = -\frac{1}{N}\sum_t \log p(y_t \mid y_{<t},x)-\frac{1}{Q} \sum_t \log p(j_t \mid j_{<t},y,x)
+$$
+> 在 Forward Prediction 學習生成判斷並不一定會轉化為增強的回應生成，因為回應生成是在判斷生成之前的。
+#### Imitation learning from language feedback (ILF)
+要求 LLM 根據**判斷**進行反饋，由此我們可以獲得改進的**回應** $\hat{y}$。
+
+$$\hat{y} = \text{LLM}(x,y,j)$$
+
+要學習改進後的**回應** $\hat{y}$ 有兩種方法：
+
+- **ILF-MLE**
+$$
+\mathcal{L}_i^{mle} = -\frac{1}{N} \sum_t \log p(\hat{y}_t \mid \hat{y}_{<t},x)
+$$
+
+- **ILF-DPO**
+
+$$
+\mathcal{L}_i^{dpo} = \text{DPO}(x,y,\hat{y})
+$$
+
+> ILF中間接使用判斷限制了其發現和糾正判斷中弱點的能力。
+
+#### Hindsight
+LLM 在條件序列 $[x, j]$ 下生成回應 $y$。
+$$
+\mathcal{L}_h = -\frac{1}{N}\sum_t \log p (y_t \mid y_{<t},x,j)
+$$
+
+> Hindsight 將不滿意的回應作為最大似然估計的目標，不可避免地增加了生成不滿意回應的風險。
+
+## Contrastive Unlikelihood Training
+Contrastive Unlikelihood Training (CUT) 是一個微調框架，用於使LLM與**判斷**對齊。其核心思想是通過對比不同條件下的回應生成，來確定LLM應保持的適當行為及需要調整的具體內容。
+
+適當內容使用最大似然估計（MLE）來處理，不適當內容則使用 Unlikelihood Training (UT) 方法。
+
+### 對齊時加入判斷
+
+![image](./cut.png)
+
+如果指令的回應是符合人類期望($x \rightarrow y$)我們稱為**對齊**。否則將會有**判斷**來指出**回應**內的錯誤
+
+假設任務是生成一個滿足**判斷**的**回應**，我們表示成 $[x, j] \rightarrow y$。
+
+基於這個想法，我們構建了三種類型的對齊數據：
+
+
+- Align-P: $x$ 與 $y$ 的組合是令人滿意的。
+- Align-N: LLM 生成**回應**的時候，犯了一些錯誤，因此需要有**判斷**$j$來指證。
+- Misalign: Align-N 中真實的負面**判斷**($j^-$)被替換成了虛假的正面**判斷** $j$ ($j^+$)。
+
+### 從對比中學習
+
+#### Align-N vs. Misalign
+集合 $U$ 紀錄那些對 $j^-$反應較高的$t$，這些 tokens 被認為是不合適的：
+$$
+U = \{ t \mid p(y_t \mid y_{<t}, x, j^{-}) - \lambda \cdot p(y_t \mid y_{<t}, x, j^{+}) > 0 \}
+$$
+
+
+我們希望那些合適的 tokens (即$t \notin U$) 的 likehood 要比較高，並且不合適的 tokens (即$t \in U$) 需要被懲罰：
+$$
+\begin{array}{l}
+\mathcal{L}_1 = -\frac{1}{N}(\sum_{t \notin U} \log p(y_t \mid y_{<t},x) \\
+\hspace{1.3cm} +\sum_{t \in U} \alpha p(y_t \mid y_{<t},x,j^-)^\gamma \log(1-p(y_t \mid y_{<t},x))) 
+\end{array}
+$$
+
+#### Align-P vs. Align-N
+Align-P 和 Align-N 都有相同的表示 $[x,j]\rightarrow y$ 但是在考慮指令 ($x \rightarrow y$)的狀況下，便僅有 Align-P 是成立的。
+
+首先我們需要讓模型學直接的$x\rightarrow y$關係，然後如果$x\rightarrow y$的語意關係是不成立的，需要有**判斷**介入描述 $y$ 的錯誤類型：
+
+$$
+\begin{array}{l}
+\mathcal{L}_2 = - \frac{\mathbb{1}(x \rightarrow y)}{N} \sum_t \log p(y_t \mid y_{<t},x) \\
+\hspace{1.3cm}-\frac{1-\mathbb{1}(x \rightarrow y)}{N} \sum_t \log p(y_t |y_{<t},j,x)
+\end{array}
+$$
+
+最後我們結合兩個 Loss: $\mathcal{L}_{cut} = \mathcal{L}_1 + \mathcal{L}_2$
+
+
+## Experiments
+
+#### 指令遵循
+![image](./exp1.png)
+
+> For AlpacaEval, we report the winning rate of the responses generated by our models against DaVinci003 using GPT4 as the judge.
+
+#### CUT應用在不同模型
+![image](./exp2.png)
+
diff --git a/public/docs/reason-to-reject/exp1.png b/public/docs/reason-to-reject/exp1.png
diff --git a/public/docs/reason-to-reject/exp2.png b/public/docs/reason-to-reject/exp2.png