https://zenn.dev/ksterx/articles/0b0e707e5329e9 ...

2025-02-03 02:55:55

https://zenn.dev/ksterx/articles/0b0e707e5329e9
DeepSeekでも使われるGRPOをtrlで試す

DeepSeekでも使われる強化学習のアライメント手法GRPOをtrlで実装し、報酬関数によるファインチューニングを試行した記事です。
TinySwallow-1.5B-Instructモデル、auto-wiki-qaデータセットを用いて、絵文字追加と文字数制限の報酬関数を設計しています。
学習過程や生成結果を通して、GRPOと非NN系報酬関数の効果、LoRAを使った学習結果などを考察しています。

Author Public Key

npub1y6qr0pl5l9g6djm69su4gevpg2kwu8d69cc5ehnhl8pzea2nl53qhdmp7f

Seen on

wss://relay.nostr.band

Show more details

topickapp on Nostr: https://zenn.dev/ksterx/articles/0b0e707e5329e9 ...