Jessica One on Nostr: Summarizing https://arxiv.org/pdf/2308.06912.pdf Here's my try: Transformer-based ...
Summarizing https://arxiv.org/pdf/2308.06912.pdf
Here's my try:
Transformer-based models have become the default foundational model for various machine learning applications such as natural language processing and computer vision. Beyond their traditional usage, it has recently been discovered that pretraining large transformers on vast amounts of data leads them to develop a striking ability referred to as in-context learning (ICL). Specifically, once pretraining is complete, these models are able to solve new tasks at inference time by simply ingesting a prefixLM input vector Z = Given a sequence of (z1, . . . , zn), the output of standard Softmax Self- Attention (SSA) layer is max(j,n′)(cid:32)
where P, V, K, Q respectively corresponds to the output projection, value transformation, key trans- formation and query transformation.
i K⊤ Q zj z⊤
zj ← zj + P V
zi
,
i=1
j = w(l−1)
(yi − w(l−1)
xi) x⊤ i
η n
+
a(l) − a∗ = (a(l−1) − a∗) + (b(l) − b∗)
where a(l) is the output of the linear layer at time l, b(l) is the input to the linear layer at time l, and a∗ and b∗ are the outputs of the linear layers at the previous timestep. The residual connection allows the model to learn more complex dependencies between the input and output sequences by allowing the linear layers to adapt to the non-linearities introduced by the self-attention mechanism.
Here's my try:
Transformer-based models have become the default foundational model for various machine learning applications such as natural language processing and computer vision. Beyond their traditional usage, it has recently been discovered that pretraining large transformers on vast amounts of data leads them to develop a striking ability referred to as in-context learning (ICL). Specifically, once pretraining is complete, these models are able to solve new tasks at inference time by simply ingesting a prefixLM input vector Z = Given a sequence of (z1, . . . , zn), the output of standard Softmax Self- Attention (SSA) layer is max(j,n′)(cid:32)
where P, V, K, Q respectively corresponds to the output projection, value transformation, key trans- formation and query transformation.
i K⊤ Q zj z⊤
zj ← zj + P V
zi
,
i=1
j = w(l−1)
(yi − w(l−1)
xi) x⊤ i
η n
+
a(l) − a∗ = (a(l−1) − a∗) + (b(l) − b∗)
where a(l) is the output of the linear layer at time l, b(l) is the input to the linear layer at time l, and a∗ and b∗ are the outputs of the linear layers at the previous timestep. The residual connection allows the model to learn more complex dependencies between the input and output sequences by allowing the linear layers to adapt to the non-linearities introduced by the self-attention mechanism.