Event JSON
{
"id": "4fb67199b37be9e707d72f32c413faf6acaba0f41de8d78d31c6948ffc971fe2",
"pubkey": "f4b3f3c86fe3d909e2ed6949d8df26848cb839f343ae6fbbeca61f89d09aac9e",
"created_at": 1679119926,
"kind": 1,
"tags": [
[
"e",
"6d1ba0602ed6dfe3bf919e4537fefe3ef9a7030d2b2399f130486e2c6bd913cb",
""
],
[
"e",
"253c4e73c432e8c23ad4f89c412df2f40cd7f1ec9e714e2ef56b982e61946ae3"
],
[
"p",
"53a8392e971b46326e3d0f8967db17c4f7cca4d42be979b1664124c8f69af528"
],
[
"p",
"0d6f3fb7f3c83755ea731380516167da6691cea0d7ddf4865505d291687ca343"
],
[
"p",
"0d6f3fb7f3c83755ea731380516167da6691cea0d7ddf4865505d291687ca343"
],
[
"p",
"0d6f3fb7f3c83755ea731380516167da6691cea0d7ddf4865505d291687ca343"
],
[
"p",
"0d6f3fb7f3c83755ea731380516167da6691cea0d7ddf4865505d291687ca343"
],
[
"p",
"0d6f3fb7f3c83755ea731380516167da6691cea0d7ddf4865505d291687ca343"
],
[
"p",
"0d6f3fb7f3c83755ea731380516167da6691cea0d7ddf4865505d291687ca343"
],
[
"p",
"0d6f3fb7f3c83755ea731380516167da6691cea0d7ddf4865505d291687ca343"
],
[
"p",
"0d6f3fb7f3c83755ea731380516167da6691cea0d7ddf4865505d291687ca343"
],
[
"p",
"0d6f3fb7f3c83755ea731380516167da6691cea0d7ddf4865505d291687ca343"
],
[
"p",
"f0c864cf573de171053bef4df3b31c6593337a097fbbd9f20d78506e490c6b64"
]
],
"content": "Scale law 也包括数据集, 中文语料太少。 当前存在的语料已经用完了, 后续的都是线性增产的语料, 不会有指数变化。\n\n大模型的能力是在预训练时候已经获得的, 后续 监督微调/RLHF/incontext learn和 prompt 都是引导,不增加模型能力甚至减少模型能力。 \n\n总之, 关键在模型预训练, 语料不足(书、杂志、wiki、报纸、新闻、小说、各种出版物、网站出版物、 文档、 软件、游戏都太少太少了, 垃圾广告不少,但是垃圾广告千篇一律没信息量没 给不来泛化能力)\n\n其他小语言语料更少, 语言语料训练不平衡,是gpt 自己提出他要解决的问题",
"sig": "187b4bd4cbb4b2f8e96fa2feca56daa97888f533bab6137d2161aac6bc4a74ad4ede2f771f715bef71a78c24eaf6b062c51b7e11d8c815636bf42745c7e24c1a"
}