ResearchBuzz on Nostr: #AI #MachineLearning #OpenAccess #LLM #HuggingFace "Many have claimed that training ...
#AI #MachineLearning #OpenAccess #LLM #HuggingFace
"Many have claimed that training large language models requires copyrighted data, making truly open AI development impossible. Today, Pleias is proving otherwise with the release of Common Corpus...—the largest fully open multilingual dataset for training LLMs, containing over 2 trillion tokens of permissibly licensed content with provenance information (2,003,039,184,047 tokens)."
https://huggingface.co/blog/Pclanglais/two-trillion-tokens-open
"Many have claimed that training large language models requires copyrighted data, making truly open AI development impossible. Today, Pleias is proving otherwise with the release of Common Corpus...—the largest fully open multilingual dataset for training LLMs, containing over 2 trillion tokens of permissibly licensed content with provenance information (2,003,039,184,047 tokens)."
https://huggingface.co/blog/Pclanglais/two-trillion-tokens-open