#AI #MachineLearning #OpenAccess #LLM #HuggingFace "Many have claimed that training ...

2024-11-14 12:58:30

#AI #MachineLearning #OpenAccess #LLM #HuggingFace

"Many have claimed that training large language models requires copyrighted data, making truly open AI development impossible. Today, Pleias is proving otherwise with the release of Common Corpus...—the largest fully open multilingual dataset for training LLMs, containing over 2 trillion tokens of permissibly licensed content with provenance information (2,003,039,184,047 tokens)."

https://huggingface.co/blog/Pclanglais/two-trillion-tokens-open

Author Public Key

npub1edz6ysaqe6ratc29kzpqvgp33twzr0xefqv9ka7mdxjyacfy52mq42rj6j

Show more details

ResearchBuzz on Nostr: #AI #MachineLearning #OpenAccess #LLM #HuggingFace "Many have claimed that training ...