halalmoney on Nostr: “The company could have saved itself a giant headache by building in robust data ...
“The company could have saved itself a giant headache by building in robust data record-keeping from the start, she says. Instead, it is common in the AI industry to build data sets for AI models by scraping the web indiscriminately and then outsourcing the work of removing duplicates or irrelevant data points, filtering unwanted things, and fixing typos. These methods, and the sheer size of the data set, mean tech companies tend to have a very limited understanding of what has gone into training their models.
Tech companies don’t document how they collect or annotate AI training data and don’t even tend to know what’s in the data set, says Nithya Sambasivan, a former research scientist at Google and an entrepreneur who has studied AI’s data practices.”
https://www.technologyreview.com/2023/04/19/1071789/openais-hunger-for-data-is-coming-back-to-bite-it/
Tech companies don’t document how they collect or annotate AI training data and don’t even tend to know what’s in the data set, says Nithya Sambasivan, a former research scientist at Google and an entrepreneur who has studied AI’s data practices.”
https://www.technologyreview.com/2023/04/19/1071789/openais-hunger-for-data-is-coming-back-to-bite-it/