"Separately, the authors also tested several contemporaneous large language models ...

Miguel Afonso Caetano /

npub1pwu…95z7

2025-02-11 12:41:59

"Separately, the authors also tested several contemporaneous large language models (GPT-4, GPT-3.5 and Llama 3 8B). GPT-4's edit summaries in particular were rated as significantly better than those provided by the human Wikipedia editors who originally made the edits in the sample – both using an automated scoring method based on semantic similarity, and in a quality ranking by human raters (where "to ensure high-quality results, instead of relying on the crowdsourcing platforms [like Mechanical Turk, frequently used in similar studies], we recruited 3 MSc students to perform the annotation").

This outcome joins some other recent research indicating that modern LLMs can match or even surpass the average Wikipedia editor in certain tasks (see e.g. our coverage: "'Wikicrow' AI less 'prone to reasoning errors (or hallucinations)' than human Wikipedia editors when writing gene articles").

A substantial part of the paper is devoted to showing that this particular task (generating good edit summaries) is both important and in need of improvements, motivating the use of AI to "overcome this problem and help editors write useful edit summaries":"

https://meta.wikimedia.org/wiki/Research:Newsletter/2025/January

#Wikipedia #AI #GenerativeAI #LLMs #ChatBots #ChatGPT #GPT4

Author Public Key

npub1pwuvltfvfme0d987k6rs3an6jnv9k2w32zqdlqt5hpy9u3cmv6ps2r95z7

Show more details

Published at

2025-02-11 12:41:59

Kind type

1 Short Text Note

Event JSON

{ "id": "56c8c8ff3710bb1718bb3777926c3c39b87770cec1cc8b03ecbe5ff1edd4034f", "pubkey": "0bb8cfad2c4ef2f694feb68708f67a94d85b29d15080df8174b8485e471b6683", "created_at": 1739277719, "kind": 1, "tags": [ [ "t", "wikipedia" ], [ "t", "ai" ], [ "t", "generativeAI" ], [ "t", "LLMs" ], [ "t", "Chatbots" ], [ "t", "chatgpt" ], [ "t", "gpt4" ], [ "proxy", "https://tldr.nettime.org/users/remixtures/statuses/113985304603064597", "activitypub" ] ], "content": "\"Separately, the authors also tested several contemporaneous large language models (GPT-4, GPT-3.5 and Llama 3 8B). GPT-4's edit summaries in particular were rated as significantly better than those provided by the human Wikipedia editors who originally made the edits in the sample – both using an automated scoring method based on semantic similarity, and in a quality ranking by human raters (where \"to ensure high-quality results, instead of relying on the crowdsourcing platforms [like Mechanical Turk, frequently used in similar studies], we recruited 3 MSc students to perform the annotation\").\n\nThis outcome joins some other recent research indicating that modern LLMs can match or even surpass the average Wikipedia editor in certain tasks (see e.g. our coverage: \"'Wikicrow' AI less 'prone to reasoning errors (or hallucinations)' than human Wikipedia editors when writing gene articles\").\n\nA substantial part of the paper is devoted to showing that this particular task (generating good edit summaries) is both important and in need of improvements, motivating the use of AI to \"overcome this problem and help editors write useful edit summaries\":\"\n\nhttps://meta.wikimedia.org/wiki/Research:Newsletter/2025/January\n\n#Wikipedia #AI #GenerativeAI #LLMs #ChatBots #ChatGPT #GPT4", "sig": "1b5627a3cd90b41c26775a88042acf584a4593c8b8aa5349d98f440934e0a60791512ad1df4a5491927c6b9ff94b79fc4da0462d7ea618b43891b4a7f33b863f" }