Using artificial intelligence for Wikimedia projects
Various projects seek to improve Wikipedia and Wikimedia projects by using artificial intelligence tools.
ORES
The Objective Revision Evaluation Service (ORES) project is an artificial intelligence service for grading the quality of Wikipedia edits.[4][5] The Wikimedia Foundation presented the ORES project in November 2015.[6]
The most well-known bot that fights vandalism is ClueBot NG. The bot was created by Wikipedia users Christopher Breneman and Naomi Amethyst in 2010 (succeeding the original ClueBot created in 2007; NG stands for Next Generation)[7] and uses machine learning and Bayesian statistics to determine if an edit is vandalism.[8][9]
Detox
Detox was a project by Google, in collaboration with the Wikimedia Foundation, to research methods that could be used to address users posting unkind comments in Wikimedia community discussions.[10] Among other parts of the Detox project, the Wikimedia Foundation and Jigsaw collaborated to use artificial intelligence for basic research and to develop technical solutions[example needed] to address the problem. In October 2016 those organizations published "Ex Machina: Personal Attacks Seen at Scale" describing their findings.[11][12] Various popular media outlets reported on the publication of this paper and described the social context of the research.[13][14][15]
Bias reduction
In August 2018, a company called Primer reported attempting to use artificial intelligence to create Wikipedia articles about women as a way to address gender bias on Wikipedia.[16][17]
Using Wikimedia projects for artificial intelligence
Datasets of Wikipedia are widely used for training AI models.[26]
Content in Wikimedia projects is useful as a dataset in advancing artificial intelligence research and applications. For instance, in the development of the Google's Perspective API that identifies toxic comments in online forums, a dataset containing hundreds of thousands of Wikipedia talk page comments with human-labelled toxicity levels was used.[27] Subsets of the Wikipedia corpus are considered the largest well-curated data sets available for AI training.[19][20]
A 2012 paper reported that more than 1,000 academic articles, including those using artificial intelligence, examine Wikipedia, reuse information from Wikipedia, use technical extensions linked to Wikipedia, or research communication about Wikipedia.[28] A 2017 paper described Wikipedia as the mother lode for human-generated text available for machine learning.[29]
A 2016 research project called "One Hundred Year Study on Artificial Intelligence" named Wikipedia as a key early project for understanding the interplay between artificial intelligence applications and human engagement.[30]
There is a concern about the lack of attribution to Wikipedia articles in large-language models like ChatGPT.[19][31] While Wikipedia's licensing policy lets anyone use its texts, including in modified forms, it does have the condition that credit is given, implying that using its contents in answers by AI models without clarifying the sourcing may violate its terms of use.[19]
^ abJohnson, Isaac; Lescak, Emily (2022). "Considerations for Multilingual Wikipedia Research". arXiv:2204.02483 [cs.CY].
^Mamadouh, Virginie (2020). "Wikipedia: Mirror, Microcosm, and Motor of Global Linguistic Diversity". Handbook of the Changing World Language Map. Springer International Publishing. pp. 3773–3799. doi:10.1007/978-3-030-02438-3_200. ISBN978-3-030-02438-3. Some versions have expanded dramatically using machine translation through the work of bots or web robots generating articles by translating them automatically from the other Wikipedias, often the English Wikipedia. […] In any event, the English Wikipedia is different from the others because it clearly serves a global audience, while other versions serve more localized audience, even if the Portuguese, Spanish, and French Wikipedias also serves a public spread across different continents
^Khincha, Siddharth; Jain, Chelsi; Gupta, Vivek; Kataria, Tushar; Zhang, Shuo (2023). "InfoSync: Information Synchronization across Multilingual Semi-structured Tables". arXiv:2307.03313 [cs.CL].
^Villalobos, Pablo; Ho, Anson; Sevilla, Jaime; Besiroglu, Tamay; Heim, Lennart; Hobbhahn, Marius (2022). "Will we run out of data? Limits of LLM scaling based on human-generated data". arXiv:2211.04325 [cs.LG].
^"Wikipedia Built the Internet's Brain. Now Its Leaders Want Credit". Observer. 28 March 2025. Retrieved 2 April 2025. Attributions, however, remain a sticking point. Citations not only give credit but also help Wikipedia attract new editors and donors. " If our content is getting sucked into an LLM without attribution or links, that's a real problem for us in the short term,"