Artificial Intelligence Training Data Being Developed by Wikipedia Due to Bot Overload
On hump day, the Wikimedia Foundation flipped the script, teaming up with Google's pet project Kaggle - a beloved data science playground. This new partnership aims to unleash a fine-tuned version of Wikipedia, tailored for AI model training. The initial rollout will cater to English and French languages, offering engineers a stripped-down version of Wikipedia's raw text. This makeover will remove any references or fancy markdown codes.
As a non-profit, volunteer-run platform, Wikipedia's cash flow primarily comes from donations, and it doesn't own the content it hosts. But that hasn't stopped it from sharing its immense repository of knowledge far and wide. For instance, Kiwix, a nifty offline Wikipedia version, has slithered its way into North Korea, serving as a secret information pipeline.
The deluge of bots out there scribbling away on Wikipedia in search of AI training content has sent traffic through the roof, and the costs have been escalating. The foundation announced earlier this month that bandwidth consumption has gone up by a whopping 50% since January 2024. To take the edge off this digital madness, the Foundation plans to release a standard, JSON- formatted version of Wikipedia articles. Hopefully, this move will discourage AI whizzes from flooding the website with requests.
"Kaggle's totally stoked to be a home for the Wikimedia Foundation's data, man! It's gonna be a gas, keeping this data accessible, available, and useful for all!" says Brenda Flynn, Kaggle's partnership lead, dishing out the excitement to The Verge.
The tech industry's slacker attitude toward content creators is evident, as they place little worth on individual creative endeavors. Some believe that content should be free and that snagging it for AI model training constitutes fair game due to the wondrous nature of AI. But remember, someone's gotta create the content in the first place, and it ain't cheap. AI startups have been blithely sidestepping accepted norms around website-crawling etiquette, ignoring a site's wishes not to get swamped.
Training language models that produce chatty AI responses calls for bucketloads of data. Over the past few years, these training data sets have become akin to black gold in the AI gold rush. It's no secret that the leading models are reared on copyrighted works, and several AI companies continue to duke it out in court over theseissues.
Some Wikipedia contributors might not be too thrilled about their content being used for AI training, for various reasons. All contributions on Wikipedia are protected by Creative Commons Attribution-ShareAlike licenses, which lets anyone share, fiddle with, and build upon content, as long as they credit the original author and keep the same terms for their works.
The dataset available via Kaggle is a treasure trove for developers, totally free to snatch up. According to Gizmodo, the Wikimedia Foundation is harvesting Wikipedia's data through a "Structured Content" beta program within Wikipedia Enterprise, a VIP offering for high-volume users. They assure that AI model companies and other users are still expected to honor Wikipedia's attribution and licensing terms.
- The Wikimedia Foundation, in collaboration with Google's project Kaggle, is aiming to alter Wikipedia for AI model training in the future, starting with English and French languages.
- In the tech industry, there's a questionable approach towards content creators, with AI startups seemingly disregarding accepted norms about website-crawling etiquette.
- AI model companies, including those using Wikipedia's data for training, are expected to adhere to the Creative Commons Attribution-ShareAlike licenses that protect all contributions on Wikipedia.
- Kaggle, a beloved data science playground, is thrilled about the partnership with the Wikimedia Foundation, seeing it as an opportunity to keep the data accessible, available, and useful for tech enthusiasts.
- Over the past few years, large amounts of data have become crucial for training language models that generate chatty AI responses, and these data sets have become as valuable as black gold in the AI gold rush.
- Some AI whizzes have been flooding Wikipedia with requests for data, contributing to a significant increase in bandwidth consumption on the platform, yet this move by the Foundation aims to discourage such behavior by releasing a standard, JSON-formatted version of Wikipedia articles.