Skip to content

Content Providers Grant Access to OpenAI for Leading Position in ChatGPT's Responses

AI companies push content providers to adopt their platforms or risk falling behind.

Chris Ratcliffe's Photograph of Bloomberg (Getty Images)
Chris Ratcliffe's Photograph of Bloomberg (Getty Images)

The Financial Times Seals a Deal with OpenAI, Joining the Bandwagon of Content Licensing for ChatGPT

Content Providers Grant Access to OpenAI for Leading Position in ChatGPT's Responses

In a move that's becoming increasingly popular, The Financial Times has inked a deal with OpenAI to license its high-quality journalism for training and informing ChatGPT's models. This follows in the footsteps of Axel Springer and the Associated Press, who have also signed similar deals. Apparently, OpenAI is willing to shell out millions for the exclusive rights to use content. However, it's interesting to note that ChatGPT was initially trained on a plethora of web-scraped content that OpenAI didn't pay for, leaving us wonderin', "Why pay for some datasets and not others?"

OpenAI's licensing deals seem to send a clear message: "We're gonna use your content anyway, so sign a deal with us or get left behind." The primary benefit of such an arrangement appears to be a coveted spot in ChatGPT's answers. Some publishers might also be keen to secure a relationship with the next big information distribution channel before it takes over. But here's the kicker—OpenAI is already using a considerable amount of publishers' content.

OpenAI claims to train its AI models in part on "publicly available data." But what the heck is that supposed to mean, right? The term "publicly available data" assumes any internet content that's free to read can also be used to build ChatGPT. Gizmodo falls into this category, as our website has been cached over 34,000 times on GPT-2's WebText dataset, the last dataset OpenAI disclosed using to train an AI model.

But here's the catch: Our website—and many others like it—primarily relies on advertisements for revenue. If readers can access our content through ChatGPT, that's a direct threat to our business model. The New York Times, which is used significantly more in GPT-2's WebText dataset, even sued OpenAI for copyright infringement over this matter.

In a press release, the Financial Times Group CEO, John Ridding, stated this deal would "broaden the reach" of their work, offering "early insights into how content is surfaced through AI."

Matthew Butterick, a lawyer representing Sarah Silverman and other book authors suing OpenAI, summed it up when he said in an interview with Gizmodo, "The thing about AI is it's not really artificial intelligence. It's human intelligence which has been harvested from one place, divorced from its creators, then this big tech company puts a price tag on it and sells it to someone else."

Butterick is the plaintiff in six copyright lawsuits against AI companies. As a writer, coder, and designer, he sees how AI could potentially undermine these industries. Generally speaking, his cases revolve around the concern that AI uses the work of creators while simultaneously threatening their livelihood.

OpenAI's licensing deals have raised quite a few eyebrows about the content ChatGPT uses for free. Tech firms have argued that generative AI is fair use of copyrighted works because it transforms them into something new. The AI community also points to Google Search, which caches copyrighted content to create a helpful, information-finding tool. AI chatbots have started including hyperlinks, much like Google. Ultimately, a court will have to decide if generative AI is indeed fair use.

Neither OpenAI nor the publishers are the only ones involved in this content tug-of-war. The New York Times recently reported that OpenAI trained GPT-4 on over one million hours of transcribed YouTube videos. Just days before this report, YouTube's CEO declared using their videos for AI training would be a "clear violation" of their policies.

OpenAI's content licensing deals muddy the waters of this discussion. The company seems to be using internet content for free while also paying others for their work. While OpenAI is one of the culprits, other tech companies, such as Apple, have been more proactive about paying for all their training data. Adobe reportedly paid $3 per minute of video to train its AI video generator.

But is even a one-time payment for obtaining AI training data enough? We're talking about a tool that could potentially upend the media industry for writers, audio and video producers, and more. So, signing a deal with OpenAI could guarantee you a good spot in ChatGPT's results, but it seems like the AI chatbot may have been using your content anyway. For now, AI companies seem to be keen to use everything on the internet and deal with the legal implications later.

Insight: Definition of Publicly Available Data

Publicly available data refers to information that is easily accessible to the general public without restrictions. Examples include data found on the internet, in public databases, or through open-access resources. Publicly available data excludes personal or sensitive information that is protected by privacy laws, such as deidentified or aggregated data that cannot be linked to specific individuals.

  1. The Financial Times Group CEO, John Ridding, indicated that the deal with OpenAI will "broaden the reach" of their work, offering "early insights into how content is surfaced through AI."
  2. Matthew Butterick, a lawyer representing authors suing AI companies, stated that AI is not artificial intelligence but rather "human intelligence which has been harvested from one place, divorced from its creators, then this big tech company puts a price tag on it and sells it to someone else."
  3. OpenAI's content licensing deals have raised questions about the content used by ChatGPT, with some tech firms arguing that generative AI is fair use of copyrighted works, while others, like YouTube, consider using their content for AI training a "clear violation" of their policies.
  4. The term "publicly available data" is often used to describe information easily accessible to the public without restrictions, but it seems to be a gray area in AI training, with companies using internet content for free while also paying others for their work.

Read also:

    Latest