Human moderators outperform AI in content moderation, yet they come at a cost 40 times higher
In a groundbreaking study titled "AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety," researchers have found that multimodal large language models (MLLMs) are a more cost-effective solution for brand safety tasks compared to human moderators.
Despite human moderators achieving higher accuracy, especially in nuanced or context-heavy cases, their cost is significantly higher. Human content moderation is approximately 40 times more expensive than the most cost-efficient machine learning labor, represented by MLLMs like Gemini and GPT.
The study, which will be presented at the upcoming Computer Vision in Advertising and Marketing (CVAM) workshop at the 2025 International Conference on Computer Vision, evaluated six MLLMs and human reviewers using a dataset of 1500 videos, categorised into Drugs, Alcohol and Tobacco (DAT); Death, Injury and Military Conflict (DIMC); and Kid's Content.
The performance of the models was scored in three categories: precision, recall, and F1, which are common methods for machine learning evaluation. F1 is the harmonic mean of precision and recall.
The authors of the study concluded that compact MLLMs offer a significantly cheaper alternative compared to their larger counterparts without sacrificing accuracy. In fact, the Gemini models have the highest F1 scores among the MLLMs evaluated.
Industry experts, such as Jon Morra, Zefr's Chief AI Officer, advocate for a hybrid human and AI approach as the most effective and economical path forward for content moderation in the brand safety and suitability landscape. This model would utilise AI for broad-scale triage and flagging, with humans handling edge cases and complex decisions.
It's important to note that brand safety requires judgment, empathy, and discretion from human moderators for sensitive content. Outsourcing and augmenting moderation with AI can support compliance with ethical frameworks and international regulations.
In conclusion, while human moderators provide superior brand safety outcomes, their significantly higher cost (~40x) compared to MLLMs makes a hybrid model the recommended path, leveraging AI's cost-efficiency and human judgment where most needed.
- The study, evaluating six MLLMs and human reviewers, used a dataset of 1500 videos, categorized into DAT, DIMC, and Kid's Content, and the models' performances were scored using precision, recall, and F1, where F1 is the harmonic mean of precision and recall.
- Although human content moderators achieved higher accuracy in nuanced or context-heavy cases, their cost is approximately 40 times more expensive than the most cost-efficient machine learning labor, represented by MLLMs like Gemini and GPT.
- Despite the higher accuracy of human moderators, compact MLLMs, such as Gemini models, offer a significantly cheaper alternative without sacrificing accuracy, concludes the study titled "AI vs. Human Moderators."
- As suggested by industry experts like Jon Morra, Zefr's Chief AI Officer, a hybrid human and AI approach is the most effective and economical path forward for content moderation in the brand safety and suitability landscape, utilizing AI for broad-scale triage and flagging, while humans handle edge cases and complex decisions.