After researcher David Thiel of the Stanford Internet Observatory found links to child sexual abuse material (CSAM) in an AI training dataset that distorted image generators, the controversial dataset was immediately removed in 2023.
Now, the LAION (Large-scale Artificial Intelligence Open Network) team has released a cleaned version of the LAION-5B dataset called Re-LAION-5B, claiming that it is “the first web-wide dataset of text-link-image pairs that has been thoroughly cleaned of known links to suspected CSAM.”
To clean up the dataset, LAION worked with the Internet Watch Foundation (IWF) and the Canadian Centre for Child Protection (C3P) to remove 2,236 links that matched hashed images in the online safety organizations' databases. The removals include all links flagged by Thiel, as well as content flagged by LAION's partners and other watchdog organizations such as Human Rights Watch, which warned of privacy issues after finding photos of real children included in the dataset without their consent.
In his study, Thiel warned that “include child abuse material in the training data of AI models provides tools that associate children with illicit sexual activity and leverages known imagery of child abuse to generate new, potentially realistic child abuse content.”
Thiel called on LAION and other researchers who scour the internet for AI training data to create a new safety standard to better filter out not only CSAM, but also explicit images that could be combined with photos of children to create CSAM. (Recently, the U.S. Department of Justice explicitly said that “CSAM generated by AI is still CSAM.”)
While LAION's new dataset does not alter the models trained on the previous dataset, LAION claims that Re-LAION-5B sets “a new safety standard for cleaning web-wide image link datasets.” While previously illegal content “slipped” through LAION's filters, researchers have now developed an improved new system “for identifying and removing illegal content,” LAION's blog states.
Thiel told Ars he agreed that LAION had set a new security standard with its latest version, but “there are certainly ways to improve it.” However, “those methods would require ownership of all the original images or a brand new crawl,” and LAION's post made it clear that only image hashes were used and no new crawl was performed, which could have risked revealing even more illegal or sensitive content. (On Threads, Thiel shared more detailed impressions of LAION's efforts to clean up the dataset.)
LAION warned that “current state-of-the-art filters alone are not reliable enough to provide protection against CSAM in web-scale data compilation scenarios.”
“To ensure better filtering, hash lists of suspicious links or images created by expert organizations (in our case, IWF and C3P) are a suitable choice,” says LAION's blog. “We recommend that research labs and any other organizations compiling datasets from the public web collaborate with organizations like IWF and C3P to obtain such hash lists and use them for filtering. In the longer term, a larger joint initiative can be created that makes such hash lists available to the research community working on compiling datasets from the web.”
According to LAION, the bigger concern is that some links to known CSAM cases inserted into a 2022 dataset are still active more than a year later.
“This is a clear indication that law enforcement authorities need to intensify their efforts to shut down domains hosting such image content on the public web. This must follow the information and recommendations of organizations such as IWF and C3P to make the Internet a safer place, including for various types of research-related activities,” says LAION's blog.
HRW researcher Hye Jung Han praised LAION for removing sensitive data she had pointed out, while calling for further interventions.
“LAION's responsive removal of some children's personal photos from the dataset is very welcome and will help protect these children from having their images misused by AI systems,” Han told Ars. “Now it is up to governments to enact child data protection laws that protect the privacy of all children online.”
Although LAION's blog said the content removals represented an “upper limit” of CSAM that existed in the original dataset, AI specialist and Creative.AI co-founder Alex Champandard told Ars he was skeptical that all CSAM was removed.
“They only filter out previously identified cases of CSAM, which is only a partial solution,” Champandard told Ars. “Statistically, most cases of CSAM have probably never been reported or investigated by C3P or IWF. A more reasonable estimate of the problem is around 25,000 cases of things you would never want to train generative models on – maybe even 50,000.”
Champandard agreed with Han that more regulations are needed to protect people from harm caused by AI when training data is extracted from the internet.
“There is room for improvement on all fronts: privacy, copyright, illegal content, etc.,” Champandard said. Since “too many privacy rights are violated with such datasets scraped from the web,” Champandard said that datasets like LAION's “will not stand the test of time.”
“LAION is simply working on the regulatory gap and the backlogs in the justice system until policymakers recognize the magnitude of the problem,” Champandard said.