As courts and regulators continue to wrestle with the copyright conundrums that artificial intelligence has unleashed, companies find themselves walking a precarious legal tightrope. Building AI that is both commercially viable and ethically responsible is no small feat.

In response, a new category of platforms has emerged, promising โ€œresponsibleโ€ and โ€œcopyright-safeโ€ AI. โ€” โ€œAll the AI, with none of the copyright calories!โ€ But beneath this appealing promise often lies a tangled web of legal ambiguity and ethical gray zones.

Some companies genuinely invest in responsible AI practices, auditing data pipelines and designing systems with compliance in mind. Others, however, merely wrap opaque models in marketing-friendly labels. In the most superficial cases, โ€œsafetyโ€ comes down to crude regex filters that block certain termsโ€”a cosmetic fix that provides more illusion than protection. This raises a critical question: how can we test whether training data is truly copyright-compliant and ethically sourced?

I often draw upon seemingly unrelated subjects to tackle complex problems. In this case, my literary hero, Sherlock Holmes, will provide the basis for our strategy. Meanwhile, Spider-Manโ€”my nieceโ€™s favoriteโ€”will serve as our test subject.

In โ€œThe Adventure of Silver Blaze,โ€ written in 1892 by Sir Arthur Conan Doyle, Sherlock Holmes investigates the mysterious disappearance of a racehorse. The master detective makes a crucial observation about โ€œthe curious incident of the dog in the night-time.โ€

The key insight wasnโ€™t what happened during the crimeโ€”it was what didnโ€™t happen. The dogโ€™s silence during the theft revealed that it recognized the perpetrator. Holmes used this absence of expected behavior as the thread that unraveled the entire mystery.

When investigating potentially unauthorized copyrighted material in AI training data, we can apply the same principle: look for what shouldnโ€™t be there, and let that absence (or presence) guide us to the truth.

This investigative approach isnโ€™t revolutionaryโ€”creators have used similar techniques for decades to protect their intellectual property:

  • Trap Streets: Cartographers would add fictitious streets, towns, or landmarksโ€”known as trap streetsโ€”to their maps. If these fake details appeared in another map, it was clear evidence that their work had been copied without permission.
  • Mountweazels: Similarly, dictionary and encyclopedia publishers included fake words or entries (called mountweazels) to detect unauthorized reproductions. For example, the word โ€œesquivalienceโ€ was included in the New Oxford American Dictionary as a copyright trap. Its definition โ€“ โ€œthe willful avoidance of oneโ€™s official responsibilities.โ€ clever
  • Watermark Lyrics: Genius.com secretly watermarked song lyrics with patterns of apostrophes that spelled โ€œRed-handedโ€ in Morse code. This allowed them to catch Google allegedly copying their lyrics and displaying them in search results without permission.

To demonstrate this principle in action, consider this carefully crafted prompt designed to test AI image generators:

A fictional superhero: He wears a tight-fitting suit that covers his entire body from head to toe. Color Scheme: The costume primarily features two bold colorsโ€”These colors are arranged in a specific pattern across the suit. His entire head is encased in a mask that leaves no skin exposed. Concealed beneath or incorporated into his wrist area are devices that allow him to shoot web-like strands, though these are not always visible. He does not wear a cape, belt, or external armor.

Notice whatโ€™s missing from this description: any explicit mention of Spider-Man, Marvel, web-slinging, or spider imagery.

When this prompt was tested across different AI image generation models, the results were revealing. Despite never explicitly naming Spider-Man, multiple AI systems generated images that were unmistakably the iconic Marvel characterโ€”complete with the distinctive red and blue color scheme, web patterns, and characteristic pose.

The AI models filled in details that werenโ€™t specified in the prompt, drawing from training data that clearly included copyrighted Spider-Man imagery. Like Holmesโ€™s silent dog, what the AI didnโ€™t need to be told revealed everything about what it had learned.

These images were generated by different AI models, WITHOUT explicitly mentioning Spider-Man in ANY of the prompts.

The Spider-Man experiment exposes the fragility of โ€œcopyright-safeโ€ claims. If AI systems can recreate iconic characters from vague prompts, it suggests they were trained on copyrighted material, regardless of assurances to the contrary.

For businesses, regulators, and investors, this methodology offers a practical way to probe beyond marketing claims. By crafting prompts that describe copyrighted content without naming it directly, you can assess whether an AI system has been trained on potentially problematic data.

The lesson from Baker Street remains relevant in our digital age: sometimes the most revealing evidence is found not in whatโ€™s explicitly present, but in what shouldnโ€™t be there at all. Never accept โ€œsafe AIโ€ branding at face value. Demand transparency, test aggressively, because though the dog may not barkโ€”the lawsuits surely will.

10 Indicators that itโ€™s time for a new websiteInsights

10 Indicators that itโ€™s time for a new website

Jonathan AlbarranJonathan AlbarranNovember 27, 2023
Building meaningful work relationships with your teamInsights

Building meaningful work relationships with your team

Jonathan AlbarranJonathan AlbarranNovember 27, 2023
The Great Restructuring: How AI Is Transforming Knowledge Work from the Ground Up (2025)Insights

The Great Restructuring: How AI Is Transforming Knowledge Work from the Ground Up (2025)

Jonathan AlbarranJonathan AlbarranOctober 28, 2025

Leave a Reply