1building blockscopyrightcreativitycreativity and aiCulturedata winterFeaturedllmstraining

AI Training: What Creators Need To Know About Copyright, Tokens, And Data Winter

from the beware-the-data-winter dept

This is the final piece in a series of posts that explores how we can rethink the intersection of AI, creativity, and policy. From examining outdated regulatory metaphors to questioning copyright norms and highlighting the risks of stifling innovation, each post addresses a different piece of the AI puzzle. Together, they advocate for a more balanced, forward-thinking approach that acknowledges the potential of technological evolution while safeguarding the rights of creators and ensuring AI’s development serves the broader interests of society. You can read the firstsecondthirdfourthfifth, and sixth posts in the series.

As the conversation about AI’s impact on creative industries continues, there’s a common misconception that AI models are “stealing” content by absorbing it for free. But if we take a closer look at how AI training works, it becomes clear that this isn’t the case at all. AI models don’t simply replicate or repackage creative works—they break them down into something much more abstract: tokens. These tokens are tiny, fragmented pieces of data that no longer represent the creative expression of an idea. And here’s where the distinction lies: copyright is meant to protect expression, not individual words, phrases, or patterns that make up those works.

The Lego Analogy: Breaking Down Creative Works into Tokens

Imagine you’re a creator, and your work is like a detailed Lego model of the Star Wars Millennium Falcon. It’s intricate, with every piece perfectly assembled to create something unique and valuable. Now imagine that an AI system comes along—not to take your Millennium Falcon and display it as its own creation, but to break it down into individual Lego blocks. These blocks are then scattered among millions of others from different sources, and the AI uses them to build entirely new structures—things that look nothing like the Millennium Falcon.

In this analogy, the Lego blocks are the tokens that AI models use. These tokens are fragments of data—tiny bits of information stripped of the original context and creative expression. Just like Lego pieces, tokens are abstract and can be recombined in an infinite number of ways to create something entirely new. The AI doesn’t copy your Falcon; it takes the building blocks (tokens) and uses them to create something that’s not a replica of the original but something completely different, like a castle or a spaceship you’ve never seen before.

This is the key distinction: AI models aren’t absorbing entire creative works and reproducing them as their own. They’re learning patterns from vast datasets and using those patterns to generate new content. The tokens no longer reflect the expression of the original work, and thus, they don’t infringe on the creative essence that copyright law is designed to protect.

Why Recent Content Matters: AI Needs to Reflect Modern Language and Values

There’s another critical point that often gets overlooked: AI models need access to recent, contemporary content to be useful, relevant, and ethical. Let’s imagine for a moment what would happen if AI models were restricted to learning only from public domain works, many of which are decades or even centuries old.

While public domain works are valuable, they often reflect the social norms and biases of their time. If AI models are trained primarily on outdated texts, there’s a serious risk that they could “speak” in a way that’s misogynistic, biased, anti-LGBTQ+, or even outright racist. Many public domain works contain language and ideas that are no longer acceptable in today’s society, and if AI is limited to these sources, it may inadvertently propagate harmful, antiquated views.

To ensure that AI reflects current values, inclusive language, and modern social norms, it needs access to recent content. This means analyzing and learning from today’s books, articles, speeches, and other forms of communication. If creators and copyright holders opt out of allowing their content to be used for AI training, we risk creating models that don’t reflect the diversity, progress, and inclusivity of modern society.

For example, language evolves quickly—just look at the increased use of gender-neutral pronouns or terms like intersectionality in recent years. If AI is cut off from these contemporary linguistic trends, it will struggle to understand and engage with the world as it is today. It would be like asking an AI trained exclusively on Shakespearean English to have a conversation with a 21st-century teenager—it simply wouldn’t work.

Article 4 of the EU Directive: Opting Out of Text and Data Mining

Let’s bring the EU Directive on Copyright in the Digital Single Market (DSM) into the picture. The Directive includes provisions (Article 4) allowing copyright holders to opt out of having their content used in text and data mining (TDM). TDM is crucial for training AI models, as it allows them to analyze and learn from large datasets. The opt-out mechanism gives creators and copyright holders the ability to expressly reserve their works from being used for TDM.

However, it’s important to remember that this opt-out applies to all AI models, not just generative AI systems like ChatGPT. This means that by opting out in a broad, blanket manner, creators could inadvertently limit the potential of AI models that have nothing to do with creative industries—tools that are critical for advancements in healthcare, education, and even in day-to-day conveniences that many of us benefit from.

The Risk of a Data Winter: Why Broad Opt-Outs Could Harm Innovation

What happens if creators and copyright holders across Europe start opting out of TDM on a large scale? The answer is something AI researchers dread: a data winter. Without access to a diverse and rich array of data, AI models will struggle to evolve. This could slow innovation not just in the creative industries, but across the entire economy.

AI needs high-quality data to function properly. The principle of Garbage In, Garbage Out applies here: if AI models are starved of diverse input, their output will be flawed, biased, and of lower quality. And while this may not seem like an issue for some industries, it has a ripple effect. Every AI tool we rely on—from smart assistants to medical research applications—depends on robust training data. Restricting access to this data doesn’t just hinder progress in AI innovation; it stifles public interest tools that have far-reaching benefits for society.

Think about it: many creators themselves probably use AI-driven tools in their daily lives—whether it’s for streamlining workflows, generating new ideas, or even just organizing information. By opting out of TDM, they could inadvertently be damaging the very tools that enhance their own creative processes.

The Way Forward: Balance Between Protection and Innovation

While copyright is crucial for protecting creators and ensuring fair compensation, it’s equally important not to over-regulate in a way that stifles innovation. AI models aren’t absorbing entire works for free; they’re breaking them down into unrecognizable tokens that enable transformative uses. Rather than opting out of TDM as a knee-jerk reaction, creators should consider the long-term consequences of limiting AI’s potential to innovate and enhance their own industries.

A balance needs to be struck. Copyright protection should ensure that creators are fairly compensated, but it shouldn’t be wielded as a tool to restrict the very data that drives AI innovation. Creators and policymakers must recognize that AI isn’t the enemy—it’s a collaborator. And if we’re not careful, we might find ourselves facing a data winter, where the tools we rely on for both convenience and advancement are weakened due to short-sighted decisions.

Caroline De Cock is a communications and policy expert, author, and entrepreneur. She serves as Managing Director of N-square Consulting and Square-up Agency, and Head of Research at Information Labs. Caroline specializes in digital rights, policy advocacy, and strategic innovation, driven by her commitment to fostering global connectivity and positive change.

Filed Under: , , , , , , ,

Source link

Related Posts

1 of 18