- Posted on
- • Background
Is it legal for AI to scrape licensed material and reproduce it in the generated output?
- Author
-
-
- User
- maintainer
- Posts by this author
- Posts by this author
-
There are some big legal questions, which are affecting any content creator and/or content provider today:
- Is it legal for a internet robot so scrape all of the available content for the purpose of training a generative AI model?
- Is it legal for a AI company to train any generative AI model on the scraped content?
- Is it legal when a generative AI model reproduces licensed and/or copyrighted content while inferencing?
Hmm... Let us apply some logical thinking and common sence when trying to answer this questions...
I am not an AI so this will be not simulated reasoning and not a chain of thought of an LLM ;)
Is it legal for a internet robot so scrape all of available content for the purpose of training a generative AI model?
The answer, whether it is legal or not to scrape any available internet content for the purpose of training a AI model, depends very large on the country in which the scraper and the AI company is located:
- In a country with strict copyright law without any exemptions you usually would have to purchase a publishing or manufacturing or redustribution license for every copy you made in order to prepare it for training a AI.
- To feed some content into a model while training you have at least to copy it on the computer where the training is running, therefore you will need a license for this copy in a country with restricted copyright
- Because the model can reproduce learned content more than once you will need an allowance for manufacturing or publishing it. Otherwise, if you are not going to purchase such a license you will have to ensure technically, that your model does not reproduce whole copies or significant parts of the content its trained on and, that your model - just like an scientific author or journalist - follows all citation rules of the country of residence.
- In a country like USA lr in parts the EU, where the so called fair use copyright exeptions exist, you would have to legally obtain all copies of content you use for model training.
- To feed some content into a model while training you have at least to copy it on the computer where the training is running, therefore you will need to claim fair use on all content you are using
- Also the current case law indicates, that every copy of a content you make when claiming fair use should be obtained legally
- If you train your model in a country where the state does not care about copyright, you can ignore all things related to copyright and content licensing.
- By the way, only human work is covered by copyright laws. So If you use an open source lagre language model (LLM) for training your model or the supplier of a commercial model does not restrict its usage for training other models, you are welcome to do it this way. This is called model distillation. But be careful if the open sourced LLM reproduces original copyrighted work without any notice (just like the DeepSeek R1 which is trained in China where very weak copyright restrictions exist for content crafted outside of China). If this reproductions are correctly noticed, then you can at least exclude this responses from your training set...
Is it legal for a AI company to train any generative AI model on the scraped content?
This is the next step after crafting together training data for your model. What happens during the training process and what is the result of this training?
- During the training the structure and the weight parameters of the model are being optimized and at the end you get a set of training parameters which consists of floating point numbers with some specified precision.
- Is this set of floating point numbers and maybe also the information of the internal model structure a copy of the original content used for training? Clearly no, and also any part of the model parameter set can not be matched to any content used for training.
- This means, if you copy your model to another computer in order to optimize inferencing, you are not copying copyrighted material related to content used for training.
Is it legal when a generative AI model reproduces licensed and/or copyrighted content while inferencing?
I think this is actually the hottest question around LLM reproducing original copyrighted work when asked to do so by their users. And it is in fact subject of several trials in the US and in other countries, where content owners are suing LLM suppliers for breaking copyright.
And I think this question should be answered as follows:
- If the original work is restricted by copyright, especially when
all rights reservedis claimed, and the model supplier lets the model reproduce whole or at least significant parts of the original content used for training, then this model supplier has to obtain an explicit license for manufacturing or republishing this original content. - If such a manufacturing or republishing license is not obtained and the original work is restricted by copyright, the LLM supplier has to restrict technically the reproduction of the original content according to the journalistic citation rules of the country where the user asks it for. The original content source and the copyright status is to be provided in any case.
- If the LLM reproduces content originally published under an open source license, then this reproduction will be correct and fair, if the reproduced content is properly quoted and accompanied with the respective original open source license. For example, if an LLM has learned the original content published under the terms and conditions of the Universal General Public License then it is allowed for this LLM to reproduce this original content completely or in parts, but only if the reproduced original content is properly quoted and the original UGPL license is correctly stated.
- Only if an LLM reproduces original content, which is completely free of any copyright or it is licensed under a completely free license without any obligations for attribution to the original author or obligations to provide and redistribute the original source code under their respective licenses, then nothing special is to the done while inferencing. Of course the model supplier should keep the original copies of this free content along with the respective copyright status just for the case somebody (falsely) claims copyright and opens a trial ;)
Final considerations:
- Is it technically feasible to supplement any content used or AI training with their respective license? - YES!
- Is it technically feasible to craft a system prompt for a model that encourages the reproduction of original content along with the original license? - Absolutely YES.
- Is it technically feasible to ensure that any model trained on this content - license pairs will reproduce original content only with properly noted original license? - Most probably, YES
- Is it financially worth for a LLM provider to ensure such license annotations? - YES, because of the risk of going out of business if a copyright claim gets a win in a trial, might be much more expensive for the investors behind the LLM company.
- What if the investors into companies behind the actual LLMs have invested much more money than the investors into legal entities owning the copyrighted work? - Then it depends on the lobbying and the political power and the legal and financial strength of the counter parties ( and their respective support by the voters in a democratic "setup" ;)
P.S. Just for the case you want to better understand the inner workings of a "classical" LLM:
Sebastian Raschka
Build a Large Language Model (From Scratch)
ISBN-13: 978-1633437166, ISBN-10: 1633437167
UPDATE om 08th of August 2025 08:28 p.m. Added hyperlink to arstechnica posting regarding class action against Anthropic