Introduction
Data scraping, i.e. the automated extraction of information from online and publicly accessible sources, is essential for training artificial intelligence (AI) and large language models (LLMs). These models rely on vast datasets to achieve accuracy and functionality.
However, data scraping – and its subsequent processing for AI training – raises significant legal concerns that AI companies should be aware of, including personal data protection issues, copyright infringement, and breaches of terms of service.
This article provides a high-level overview of the legal landscape on data scraping in the EU, US, and UK, aiming, thus, to provide insights for navigating these complex challenges and ensuring responsible innovation.
Processing of (Scrapped) Personal Data in Compliance with the GDPR and Other Data Protection Laws
Scrapped data can (and most likely will) include personal data, i.e. data that leads or may result in the identification of an individual.
The (i) scraping of personal data and (ii) its subsequent use for AI training are deemed as processing operations (the foregoing operations shall be collectively referred to as the “Processing”) under the EU General Data Protection Regulation (GDPR), and under the UK personal data protection legislation.
However, in general, personal data of EU and UK individuals can be processed only if companies show they have a lawful ground, such as the data subject consenting to the Processing, or the company having a legitimate interest in the Processing.
Given that consent is unlikely to be obtained from the data subjects, there seems to be a tendency of AI companies relying on the ground of legitimate interest for the Processing, aiming to avoid, thus, any consent requirements.
But legitimate interest allows a company to do the Processing only if certain conditions are being complied with, such as:
- a) the interest is lawful,
- b) the Processing is necessary to achieve that interest, and
- c) the legitimate interest should not be overridden by the data subjects’ rights.
Due to these requirements, the use of the legitimate interest ground for data scraping and AI training has been subject to the scrutiny of data protection authorities (“DPA”) in Europe. The DPAs views, however, are divergent.
For example, the Italian DPA (it. Garante) questioned the legality of the use of legitimate interest for scraping personal data to train AI models and temporarily banned in Italy a generative AI product.
Similarly, the Dutch DPA published guidelines on data scraping and stated that web scraping based on legitimate interest is almost always illegal. On a different occasion, the Dutch DPA argued that the use of legitimate interest for pure commercial reasons is not permitted, but its position was criticized by the EU Commission.
The UK DPA / ICO in its analysis and consultation on the matter provided a more balanced approach. First, it said that the business interest in developing a model and deploying it for commercial gain can be deemed a legitimate interest. Second, as for the necessity requirement (also indicated above), it stated that “the ICO’s understanding is that currently, most generative AI training is only possible using the volume of data obtained though large-scale scraping either on their own platform or bringing it into the market for third parties to procure”. Finally, the ICO provided data scraping is a so-called “invisible processing” operation, where people are not aware that their personal data is being processed in this way. As such, companies must carry a data processing impact assessment to ensure adequate measures are in place to safeguard individuals’ personal data protection rights against these high-risk processing activities.
In the USA, laws and authorities are more lenient, as such the data scraping legality threshold is deemed lower.
Breach of Copyright Laws
The Processing matter also touches upon the subject matter of copyright protection and infringement.
When the so collected data is protected by copyright, the developer should either have the copyright holders’ permission to use the data or be able to rely on any legal exceptions from copyright.
In the EU, the Copyright Directive provides for the text and data mining (TDM) exception which, generally speaking, allows the scraping of public data by private organizations if the access is i) open, and ii) legal.
The access is not legal if:
- Websites/platforms opt out from the TDM exception, making, thus, the scraping of their site illegal; and
- it is prohibited by platforms’ terms and conditions (TCs), as further tackled below.
Indirectly touching upon copyright infringement, the French competition authority fined Google (again) in 2024 with €250 million for the unauthorized use of media content to feed and train Gemini, previously known as Bard.
In UK, TDM is allowed – without permission from the copyright holders –, in principle, only for research organizations and for non-commercial purposes.
In the USA, the use of copyrighted material may be allowed, without permission, under the fair use doctrine. A careful analysis must be undertaken in this respect by taking into account, among others:
- If the scraping transforms the original content for a different purpose
- If the scraping does not compete with the original work
- The amount of the original work is copied
- The effect on the market for the original work.
Breach of Scrapped Websites’s TCs
Collecting data through scraping may be prohibited by some websites and platforms as per their TCs.
In fact, this is generally the case for many platforms as they themselves derive a commercial benefit from their use of the data publicly displayed thereon. As data is a valuable asset, platforms guard their ownership thereof through their TCs.
Then, if this is the case, data scraping will result in breach of contract or in tort liability.
One of the most famous cases in this respect is the LinkedIn vs. hiQ dispute wherein Linkedin managed to show that hiQ’s scraping of LinkedIn was done in breach of their terms, as “LinkedIn’s User Agreement expressly prohibits scraping of its site” (see the United States District Court, N.D. California, HIQ LABS, INC.v. LINKEDIN CORPORATION, Case No. 17-cv-03301-EMC).
Conclusion
Data scraping is, in principle, essential for training AI models. However, this practice brings significant legal challenges in data protection, copyright laws, and compliance with terms of service.
In the EU and UK, the stringent requirements of the GDPR and equivalent legislation make the processing of personal data complex, often scrutinized by authorities. The legitimate interest ground is frequently used to bypass consent requirements, but this approach varies in acceptance, with significant cases highlighting the risks involved. Additionally, copyright laws in the EU and the US offer different degrees of flexibility, and high-profile cases on breach of contractual terms, such as the LinkedIn vs. hiQ dispute, emphasize the importance of adhering to legal boundaries and carefully considering compliance.
AI companies must adopt responsible practices to navigate these legal complexities. Moreover, as the regulatory environment evolves, ongoing dialogue between AI developers, legal experts, and regulatory bodies is essential to balance AI innovation with legal and ethical standards.