The rapid development of artificial intelligence (AI) and its reliance on vast datasets has raised critical questions about the balance between innovation and intellectual property (IP) rights. As AI-generated content (AIGC) grows, concerns over data scraping—often essential for AI training—are leading to debates about copyright infringement, unfair competition (in countries such as China), enforceability of website terms of use, and technological safeguards.
So, what are the issues at stake and what can content owners and AI developers do to manage the risk surrounding them?
Scraping and Copyright Infringement
At its core, data scraping involves extracting large volumes of information from websites, often through automated bots. While browsing by humans inherently implies a license to view and reproduce content on their devices, this implied license does not extend to bots performing large-scale scraping. This distinction often forms the basis for alleging copyright infringement against unauthorized scrapers.
Despite this, the question arises whether such scraping can be defended under the doctrine of fair use or fair dealing. However, these defences are often limited or absent in many jurisdictions, leaving the issue unresolved.
Other Rights and Interests Related to Data
Apart from copyright, there could also be other types of rights or interests entitled to data. Taking China as example, if a dataset is collected and produced to bring monetary benefits to the owner, others’ who simply scrape the dataset would unfairly harm the interests of the dataset owner without any justifiable grounds.There is also the possibility that the scraping and follow up use of the dataset may be deemed as unfair competition activities according to the Anti-Unfair Competition Law.
Enforceability of Website Terms of Use
Website owners frequently use express terms of use to regulate access, including prohibitions on scraping. These terms, when legally enforceable, can form the foundation of contractual claims against scrapers. For example, a European case involving airline Ryanair upheld terms of use as an enforceable contract, leading to a ruling against a price comparison platform that violated these terms.
However, the effectiveness of such enforcement remains limited. Quantifying damages caused by scraping is challenging, and pursuing litigation across jurisdictions is resource-intensive. Strengthening the prominence and clarity of website terms of use may improve enforceability and provide a stronger deterrent.
The Role of Technological Protection Measures
Technological Protection Measures (TPMs) and Digital Rights Management (DRM) systems serve as safeguards against unauthorized data access and tampering. These measures include anti-crawling mechanisms, such as systems that differentiate human browsing from bot activity. For example, Getty Images successfully traced copyright infringement in a case involving Stable Diffusion by relying on watermarked content embedded in its dataset.
Yet, these measures are not foolproof. Techniques like data cleaning, often used during AI training, can remove watermarks or other identifiers, making it harder to trace or prove infringement. Moreover, identifying the individuals or entities responsible for scraping often requires court-ordered discovery actions, which can be hindered by legal and jurisdictional challenges.
Policy and Legal Frameworks Across Jurisdictions
Legal certainty varies widely across countries, influencing the balance of power between content owners, data centres, and AI developers. Singapore, for instance, offers legal clarity that facilitates enforcement against data centres hosting scraping activities. Conversely, jurisdictions like Indonesia, which lack fair use defenses and recognition of clickwrap agreements, present challenges in proving and addressing copyright infringement.
Data Centres/Cloud Services Liability Impact on AI Developers
The new reporting requirements from the US Department of Commerce for AI developers aim to enhance oversight and national security by mandating detailed disclosures about AI model development, cybersecurity measures, and testing outcomes. This could lead to increased operational costs as companies invest in compliance resources and modify processes to meet reporting standards.
Also in another developing area, a data centre has been sued for enabling copyright infringement - this is probably a tactic when the data centre user cannot be identified.
Recommendations for Stakeholders
Conclusion
The tension between protecting IP and fostering AI development highlights the need for clearer legal frameworks and proactive measures. While no country outright opposes AI innovation, the degree of legal certainty they offer significantly impacts stakeholders. Striking a balance between innovation and rights protection is key to ensuring the sustainable growth of AI technologies.