CREATe is publishing today a new working paper by Thomas Margoni and Martin Kretschmer: “A deeper look into the EU Text and Data Mining exceptions: Harmonisation, data ownership, and the future of technology”.
The research has been conducted as part of the EU Horizon 2020 project: reCreating Europe: Rethinking digital copyright law for a culturally diverse, accessible, creative Europe. The work complements three case studies on the copyright implications of training data in selected AI environments, investigating (1) Data scraping for scientific purposes; (2) Machine learning, in the context of Natural Language Processing (NLP); (3) Computer vision, in the context of content moderation of images. This joint work with Pinar Oruc has just been released as an Interim Report here.
Thomas Margoni and I have been participants in the debate about exceptions for Text and Data Mining (TDM) since the UK Hargreaves Review in 2011, leading to the introduction of section 29A into the UK Copyright, Designs and Patents Act (CDPA 1988) and the call for intervention at EU level, recognised in the Directive on Copyright in the Digital Single Market (CDSM). See CREATe’s CDSM policy and implementation resources. This new paper consolidates the arguments we made in different fora (blogs, open letters, conferences) archived here.
There is global attention on new data analytic methods. Data scraping (a typical first step for advanced data analytics), text and data mining (TDM, the extraction of knowledge from data) and machine learning (ML, often also simply referred to as Artificial Intelligence or AI) are seen as critical technologies. The legal issues involved in the regulation of data range from privacy and data protection (such as the GDPR) to proprietary approaches (such as copyright, database rights, or proposed new rights in data themselves).
This paper focusses on one specific intervention, the introduction of two exceptions for text and data mining in the Directive on Copyright in the Digital Single Market (CDSM). Art. 3 is a mandatory exception for text and data mining (TDM) for the purposes of scientific research; Art. 4 permits text and data mining by anyone but with rightsholders able to “contract-out” (Art. 4), for example preventing TDM use of publicly available online content by “machine-readable means”.
We trace the context of using the lever of copyright law to enable emerging technologies and support innovation. Within the EU copyright intervention, elements that may underpin a transparent legal framework for AI are identified, such as the possibility of retention of (permanent) copies for further verification. On the other hand, we identify several pitfalls, including an excessively broad definition of TDM which makes the entire field of data-driven AI development dependent on an exception. We analyse the implications of limiting the scope of the exceptions to the right of reproduction (which leaves the communication of research results in a grey zone). We also argue that the limitation of Art. 3 to certain beneficiaries remains problematic; and that the requirement of lawful access is difficult to operationalize.
In conclusion, we argue that there should be no need for a TDM exception for the act of extracting informational value from protected works. The EU’s CDSM provisions paradoxically may favour the development of biased AI systems due to price and accessibility conditions for accessing training data that offer the wrong incentives. We also identify some old and new areas of the EU acquis which will play a crucial role in the future relationship of EU copyright law with technological innovation.
The full paper can be downloaded here.
For further information about our ongoing work on Text and Data Mining and Machine Learning, visit our Resource Page Legal Approaches to Data: Scraping, Mining and Learning.
A report by Pinar Oruc from a recent workshop with the ESRC Urban Big Data Centre on legal approaches to data scraping can be found here.