Legal approaches to Data: Scraping, Mining and Learning


[BETA] This resource page documents task 3.3 of WP3 of the reCreating Europe project. It focusses on copyright and input data used as training material for AI and machine learning applications.

Whereas the technical ability to increase the stock of knowledge has clear positive implications for science, society and the economy, the unregulated use of data may also pose threats to the subjects who own that data or to whom that data refer to. The law, in fields such as copyright, technological protection measures and contracts has developed rules intended to mitigate those threats and to balance the protection of personal autonomy and financial investments with the promotion creativity and innovation. However, legal rules, which are necessarily general and abstract, often fail to offer the required level of detailed guidance to data scientists during their day-to-day activities. At the same time the deeper implications of regulating technology via private law are difficult to identify and require a proper methodological approach. These considerations often lead to legal uncertainty for researchers, technologists and creative industries in areas where the use of analytical techniques, machine learning, content moderation and the advancement of science and culture could substantially contribute to socio-economic development.

This resource page reflects the ongoing work by Prof. Thomas Margoni, Prof. Martin Kretschmer and Dr Pinar Oruc and introduces the Case Studies and the Executive Summary of D3.6 - Interim Report for Task 3.3 of WP3 of reCreating Europe.

Project Summary: The mining of big data and machine learning requires the compilation of corpora (e.g. literary works, public domain material, data) that are often “available on the internet”. The collection stage is usually followed by processing and annotation of the collected data, depending on the type of learning (supervised/unsupervised) and the purpose of the algorithm. Copyright law has a direct impact on this process, as the corpora could include works protected by copyright and, any digital copy, temporary or permanent, in whole or in part, direct or indirect, has the potential to infringe copyright (Art. 2 InfoSoc Directive). Furthermore, the changes made in the collected material can amount to ‘adaptation’ and the relevant exceptions, such as research or text and data mining, might not sufficiently cover these activities of the stakeholders in this area. This project will analyse case studies on data scraping, natural language processing and computer vision to assess whether the current legal framework is well equipped for the development of AI applications, especially in the field of machine learning, or, if not, what kind of measures should be developed (legal reform, policy initiatives, licences and licence compatibility tools, etc).

Case study 1: Data scraping for scientific purposes

“Scraping” involves manually or automatically collecting data from websites, which takes different forms such as web scraping, web harvesting and web crawling. Data scraping involves the collection of both protected and unprotected data, which is then restructured, validated and stored. Data scraping can be performed once to provide an accurate snapshot or it can be used for real-time updates. Although data scraping is treated as a separate case study of a technological process, it is a data collection method and can be a preliminary step for data analytics and lead to Natural Language Processing and Computer Vision.

From a copyright law perspective, scraping needs to be assessed for the type of data collected, for the activities performed both during scraping (copying and the editing) and afterwards (using data in outputs) and whether there are contractual terms on the websites prohibiting scraping. The case study can be downloaded as part of the Interim report.

Case study 2: Machine learning, in the context of Natural Language Processing (NLP)

Natural language processing (NLP) is a technology at the intersection of computer science, AI and linguistics. It is a form of machine learning where the purposes can range from analysing larger texts to computers generating realistic texts. Once the data is collected (through scraping or otherwise), NLP requires pre-processing to simplify and standardize the text. The edited text then goes through supervised or unsupervised training processes. Supervised learning requires labelled text data, so they have an “annotation” stage in their workflow. On the other hand, unsupervised NLP uses unlabelled data and instead detects patterns. This requires large datasets and is not suitable for all research projects.

From a copyright perspective, NLP needs to be assessed for the type of data collected (protected or unprotected), the activities performed in the text analysis (copying, editing, annotating and using pre-trained language models) and the outputs in the trained model (whether it uses the training text data). The case study can be downloaded as part of the Interim report.

Case study 3: Computer vision, in the context of content moderation of images

This case study is focussed on computer vision. While there are many uses for computer vision, such as facial recognition or self-driving cars, this case study will focus on the example of using object recognition technology for content moderation of images. Computer vision involves the collection of images and videos (protected and unprotected). It is followed by their pre-processing, such as cropping, rotating or converting colour. Training can be supervised or unsupervised, both based on features of the images. If supervised, images will be annotated in full or partially. If unsupervised, the computer will detect similarities and classify images, but will be unable to interpret them. When used for content moderation, human moderators are still widely used for uncertain decisions in regard to the visual content with violence, nudity and criminal activity.

From a copyright perspective, computer vision needs to be assessed for type of data collected (protected or unprotected), the activities performed for content moderation (copying, editing, annotating and moderation decisions) and the outputs in the trained model (whether it uses the training image data). The case study can be downloaded as part of the Interim report.

Interim Report (7 July 2021)

Executive Summary: There is global attention on new data analytic methods. Machine learning (essentially pattern recognition dressed as Artificial Intelligence or AI) is seen as a critical technology. Data scraping, the acquiring and structuring of information from online sources, is a typical first step for many advanced data analytic methods.

The technologies of scraping, mining and learning are often conflated, as are the legal regimes under which they are regulated. One regulatory lever under one legal regime will not deliver policy aims, such as innovation, personal dignity or the currently popular ‘data sovereignty’. The legal issues involved in the governance of data range from proprietary approaches (copyright, database rights) to privacy and data protection.

In addition, there are a wide range of public law instruments, for example relating to public sector data governance or the right to non-discrimination. Competition law again (which may be both privately and publicly enforceable) increasingly prescribes conduct in relation to data, such as in merger or acquisition cases, or in transparency provisions (Art. 17 CDSM; and centrally in the proposed DMA).

The scope of our enquiry in this report is within private law, specifically on the attempt to assert quasi-proprietary control of information and data, or vice versa limit such attempts, for example by exempting desired activities via copyright exceptions, such as the exception for text and data mining in Arts. 3 and 4 CDSM.

The copyright regime offers a template with a centuries old tradition of exclusive rights, supplemented in the EU since 1996 by a sui generis database right. While data or information are not subject matter within copyright law, almost all materials used to construct so-called corpora for new data analytic methods are protected by copyright law: scientific papers, images, videos, and so on.

The research design we adopt for this interim report is a reverse inductive strategy. We focus on case studies of three technological processes to explore in detail possible descriptions that would allow legal analysis, and an assessment of the need for a harmonisation of rights and connected exceptions under copyright law.

The case studies were selected in consultation with stakeholders, reflecting a need by scientific researchers and technology companies for a better legal understanding of what they do. They are designed to reflect a range of techniques and processes that underpin advanced data analytics, responding to the EU policy objective of supporting innovation in this field.

The three case studies are:

  • Data scraping for scientific purposes
  • Machine learning, in the context of Natural Language Processing (NLP)
  • Computer vision, in the context of content moderation of images

In parallel, we offer a thorough analysis of the policy rationale and legal context for the introduction of the two exceptions for text and data mining in the CDSM Directive (Art. 3 Text and data mining for the purposes of scientific research; Art. 4 Exception or limitation for text and data mining) which includes an analysis of how the right of reproduction (Art. 2 ISD) and its limitations (mainly Art. 5(1) ISD) interface with the overall regulatory framework of data analytics. This part of the Interim report is written as a self-contained scientific paper and will be added to CREATe Working Papers.

The Interim report (including case studies) is available here.

Margoni and Kretschmer's working paper entitled A deeper look into the EU Text and Data Mining exceptions: Harmonisation, data ownership, and the future of technology will be available here.

Workshop (27 May 2021)

This invitation-only workshop was organised as a collaboration between CREATe ( and the Urban Big Data Centre (, both research centres at the University of Glasgow. The workshop sought to explore initial findings on the legal implications of data analysis with researchers and industry participants that use advanced data analytic techniques.

Workshop Programme

27 May 2021 10.00 – 12.00 – Online

10:00 – 10:05: Welcome and introduction to the day (Prof. Martin Kretschmer, CREATe and Prof. Nick Bailey, UBDC)

10:05 – 10:10: Data science needs law (Dr Andrew McHugh, UBDC)

10:10 – 10:55: The Law of Data Scraping Dr Sheona Burrow, CREATe (15 min), with comments from Bartolomeo Meletti, CREATe (5min) and Dr Andrew McHugh, UBDC (5min); Q&A (15min).

10:55 – 11:00: BREAK

11:00 – 11:55: Data Scraping, Data Mining, Data Learning Dr Pinar Oruc, CREATe (15 min) with comments from Tobias Mckenney, Google (10 min) and Dr. Richard Eckart de Castilho, Ubiquitous Knowledge Processing (UKP) Lab at the Technical University of Darmstadt (10 min), Q&A (20 min).

11:55 – 12:00: Concluding remarks (Prof. Thomas Margoni, CREATe and CiTiP).

Workshop Summary

Event summary can be found here as a blog post.

Images or short clips from the event

Option 1: Seek permission from everyone at the event for full recording (and we should crop and remove the talks before the audience joins the room)

Option 2: Short clips to be cropped from the presentations (Andrew, Sheona and Pinar – also Martin and Thomas?)

Option 3: Screenshots of the first slides of the presentations (Andrew, Sheona and Pinar)

Connected Projects

Paper by Dr Sheona Burrow “The Law of Data Scraping: A review of UK law on text and data mining”, which is part of the CREATe Open Science series and was supported by the ESRC Urban Big Data Centre (UBDC) at the University of Glasgow.

Abstract: Data is perceived to be a key asset in the digital economy. Many governments have been keen to promote and exploit data driven economies. Data scraping is a widely used technique that automatically extracts information from different (often online) sources, whilst data mining is the machine reading of data to identify useful information not immediately obvious on human reading. In 2014, the UK implemented a limited exception to copyright law for text and data mining (TDM). However, copyright is only one layer of legal protection available to ‘data’ and the protection of data has been the subject of a long-running tension between property based rights and concurrent protection for data owners in liability rules arising through competition and contract law. Maintaining an appropriate balance between protecting rightholders and users has remained problematic. This paper summarises the legal protection available in the UK for different types of data, and the (limited) interpretation of that protection by the UK courts. The analysis is situated in a review of the academic literature. Ultimately this paper will conclude that the layered protection for data is confusing for end users, and that the case law on the protection and exceptions available to those seeking to engage in TDM limited and fact dependent.

Further Reading

[This is a preliminary list, tracking the development of our own work.]