Report – UBDC and CREATe Workshop on the Legal approaches to Data: Scraping, Mining and Learning

Report by Dr Pinar Oruc, Research Associate at CREATe, University of Glasgow.

As part of the ongoing collaboration between CREATe and ESRC Urban Big Data Centre (UBDC), an invitation-only online workshop on the legal implications on data collection and analysis took place on 27 May 2021. The presentations were followed by feedback from commentators and other workshop participants with law, digital humanities, and data science expertise.

Welcome by Professors Nick Bailey (UBDC) and Martin Kretschmer (CREATe)
Presentation ‘Data Science Needs Law’ by Dr Andrew McHugh
First session ‘Law of Data Scraping’ by Dr Sheona Burrow (CREATe), based on her recently published paper, comments from Bartolomeo Meletti (CREATe and Learning on Screen) and Dr Andrew McHugh (UBDC)
Second session on ‘Data Scraping, Data Mining, Data Learning’ by Dr Pinar Oruc, based on the work being developed for reCreating Europe, comments by Tobias McKenney (Google) and Dr Richard Eckart de Castilho (Ubiquitous Knowledge Processing (UKP) Lab at the Technical University of Darmstadt)
Closing remarks by Professor Thomas Margoni (CREATe and CiTiP)

Text & Data Mining. Original illustration by Davide Bonazzi for CopyrightUser.org

Various developments in data science have allowed the use of existing knowledge in new ways. Data scraping has made it easier to collect large amounts of data and is used in different sectors. Similarly, research on AI is advancing further in many countries, which increases the need for data. However, the law in the area has posed a challenge to the stakeholders wanting to use existing data. Copyright law is only one of the legal challenges, in addition to the sui generis database right, personal data protection law, contract law, confidentiality and competition law. This workshop was based on existing work that focuses on both the UK law and the EU law. The initial findings show that there is significant complexity in the area of data mining and machine learning, due to overlapping legal protection, ongoing developments in the technology and the different perspectives of the stakeholders. The goal of this workshop was to contribute to the ongoing dialogue and to reduce the legal uncertainties in this area.

Event summary

In the opening presentations, Professor Nick Bailey explained the research at the UBDC using new forms of data and stated how their attempts of unleashing and interpreting this new data is stymied by the legal concerns. Professor Kretschmer also welcomed the participants and introduced the activities of CREATe and the ongoing collaboration with the UBDC on the uncertainty and practical problems of data science researchers due to the changing legal framework.

In his presentation, Dr Andrew McHugh explained UBDC’s research priorities and expressed their interest in using big data. Andrew introduced their research on the impact of AirBnB on the housing market and the larger research questions on housing policy. This project has required decisions on how to collect the needed data and whether they can rely on automated scraping of the data only, without seeking licenses. Given the uncertainties around the UK text and data mining (TDM) exception, UBDC has decided to cooperate with CREATe so that they can proceed with more legal certainty – which led to the commissioning of the paper by Dr Sheona Burrow.

The first session was opened by Professor Martin Kretschmer, who explained the background of the text and data mining copyright exception in the UK. Following the Hargreaves Review in 2011, the UK policymakers decided that copyright needed to adapt to the technological changes and enable new research tools, especially for developing new data analytic techniques and make way for a more flexible approach to data mining. This led to the 2014 UK copyright exception for text and data mining in CDPA section 29A. As the UK was bound by EU law at the time, this exception is limited to non-commercial research purposes. The Hargreaves Review recommended that the UK should open discussions at EU level to reconceive data analytic exceptions, and indeed the EU introduced a TDM exception in the Directive on Copyright in the Digital Single Market, which are now in Articles 3 and 4 of the Directive. Due to the UK’s withdrawal from the EU and the decision to not implement the CDSM Directive, the UK can now choose to introduce a wider TDM exception. However, Martin notes that this much needed copyright approach will only be useful if linked to other areas of law that are applicable, so the law of scraping needs to be assessed in detail.

Dr Sheona Burrow presented her recent publication “The Law of Data Scraping: A review of UK law on text and data mining”. Noting that the law on text and data mining is complex, Sheona first introduced the legal regimes that are relevant for TDM: copyright law, sui generis database right (SGDR), data protection law, contract law, confidentiality, and competition law.

The legal protection is further complicated by the limited case law in this area, which are very fact-dependent cases. To illustrate this issue Sheona provided three different scenarios. First, sports databases receive limited protection except confidentiality and the existing cases usually focus on how the sports data was created and not obtained, and therefore should not receive SGDR. Second, price databases, are not protected either, but they can be limited by contract law as illustrated by the Ryanair case. Third, location databases, are further complicated by the personal data aspect of locations and will be affected by GDPR regime.

After introducing the methodology and illustrating that both academic literature and case law are limited in the area of TDM, Sheona discussed four important cases (BHB v William Hill, Ryanair v PR Aviation, Racing Partnership Ltd v Done Brothers, 77Mv Ordnance Survey) for the copyright, SGDR, contract and confidentiality law implications. Therefore, any TDM user should be aware of the problems caused by the overlapping legal protections, further muddied by unharmonised exceptions, fact dependent case law, contractual override, and unclear open-licensing contract terms.

In response to Sheona’s presentation, first commentator Dr Andrew McHugh mentioned UBDC’s concerns on how the variety of different legal regimes has limited their research activities. This is also affected from the technological changes in the area – as what is clarified now might not fit the upcoming technological changes. Embracing a risk-based approach could be the way ahead, but it requires ongoing communication between data researchers and law scholars.

The second commentator, Bartolomeo Meletti, referred to the BoB and TRILT archives held at Learning on Screen for educational use and the challenges posed by the layered protection for data to make these resources available for research purposes. Bartolomeo has also highlighted his work on the development of best practices for documentary makers and immersive heritage sector for reCreating Europe and asked whether a code of best practices for the research community may facilitate the lawful use of data. Sheona responded that a code might be useful to inform ambiguous concepts embedded in the legislation such as ‘research’ and ‘non-commercial purpose’. According to Sheona, more clarity around permitted uses would also help rights holders avoid engaging in unnecessary licensing negotiations.

In the second session chaired by Professor Thomas Margoni, Dr Pinar Oruc presented three different case studies broadly covering the technological processes in the field of AI. As part of the reCreating Europe task on input training data for machine learning applications, the team of Professors Margoni and Kretschmer and Dr Oruc are working to classify these technological processes in the light of copyright law with the goal of properly identify the various technical steps that may trigger copyright. The case studies focus on data scraping, natural language processing and computer vision. The presentation went through the stages identified for these technological processes and indicated the copyright concerns for them.

The case study on data scraping focused on three main stages: collection, processing, and analysis. Data scraping involves collection of both protected and unprotected data, by using methods such as screen scraping, web crawling or parsing. Scraping can negatively affect the website and get blocked, which can be avoided by distributing the requests among different proxies. The collected data is then restructured, cleaned, and validated by the scrapers. This data is not usually visible in the project outputs, but the purpose of the research (for copyright exceptions) and the scraped websites’ terms of service (contractual override) will further determine whether there will be copyright law obstacles. While data scraping is addressed as a technological process on its own, it can be seen as a pre-condition to other types of data analytics.

Both natural language processing and computer vision case studies were handled in four stages: collection, pre-processing, training, and the use of trained models. For both processes, data (text or images and videos) is collected either through scraping or otherwise. The collected data goes through methods of cleaning and forms of standardisation. They are then used for training, either supervised and unsupervised, the former requiring further annotation and human input. The outcome is the trained model, which can perform a variety of tasks such as translation or content moderation, depending on what the training was targeting.

From copyright law perspective, these technological processes face similar problems on the status of necessary data, scopes of reproduction and adaptation rights and the applicable copyright exceptions, which will be further expanded in the reCreating Europe output.

In response to this presentation, the commentators further discussed the complexities in the area of machine learning and the availability of data. Tobias McKenney noted that the current discussions around AI focus on concerns such as explainability, fairness and bias. But it is necessary to recognise the impact of copyright on limiting the availability of the data. Being pushed to use only freely available data in machine learning might be contributing to biased outputs. Similarly, not being able to store and examine the datasets (due to copyright) also reduces the explainability of the AI. Furthermore, Tobias also drew attention to the (i) implications of storing the training data on the cloud for intermediary liability, and (ii) uncertainty for companies due to regional legal differences for obtaining the data and the applicable exceptions.

The second commentator, Richard Eckart de Castilho shared his views on the legal uncertainty in the field of Natural Language Processing. Although machine learning models can potentially be built for making the access to data more local and more fleeting, doing that can cause significant strain on third party data sources. All of these attempts could still be in vain, because the AI researchers are not clear on what is temporary enough to benefit from the copyright exception in InfoSoc Directive Article 5(1). Richard also expressed that copyright law does not address the abstraction of data: throughout the training, AI will be restructuring the input data to the point that it is too abstract for human eye to recognise the input data – but that also raises the question whether the AI output should be treated as a new original work.

The audience discussions also included the implications for trade secret law, patent law, machine-generated music, originality implications of structuring scraped data and the risk of injunctions completely eliminating the AI trained on partially copyright protected material (as the individual protected material cannot be extracted from the trained model) – thus highlighting again the importance of an interdisciplinary approach in this field.

In his concluding remarks, Professor Margoni pointed out how the workshop discussion clearly brings to the fore the need for a clearer, broader and more balanced framework in the field of data analytics in the EU. From a European Union copyright law point of view, this is needed both for internal as well as for external reasons. Regarding internal reasons, it is important to strike a proper balance in the consolidating digital single market among the plurality of interests and actors that are affected by the law in this field. These include those identified by the CDSM (mainly publishers and research and cultural heritage institutions) but also those left out from the Directive, such as journalists, private researchers, citizens (e.g. citizen science), EU firms, start-ups and SMEs and more broadly technological development in fields such as AI and ML. Externally, the EU legal framework has to measure itself with that of other “competitors” – i.e. legal systems that compete with the EU in the creative and cultural industries as well as in the technological sectors (e.g. US, Japan, Canada, South Korea, Singapore, etc), which have approached the issue of data analytics following a more innovation-friendly approach. What this means for the future of the creative, cultural and technological sectors in the EU and for issues of regulatory competition in key areas such as a transparent, fair and accountable AI framework remains an open question, which demands close scrutiny also in the light of the (many) legislative proposals recently issued by the EU legislature.

You can find further information about this event and our upcoming work on Text and Data Mining and Machine Learning on our new (beta) Resource Page for Legal Approaches to Data: Scraping, Mining and Learning.

Our next steps will include sharing new publications, exploring further collaboration between UBDC and CREATe and organising a second (public) workshop. If you are interested in future developments, please keep following our blog!

Tags:

Blog