Skip to main content


Report: Text and Data Mining, data ownership and open science

Posted on    by Admin

Report: Text and Data Mining, data ownership and open science

By 20 December 2019No Comments

Guest post by Jiarong Zhang (PhD Candidate, CREATe).

The 2019 International Open Access Week (21 to 27 October 2019) was dedicated to the theme ‘Open for Whom? Equity in Open Knowledge’. Within this important framework, OpenAIRE in collaboration with EOSC-Hub organised a week of webinars covering key issues such as cost transparency of Open Access publishing, Research Data Management,Open Science Policies, Plan S compliance for Open Access Journals and Inclusive Science. Under the “research data” theme,  Dr Thomas Margoni (CREATe’s co-director and senior lecturer in IP and Internet Law) gave a presentation on ‘Text and Data Mining, data ownership and open science.’

Thomas’ presentation is based on his seminal work on Open Science developed at CREATe through a number of research projects and initiatives including OpenAIRE and OpenMinTeD and the recent Information, (Research) Data and Open Science Workshop, organised within the 2019 CREATe Symposium. The main focus of the presentation concentrated on the question of whether, when and how is non personal data owned, and why does it matter for open science both from a copyright theory as well as from a copyright law perspective.

Thomas observed that it is generally accepted that copyright theory does not protect mere data (such as facts, principles, ideas, methods of operations, etc) as these are the fundamental elements of scientific knowledge which should be available to everyone and lack originality. However, data, or certain aspects thereof, may be protected by copyright law, or by other areas of law such as trade secrets or contracts. Additionally, in the EU, it is important not to forget the Sui Generis Database Right that protects non original databases constituted with a substantial investment in obtaining (not creating!), verifying and presenting data against substantial extractions. This is an EU peculiarity that does not exist in other innovation-oriented systems such as the US, Canada, Singapore or Japan.

The presentation argued that in the EU, due to the described overprotection of non personal data, innovation is often halted by obstacles (if a researcher needs to clear rights before performing an experiment this is an obstacle that introduces transactive costs), which causes many initiatives in areas such as text and data mining and machine learning to relocate in legal systems that are more innovation friendly. It should be kept in mind that this legal framework is in all likelihood not intentional in the sense that when the current laws were drafted (the InfoSoc directive is of 2001), the technology we are discussing right now was simply non existent. At the time the “new” technology was the Internet and a specific exception for using the Internet was created (Art. 5(1) InfoSoc). But this way of legislating, i.e. creating a “high level of protection” with broadly harmonised rights (e.g. Arts. 2, 3 and 4 InfoSoc), and carving out specific “exceptions” for specific cases, means to veto the future. In fact, if by default everything is not reusable unless specifically allowed, this means that only what was already known (i.e. the past) at the drafting table could be allowed, whereas what was not yet known (i.e.the future), by definition could not be allowed by way of a dedicated exception. Contrast this with those countries that have an “open door” to the future, by way of a flexible norm that can be constantly interpreted to balance the interests of right holders with those of innovators and the public at large. The US have fair use, a standard that has recently been adopted in a number of other countries. Japan has enacted a rather generous exception for text and data mining almost 10 years ago. The EU has only very recently adopted an exception that is however very limited in scope, beneficiaries and in its relationship to contracts and technological protection measures (Arts. 3 and 4 CDSM Directive).

By way of illustration, these new articles only cover the right of reproduction but not the rights of distribution/communication to the public and adaptation. Beneficiaries are limited to research organisations operating for research purposes which does not include individuals, micro or SMEs (Art. 4 here opens a small door that can however be closed by right holders “reserving” the right to TDM). The role of technological measures (whether they are protection or integrity measures) remains unclear.

Finally, the presentation offered an overview of some of the tools and guides developed by OpenAIRE to help researchers navigate these waters. These guides intend to address common issues researchers are likely to encounter in data management. The webinar ended with some Q&A with interesting questions from the public asking among other things how to balance research data utilisation and public interest, whether universities should own data that their employees have produced, the differences between the protection of personal data and research (non personal) data, and the possible influences of Brexit on data management for those UK projects funded by EU. The webinar, Q&A and slides are available at