Critiquing Google-PRS (2012) Study of Copyright Infringement

Critiquing a report commissioned by Google and PRS (2012) titled “The six business models for copyright infringement”. The report is available to download from here.

Critique by Nicola Searle, Economist at the Intellectual Property Office

I am an economist and I’m going to talk about this report commissioned in 2012 by Google and PRS titled “The six business models for copyright infringement”. My understanding is that it is largely written by the former chief economist of PRS, Will Page, who’s both an economist and a lobbyist, in conjunction with their external consultants that they used. They got contributions from most of the collecting societies, so in terms of having buy-in from the stakeholders, which is one of the key things in the lobbying world, that’s quite key. And they also had the assistance of the Department for Culture and Media and Sport. So, in terms of actually bringing together two very disparate firms, Google and PRS, it’s actually quite an interesting study. So I’m going to run through the report itself and then I’m going to critique it.

To start with, the report doesnt have research questions, it has goals. One of the goals is a segmentation driven investigation of sites that are thought, by major rights holders, to be significantly facilitating copyright infringement. And given that these are major rights holders, or rather these organisations who commissioned the report at least on the collecting society side are major rights holders, you can see why this was an interesting study for them. And secondly, the aim of the study is to provide quantitative data to inform the debate around infringement and enforcement.

The report’s major findings are the classification of six major business models and I’ll come back and work through them very quickly after I go through the methodology a bit more. So they have listed i) live TV gateways, ii) peer to peer communities, iii) subscription communities, iv) music transaction sites, v) rewarded freemium sites and vi) embedded streaming sites – those are the six that they found. The methodology starts off with stakeholder interviews which is not unusual in this type of report and based on those stakeholder interviews they identified about 150 of the top infringing sites. So from there, and there were a number of steps they took to make that less of a selection bias but that’s how they were identified and then from there they went to the websites and did both manual and automatic data collection, and then from that did a statistical segmentation analysis, which is often called a cluster analysis and then interpretation.

The stakeholders’ interviews again represent a lot of the collecting societies and lobbying groups, so these were the ones who provided the list of infringing websites. They were then narrowed down to the top 150 which was then cross-referenced and looked at in terms of size. So these are basically the top 150. I think their estimate, there was about 150,000 of these sites as a whole. The data itself ended up being 153 sites for stakeholders and that was their training data so they did that and then they did another analysis on 104 websites which were independently selected not using the stakeholder interviews to validate the data. They did get 101 metrics per observation which is quite a lot and they also did some revenue estimates and the revenue estimates, you can look through the calculations in that paper, take quite a lot of assumptions to estimate what the amount of money that some of these sites are potentially making.

They have also got quite a few dummy variables so those are ‘yes / no’ type of variables where we have pricing mechanisms for the type of pricing structure these sites are uses, media categories, formats etc. And those would have been primarily done based on the researchers’ observations, so you would look at it and decide whether it was doing this or that, so whether it’s offering games or it’s not offering games. So those are kind of interpretive in some ways. And they also used link statistics, and that was through a number of different sources. I’ll come to the sources in a minute. So quite a variety of different types of data that they’ve gathered here.

So the collection, and this is quite interesting because, I’ve seen a couple of studies looking at Google data and it is fairly new and Google holds a lot of data. They view data as a massive asset, which is one of the reasons they don’t share it that often. So you can see that from Google they took things like page views, ad planner data and brand rank, so a lot of information that wouldn’t necessarily be available to the public. Then we also count our media compete data, which looks at website referrals, so in terms of how your website actually ranks from search engine optimisation it really is quite important in terms of how people are actually finding the website. So that’s more on the consumer side. Alexa, the reputation scorer, so how well your website is ranked internationally. Then, they’ve got the Who is look up, so that’s kind of identifying locations, so geographical, who actually owns the website. And then we have Team Cymru Community Services, which apparently, somewhere in Wales, they’ve got a list of country codes associated with all of the different websites, in terms of IP addresses and then some data on the top level domain and then a lot of manual data collection.

Statistical analysis – the report did various segmentation approaches using different ways of doing cluster analysis and they got a cut off value of six clusters. So you can see on Page 5 of the report, and I’m actually going to come back to this diagram in a minute, but you can see they’ve got six, so these are the different segments that they’ve done, you can see them graphically. And they used various ways of doing this.

The first one was a live TV gateway and that’s about 33% of the sample and this is essentially streaming, so you’re looking at live, free to air and pay TV. It’s predominantly advertising funded. It’s, according to them, one of the fastest growing sectors and it tends to be US based. I should note that it’s interesting that they’ve looked at the geography of this. I suspect it’s partially because they want to identify who to lobby.

Peer to peer communities is about 19% of the sample, a range of content types so really all over the place. Advertising and donation funding is the main revenue factor and European based.

The subscription community is 5% of the sample and provides a range of content and as the name suggests is subscription.

The music transaction site is a little bit more of a traditional paying site, so 13% of the sample, mostly music but we’re starting to see more games and e-books and it’s transaction based.

Rewarded freemium is 18% of the sample, mostly music and freemium is a wonderful phrase that is the idea that you have a basic free service funded by advertising and then a premium service by subscription and that’s mostly in the Netherlands and the US.

And then finally, embedded streaming, which was 12% of the sample. It’s a little unclear on the content because it’s a bit of a mix and mostly advertisement with some chargeable donations and interestingly enough of this side, is that content contributors who upload are actually financially rewarded which is interesting, when we talked about the incentives to actually upload. And mostly in the Netherlands.

So when I come to the critique of this, first of all I think, and this is the official line from the Intellectual Property Office on this, that the goal or rather the biggest value of this piece of research is the fact that they got Google and PRS to work together, and that’s huge. Two groups that are pretty much opposed working together is something that I think we’d all want to see more, the kind of sharing of data, the insights we can get from that, aren’t going to happen easily in a research arena outside of those corporations – there’s a lot of value on that. Hence, the official line from the Intellectual Property Office is that we welcome this kind of research and want to see more of it.

While it is pretty impressive that they got these two corporations to work together, and I think some of the limitations of the research probably stem from that, the fact is that we have a back story here where the creators are accusing Google of profiting from these infringing websites.

The other point is there’s no literature review in it and that’s not actually that unusual in this type of report. Often the literature review ends up as an appendix or scrapped altogether because the point, the audience is not academics, it is people who want to look at it and get bullet point facts and figures. So that’s kind of why it’s not there.

It’s also interesting what’s not there is the background and it’s not the tension between the contributing and commissioning parties is not there. I mean there’s not really discussion about why we’re interested in looking at who’s doing what. And so that background data is not there.

The data driven approach, I’m not entirely convinced that they needed to do this kind of data approach that’s very emphasised in the report. The big idea that this is data driven, quantitative and I suspect that’s partially the fault of economists for this emphasis on data as being so much better than qualitative research. So I do wonder if their focus on the data driven is more of a strategic effort rather than an actual research effort. So that obsession with data is a bit interesting.

The stakeholder interviews themselves again would be part of a traditional research process for this type of report and you wouldn’t necessarily find this in more academic research because the stakeholder interview is essentially trying to get everyone on board to make the report happen. So that’s a necessary function of it. It does mean that when we got to the actual data analysis and identification of the sample that you’re starting from a slightly biased perspective because they’re going to have certain sites that they’ll be looking at. On the other hand you could argue quite easily that they’re the best people who actually know where these sites are.

Now the data itself (and I understand that Richard Mortier is going to have quite an in depth look at this). I should say I get very excited about this kind of data, just in terms of the possibilities and what we can see on the internet, it is quite possible that some of the data that we’ve not been able to use before is there, so kind of the answers and the possibilities are huge. So that’s quite exciting.

Data does have its flaws and I think particularly in this study, you can see it. There’s a lot of assumptions made, so when I mentioned the revenue data, they’ve made a lot of assumptions for that to happen. They’ve got a mix of qualitative and quantitative measurements that they’ve translated into dummy variables which in some cases are probably questionable. Just to give you an example, the Alexa data on your website rank is notoriously … it’s not very good. It’s quite often that what the website statistics are, that the owner sees will be very different from the Alexa data, so that’s just one indicator of the problem of using this type of data.

So it’s a bit problematic from the data side. I’m very excited about the idea that crawling websites for data, because there’s a lot of stuff there that we could be doing but I suspect again, it’s fairly, for economists that’s a fairly new way of looking at data and I suspect there’ll be a lot of problems looking at that. They’ve also got some issues with missing data and they’ve made some assumptions to account for that. So there’s quite a few problems from the data in general.

When we come to the statistics, it’s actually a bit more problematic. The statistical analysis they’ve done is a classification analysis as opposed to a prediction analysis and economists don’t tend to do classification analysis so I’m going to be particularly suspect of this methodology.

So what it’s actually telling us, I think we could debate that, is basically what these sites look like. It’s not actually telling us about what it means, what kind of revenues they’re actually achieving, it’s just sort of labeling them and I think there’s a bit of a challenge there. And if we come back to the dendrogram on Page 61, you can see they’ve got some challenges here. At the very bottom of the dendrogram, you would have, if you start off at the bottom is basically if we’re all observations, that would be each one of us would be one observation, so we would all be one point, one little line at the bottom. So there’s absolutely no grouping. At the very top, you would have all of us being all humans, for example, and in the middle you would have people grouped together by things like hair and eye colour, for example. So you can see as you go from the bottom to the top, you actually get much more grouping. The problem with this dendrogram is you can see that there really isn’t that much space. The space in the dendrogram between groupings actually is a measurement of how far they are in terms of distance statistically, and in this one you can see there’s really not much space between them. So they’re actually really bunched up and that suggests that really the cut off of six is a little bit arbitrary, and that’s one of the critiques I think economists would have of this statistical analysis. You can often have that there’s a lot of subjective decision making that goes into the statistical analysis. I also suspect that one of the reasons they have this problem is because they have so many many variables per observation. So it’s a little bit not convincing from the statistical side.

In addition, they used something called a Random Forests analysis which I’m actually not that familiar with. But interestingly enough, it kept having TM after it. Turns out, they’ve actually trademarked it for software, which I thought was interesting. But the issue is that they’ve got a lot of categorical variables in here and you would think that that would actually lead to better clustering and in this case it’s not really showed that, as I say. Not always, but you would think you’d get a better clustering and you’re not getting that. So I think there’s a lot of problems with the actual analysis from that perspective. So again, we’re looking, it’s not a very parsimonious approach. There’s a lot of data in there that’s not really being looked at.

Finally, I’d say from the interpretation side, the idea that there really isn’t a theoretical basis for the differences between all the classifications that they’ve ended up with. So when we have six, it’s a little bit arbitrary. They’ve not explained why they’ve cut off between the six. The statistical analysis supporting it isn’t necessarily that good so it’s just a little bit not convincing from that perspective. But again, I should say, the overall point is they actually got PRS and Google to work together, and that’s really impressive.


Critique by Dr. Richard Mortier, Transitional Fellow in Computer Science, HORIZON / CREATe, University of Nottingham

I’m Richard Mortier and I’m from a computer science background. I didn’t like this report as I was reading it. I didn’t think there was that much actual content in it, which I could really take away and believe. Although I guess in some sense I’d missed the point that if it was just getting them to work together, it was a good thing. I think my problems with it really kind of fall under three main areas.

The first was in concept that I wasn’t very convinced by the way they set about getting their initial data set of which of the websites to go and look at. It seemed that just picking 150 or so and then validating them with another 100 was almost like a cherry picked approach. It was quite a small data set as well. It felt to me that there would be a lot more that could be gathered out there and one of the things about doing this kind of data collection and data analysis with internet related things is you really can go to the scale if you want to. You don’t have to do so much by hand. There’s a lot more automatic processes that you could use for that.

Secondly, it wasn’t quite clear to me how they’d gone about collecting a lot of the data they collected. They’d done some scraping, some manual analysis from where the scraping of the data off the websites using scripts hadn’t worked so well because of the particular types of tools they were using. They could have picked better tools to do that with. And I think that the way they went about it, if they’d used better tools essentially they could have done a much more consistent and much more temporal analysis and actually looked in and certainly presented in much better detail some of the trends across this data as well. They didn’t have to do almost a snapshot approach, which again is another thing, that when you’re doing this kind of data collection and data analysis in networks, you can do this kind of thing automatically and over long periods of time. Some of the work I’ve done in the past, when I was at Microsoft, we did a data collection off their corporate network and that ran for about three weeks and we could get all the traffic off the Cambridge part of the Microsoft corporate network over three weeks and analyse that for trends. When I worked at Spring Labs, looking at their backbone network, the data collection that was running for that I think ran for four or five years across that, and you really can see trends and data there. It’s quite possible to do that sort of work with the kinds of tools and technology we’ve got today. We don’t need to do this kind of snapshot, manually curated and really take that kind of intervention over the data set.

So I thought initially I wasn’t that convinced by the starting point that they went to, I wasn’t quite sure what they were going to find by starting from there and then gathering all this data about it. I wasn’t convinced by the tools and the techniques they’d used to gather a lot of that data. I also had, I guess I would have hoped that they would have done more, particularly given that Google were involved with this, it felt that maybe Google could have provided more and could have done a better job of providing them with C data sets. Google have all the search index information they’ve got there. I mean, they could do more with that, I think, to strengthen this data collection. The other thing about that was that because of where they started, it was very much where they believed to be infringing sites and there was almost nothing about actual use, actual infringement that was going on through these sites. And I thought that was perhaps a more interesting thing to go and look at. So I had some problems with the concept of it, I had some problems with the data that they gathered.

Thirdly, I had particular issues with the presentation of the results they had throughout. So things like that are fairly standard. I thought the six (categories that they decided) was completely arbitrary from the data they had. They basically seemed to base the choice of six segments on, there’s a plot they’ve got of how the cluster tightness is going down and it’s slightly steeper at around six, but it’s really no more than slightly steeper, it’s not really a very kind of oh, obviously it’s there, it must be six, kind of result, even by eye.

I had particular issues with a lot of the charts they present. They do scale normalised comparisons within each of the metrics and they present these in groups on charts and they join up the data points, having explicitly said before they begin that section, that you can’t do comparisons between the different metrics because they’ve normalised them all separately. And so that seemed to me a very odd and misleading thing to do, to then explicitly draw your eye to the comparison between the things after having said you can’t compare them. So it was one of these things where the presentation seemed very odd.

The other thing that I disliked about the presentation and maybe this is coming from, it’s a different goal almost in the sense that this was a report being put out, it wasn’t an academic paper trying to be published. But from an academic point of view, there’s certainly been a move in the network measurement community with which I’m most familiar, that you have to be very clear about obviously how you’ve collected your data, how you’ve analysed your data, but to the extent now that at least one of the big network measurement conferences, you’re not allowed to have your paper entered for the award for the conference for the best paper if you haven’t made your data set public. So you have to be publishing the raw data that you used. And again, that was something which I felt this was rather opaque in how they’d actually gone about the details of the collection and the analysis and on what grounds they were coming to some of their conclusions. Perhaps that’s because of sort of where it’s coming from and the fact that it is a report of this type rather than an academic paper, but again that was something that didn’t gel well for me.

In defence, certainly from the kind of work I’ve done in the past, I find this sort of classification study, to be a useful thing to do. There is sometimes benefit, particularly in very complex systems of just describing them in some way and describing how they really are, how things really are, rather than the kind of intuitions people may have about how things work. So in that sense I didn’t have so much problem with that as a goal if you like, I just didn’t particularly like the way they’d gone about it in this particular study.

Having said all that, it’s not to say that I think doing this well is easy. Designing good measurement studies and carrying them out very vigorously is a very difficult thing to do. I think it does take a lot of time and effort and real investment of thought to work out how to do it well and to make sure you’ve gathered the right data, so I’m not trying to say that I think was easy and they did a bad job of it. I just think they could have done a much better job than they did – I wasn’t that impressed with the results they came out with.
General Comments from the floor.

Prof. Martin Kretschmer (MK): So given that a lot of this data is privately held, do you see any strategies obviously using scraping or some other way of overcoming that problem, institutional problem in our area, contested policy where you can’t necessarily, can’t do a collaboration?

Dr. Richard Mortier (RM): I did try and think a little bit about how I might have gone about trying to do this myself, trying to do it better. I really do think that a lot of the data that would be relevant to this would be held by Google and given that they were in the report, I don’t see a problem with them having done that. We’ve certainly seen in the measurement community, we’ve seen quite a lot of papers over the last two or three years starting to come out of Google and Facebook and a lot of these companies doing proper, or attempting to do proper, rigorous analysis of the data they have.

MK: And they have publication of the data sets?

RM: So in those cases they tend not to agree to the publication of the data set, at least in its raw form, and therefore they don’t get given the awards for the papers. But they’re willing to do that, essentially, they’re willing to make that trade. The point I was making there was simply that there is definitely a movement towards trying to encourage that sort of thing to come out and I have heard people suggest that it should be that you have to publish the data, or it should be held much more like I believe the psychology community and so on do, where you have to hold your data for a period of time and allow other people to interrogate it, I think. Is that right? Somebody’s who’s in psychology can tell me whether that’s wrong or not.

Commentator: Yes, that’s right.

RM: So there’s sort of more of a move within computer science to try and become more like some of these more traditional sciences, in approach. So I think I really would have expected Google to do a better job of trying to help them use that data. I’ve wondered whether other things that could have been done, if they’d talked to perhaps network operators – this is again, personal bias, I have most experience working with them – but the network operators might have been able to give them some insight into where to go to start looking in terms of what actual use was going on, because the network operators can see to a reasonable extent what you’re doing in terms of, or you could imagine looking from that vantage point and trying to see what people were doing and then using that to see what was going on and then taking that sort of where people were visiting, going and scraping those sites and then doing some analysis on those sites to see what was going on. So I thought there were other ways in that maybe could have been tried, that might have been more successful.

Prof. Lilian Edwards (LE): Can I quickly ask a boring legalistic question? It isn’t really a methodology question but I think it is germane to using these kinds of methods in future and I’m fascinated, and also it’s unfair because I ought to be asking the people who wrote the paper but they’re not here. But I’m wondering if Nicola might know something about it, basically, from the way you’re talking. If I picked it up correctly, data was used about the advertising revenue that these sites generate, presumably from Google Adwords. Is that right?

Dr. Nicola Searle (NS): No, they actually, it’s not actual, well – they’ve got Ad Planner, so that’s a wide range of metrics, but it’s not, they had to use average value for a lot of that and that is probably the value of the website to potential advertisers as opposed to actual revenue. The revenue itself, so I’m trying to find the actual page, they made assumptions. So monthly donation revenue, if donation, they basically made calculations based on the unique visitors. So they’re assumptions, it’s not factual data, proxies.

LE: So in fact, all the important parts of this study seem to me to be estimated.

NS: Yes, a lot of it is.

LE: You know, whether it’s an infringing site and whether it makes money, right? Because my follow up comment, which is now probably irrelevant, is if they were using, which clearly would be incredibly valuable information and very sensitive and very confidential, if they were using information from Google that only Google would know about how much money these sites actually made, then my question was can Google actually choose to give that data without the consent of the website, because surely no infringing website would ever agree for that data to be passed on, so I’m very curious about the internal commercial confidentiality and ethics involved.

Next Commentator: One of the other things is that for truly infringing websites and people who visit them, for example, the first thing you do is switch off all the crap including adverts. So as a data sample it is almost entirely spurious, even to use the Google advertising information because anyone with any nouse has already switched it all off. In terms of the data set here, very questionable, very questionable, and I really like the talk because you pointed out that we got PRS and Google to work together and in terms of the one thing that comes out of this that’s interesting, it’s that, but I take everything else with a pinch of salt.

NS: I just want to add, as you’ve just said, Google may not have all the data because the ads themselves could be placed by a third party, so it’s not necessarily Google revenues that they would have.

Next Commentator: That’s just what I was going to say, that Google, you can’t always assume that there’s only Google advertising, there is a lot of other advertising, fundamentally there’s also direct advertising, but there are advertising networks working in foreign countries that Google has no idea exactly what traffic’s going through them. So a lot of the pirated websites won’t be using Google because Google’s adverts just won’t pay out. They’ll be using these shadier companies to advertise, which are a lot harder obviously to get the data from.

RM: There was another thing there as well that they couldn’t do anything about people who were accessing these things using apps or other kind of mobile phone things. So obviously if you’re just browsing them on your phone that’s still trackable or at least as trackable as anything else would have been in the way they did this. But they were not doing anything to do with any smartphone apps or anything and that potentially would be a different and differently growing sector.

NS: One quick example of, for music transaction revenue for example, they’ve taken monthly page views, times the transaction value of an album at the cheapest rate. And then the other one’s advertising revenue, is that basically they’re clicks per million, or clicks per mile times page views. I mean it’s very, it’s not actual revenue at all. It’s based on clicks and their estimate of what the ad value would be.

Next Commentator: Just because it’s been mentioned a number of times, I’m not as convinced as some that we should be so impressed by Google’s participation in this exercise. I think from Google’s point of view, it wasn’t a particularly risky thing to do. They didn’t contribute a huge amount of data. There’s a sense in which there’s a bit of green washing going on that Google can be seen to be engage in the debate. I did notice in the report a lot of the implicit blame is towards things like payment providers and perhaps sectors that Google is not particularly engaged in at least at the time being. I also wonder, in the same way that we saw in an issue like, say, net neutrality where it was framed as content providers versus the internet industry and then we were all very shocked when Google and Verizon started cooperating and we could see obviously there was a shared commercial interest there where users were no longer part of that conversation. So I do, I would sort of congratulate both PRS and Google on working together but I think I’d suggest that maybe it’s not as radical in terms of capturing the whole copyright debate as Google or indeed PRS would like to portray it as. I think both of them had a self interest in being seen to cooperate and I think from Google’s point of view, its legitimacy in the eyes of the aspects of the commercial creative industries, that’s good for Google. But I’m just not sure it’s entirely, it’s good for everybody in that way. It’s not harmful but perhaps it’s not such a big deal.