02.12.2013

Data and text mining

11.57 Data and text mining has been defined as automated analytical techniques that work by ‘copying existing electronic information, for instance articles in scientific journals and other works, and analysing the data they contain for patterns, trends and other useful information’.^[69]

11.58 Data and text mining is becoming increasingly important in a number of research sectors, including medicine, business, marketing, academic publishing and genomics.^[70] Employing technology to mine journal databases has been referred to as ‘non-consumptive’ research, because it does not involve human reading or viewing of the works.^[71] Researchers and research institutions have highlighted the value of data mining in paving the way for novel discoveries, increased research output and early identification of problems.^[72]

11.59 At the commercial level, the ability to extract value from data is an increasingly important feature of the digital economy. For example, the McKinsey Global Institute suggests that data has the potential to generate significant financial value across commercial and other sectors, and become a key basis of competition, underpinning new waves of productivity growth and innovation.^[73] The Cyberspace Law and Policy Centre submitted that data mining

has the potential to grant ‘immense inferential power’ to allow businesses, researchers and institutions to ‘make proactive knowledge-driven decisions’. There are significant potential commercial benefits—data mining has the potential to improve business profits by allowing businesses to better understand and predict the interests of customers so as to focus their efforts and resources on more profitable areas.^[74]

Non-expressive use

11.60 There has been growing recognition that data and text mining should not be infringement because it is a ‘non-expressive’ use. Non-expressive use leans on the fundamental principle that copyright law protects the expression of ideas and information and not the information or data itself. For example, consider a computer algorithm employed to search through a text to obtain metadata, which discovers two facts about Moby Dick:

first, that the word ‘whale’ appears 1119 times; second, that the word ‘dinosaur’ appears 0 times. While a whale is certainly central to the expression contained in Moby Dick, this data is not. Rather, metadata of this sort … is factual and non-expressive, and incapable of infringing the rights of copyright holders.^[75]

11.61 Academics use this example to argue that ‘acts of copying that do not communicate the author’s original expression to the public do not generally constitute copyright infringement’.^[76] They suggest that to the extent that data and text mining do not substitute for the author’s original expression, such non-expressive uses

are properly considered equivalent to (or a subset of) highly transformative uses: their ‘purpose and character’ is such that they do not merely supersede the objects of the original creation.^[77]

11.62 Similarly, Burrell and others submitted that uses that treat copyright material as mere data—rather than for its expressive value—do not compete with the original works and should not be treated as falling within the scope of the copyright owner’s rights.^[78]

11.63 Similar thinking was evidenced in the Hargreaves Review, which recommended an exception for uses of works enabled by technology which do not trade on the underlying and expressive purpose of the work. As a result of the recommendation, the UK Government will introduce an exception that allows a person who already has access to a work (whether under license or otherwise) to copy the work as part of a technological process of analysis and synthesis of the content of the work for non-commercial purposes.^[79]

Current law

11.64 There is no exception in the Copyright Act that covers data and text mining. Where the data or text mining processes involve the copying, digitisation, or reformatting of copyright material without permission, it may give rise to copyright infringement.

11.65 One issue is whether data and text mining, if done for the purposes of ‘research or study’, would be covered by the fair dealing exception. The reach of the fair dealing exceptions may not extend to text mining if the whole dataset needs to be copied and converted into a suitable format. Such copying would be more than a ‘reasonable portion’ of the work concerned.^[80] Nor is it clear whether copying for text mining would fall under the s43B exception relating to temporary reproduction of works as part of a technical process, but it seems unlikely.

11.66 A number of stakeholders argued that data and text mining should be covered by fair use,^[81] drawing on the principle of non-expressive use, or uses that do not trade on the underlying or expressive purpose of the work.^[82] Others suggested that data and text mining are properly considered as ‘transformative’ uses.^[83]

11.67 The Australian Industry Information Association argued that it is important for legislative reform to encourage research, development and competition in the data analytics field.^[84] Universities Australia suggested that subjecting data and text mining to fair use would put Australian universities

on a level playing field with their counterparts in the US (who rely on fair use to engage in non-consumptive uses such as data mining and text mining for socially useful purposes) as well as the UK (who will soon have the benefit of a stand-alone exception for non-commercial data mining and text mining).^[85]

11.68 The Commonwealth Scientific and Industrial Research Organisation (CSIRO) agreed that if laws in Australia are more restrictive than elsewhere, the increased cost of research would make Australia a less attractive research destination.^[86]

11.69 A number of stakeholders suggested that data and text mining should be limited to non-commercial research and study.^[87] However, the CSIRO argued that the commercial/non-commercial distinction is not useful, since

such a limitation would seem to mean that ‘commercial research’ must duplicate effort and would be at odds with a goal of making information (as opposed to illegal copies of journal articles, for example) efficiently available to researchers.^[88]

11.70 Other stakeholders agreed.^[89] Google submitted that there are clear public benefits to facilitating data and text mining ‘regardless of whether this occurs within the confines of a university or other public research institution, or in the private sector’.^[90]

11.71 On the other hand, publishers opposed an exception for data and text mining and suggested that ‘the relative immaturity of the text/data mining market should not be considered as indicative of market failure demanding legislative intervention’.^[91]

11.72 The Association of Learned and Professional Society Publishers (ALPSP) argued that ‘publishers are not blocking access to articles for text and data mining— publishers are reporting that current requests are very low, and in the main, they are granted’.^[92] Therefore, it was suggested that solutions lie in cooperation between users and publishers to create licensing solutions.^[93] Exceptions, it was argued, would not create an environment conducive to collaboration:

Data and text mining solutions are best found in market-based initiatives, like proactive voluntary licensing, that offer faster and more flexible ways to adapt to changing market needs and preferences … Value proposals and business models for publishers in the field of data and text mining are only now emerging, and publishers are experimenting with various contractual and operational models.^[94]

11.73 Publishers also argued that licensing helps offset publishers’ costs to support content mining on a large scale, and that increases in costs ‘could act as a significant disincentive to publishers to continue to invest in programmes to enrich and enhance published content, which in turn facilitates greater usage and encouragement’.^[95]

Fair use

11.74 The ALRC considers that the unlicensed use of copyright material for non-expressive purposes, such as data and text mining, should be considered under the fair use exception recommended in this Report.

11.75 The ALRC agrees that non-expressive use can be considered a subset of transformative use. To the same extent that transformative use is not an illustrative purpose, the ALRC does not consider it necessary to include ‘non-expressive use’ or ‘data and text mining’ in the list of illustrative purposes.

11.76 Arguments in favour of considering data and text mining under a fair use exception, rather than introducing a new specific exception, largely parallel the more general arguments for introducing fair use. Data and text mining can ‘cover a range of activities which do or may not raise the same issues’.^[96] It is clear that data and text mining technologies are still evolving and they will become useful across a wide range of sectors in the economy, both commercial and non-commercial. The ALRC considers that fair use is sufficiently flexible to balance the competing interests between ‘copyright owners on the one side and academic and commercial users of data mining techniques on the other’.^[97]

11.77 Whether a use is fair must, in each instance, be assessed after considering the following fairness factors.

The purpose and character of the use

11.78 Data and text mining for illustrative purposes of fair use, such as ‘research or study’, ‘education’, ‘library or archive use’, are more likely to be fair. For example, the ALRC considers that the illustrative purpose of ‘research and study’ under fair use would allow data and text mining on the same grounds as the exception being implemented in the UK. This broadly aligns with the view of publishers, who had little problems with data mining for non-commercial purposes where a person has subscribed to the content that is being mined.^[98]

11.79 A finding that data and text mining is transformative would weigh heavily in favour of fair use. For example, to the extent that data and text mining allows ‘for the creation of new information, new aesthetics, new insight and understanding’,^[99] its use may be considered transformative.

11.80 Data and text mining for a commercial purpose would generally disfavour a finding fair use, but not always. The Cyberspace Law and Policy Centre submitted that data mining may be done in relation to commercial medical research, and it is not clear that the commerciality ought always to be decisive, when all the fairness factors are considered.^[100]

The nature of the copyright material used

11.81 Copyright exists to protect the expression of ideas and facts, rather than the facts themselves. US courts have held that the scope of fair use is greater with respect to factual than non-factual works.^[101] In addition, it has also been held that ‘the second factor may be of limited usefulness where the creative work of art is being used for a transformative purpose’.^[102]

The amount and substantiality of the part used

11.82 The amount and substantiality needed will depend on the purpose and character of the use. The ALRC envisages that many data and text mining exercises, to be useful, will involve reproduction of entire works. Fair use case law in the US makes it clear that reproduction of a whole of a work can, depending on the circumstances, amount to fair use.^[103]

Effect of the use upon the market

11.83 The effect on the market would be a relevant factor. Where the use is non-expressive or highly transformative, there will be good arguments that such uses are not a substitute for the original work, and therefore cannot directly harm the market for the original. For the market factor to work against fair use, the unlicensed use must harm ‘traditional, reasonable, or likely to be developed’ markets.^[104]

11.84 The ALRC appreciates the arguments that licensing solutions are being developed for data and text mining. However, the mere availability of a licence should not mandate that unlicensed uses are never fair. However, where a licence is offered on reasonable terms, it will be more difficult to argue that the unlicensed use is fair. This will go against a finding of fair use, especially where the use is also commercial and non-transformative.

[69]
UK Government Intellectual Property Office, Consultation on Copyright (2011), 80.
[70]
R Van Nooren, ‘Text Mining Spats Heats Up’ (2013) 495 Nature 295 provides examples of text mining including: linking genes to research, mapping the brain and drug discovery.
[71]
C Haven, Non-consumptive research? Text-mining? Welcome to the Hotspot of Humanities Research at Stanford (2012) <http://news.stanford.edu/news/2010/december/jockers-digitize-texts-120110.html> at 22 April 2013; Association of Research Libraries, Code of Best Practices in Fair Use for Academic and Research Libraries (2012).
[72]
See, eg, UK Government, Consultation on Copyright: Summary of Responses (2012), 17.
[73]
McKinsey Global Institute, Big Data: The Next Frontier for Innovation, Competition and Productivity (2011), Executive Summary. It is suggested that big data equates to financial value of $300 billion (US Health Care); 250 billion Euros (EU Public sector administration); global personal location data ($100 billion in revenue for service providers and $700 billion for end users).
[74]
Cyberspace Law and Policy Centre, Submission 201.
[75]
M Jockers, M Sag and J Schultz, Brief of Digital Humanities and Law Scholars as Amici Curiae in Authors Guild v. Hathitrust (2013), 18.
[76]
Ibid, 1609.
[77]
M Jockers, M Sag and J Schultz, Brief of Digital Humanities and Law Scholars as Amici Curiae in Authors Guild v. Hathitrust (2013).
[78]
R Burrell, M Handler, E Hudson, and K Weatherall, Submission 716.
[79]
Intellectual Property Office, Data Analysis for Non-commercial Research (2013).
[80]
Copyright Act 1968 (Cth) s 40(5) setting out what is a ‘reasonable portion’ with respect to different works.
[81]
Internet Industry Association, Submission 253; Google, Submission 217; Society of University Lawyers, Submission 158; R Xavier, Submission 146.
[82]
ADA and ALCC, Submission 213; Australian Industry Group, Submission 179.
[83]
ADA and ALCC, Submission 213; R Xavier, Submission 146; M Rimmer, Submission 138.
[84]
AIIA, Submission 211. See also Internet Industry Association, Submission 253, who supported an exception around copying for the purposes of extracting information.
[85]
Universities Australia, Submission 754.
[86]
CSIRO, Submission 242.
[87]
AFL, Submission 717; Cricket Australia, Submission 700; CSIRO, Submission 242; Telstra Corporation Limited, Submission 222; M Rimmer, Submission 138.
[88]
CSIRO, Submission 242. The problematic distinction between commercial/non-commercial was also highlighted by Cyberspace Law and Policy Centre, Submission 640 and John Wiley & Sons, Submission 239.
[89]
Universities Australia, Submission 754; Google, Submission 600; Cyberspace Law and Policy Centre, Submission 201.
[90]
Google, Submission 600.
[91]
John Wiley & Sons, Submission 239; Australian Publishers Association, Submission 225; ALPSP, Submission 199.
[92]
ALPSP, Submission 199.
[93]
Australian Publishers Association, Submission 225.
[94]
IASTMP, Submission 200.
[95]
John Wiley & Sons, Submission 239. The APA argued that cost implications arise because ‘crawling can affect platform performance and response times, and may require the development and maintenance of parallel content delivery systems; costs are then incurred to ensure that adequate performance and access (whether for licensed or unlicensed users) is maintained’: Australian Publishers Association, Submission 225.
[96]
Intellectual Property Committee, Law Council of Australia, Submission 765. See also John Wiley & Sons, Submission 239 which submitted that ‘there is currently little or no uniform understanding of what TDM actually is, nor how best it can be enabled or supported’.
[97]
Cyberspace Law and Policy Centre, Submission 201.
[98]
International Association of Scientific Technical and Medical Publishers, Submission 560.
[99]
P Leval, ‘Toward a Fair Use Standard’ (1989–1990) 103 Harvard Law Review 1105, 1111.
[100]
Cyberspace Law and Policy Centre, Submission 640.
[101]
Basic Books, Inc v Kinko’s Graphics Corp., 758 F Supp 1522 (SNDY, 1991), 1533.
[102]
Bill Graham Archives v Dorling Kindersley, Ltd, 448 F3d 605 (2nd Cir, 2006), 612.
[103]
The Authors Guild Inc v HathiTrust, WL 4808939 (SDNY, 2012).
[104]
Princeton University Press v Michigan Document Services, Inc, 99 F 3d 1381 (6th Cir, 1996), [26].