04.06.2013

Text and data mining

8.41 Data and text mining has been defined as ‘automated analytical techniques’ that work by ‘copying existing electronic information, for instance articles in scientific journals and other works, and analysing the data they contain for patterns, trends and other useful information’.^[62] Data and text mining has also been described as ‘a computational process whereby text or datasets are crawled by software that recognises entities, relationships and actions’.^[63]

8.42 The growth of digital technology has seen increasing amounts of data stored in databases and repositories. Use of data and text mining to extract patterns across large data sets and journal articles is becoming more widely used in a number of research sectors, including medicine, business, marketing, academic publishing and genomics.^[64] This type of research has been referred to as ‘non-consumptive’ research, because it does not involve reading or viewing of the works.^[65]

8.43 The Terms of Reference refer to the general interests of Australians to ‘access, use and interact with content in the advancement of education, research and culture’. Researchers and research institutions have highlighted the value of data mining in paving the way for novel discoveries, increased research output and early identification of problems.^[66]

8.44 The Cyberspace Law and Policy Centre submitted that data mining

has the potential to grant ‘immense inferential power’ to allow businesses, researchers and institutions to ‘make proactive knowledge-driven decisions’. There are significant potential commercial benefits—data mining has the potential to improve business profits by allowing businesses to better understand and predict the interests of customers so as to focus their efforts and resources on more profitable areas.^[67]

8.45 At the commercial level, the ability to extract value from data is an increasingly important feature of the digital economy. For example, the McKinsey Global Institute suggests that data has the potential to generate significant financial value across commercial and other sectors, and become a key basis of competition, underpinning new waves of productivity growth and innovation.^[68]

Current law

8.46 There is no specific exception in the Copyright Act for text or data mining. Where the text or data mining process involves the copying, digitisation, or reformatting of copyright material without permission, it may give rise to copyright infringement.

8.47 One issue is whether text mining, if done for the purposes of research or study, would be covered by the fair dealing exceptions. The reach of the fair dealing exceptions may not extend to text mining if the whole dataset needs to be copied and converted into a suitable format. Such copying would be more than a ‘reasonable portion’ of the work concerned.^[69] Nor is it clear whether copying for text mining would fall under the exception relating to temporary reproduction of works as part of a technical process, under s 43B of the Copyright Act, but it seems unlikely.

International comparisons

8.48 The need for a specific text mining exception has been hotly contested in the UK. The Hargreaves Review recommended that the UK Government ‘press at EU level for the introduction of an exception allowing uses of a work enabled by technology which do not directly trade on the underlying creative and expressive purpose of the work’.^[70] One example given of such a use was data mining. The report also recommended that the Government ensure that such an exception cannot be overridden by contract.^[71]

8.49 In response to the Hargreaves Review, the Business, Innovation and Skills Committee of the UK Parliament did not endorse a specific exception to deal with data mining for research. Rather, it urged the Government to encourage the early development of models in which ‘licences are readily available at realistic rates to all bona fide licensees’.^[72]

8.50 However, the UK Government has proposed to amend the Copyright, Designs and Patents Act 1988 (UK) so that ‘it is not an infringement of copyright for a person who already has a right to access the work (whether under a licence or otherwise) to copy the work as part of a technical process of analysis and synthesis of the content of the work for the sole purpose of non-commercial research’.^[73] The rationale for this exception was that

the copying involved in text and data analytics is a necessary part of a technical process, and is unlikely to substitute for the work in question (such as a journal article). It is therefore unlikely that permitting mining for research will itself negatively affect the market for or value of copyright works. Indeed, it may be that removing restrictions from analytic technologies would increase the value of articles to researchers.^[74]

8.51 It was also proposed that a licence could not prevent the use of works under the exception, but may impose conditions of access to a licensor’s computer system or to third party systems on which the work is accessed. Where a TPM prevents a researcher from benefiting from this exception, appeal can be made to the Secretary of State.

8.52 Text and data mining has also been considered in the US in the context of ‘transformative use’. In The Authors Guild v HathiTrust, the trial judge found that non-expressive uses such as text searching and computational analysis are fair use and therefore do not infringe the copyright in the underlying material.^[75]

Licensing solutions

8.53 A number of stakeholders submitted that there was no impediment to data or text mining in the Copyright Act.^[76] Some suggested that data and text mining activities may already be covered under the existing research or study fair dealing provisions, or may be covered by statutory licence if done for educational purposes.^[77]

8.54 In particular, publishers argued that the market for data and text mining is still developing, and that solutions to the perceived problem have not had a chance to evolve. For example, John Wiley & Sons submitted that:

There is currently little or no uniform understanding of what TDM (text/data mining) actually is, nor how best it can be enabled or supported. From our experience, there is little consistency across TDM projects as far as activities, processes and results are concerned, let alone definitions around content access methods and protocols or standard licensing terms.^[78]

8.55 The Association of Learned and Professional Society Publishers (ALPSP) argued that ‘publishers are not blocking access to articles for text and data mining— publishers are reporting that current requests are very low, and in the main, they are granted’.^[79] Therefore, it was suggested that solutions lie in co-operation between users and publishers to create licensing solutions.^[80] Exceptions, it was argued, would not create an environment conducive to collaboration:

Data and text mining solutions are best found in market-based initiatives, like proactive voluntary licensing, that offer faster and more flexible ways to adapt to changing market needs and preferences. These solutions must be based on collaboration between users and publishers. Value proposals and business models for publishers in the field of data and text mining are only now emerging, and publishers are experimenting with various contractual and operational models.^[81]

8.56 Publishers also argued that licensing helps offset publishers’ costs to support content mining on a large scale, and that increases in costs ‘could act as a significant disincentive to publishers to continue to invest in programmes to enrich and enhance published content, which in turn facilitates greater usage and encouragement’.^[82]

8.57 Publishers warned that ‘the relative immaturity of the TDM market should not be considered as indicative of market failure demanding legislative intervention’.^[83]

8.58 Other stakeholders were concerned about the reach of any data and text mining exception into commercial operations.^[84] For example, Telstra recognised the value of data and text mining ‘in the context of research, education and culture’, but was opposed to reform that would allow the use of data mining tools or software for commercial exploitation. For example:

an offshore data-miner that scrapes (or copies) data from an online Australian database, such as a telephone directory. The data-miner then uses the scraped content to establish a competing business, without the need to source, verify, supplement or format the content. The data-miner also avoids the need to employ Australian staff, or to invest in the creation or development of content.^[85]

8.59 IASTMP argued that publishers are increasingly providing licensing solutions for commercial text mining and that they should be allowed to continue providing or facilitating customised data and text mining solutions.^[86]

Facilitating research and study

8.60 A number of stakeholders argued that data and text mining should be permitted, drawing on the principle of ‘non-expressive’ use, or uses that do not trade on the underlying or expressive purpose of the work.^[87]

8.61 For example, the Australian Industry Information Association argued that it is important for legislative reform to encourage research, development and competition in the data analytics field. It suggested a specific exception to allow data and text mining for the purposes of ‘comparison, classification or analysis’ would not negatively impact on the original data provider’s rights and commercial interests because the technology is not intended to reprint the original data, but to provide a synthesised result. These outcomes do not interfere with the economic value of the copyright material nor compete with it.^[88]

8.62 Similarly, others referred to use of academic materials and journals that could be considered as ‘transformative’ uses.^[89] The ADA and ALCC suggested that data and text mining, as a subset of transformative use may be best supported by a flexible, open ended exception:

uses which may have been characterised as transformative, such as text and data mining, but may be better seen as ‘non-expressive’ or ‘orthogonal’ uses. Fair use in the US provides the flexibility for new technologies to develop which may straddle the two definitions, and similarly providing courts with the tools to deem when such uses will unreasonably harm the copyright owner.^[90]

8.63 A number of submissions referred to the importance of data and text mining for non-commercial research and study.^[91] However, the Commonwealth Scientific and Industrial Research Organisation (CSIRO) argued that the commercial/non-commercial distinction is not useful, since:

such a limitation would seem to mean that ‘commercial research’ must duplicate effort and would be at odds with a goal of making information (as opposed to illegal copies of journal articles, for example) efficiently available to researchers … As noted, much research is conducted through international collaboration. If the laws in Australia are more restrictive than elsewhere or if the administration of any rights system is cumbersome or onerous and creates excessive cost for research, then that might be expected to impact on the desirability of Australia as a research destination.^[92]

^[62] UK Government Intellectual Property Office, Consultation on Copyright (2011), 80. See also, D Sašo, ‘Data Mining in a Nutshell’ in S Džeroski and N Lavrač (eds), Relational Data Mining (2001). Data mining programs are often called data-analytics software.

^[63] IASTMP, Submission 200.

^[64] R Van Nooren, ‘Text Mining Spats Heats Up’ (2013) 495 Nature 295 provides examples of text mining including: linking genes to research, mapping the brain and drug discovery.

^[65] C Haven, Non-consumptive research? Text-mining? Welcome to the Hotspot of Humanities Research at Stanford (2012) <http://news.stanford.edu/news/2010/december/jockers-digitize-texts-120110.html> at 22 April 2013; Association of Research Libraries, Code of Best Practices in Fair Use for Academic and Research Libraries (2012).

^[66] UK Government, Consultation on Copyright: Summary of Responses (2012), 17.

^[67] Cyberspace Law and Policy Centre, Submission 201.

^[68] McKinsey Global Institute, Big Data: The Next Frontier for Innovation, Competition and Productivity (2011), Executive Summary. It is suggested that big data equates to financial value of $300 billion (US Health Care); 250 billion Euros (EU Public sector administration); global personal location data ($100 billion in revenue for service providers and $700 billion for end users).

^[69]Copyright Act 1968 (Cth) s 40(5) setting out what is a ‘reasonable portion’ with respect to different works.

^[70] I Hargreaves, Digital Opportunity: A Review of Intellectual Property and Growth (2011), 47.

^[71] Ibid, 51.

^[72] House of Commons Business, Innovation and Skills Committee, The Hargreaves Review of Intellectual Property: Where next? (2012), 19.

^[73] UK Government, Modernising Copyright: A Modern, Robust and Flexible Framework (2012), 37.

^[74] Ibid.

^[75] This analysis was supported in submissions from the ADA and ALCC, Submission 213 and R Xavier, Submission 146.

^[76] Copyright Agency/Viscopy, Submission 249; APRA/AMCOS, Submission 247; Australian Directors Guild, Submission 226; Australian Copyright Council, Submission 219.

^[77] Copyright Agency/Viscopy, Submission 249; Australian Publishers Association, Submission 225.

^[78] John Wiley & Sons, Submission 239.

^[79] ALPSP, Submission 199.

^[80] Australian Publishers Association, Submission 225.

^[81] IASTMP, Submission 200.

^[82] John Wiley & Sons, Submission 239. The APA argued that cost implications arise because ‘crawling can affect platform performance and response times, and may require the development and maintenance of parallel content delivery systems; costs are then incurred to ensure that adequate performance and access (whether for licensed or unlicensed users) is maintained: Australian Publishers Association, Submission 225.

^[83] John Wiley & Sons, Submission 239; Australian Publishers Association, Submission 225; ALPSP, Submission 199.

^[84] Telstra Corporation Limited, Submission 222; Australian Broadcasting Corporation, Submission 210; Cyberspace Law and Policy Centre, Submission 201. The Cyberspace Law and Policy Centre stressed that ‘there is a need to manage access to address, technical, competitive and commercial risks’.

^[85] Telstra Corporation Limited, Submission 222.

^[86] IASTMP, Submission 200.

^[87] ADA and ALCC, Submission 213; Australian Industry Group, Submission 179.

^[88] AIIA, Submission 211. See also Internet Industry Association, Submission 253 who also supported an exception around copying for the purposes of extracting information.

^[89] ADA and ALCC, Submission 213; R Xavier, Submission 146; M Rimmer, Submission 138.

^[90] ADA and ALCC, Submission 213.

^[91] CSIRO, Submission 242; Telstra Corporation Limited, Submission 222; M Rimmer, Submission 138.

^[92] CSIRO, Submission 242.