Text and data mining (TDM) is the automated process of extracting useful information and insights from large amounts of unstructured data, for the purposes of identifying trends, patterns and knowledge. This allows organisations to efficiently and cost-effectively gain insight from a wide range of data sources.
Unstructured data is data that is not actively managed by a database management system. When we think of data we think of binary/factual information like statistics, numbers and facts, when in reality unstructured data makes up 80 – 90% of global data that is being used by organisations. Unstructured data is less the quantitative content that comes to mind, and more everything we see and use online. Text is the most common type of unstructured data, found in the form of websites, Word documents, online articles, social media posts, reviews, video transcripts, e-books etc. Other types of unstructured data include images, audio and video files. Most new data generated today is unstructured data, and this data is difficult to store and manage in a conventional database, which is why organisations need tools and processes, such as text and data mining, to manage, analyse and make use of it.
The most common sources of data for TDM include journal articles, books, datasets, images, social media posts and websites. TDM involves accessing and analysing this content, and then extracting and reproducing – at least parts of – these works.
The content used in the TDM process are by default protected by copyright, and while copyright does not apply to accessing and analysing published content it does cover the reproduction of it. TDM practices go beyond accessing and gathering information from datasets, they extract and reproduce information, and it is this act of copying that is subject to copyright. When it comes to TDM, technology is the substitute for a human viewing or reading something, and then making a copy of extracts of it.
TDM is generally NOT permitted in the UK without a licence due to existing UK copyright law (Copyright, Designs and Patents Act 1998 or “CDPA”), the one exception to this being for TDM for non-commercial research.
While there is overlap, TDM and generative AI training are two distinct activities. TDM is the process of turning unstructured data into structured data, and it is this structured data that generative AI models use for training. TDM can exist without generative AI, but generative AI would be a lot less effective without TDM. In terms of business practices, an organisation would carry out text and data mining to gain insight from published content, whereas they would use published content in a generative AI tool to generate new content.