Text and Data Mining

Text and data mining (TDM) uses computational methods to extract and analyze large quantities of text files or data sets to quickly identify patterns and relationships. TDM methods include information retrieval, named entity recognition, part of speech

What is Text and Data Mining?

Text and data mining (TDM) uses computational methods to extract and analyze large quantities of text files or data sets to quickly identify patterns and relationships. Text analytics methods include information retrieval, named entity recognition, part of speech tagging, sentiment analysis, network analysis, and topic modelling.

Don't know where to start? Check out Ted Underwood's Seven ways humanists are using computers to understand text.

Planning Your Project

Finding Corpora

You can find and collect text for analysis from a variety of resources including library content we subscribe to, open access content, social media, and online web resources. Your librarian can help you find corpora suitable for analysis.

Some textual resources are born-digital (e.g., Wikipedia, social media); other works are digitized and converted to text by OCR (optical character recognition). The accuracy of OCR varies by resource and your corpora may require OCR cleaning depending on your research needs.

 

TDM Use and Copyright

Many textual resources are still under copyright, complicating full-text access. Simply because the library has subscribed to the journal or database does not necessarily mean that we have TDM rights to that same content. Please contact your librarian to understand whether you have TDM rights to the library resource you are interested. In some cases, we can negotiate additional access for you (often at additional cost that may be borne by the researcher). Please allow sufficient lead time to negotiate additional access.

 

Budgeting 

We are happy to consult with researchers on projects using existing content, however, we may be unable to provide licensing or funding for individual text-mining projects for needs not covered by university wide licenses.  The library is unable to pay for project-by-project fees but will attempt to negotiate with the vendor for a more institutional solution. Therefore, as noted above, we highly encourage scholars to consider research or grant funding in these cases.