Research Guides: Subject Librarian Toolkit: Text &amp; Data Mining

What is TDM?

Text and data mining (TDM) uses computational methods to extract and analyze large quantities of text files or data sets to quickly identify patterns and relationships. Text analytics methods include information retrieval, bibliometrics, summarization, named entity recognition, part of speech tagging, sentiment analysis, network analysis, topic modelling.

By using techniques like counting the frequency of a word in any given text or or examining where words occur close to one another (collocation), researchers can get insight into what a given text is about or find which texts that use certain words.

You can read more about what text analysis methods a researcher might learn from this Ithaka guide. Ted Underwood has written about Seven ways humanists are using computers to understand text.

Data scientists and businesses also use these methods to analyze social media trends, sentiment of product reviews, or to extract information from forms or webpages. This Ithaka guide explains what computational methods a data analyst might apply to texts.

What to do when you're asked about TDM?

1. Clarify what the researcher is asking for

Does the researcher need full-text or just metadata?

Is the resource open access, in the public domain, or still under copyright? TDM access to full text of materials still in copyright is trickier to obtain and often costs more.

Does the researcher have any research funds to use to pay for content the library has not licensed? We may be unable to provide licensing or funding for individual, text-mining projects. We highly encourage scholars to consider research or grant funding in these cases.

What is the project timeline? It takes time (anywhere from 2-8 weeks) to determine access or negotiate additional access.

2. Check with e-Resources

Do we have TDM rights to this content? Simply because the library has subscribed to the journal or database does not necessarily mean that we have TDM rights to that same content.

If not, ask for a quote for additional TDM access.

3. What can DiSC do?

DiSC offers beginner to intermediate workshops on using Python or R for analysis and can consult on project scoping. You can find links to text mining training tools and tutorials in the text mining research guide.

DiSC cannot serve as research project staff members. For researchers looking for project staff with programming or data analysis expertise, you can refer them to these campus partners:

The Data Science Institute for consultation and potential staffing on larger-scale text and data mining projects; or
Research IT for consultation on hiring programmers for text or data analysis and visualization.

4. What about TDM in the classroom?

Faculty interested in introducing text mining in the classroom might be interested in these vendor platforms we have access to: Gale's Digital Scholar Lab (for Gale content) or Constellate (for JSTOR content). These vendor platforms limit the size of the corpus to be analyzed and so would typically not be suited for large scale analysis.

Gale's Digital Scholar Lab is a web-based learning platform that allows students and researchers to apply natural language processing tools to create visualizations from raw text data (OCR text) in Gale's primary source collections. It does not require any programming expertise or software downloads, and Gale provides curriculum materials and sample datasets.

JSTOR's Constellate platform allows researchers to perform text analysis on JSTOR and Portico content in a secure Jupyter notebook environment. It also provides beginner and intermediate series of Jupyter notebooks for analyzing word frequencies, topic modelling, sentiment analysis, and other common text mining techniques. These notebooks are open source and make a great introduction to text mining techniques.

FAQs

What is an API? An API (application programming interface) is a method of retrieving data from the web. The website owner (e.g., New York Times, Clarivate, Twtiter) makes the information available in their databases via an API for programmers to search and extract specific information. The website owner typically provides documentation on how to use its API and how to obtain an API key (access rights to the API).

What is web scraping? Web scraping is an alternative method of retrieving data from the web where an API is unavailable. You can download entire pages or only selections of pages. It requires basic understanding of HTML, CSS and the Document Object Model.

Should I use Python or R? There is no right answer. The researcher should use the programming language he or she is most familiar with or the one most commonly used in the relevant discipline. Both are open source with large communities of practice. Most text and data analysis packages are available in both languages.

Have the texts been OCR'd? Some textual resources are born-digital (e.g., current newspapers, social media, Wikipedia); other works are digitized and converted to text by OCR (optical character recognition). The accuracy of OCR varies by resource; typically the older the text, the worse the OCR output. Texts may require OCR clean up or other pre-processing (e.g., stripping out markup tags) depending on research needs.

What type of files are available? Text files are typically provided as structured data - with different fields for author, title, date, etc. and a single field for the full-text. Output may come as XML files, JSON files, and/or .csv files depending on the vendor. These are all file types readily usable in any programming language and convertible from one to the other.

For materials still under copyright, output will likely be limited to results (sometimes called extracted feature sets), typically metadata, word counts, n-grams, topic models, etc. You are unlikely to be able to download full-text.

What's a Jupyter Notebook anyway? A Jupyter notebook provides an environment where you can combine human-readable narrative with computer-readable code (including R and Python) and display the results of your analysis or visualization. It allows you to document your thought process and research methods in order to produce publishable and reproducible results.