Text and Data Mining

Text and data mining (TDM) uses computational methods to extract and analyze large quantities of text files or data sets to quickly identify patterns and relationships. TDM methods include information retrieval, named entity recognition, part of speech

Library Databases

Resources available for text and data mining vary by publisher. If you do not see the resource you are looking for, please contact your librarian about obtaining access or where to find corpora for your research needs.

Publisher Content Available Access Method Registration Process For More Information
Adam Matthew

All Vanderbilt licensed content

API

Contact your subject librarian.

Adam Matthew data mining / text mining statement

Adam Matthew API overview

Annual Reviews All Vanderbilt licensed content   Contact your subject librarian.  
American Association for the Advancement of Science (AAAS) All Vanderbilt licensed content Download from the AAAS online platform Contact your subject librarian. Science Online Journals Institutional License Agreement

Clarivate Analytics

Web of Science API Create a user account for the Clarivate Developer Portal. Because this site shares credentials with other Clarivate services, you may already have an existing account.

Available Web of Science APIs

Clarivate Developers Portal

Duke University Press Vanderbilt licensed ebooks and Project Euclid   Contact your subject librarian.  
Elsevier ScienceDirect API Request an API key via the Elsevier developers portal. Elsevier text and data mining policy
Gale 

19th Century Collections Online

Archives of Sexuality and Gender

Associated Press Collections

British Library Newspapers

Early Arabic Printed Books from the British Library

Financial Times Historical Archive

Making of Modern Law 

State Papers Online 

Full title list

  Contact your subject librarian.  
HistoryMakers HistoryMakers Digital Archive    Contact your subject librarian.  
JSTOR Available data includes metadata, n-grams, and word counts for most articles and book chapters, and for all research reports and pamphlets on JSTOR. Datasets may include data for up to 25,000 documents. Zip files containing .txt, .xml, or  n-grams A JSTOR account is required to request a dataset. Register for a free JSTOR account. JSTOR Data for Research
Linguistic Data Consortium All Vanderbilt licensed content   Vanderbilt users can select corpora published from 2022 - present.  Contact your subject librarian for access. LDC corpora by year
OCLC WorldCat API Contact your subject librarian. WorldCat Search API overview
Oxford University Press Oxford Historical Treaties   Contact your subject librarian.  
ProQuest

British Periodicals I-IV

American Periodicals

History Vault: Latino Civil Rights during the Carter Administration

Proquest History Vault. American Federation of Labor Records: The Samuel Gompers era, 1877-1937

  Contact your subject librarian.  
Royal Society Vanderbilt users may perform automated searches of licensed content.   Contact your subject librarian.  
Sage Journals All Vanderbilt licensed content

Download from the Sage platform or use the CrossRef Public API

 

No registration is required. Follow publisher instructions and terms of use. Text and Data Mining on Sage Journals
Springer Nature

All Vanderbilt licensed and open access content

API

Register via the Springer Nature API Portal.

Springer Nature text and data mining policy
Taylor & Francis All Vanderbilt licensed journal content

Arranged by request

 

Contact your subject librarian. Taylor & Francis Text and Data Mining Policy
TDS Health Vanderbilt licensed content in Stat!Ref

API

 

Contact your subject librarian. TDS Health OpenSearch Support
Wiley All Vanderbilt licensed content API Review the Wiley Text and Data Mining statement and scroll to the Get a Text and Data Mining Token section. Users must login using their Wiley Online Library credentials. If you are not registered, please do so at the registration page. Wiley Text and Data Mining statement

Freely Available Content for TDM Projects

In addition to the specific resources listed below, check out this list of Open Access disciplinary repositories.

Publisher Content Available Access Method Registration Process For More Information
arXiv Offers public API access to e-print content and metadata in the areas of physics, mathematics and computer science. API None arXiv API access documentation
BioMed Central Open access content published by BMC API Register via the Springer Nature API Portal. BMC API overview
Caselaw Access Project All U.S. federal and state case law API Some access requires registration for free API key. Usage and access
CrossRef Metadata records with CrossRef DOIs API None Text and data mining for researchers
Digital Public Library of America Metadata on items and collections API Request an API key DPLA API Codex
Folger Shakespeare Library Downloadable files of the Folger Shakespeare texts in six different digital formats. Downloads available from https://shakespeare.folger.edu/download/. None Additional API tools
HathiTrust Use the HathiTrust APIs to query and retrieve data when you have a known identifier. HathiTrust APIs are not search APIs (e.g., where you use a keyword to search across the collection). API To use the Data API request an API key. HathiTrust Data Availability and API Options
Internet Archive 20 million  freely downloadable books and texts Individual works are downloadable  from the Internet Archive website. Bulk download require a terminal emulator and wget. None Instructions for downloading in bulk
Library of Congress Chronicling America: Historic American Newspapers API None About the site and API directions
Library of Congress

LC for Robots provides machine-readable access to the Library of Congress' digital collections, including images, laws and regulations, and bibliographic information.

Varies   LC for Robots documentation
National Library of Medicine Multiple text mining tools for accessing various NLM databases and biomedical literature. Varies  

Text Mining Tools

NLM Products and Services

OECD Programmatically access a selection of top used datasets covering data for OECD countries and selected non-member economies. OECD datasets are dynamically updated. It is recommended that VU researchers start in the OECD iLibrary subscription access as more data can be exported in one request. API available and if needed RSS data feeds. Contact your subject librarian if RSS is needed. None OECD data for developers
PLOS (Public Library of Science) Access to article corpus and article metadata. API Register to obtain an API key. PLOS text and data mining documentation
Project Gutenberg Over 60,000 books, usually out of copyright. TXT, HTML, ePUB None Project Gutenberg Permissions, Licensing and other Common Requests