Research Guides: Text and Data Mining: TDM Tools

Vendor Tools

The following vendors have built-in tools for conducting text analysis on their content. Access may be restricted depending on Vanderbilt subscriptions and copyright.

HathiTrust Resource Center
Supports large-scale computational analysis of the works in the HathiTrust Digital Library.

more... less...

HTRC Algorithms are web-based, click-and-run tools to perform computational text analysis on volumes in the HathiTrust Digital Library. The algorithms can help you explore, analyze, and visualize public worksets or those you have created.

For more advanced users, the HTRC Data Capsules provide secure computing environments for performing researcher-driven text analysis on the HathiTrust corpus.

Use this proxy link to the beta site and get access to larger datasets (up to 50,000 items).

Gale Digital Scholar Lab
Collect and analyze datasets from Gale Primary Sources
LC for Robots
The Library of Congress provides machine-readable access to its digital collections via APIs and built-in tools.
Proquest TDM Studio
Unlimited seats

VU Library Data Lake

Vanderbilt University students, staff, and faculty have access to licensed datasets through the VU Library Data Lake. This resource is available through the Databricks platform which offers data, analytics and AI solutions at scale. To access the Data Lake contact your liaison librarian.

Example Use Cases

Generating dashboards and visualizations
Machine Learning with Hugging Face Transformers
ETL & Data Engineering with SQL, Python, R, or Scala
Build LLMs with integrations with OpenAI, John Snow Labs and others.

Collections in the Data Lake

The Vanderbilt Library Data Lake includes five datasets ready to use. In addition to these prepared datasets, the library can load on request any dataset already acquired with a TDM license. You can also load your own data into the data lake! Databricks offers multiple options for ingesting data. Please review the Data Lake data retention policy for any datasets, notebooks, or other workspace data not part of the library’s permanent collection.

Tools and Tutorials

Programming Historian
A collection of peer-reviewed tutorials to learn a wide-variety of digital tools and techniques suitable for novice to intermediate programmers.
Voyant Tools
Voyant Tools is a web-based reading and analysis environment for digital texts designed for those without programming skills.
TAPoR
TAPoR 3.0 is a directory of tools used for gathering, extracting, manipulating, analyzing and visualizing text.

Building Legal Literacies for Text Data Mining by Rachael Samberg (Editor); Timothy Vollmer (Editor)
ISBN: 9780999797044

Publication Date: 2021-07-01

This book explores the legal literacies covered during the virtual Building Legal Literacies for Text Data Mining Institute, including copyright (both U.S. and international law), technological protection measures, privacy, and ethical considerations. It describes in detail how we developed and delivered the 4-day institute, and also provides ideas for hosting shorter literacy teaching sessions. Finally, we offer reflections and take-aways on the Institute.
Text Analytics with Python by Dipanjan Sarkar
ISBN: 1484243536

Publication Date: 2019-05-22

Leverage Natural Language Processing (NLP) in Python and learn how to set up your own robust environment for performing text analytics. This second edition has gone through a major revamp and introduces several significant changes and new topics based on the recent trends in NLP. You'll see how to use the latest state-of-the-art frameworks in NLP, coupled with machine learning and deep learning models for supervised sentiment analysis powered by Python to solve actual case studies. Start by reviewing Python for NLP fundamentals on strings and text data and move on to engineering representation methods for text data, including both traditional statistical models and newer deep learning-based embedding models. Improved techniques and new methods around parsing and processing text are discussed as well. Text summarization and topic models have been overhauled so the book showcases how to build, tune, and interpret topic models in the context of an interest dataset on NIPS conference papers. Additionally, the book covers text similarity techniques with a real-world example of movie recommenders, along with sentiment analysis using supervised and unsupervised techniques. There is also a chapter dedicated to semantic analysis where you'll see how to build your own named entity recognition (NER) system from scratch. While the overall structure of the book remains the same, the entire code base, modules, and chapters has been updated to the latest Python 3.x release. What You'll Learn * Understand NLP and text syntax, semantics and structure * Discover text cleaning and feature engineering * Review text classification and text clustering * Assess text summarization and topic models * Study deep learning for NLP Who This Book Is For IT professionals, data analysts, developers, linguistic experts, data scientists and engineers and basically anyone with a keen interest in linguistics, analytics and generating insights from textual data.
Text Mining and Analysis by Goutam Chakraborty; Murali Pagolu; Satish Garla
ISBN: 161290551X

Publication Date: 2013-10-25

Big data: It's unstructured, it's coming at you fast, and there's lots of it. In fact, the majority of big data is text-oriented, thanks to the proliferation of online sources such as blogs, emails, and social media.However, having big data means little if you can't leverage it with analytics. Now you can explore the large volumes of unstructured text data that your organization has collected with Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS.This hands-on guide to text analytics using SAS provides detailed, step-by-step instructions and explanations on how to mine your text data for valuable insight. Through its comprehensive approach, you'll learn not just how to analyze your data, but how to collect, cleanse, organize, categorize, explore, and interpret it as well. Text Mining and Analysis also features an extensive set of case studies, so you can see examples of how the applications work with real-world data from a variety of industries.Text analytics enables you to gain insights about your customers' behaviors and sentiments. Leverage your organization's text data, and use those insights for making better business decisions with Text Mining and Analysis.
Text Mining with R by Julia Silge; David Robinson
ISBN: 9781491981658

Publication Date: 2017-07-18

Much of the data available today is unstructured and text-heavy, making it challenging for analysts to apply their usual data wrangling and visualization tools. With this practical book, you'll explore text-mining techniques with tidytext, a package that authors Julia Silge and David Robinson developed using the tidy principles behind R packages like ggraph and dplyr. You'll learn how tidytext and other tidy tools in R can make text analysis easier and more effective. The authors demonstrate how treating text as data frames enables you to manipulate, summarize, and visualize characteristics of text. You'll also learn how to integrate natural language processing (NLP) into effective workflows. Practical code examples and data explorations will help you generate real insights from literature, news, and social media. Learn how to apply the tidy text format to NLP Use sentiment analysis to mine the emotional content of text Identify a document's most important terms with frequency measurements Explore relationships and connections between words with the ggraph and widyr packages Convert back and forth between R's tidy and non-tidy text formats Use topic modeling to classify document collections into natural groups Examine case studies that compare Twitter archives, dig into NASA metadata, and analyze thousands of Usenet messages