Text and Data Mining

Text and data mining (TDM) uses computational methods to extract and analyze large quantities of text files or data sets to quickly identify patterns and relationships. TDM methods include information retrieval, named entity recognition, part of speech

Vendor Tools

The following vendors have built-in tools for conducting text analysis on their content. Access may be restricted depending on Vanderbilt subscriptions and copyright.

Use this proxy link to the beta site and get access to larger datasets (up to 50,000 items). 

VU Library Data Lake

Vanderbilt University students, staff, and faculty have access to licensed datasets through the VU Library Data Lake. This resource is available through the Databricks platform which offers data, analytics and AI solutions at scale. To access the Data Lake contact your liaison librarian.

Example Use Cases

  • Generating dashboards and visualizations
  • Machine Learning with Hugging Face Transformers
  • ETL & Data Engineering with SQL, Python, R, or Scala
  • Build LLMs with integrations with OpenAI, John Snow Labs and others.

Collections in the Data Lake

The Vanderbilt Library Data Lake includes five datasets ready to use. In addition to these prepared datasets, the library can load on request any dataset already acquired with a TDM license. You can also load your own data into the data lake! Databricks offers multiple options for ingesting data. Please review the Data Lake data retention policy for any datasets, notebooks, or other workspace data not part of the library’s permanent collection.

Tools and Tutorials