Data and Source Materials for Mapping Texts

The Mapping Texts project team relied heavily on the Texas Digital Newspaper Collection’s archive of historic documents. The main primary material for this project was 232,500 pages that were digitized and converted to plain text using optical character recognition (OCR). This collection was processed through several different computational analyses to help the team explore the possibilities for computer aided “distant reading” of large document collections.

List of Cities and Publications in Document Collection

Natural Language Processing Results

We are sharing data files containing the results of various natural language processing tools that we used on the document collection:

  • Word Counts: Lists of words in descending order of frequency, sliced and diced by year, location and title
  • Named Entity Recognition results: people, places, organizations recognized from the text
  • Topic Models: Clusters of commonly co-occuring words found in the collection as a whole, and as sliced and diced into different groupings (by historical era, by city and historical era, paper and historical era)

Download the data set here: texas_newspapers_naturallanguageprocessing.tar.gz (748mb, as compressed archive)

Visualization Source Code

We are sharing the original source code of the interactive data visualizations created for this research project:

 

Comments are closed.