Data

Data and Source Materials for Mapping Texts

The Mapping Texts project team relied heavily on the Texas Digital Newspaper Collection’s archive of historic documents. The main primary material for this project was 232,500 pages that were digitized and converted to plain text using optical character recognition (OCR). This collection was processed through several different computational analyses to help the team explore the possibilities for computer aided “distant reading” of large document collections.

List of Cities and Publications in Document Collection

Newspapers used from the Texas Digital Newspaper Collection

Natural Language Processing Results

We are sharing data files containing the results of various natural language processing tools that we used on the document collection:

Word Counts: Lists of words in descending order of frequency, sliced and diced by year, location and title
Named Entity Recognition results: people, places, organizations recognized from the text
Topic Models: Clusters of commonly co-occuring words found in the collection as a whole, and as sliced and diced into different groupings (by historical era, by city and historical era, paper and historical era)

Download the data set here: texas_newspapers_naturallanguageprocessing.tar.gz (748mb, as compressed archive)

Visualization Source Code

We are sharing the original source code of the interactive data visualizations created for this research project:

Visualizing Digitization Quality
The source code for the interactive visualization of text recognition quality is available in a GitHub repository for downloading and re-use.
- Download Link: https://github.com/mcgeoff/Document-OCR-Quality-Visualization
Assessing Language Patterns
The source code for the interactive visualization of language patterns is available on a GitHub repository for downloading and re-use:
- Download Link: https://github.com/wi-design/Mapping-Texts

Comments are closed.

Data and Source Materials for Mapping Texts

List of Cities and Publications in Document Collection

Natural Language Processing Results

Visualization Source Code

Project Partners

Inside

Categories

Mapping Texts

Pages

The Latest

Visualization: Assessing Language Patterns

More