Mapping Texts http://mappingtexts.library.unt.edu Mon, 30 Apr 2012 16:29:49 +0000 en hourly 1 http://wordpress.org/?v=3.0.3 Visualization: Assessing Language Patterns http://mappingtexts.library.unt.edu/?p=231 http://mappingtexts.library.unt.edu/?p=231#comments Thu, 29 Mar 2012 23:45:04 +0000 Geoff McGhee http://mappingtexts.library.unt.edu/?p=231

This is the second visualization from the project, showing the results of several natural language processing analyses of the original texts. It plots the language patterns embedded in 232,567 pages of historical Texas newspapers, as they evolved over time and space. For any date range and location, you can browse the most common [...]]]>

This is the second visualization from the project, showing the results of several natural language processing analyses of the original texts. It plots the language patterns embedded in 232,567 pages of historical Texas newspapers, as they evolved over time and space. For any date range and location, you can browse the most common words (word counts), named entities (people, places, etc), and highly correlated words (topic models).

See the visualization at language.mappingtexts.org »

]]> http://mappingtexts.library.unt.edu/?feed=rss2&p=231 0
Paper: Topic Modeling on Historical Newspapers http://mappingtexts.library.unt.edu/?p=193 http://mappingtexts.library.unt.edu/?p=193#comments Tue, 27 Sep 2011 01:08:48 +0000 Geoff McGhee http://mappingtexts.library.unt.edu/?p=193 As part of our ongoing research into text-mining historical newspapers, we’ve been experimenting with new methods for extracting language patterns scattered across millions of digitized words. One of the most intriguing methods for such work that has emerged in recent years is topic-modeling. The idea of topic modeling is, at base, to use mathematical and [...]]]> As part of our ongoing research into text-mining historical newspapers, we’ve been experimenting with new methods for extracting language patterns scattered across millions of digitized words. One of the most intriguing methods for such work that has emerged in recent years is topic-modeling. The idea of topic modeling is, at base, to use mathematical and statistical models to identify words that are related to one another and then group them into “topics.” The hope is to concept is to thereby expose underlying patterns in the language of large-scale collections that would be hard, if not impossible, to otherwise see.

And so we have been experimenting with topic modeling for this project, concentrating on the popular MALLET software package. We recently presented a paper based on this work at the meeting of the Association for Computational Linguistics in June 2011, “Topic Modeling on Historical Newspapers.”

Download the paper: Topic Modeling on Historical Newspapers

From Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (2011), pp. 96-104.

]]>
http://mappingtexts.library.unt.edu/?feed=rss2&p=193 0
Visualization: Digitization Quality http://mappingtexts.library.unt.edu/?p=153 http://mappingtexts.library.unt.edu/?p=153#comments Thu, 12 May 2011 18:41:17 +0000 Geoff McGhee http://mappingtexts.library.unt.edu/?p=153

This visualization plots the quantity and quality of 232,567 pages of historical Texas newspapers, as they spread out over time and space. The graphs plot the overall quantity of information available by year and the quality of the corpus (by comparing the number of words we can recognize to the total number scanned). The [...]]]>

Click to go to the visualization

Click to go to the visualization

This visualization plots the quantity and quality of 232,567 pages of historical Texas newspapers, as they spread out over time and space. The graphs plot the overall quantity of information available by year and the quality of the corpus (by comparing the number of words we can recognize to the total number scanned). The map shows the geography of the collection, grouping all newspapers by their publication city, and can show both the quantity and quality of the newspapers from various locations. Clicking on a particular city will provide a detailed view of the individual newspapers, where you can examine both the quantity and quality of information. A timeline of historical events related to Texas is also available for context.

See the visualization »

]]> http://mappingtexts.library.unt.edu/?feed=rss2&p=153 0
Mapping Texts: Our Idea http://mappingtexts.library.unt.edu/?p=75 http://mappingtexts.library.unt.edu/?p=75#comments Thu, 12 May 2011 17:48:33 +0000 Andrew J. Torget http://mappingtexts.library.unt.edu/?p=75 University of North Texas and Stanford University with a pretty simple mission:  experiment with new methods for finding and analyzing meaningful patterns embedded within massive collections of historical newspapers. IDEA BEHIND THE PROJECT Why do we think this is important?  Because, quite simply, historical newspapers are [...]]]> Text Mapping is a collaboration between the University of North Texas and Stanford University with a pretty simple mission:  experiment with new methods for finding and analyzing meaningful patterns embedded within massive collections of historical newspapers.
IDEA BEHIND THE PROJECT
Why do we think this is important?  Because, quite simply, historical newspapers are currently being digitized at a scale that is rapidly overwhelming our traditional methods of research.  The Chronicling America project (a joint endeavor of the National Endowment for the Humanities and the Library of Congress), for example, recently digitized its one millionth historical newspaper page, and they will soon make millions more freely available online.
What can scholars do with such an immense wealth of information?  Currently, they cannot do much.  Without tools and methods capable of handling such large datasets—and thus sifting out meaningful patterns embedded within them—scholars typically find themselves confined to performing only basic word searches across enormous collections.  While such basic searches can, indeed, find stray information scattered in unlikely places, they becoming increasingly less useful as datasets continue to grow in size.  If a search for a particular term yields 4,000,000 results, even those search results produce a dataset far too large for any single scholar to analyze in a meaningful way using traditional methods.
Our goal, then, is to help solve this problem by combining the two most promising methods for finding meaning in such massive collections of historical newspapers:  text-mining and visualization.
THE NEWSPAPERS
For this project, we are experimenting on a collection of about 232,500 pages of historical newspapers digitized by the Texas Digital Newspaper Program at the University of North Texas Library.  These newspapers were digitized in conjunction with the Chronicling America project, as well as under UNT’s own digital newspaper program, and were selected because:
  • With nearly a quarter million pages, we could experiment with scale.
  • The newspapers were all digitized according to the standards set by the national Chronicling America project, providing a uniform sample.
  • The Texas orientation of all the newspapers gave us a consistent geography for our visualization experiments.
BUILDING PROTOTYPES
And so we have been experimenting with mining language patterns and mapping the results.  We are currently working on a series of prototypes of what this might look like, which we will be releasing on this site as we develop them.  These prototypes consist of visualizations that build on top of text- and data-mining that we are doing with the newspaper collection.
Our first prototype, which is nearly ready for initial release, examines the quantity and quality of information available in our newspaper collection as it spread out across both time and space.  Future prototypes will attempt to answer specific research questions using the collection.
THE PARTNERSHIP
The project relies on two teams, one at the University of North Texas and one at Stanford’s Bill Lane Center for the American West, that each bring unique skills to the project.  At UNT, we have expertise in the historical content and a particularly talented team of computer scientists specializing in natural language processing for the text-mining side of the project.  At Stanford’s Lane Center, we have a team deeply skilled in both complex historical visualizations and spatial mapping.  (For more detail on the folks behind the project, see the People section.)
Between the two teams, it seemed to us, we have a unique opportunity to conduct experiments in what might be possible though text-mining and visualizing a large collection of historical newspapers.
]]>
http://mappingtexts.library.unt.edu/?feed=rss2&p=75 0