Saturday, November 17, 2012

Crime, Punishment, and Data Mining



We've come a long way.
This week’s readings started off with “The History of Humanities Computing,” a pretty comprehensive piece by Susan Hockey on the history of humanities computing, which starts back in 1949 (much longer ago than I thought it started) with a project by Jesuit Priest Father Roberto Busa, who needed a way to index the Latin from the works of Thomas Aquinas (which are obtuse in English, let alone Latin).  This was an enormous task, so he turned for help to the new-fangled computer.  On one hand, this allowed him to do what he needed to do but on the other, the technology available to Busa imposed pretty significant limitations.  Magnetic tape and punch cards were the only recording media around, and both can only be scanned sequentially, unlike modern Random Access Memory (RAM) that can be used in non-linear ways.  This says nothing of the difficulties in transporting all that tape or “truckloads of punched cards” hither and yon.  The article summarizes it by pointing out that the early period was “characterized as being hampered by the technology.”  A sort of community started to grow in the 1960s that included at least one conference (Yorktown Heights, 1964) and a few “centers dedicated to the use of computers in humanities.”

Anyone remember these?
Computing in the humanities made even more progress through the 1970s, 1980s and 1990s.  In just this couple of decades, it consolidated its gains both in terms of technology and penetration into the culture of the humanities.  This is the period during which personal computers became available (along with email!), operating systems made major evolutionary leaps, data storage grew, and the ability to access information departed from the rigid linearity of the early period.  Of all the advances in this era, Shockey insists that the most important was the formulation of a standard encoding format for the humanities (Text Encoding Initiative, or TEI), which was “the first systematic attempt to categorize and define all the features within humanities texts that might interest scholars.”  

The Internet was the big story of the new millennium and while the scholarly community initially greeted it with suspicion, they were eventually won over by the increased access to information, the potential for collaborative effort, and the emergence of a body of knowledge that would not have been possible otherwise.

Visualizers help us understand data.
The readings go on to focus on data, what it is, and what to do with it?  How do we meaningfully depict data we acquire in our research?  Data Visualization gives a historical sketch, lays out Jacques Bertin’s six visual variables and throws in some of the coolest visualizers I’ve ever seen.  Two stood out: MartinWattenberg’s Baby Name Voyager and The Jobless Rate for People Like You.  Both take massive amounts of data and present it in the most usable, pliable format I could think of.

The Old Bailey Online is essentially just another way to take an incredibly unwieldy amount of information and pack it down into useful, searchable chunks.  In this case, researchers have digitized the 127 million words in the proceedings from the Old Bailey between 1674 and 1913 and created a data base.  The advanced search page allows the user to input various data (I searched for murders, guilty, drawn-and-quartered, all periods) and found that this horrific punishment was administered to those who threaten the life of the King or foment revolt, those who counterfeit of otherwise alter the coinage of the realm, and those who brought Catholicism into the kingdom (it was just a few decades after the Reformation, after all).  The first two articles that went along with it ("Digitising History From Below: The Old Bailey Proceedings Online, 1674-1834" and "Teaching with the Old Bailey Online") gave a useful developmental history of the site and how it has been used in the classroom at all levels.  One of the most interesting potential uses has to do with teaching statistics.  What better way than seeing how data lines up statistically than with a tool like the Old Bailey Online? 

The last article, Data Mining with Criminal Intent, described a project that married the Old Bailey Online with Zotero and used Voyeur as a visualizer.  Zotero brings the ability to save whole search strings rather than individual results, and Voyeur enables the user to analyze data and display it using a variety of plotting tools.  From a utility standpoint, the authors discovered that while “ordinary historians” latched on to Zotero pretty easily, they struggled more with Voyeur.  In addition to some trouble with the learning curve, they subjects needed some tutoring to appreciate Voyeur’s ability to help them envision new kinds of relatedness. I must admit that I spend some time playing with Voyeur and while I certainly see its potential, I can also understand why people might have a hard time with it.  There's so much a user can do with it that the learning curve becomes a little bit overwhelming.

Computing for the humanities has come a long way since the 1950s.  If I can ever figure out Voyeur, maybe I can take a little farther.

No comments:

Post a Comment