For readers today, there is a wealth of information available about any given work they read, from its date of publication, its author’s biographical information, its genre, details about any previous editions or formats, and the like. On top of that, it seems as though nearly any book is available through online retailers like Amazon, public or university libraries, or as an e-book. With all this information about recent works at our fingertips, it was surprising to learn how much there still is to know about collections of written works that span only the past few centuries.
On Friday, September 20, Ted Underwood, professor of English Literature and Information Science at the University of Illinois, addressed the participants of the Mellon Sawyer Seminar, answering questions about how data and the absence of data relate to his work.
Underwood’s area of expertise is distant reading — drawing conclusions about large collections of written work through an analysis of metadata and other less-subjective characteristics. As Underwood described in Digital Humanities Quarterly, distant reading was a technique in literature long before computers became equipped to help with it. When researchers first had access to computational methods like optical character recognition, which make the full text of a piece searchable, they applied them to a variety of problems, soon realizing what kinds of questions computers are suited to answer. People, rather than computers, are much better at closely reading a single text or the works of one author to characterize them, as well as answering “why” questions, such as “why did an author make a specific literary choice?”
In his most recent project, Ted Underwood and his team worked with the Hathi Trust Digital Library, a collection of seventeen million volumes gathered from over sixty research libraries. With a library that large, it’s hard to imagine a question one couldn’t answer, given enough time, but Underwood made it clear that even such an extensive collection has its limitations. This brought us to the question of who and what is absent in the data Underwood considers. Since most of the sixty libraries are located in the US, Canada, and Europe, the inclusion of international works correlates with their prestige — it’s likely that award-winning novels from other countries will be included, but not much popular literature. In addition, certain genres like children’s and juvenile fiction and newspaper fiction are underrepresented. Another issue, less specific to computational methods, is how generalizable conclusions are to the population — fiction writing certainly isn’t a perfect representation of humanity, or even the United States, with regards to issues of gender and race. A final, more technical challenge, is that many books published after 1923 are still under copyright. In order for these books to be searchable for consideration in projects like Underwood’s, users can look up frequencies of words on a given page, just not the text word-for-word.
With all of these factors considered, Underwood reiterated: you can’t model away bias or the limitations of a collection. What he does do is better understand how the characterization of gender in literature has changed over time, by finding phrases and words that correlate with the gender of the character being described, or use machine learning to detect the genre of a piece of writing, or understand what kinds of backgrounds authors come from in a large body of literature. In understanding the models Underwood used to answer these questions, readers realize that in literature, shifting human definitions of things like genre matter more than inherent qualities of the work. Informed by knowledge like this, humanists can be prepared to ask different questions.
Arriving at conclusions like these requires a team with a specific skill set. Underwood sought a dual appointment between the Information Science and English departments at the University of Illinois in part to collaborate with researchers trained in quantitative methods like computer programming and statistics. The divide in technical skills between departments seems less than ideal: Underwood’s projects are focused on literature, and it could greatly benefit literature students if there were training opportunities within their department to work on projects like these. In the question and answer portion of the talk, Underwood and several seminar participants identified strategies to include more quantitative methods in humanities degree programs. These included department-specific methods courses, sending students to a related department like information science or statistics to take intro classes, and even a course on the subject of evidence, co-taught by statisticians and humanists and considering both quantitative and qualitative methods. Perhaps the most unique approach proposed at the Friday seminar is designing degree programs with a lighter core, and encouraging deeper dives into subfields, taking a quantitative intro course through a student’s home discipline, followed by courses from other departments for more detailed content. Underwood emphasized the importance of these courses in grad school to avoid restricting research roles to students who already have the necessary technical skills, which would create additional biases in the field towards the kinds of students who already have technical skills.
Underwood’s talks and published work challenge readers to expand their definition of what methods literary scholarship can employ, what questions these methods can be used to answer, and how universities should adapt their course offerings to handle a changing landscape of research projects. It was great to hear from him at this talk, and I look forward to the rest of the series.