Ted Underwood has been poking around in the massive HathiTrust database for a few years now, and it’s taught him that libraries are anything but uniform.
During his talk with the Sawyer Seminar on Friday, Sept. 20, at the University of Pittsburgh, Underwood, a professor of English and information science at the University of Illinois, pictured his child-self walking through physical libraries, looking for books. He never guessed that every other library in existence didn’t catalogue their books in exactly the same way.
But, as he now works with the metadata associated with digitized books in the HathiTrust database, he’s realized the human side of library science a bit more.
He’s learned quite a bit about how physical libraries operate, he admitted. While there is national coordination with Library of Congress cataloguing standards, many of the decisions are up to individual librarians, he said.
Underwood was the second speaker in the Sawyer Seminar yearlong series entitled “Information Ecosystems: Creating Data (and Absence) from the Quantitative to the Digital Age.” He first spoke at a public lecture on Thursday, Sept. 19, and then for Sawyer Seminar participants on Friday.
Many of Underwood’s projects deal with large collections of data, including the HathiTrust database, which stores the digitized collections of several universities totaling over 17 million volumes. In the past, he and his collaborators have leveraged that data to find that the number of women writers of fiction declined from the nineteenth to the mid-twentieth century. He and his team at Illinois are mining HathiTrust for their latest project, this one involving genre markers.
It was here that Underwood discovered that sometimes the categories assigned to books were not at all consistent, or even obvious, in some cases.
Even with all that data at his disposal, Underwood was quick to point out that the corpus includes several blind spots. HathiTrust draws from American university libraries, so the collection reflects that bias: mostly American books (with non-American titles coming from the most prominent authors) and mostly titles that would appeal to academics. There is very little pulp fiction to be had here, nor does he claim that this collection is exhaustive or complete.
The demographics of fiction publication, he reminded, don’t mirror society at large. That’s true now and especially true of the nineteenth century, a period he studies often.
Underwood began his talk with a discussion of data use, including the applications available by the process of mass digitization – whether collections be made more complete or should scholars undertake “heroic recovery projects” of lesser known receptacles of fiction, including Australian newspapers, which Katherine Bode found was full of more than 20,000 fiction stories.
While datasets, bias, and research questions attracted attention, Sawyer Seminar participants were also eager to discuss the implications of Underwood’s decision several years ago to move from only the English department to English and information science. Underwood explained that he decided to make the move because it allows him to work with graduate students in the computer science department. While English students may be eager to learn computational methods, it can be difficult to add more instruction in those skills alongside the existing, demanding curriculum.
Underwood’s relation of his move and his suggestion that graduate students who want to work with digital projects have at least two semesters’ worth of programming experience led to a larger discussion about graduate program requirements and infrastructural changes needed to implement a digital humanities curriculum.
That set off a brisk discussion about the practicalities of implementing such a curriculum and what it would look like. Some departments are already trying to give their graduate students a foundation in working with data computationally, while others encourage students to get a basis in the home department and go to other departments to get specialist knowledge.
Underwood also suggested a “coding across the curriculum” approach, similar to a “writing across the curriculum” approaches common in universities.
As the need for providing students with quality quantitative training became clear, the qualitative characteristics of Underwood’s – and other humanists’ – work was never far from sight.
Some literary scholars, for example, worry about the limitation of computational methods in “reading” and interpreting literature. Computational methods have gotten fairly good at answering who? and what? questions, but the how? and why? questions are more difficult for those methods to get at.
For example, the work of William Shakespeare has attracted much attention from scholars, including one study that found that, of all Shakespeare’s tragedies, Othello shares the most characteristics with his comedies. A fascinating finding, to be sure, but while the study did determine the similarity was “intentional,” it does not explain why Shakespeare made this intentional choice.
For his part, Underwood emphasized the need to couple the distant reading done with computational methods with close reading that humanists excel at.
Shifting between distant and close reading while respecting the demands that both methods require may be one of the constant challenges of the digital humanities. Another is the continued discussion about the present, and future, of the field itself.
Briana Wipf is a PhD student in the English department, where she studies medieval literature and the digital humanities. Follow her on Twitter @Briana_Wipf.