Self-perpetuating data and “guided serendipity”: Colin Allen’s reflection on Charles Darwin, topic modeling, and Margaret Floy Washburn
In his computational work, Colin Allen, distinguished professor in the Department of History and Philosophy of Science at the University of Pittsburgh, embraces the fact that the textual data he uses in his computational work often depends not on his choices, but on someone else’s. Data does not emerge, fully formed, for him and his colleagues to study. He discussed this characteristic of data when he addressed the Information Ecosystems Mellon Sawyer Seminar at the University of Pittsburgh on Friday, Feb. 28. Data, as Joanna Drucker has memorably argued, isn’t data as much as it’s capta. If we remember the Latin meaning of data is “things given” while capta is “things taken,” Drucker’s argument makes sense. The stuff we generate in our experiments or gather in the world doesn’t exist naturally. Rather, it’s taken or made (in which case I suppose we’d call it facta). In Drucker’s formation, we are reminded that data isn’t neutral but often exists according to the individual choice of this or that researcher, or this or that curator. Allen points out that the textual corpus — that is, his data — he uses for one project, Darwin’s reading list, for example, yields its own data when he runs a topic model of the corpus. The topics produced by the model is data he can then interpret in his own work. In this way, Allen explained to me when I interviewed him for an upcoming episode of the Information Ecosystems podcast, data has a habit of begetting more data. “I think it’s important to realize that Read More