• HC Visitor
Skip to content
Information Ecosystems
Information Ecosystems

Information, Power, and Consequences

Primary Navigation Menu
Menu
  • InfoEco Podcast
  • InfoEco Blog
  • InfoEco Cookbook
    • About
    • Curricular Pathways
    • Cookbook Modules

Data Pipelines, Data Fluidity: Colin Allen on the “Useful Fiction” of Curated Data

By: Jane Rohrer
On: February 28, 2020
In: Colin Allen
Tagged: Big Data, Darwin, data pipelines, Topic modeling

Colin Allen, distinguished professor in the Department of History and Philosophy of Science at the University of Pittsburgh, is both an invited speaker and an ongoing participant in our Seminar; on February 28th, Dr. Allen talked with his fellow participants about his work in what he (and others) call “data pipelines.” Broadly speaking, using data pipelines means that data are collected and recorded in one of many particular ways—but eventually used for purposes other than why they were originally collected. And this means, Dr. Allen pointed out, that data are highly fluid, flexible, and even self-perpetuating.

An especially potent example of this in Allen’s own work is his current role as Associate Editor of the Stanford Encyclopedia of Philosophy. While this project has one discreet start date back in 1995, it has been anything but static since then; as of March 2018, the site has approximately 1,600 entries each of which is routinely reviewed and updated. Each new post adds to what is now a highly dynamic reference work containing data culled from all over the web—a pipeline, indeed.

Dr. Allen thoughtfully pointed out that as our relationship to data changes over our collective futures, it is important to remember that data does not enter into our world on its own but, rather, it is collected and curated. Allen co-authored an article, “Exploration and Exploitation of Victorian Science in Darwin’s Reading Notebooks,” with Jaimie Murdock and Simon DeDeo in 2017. Charles Darwin left careful records of the books he read from 1837 to 1860, making this piece of his biographical information an especially rich site for data analysis. Allen and his co-authors used topic modeling to group each of Darwin’s listed texts into “a mixture of topics.” While this method can (and did!) certainly teach us a lot about Darwin—which ideas he was reading about most frequently at certain points of his life, for example—we must remember that the data set (Darwin’s reading list) was long-ago curated for purposes having nothing to do with their eventual use (Allen and his colleagues’ project).

Additionally, our changing relationship to data requires that we acknowledge our contemporary perspective. Allen acknowledged that data sets that might seem downright tiny to us today were thought of as massive by their initial collectors; and what is now “Big” Data might someday seem tiny in a way yet-unknowable by current standards. As data sets grow, it becomes more important than ever to acknowledge them as collected, curated information sets—not reflections or statements of simple truths.

Toward the end of his talk, Dr. Allen referred to the usage of digital platforms as “useful fiction;” by this, he means that a digital platform is unlikely to ever truly and faithfully convey a representative of all facets of any topic. Here, I’m reminded of our first Sawyer Seminar, Matthew Edney, who asked us to think about maps—what they show, and what they necessarily leave out. When we hop onto Google maps and search for a coffee place, the app is unlikely to actually pull up an exhaustive list of everywhere you might buy coffee—instead, we are handed a curated list of what Google presumes to be the most helpful answers to a user’s inquiry. Similarly, a user of the Stanford Encyclopedia of Philosophy is not greeted by a truly exhaustive collection of everything that has ever had to do with the topic at hand. Such a collection would be as impossible as it were frustrating. As Dr. Allen suggested, the more data one has, the more difficult it is to keep it active, current, and useful.

So as data do indeed create representative fictions—digital worlds where we are provided but a small sliver of what is really “out there”—this fiction can be deployed to highly useful ends. So while Darwin’s self-reported reading list does not, of course, stand as a total representative of all his thought patterns, choices, and influences between 1837 and 1860, it does suggest important things about how he navigated the world—and can help confirm (or deny) important details of his biography.

While we can’t predict how data collected today might be used tomorrow, we can—as Dr. Allen so expertly points out—take ongoing care to track their movements, however fluid they might be. In uncertain times, it remains a hopeful fact that we very well may be creating or curating data right now that someday changes the world, in ways yet-unknowable.

2020-02-28
Previous Post: Self-perpetuating data and “guided serendipity”: Colin Allen’s reflection on Charles Darwin, topic modeling, and Margaret Floy Washburn
Next Post: Representations: Reproductions as Originals

Invited Speakers

  • Annette Vee
  • Bill Rankin
  • Chris Gilliard
  • Christopher Phillips
  • Colin Allen
  • Edouard Machery
  • Jo Guldi
  • Lara Putnam
  • Lyneise Williams
  • Mario Khreiche
  • Matthew Edney
  • Matthew Jones
  • Matthew Lincoln
  • Melissa Finucane
  • Richard Marciano
  • Sabina Leonelli
  • Safiya Noble
  • Sandra González-Bailón
  • Ted Underwood
  • Uncategorized

Recent Posts

  • EdTech Automation and Learning Management
  • The Changing Face of Literacy in the 21st Century: Dr. Annette Vee Visits the Podcast
  • Dr. Lara Putnam Visits the Podcast: Web-Based Research, Political Organizing, and Getting to Know Our Neighbors
  • Chris Gilliard Visits the Podcast: Digital Redlining, Tech Policy, and What it Really Means to Have Privacy Online
  • Numbers Have History

Recent Comments

    Archives

    • June 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • October 2020
    • September 2020
    • May 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019

    Categories

    • Annette Vee
    • Bill Rankin
    • Chris Gilliard
    • Christopher Phillips
    • Colin Allen
    • Edouard Machery
    • Jo Guldi
    • Lara Putnam
    • Lyneise Williams
    • Mario Khreiche
    • Matthew Edney
    • Matthew Jones
    • Matthew Lincoln
    • Melissa Finucane
    • Richard Marciano
    • Sabina Leonelli
    • Safiya Noble
    • Sandra González-Bailón
    • Ted Underwood
    • Uncategorized

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org

    Tags

    Algorithms Amazon archives artificial intelligence augmented reality automation Big Data Bill Rankin black history month burnout cartography Curation Darwin Data data pipelines data visualization digital humanities digitization diversity Education election maps history history of science Information Information Ecosystems Information Science Libraries LMS maps mechanization medical bias medicine Museums newspaper Open Data Philosophy of Science privacy racism risk social science solutions journalism Ted Underwood Topic modeling Uber virtual reality

    Menu

    • InfoEco Podcast
    • InfoEco Blog
    • InfoEco Cookbook
      • About
      • Curricular Pathways
      • Cookbook Modules

    Search This Site

    Search

    The Information Ecosystems Team 2023

    This site is part of Humanities Commons. Explore other sites on this network or register to build your own.
    Terms of ServicePrivacy PolicyGuidelines for Participation