The Sawyer Seminar’s November 15 guest was Dr. Sabina Leonelli. Dr. Leonelli teaches Philosophy and History of Science at the University of Exeter, where she is also the co-director of the Egenis Centre for the Study of Life Sciences. Her book Data-Centric Biology: A Philosophical Study, was published by the University of Chicago Press in 2016. She is now working on translating her 2018 book, Scientific Research in the Era of Big Data, into English from its original Italian. Both deal abundantly with the recent shifts and innovations in how researchers process and understand scientific data. In both her public talk on Thursday, November 14, and Sawyer Seminar lunch discussion, Dr. Leonelli walked us through the fundamentals of and distinctions between Big Data, Open Data, and FAIR Data (Findable, Accessible, Interoperable, Re-Usable); these distinctions—and mindful discussions about them—is increasingly necessary as, to quote Leonelli in Data-Centric Biology, “the rise of data centrism has brought new salience to the epistemological challenges involved in processes of data gathering, classification, and interpretation and…the social structures in which such processes are embedded” (2).
As Leonelli described it, Big Data is definied by their capacity to move, be (re)used across situations & disciplines, and (re)aggregated into different useful and usable platforms. To elaborate here: while there is “no rigorous definition of Big Data,” we use them, in general, to complete large-scale projects that may not valuably be done at a smaller scale, often to extract new insights about an entire world, community, or issue. Humanist examples of this in practice would include using huge aggregates of digitized libraries to track trends in literature over years, decades, or centuries, like our previous participant Ted Underwood did in Distant Horizons: Digital Evidence and Literary Change. Big Data are employed particularly often in fields like healthcare, from tasks like deciding how many doctors to staff at a hospital at a given time, to digitizing health records, or EHRs.
But Big Data, as Leonelli thoughtfully pointed out, are not necessarily Open nor FAIR. Leonelli is a vocal supporter of Open Data, and is actively involved in several Open Science Initiatives. Open Data can be freely used, re-used and redistributed by anyone and for anything—subject only, at most, to the requirement of attribution. Anyone who has used their iPhone’s pedometer or purchased a smartwatch to track their heartrate can infer that the ultimate uses of that personal data by companies is often anything but open; Google’s recent acquisition of Fitbit, for example, has users questioning if they really want strangers in Silicon Valley examining their blood sugar levels or heart arrhythmias, and if individual consumers, too, should be granted access to a slice of this data.
A stop along the way to making data completely Open is to also make them FAIR. Dr. Leonelli was quick to point out that we are, for the most part, a far cry from making this standard a reality. Making data FAIR requires researchers to acquire, develop, and deploy a huge range of (often new) skills across multiple disciplines. This process can be both time-consuming and expensive—two conditions, which Dr. Leonelli pointed out, can be very difficult to pitch while attracting funding and funders. And while we can certainly begin training students and young people within school systems which increasingly integrate a wide array of computational training and data collection methods, this is not exactly a speedy process, either. In our talk on Friday, and throughout the semester, members of the Sawyer Seminar have recognized the multitude of institutional roadblocks when integrating often rapidly-changing computational methods into the generally slow-moving bureaucracy of the U.S. University. In the humanities, some have theorized that the traditional instructor-centric model of classroom learning is especially inhospitable to digital approaches, which so often necessitates “collaboration among students, peer review, or opportunities for students to design aspects of their courses.” Leonelli shared that similar systemic issues come up in her home field of Philosophy of Science, where the complex ethics of data collection and use, and making data FAIR, can have a hard time cohering around otherwise rigid institutional standards and practices.
Dr. Leonelli gave several case study examples to illustrate the true complexity of data collection and use, and what she views as the necessity for Big Data to become Open and FAIR. One example, the collection of data related to cassava crops, was especially striking; cassava is one of the largest, most abundant food crops in the world, and thus an important indicator of global climate changes and trends. Dr. Leonelli pointed out, here, that while data collected could indeed be highly useful in, for example, yielding more nutritionally dense crops—it is very important to remember that there is not yet any “objective” standard of data quality. This means that the efficacy of data is entirely dependent on the (inherently flawed and variable) conditions under which they were collected. And furthermore, the process of Big Data collection is not somehow above or immune to the same biases, hierarchies, and, ultimately, potential failures as any other method. Leonelli asked us to consider, for example, what it would and could mean to make this cassava data responsive, useful, and available to the local farmers whose livelihoods actually depend on the crops. This would, of course, not be a quick nor simple process, and a large part of it, as I’ve pointed out, would be seeking funding. There are already websites and organizations that provide invaluable aggregations of data just like the cassava example, but our access to them depends on these organizations being motivated to keep funding them. Dr. Leonelli pointed out that it is, right now, difficult to find funding for any project beyond ten years, let alone for centuries of human posterity.
Like the term implies, issues of Big Data are vexing and plenty. And we are, Dr. Leonelli suggested, unlikely to find a simple solution to any of them any time soon. Perhaps the more important thing, for now, is to keep having careful, informed discussions about data. It should not suit us, in other words, to assume that the above-mentioned issues will all work out on their own and with no intervention; it also does not work to assume that data are fixed representations of ultimate, universal truths. And us folks of the Sawyer Seminar are thrilled to be partaking in conversations that complicate and interrupt these assumptions.