On November 15th, Dr. Sabina Leonelli spoke to the participants of the Sawyer Seminar. As a historian and philosopher of science, she is currently the Co-Director of the Exeter Centre for the Study of the Life Sciences and has recently worked on a five-year grant about data access, openness, and infrastructure entitled The Epistemology of Data-Intensive Science. In her conversations at the Friday seminar, Dr. Leonelli focused on practices surrounding data collection and reuse, aiming to move towards a future of Open Data as the standard.
One of her recent publications, an op-ed entitled “Data Shadows: Knowledge, Openness, and Absence,” spoke directly to many of the themes central to the Sawyer Seminar. She defines shadows, beyond being mere absences in data, as “the multiplicity of motives, goals, and conditions through which data may be construed as (in)significant, partial or complete, (un)intelligible, or (in)accessible.” Consequently, the degree to which these shadows exist depends on the context in which the data is considered, especially when data is being reused by parties other than the original creators. In her conversation with seminar participants, Dr. Leonelli discussed her vision for data use and distribution today, which involves most data being open-access, rather than owned by companies or individuals, as well as having the necessary metadata and methodological descriptions to make it valuable to others. This allows data to be reused, recontextualized, and further studied as more information becomes available, potentially allowing for discoveries in numerous fields.
Dr. Leonelli identified several challenges in creating and maintaining open data, as well as some potential benefits. One major challenge is the imposition of standards on the data depending on the field in which it’s used — using her studies of cassava as an example, she noted that no one dataset will work for biologists, nutritionists, chefs, and consumers of the crop, since each cares about different qualities of the plant. In addition, in the Harvard Data Science Review, she claims that there is no such thing as raw data, a point she supported in the talk — data is always collected with some intention and context in mind, as well as the biases of the researcher, so it will often need modification for reuse in another context. Different disciplines have different research cultures, and sometimes data is collected by people with no formal research experience, such as farmers in the aforementioned cassava example. However, these farmers often had more crop-specific expertise and knowledge of consumers’ preferences than scientists did, so they lent a unique and valuable perspective that led researchers to collect additional metadata and expand their taxonomies. By integrating data producers into research, the quality of data collected and the ways they are organized can be improved to better reflect a greater variety of expertise, as well as making it more widely reusable.
Once collected, historically, data have been considered private — owned by universities and researchers, companies, or individuals in the case of personal data. Today, nearly everyone produces some kind of data that is collected by companies, such as with a fitness tracker, credit card, or smartphone. If individuals who produce such data do not own it in the future, as Dr. Leonelli suggests, restrictions could still be placed on its use while eliminating the possibility for people to be exploited into selling their data, especially vulnerable groups with minimal assets or income. Aside from necessary restrictions, the data would be available for reuse, even outside the discipline where the data originated.
Leonelli also asserted that data isn’t a perfect reflection of the world, which interestingly aligned closely with a part of Dr. Sandra González-Bailón’s recent talk, in which González-Bailón discussed a futuristic novel in which everything is archived in a global repository — when data is omitted, the only remaining option is to correct reality. Clearly, the exact opposite is true in real life — this anecdote humorously exposed how far off some data can be from reconstructing the world. However, Dr. Leonelli advocates for a more thorough system of data infrastructure, much like a large-scale archive for digital information. She emphasized that preserving physical objects is often planned for hundreds of years into the future, whereas it’s difficult to consistently preserve data for a mere ten years, due to the speed of progress in computing. This infrastructure would leave data accessible, with systematic formats and file sizes, as well as relevant metadata. Information about how the data was collected is important as well, especially when human decisions are a part of how the data were constructed, such as in previous seminar speaker Dr. Ted Underwood’s work, wherein some genres were determined by a group of researchers reading the texts.
As examples of this infrastructure, Leonelli cited a multiyear project entitled GARDIAN, which aims to combine information from across the field of crop science into one widely accessible database. While that example has been successful, it’s just as easy for infrastructure projects to fail — a 2011 project, the Cancer Biomedical Informatics Grid (caBIG), was designed with computational and structural requirements in mind, but with less emphasis on the needs of stakeholders, including researchers themselves. This project was scaled back significantly and lost funding due to concerns about effectiveness. Clearly, creating a viable system for long term data storage and use is a difficult task. Further, many of the platforms and file formats used to store data today are rendered obsolete in only a few years, which tends to require substantial long-term funding to maintain a usable resource.
When asked about implications on training, Dr. Leonelli first explained that no one can be expected to master all disciplines and their data needs — it’s important to know when to seek out an expert outside of your field, and what such an expert can help you with. She also suggests instituting a class in data ethics and governance required for any field that works with data, as well as brief internships in the field to see what this looks like in a professional setting.
Dr. Leonelli’s talk was enlightening for anyone who is a user, curator, or even subject of data — it encouraged listeners to think about who owns data in an age when information is ubiquitous, and how one might prepare for futures that today’s researchers have yet to imagine, by building lasting infrastructure.