• HC Visitor
Skip to content
Information Ecosystems
Information Ecosystems

Information, Power, and Consequences

Primary Navigation Menu
Menu
  • InfoEco Podcast
  • InfoEco Blog
  • InfoEco Cookbook
    • About
    • Curricular Pathways
    • Cookbook Modules

Open Data and data infrastructure across disciplines

By: Erin O'Rourke
On: November 14, 2019
In: Sabina Leonelli
Tagged: Data, Information, Open Data, Philosophy of Science, Sabina Leonelli
illustration of cassava plant
Image of Cassava plant, Franz Eugen Köhler via Wikimedia Commons (Public Domain)

On November 15th, Dr. Sabina Leonelli spoke to the participants of the Sawyer Seminar. As a historian and philosopher of science, she is currently the Co-Director of the Exeter Centre for the Study of the Life Sciences and has recently worked on a five-year grant about data access, openness, and infrastructure entitled The Epistemology of Data-Intensive Science. In her conversations at the Friday seminar, Dr. Leonelli focused on practices surrounding data collection and reuse, aiming to move towards a future of Open Data as the standard.

One of her recent publications, an op-ed entitled “Data Shadows: Knowledge, Openness, and Absence,” spoke directly to many of the themes central to the Sawyer Seminar. She defines shadows, beyond being mere absences in data, as “the multiplicity of motives, goals, and conditions through which data may be construed as (in)significant, partial or complete, (un)intelligible, or (in)accessible.” Consequently, the degree to which these shadows exist depends on the context in which the data is considered, especially when data is being reused by parties other than the original creators. In her conversation with seminar participants, Dr. Leonelli discussed her vision for data use and distribution today, which involves most data being open-access, rather than owned by companies or individuals, as well as having the necessary metadata and methodological descriptions to make it valuable to others. This allows data to be reused, recontextualized, and further studied as more information becomes available, potentially allowing for discoveries in numerous fields.

Dr. Leonelli identified several challenges in creating and maintaining open data, as well as some potential benefits. One major challenge is the imposition of standards on the data depending on the field in which it’s used — using her studies of cassava as an example, she noted that no one dataset will work for biologists, nutritionists, chefs, and consumers of the crop, since each cares about different qualities of the plant. In addition, in the Harvard Data Science Review, she claims that there is no such thing as raw data, a point she supported in the talk — data is always collected with some intention and context in mind, as well as the biases of the researcher, so it will often need modification for reuse in another context. Different disciplines have different research cultures, and sometimes data is collected by people with no formal research experience, such as farmers in the aforementioned cassava example. However, these farmers often had more crop-specific expertise and knowledge of consumers’ preferences than scientists did, so they lent a unique and valuable perspective that led researchers to collect additional metadata and expand their taxonomies. By integrating data producers into research, the quality of data collected and the ways they are organized can be improved to better reflect a greater variety of expertise, as well as making it more widely reusable.

Once collected, historically, data have been considered private — owned by universities and researchers, companies, or individuals in the case of personal data. Today, nearly everyone produces some kind of data that is collected by companies, such as with a fitness tracker, credit card, or smartphone. If individuals who produce such data do not own it in the future, as Dr. Leonelli suggests, restrictions could still be placed on its use while eliminating the possibility for people to be exploited into selling their data, especially vulnerable groups with minimal assets or income. Aside from necessary restrictions, the data would be available for reuse, even outside the discipline where the data originated.

Leonelli also asserted that data isn’t a perfect reflection of the world, which interestingly aligned closely with a part of Dr. Sandra González-Bailón’s recent talk, in which González-Bailón discussed a futuristic novel in which everything is archived in a global repository — when data is omitted, the only remaining option is to correct reality. Clearly, the exact opposite is true in real life — this anecdote humorously exposed how far off some data can be from reconstructing the world. However, Dr. Leonelli advocates for a more thorough system of data infrastructure, much like a large-scale archive for digital information. She emphasized that preserving physical objects is often planned for hundreds of years into the future, whereas it’s difficult to consistently preserve data for a mere ten years, due to the speed of progress in computing. This infrastructure would leave data accessible, with systematic formats and file sizes, as well as relevant metadata. Information about how the data was collected is important as well, especially when human decisions are a part of how the data were constructed, such as in previous seminar speaker Dr. Ted Underwood’s work, wherein some genres were determined by a group of researchers reading the texts.

As examples of this infrastructure, Leonelli cited a multiyear project entitled GARDIAN, which aims to combine information from across the field of crop science into one widely accessible database. While that example has been successful, it’s just as easy for infrastructure projects to fail — a 2011 project, the Cancer Biomedical Informatics Grid (caBIG), was designed with computational and structural requirements in mind, but with less emphasis on the needs of stakeholders, including researchers themselves. This project was scaled back significantly and lost funding due to concerns about effectiveness. Clearly, creating a viable system for long term data storage and use is a difficult task. Further, many of the platforms and file formats used to store data today are rendered obsolete in only a few years, which tends to require substantial long-term funding to maintain a usable resource.

When asked about implications on training, Dr. Leonelli first explained that no one can be expected to master all disciplines and their data needs — it’s important to know when to seek out an expert outside of your field, and what such an expert can help you with. She also suggests instituting a class in data ethics and governance required for any field that works with data, as well as brief internships in the field to see what this looks like in a professional setting.

Dr. Leonelli’s talk was enlightening for anyone who is a user, curator, or even subject of data — it encouraged listeners to think about who owns data in an age when information is ubiquitous, and how one might prepare for futures that today’s researchers have yet to imagine, by building lasting infrastructure.

2019-11-14
Previous Post: How Should We Handle Personal Data, Privacy, and Leisure Time in the Information Age?
Next Post: The History of Science & Big Data’s Place in the Humanities

Invited Speakers

  • Annette Vee
  • Bill Rankin
  • Chris Gilliard
  • Christopher Phillips
  • Colin Allen
  • Edouard Machery
  • Jo Guldi
  • Lara Putnam
  • Lyneise Williams
  • Mario Khreiche
  • Matthew Edney
  • Matthew Jones
  • Matthew Lincoln
  • Melissa Finucane
  • Richard Marciano
  • Sabina Leonelli
  • Safiya Noble
  • Sandra González-Bailón
  • Ted Underwood
  • Uncategorized

Recent Posts

  • EdTech Automation and Learning Management
  • The Changing Face of Literacy in the 21st Century: Dr. Annette Vee Visits the Podcast
  • Dr. Lara Putnam Visits the Podcast: Web-Based Research, Political Organizing, and Getting to Know Our Neighbors
  • Chris Gilliard Visits the Podcast: Digital Redlining, Tech Policy, and What it Really Means to Have Privacy Online
  • Numbers Have History

Recent Comments

    Archives

    • June 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • October 2020
    • September 2020
    • May 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019

    Categories

    • Annette Vee
    • Bill Rankin
    • Chris Gilliard
    • Christopher Phillips
    • Colin Allen
    • Edouard Machery
    • Jo Guldi
    • Lara Putnam
    • Lyneise Williams
    • Mario Khreiche
    • Matthew Edney
    • Matthew Jones
    • Matthew Lincoln
    • Melissa Finucane
    • Richard Marciano
    • Sabina Leonelli
    • Safiya Noble
    • Sandra González-Bailón
    • Ted Underwood
    • Uncategorized

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org

    Tags

    Algorithms Amazon archives artificial intelligence augmented reality automation Big Data Bill Rankin black history month burnout cartography Curation Darwin Data data pipelines data visualization digital humanities digitization diversity Education election maps history history of science Information Information Ecosystems Information Science Libraries LMS maps mechanization medical bias medicine Museums newspaper Open Data Philosophy of Science privacy racism risk social science solutions journalism Ted Underwood Topic modeling Uber virtual reality

    Menu

    • InfoEco Podcast
    • InfoEco Blog
    • InfoEco Cookbook
      • About
      • Curricular Pathways
      • Cookbook Modules

    Search This Site

    Search

    The Information Ecosystems Team 2023

    This site is part of Humanities Commons. Explore other sites on this network or register to build your own.
    Terms of ServicePrivacy PolicyGuidelines for Participation