1 Researcher Feedback

This page describes the meeting held on Friday, August 11, 2023 at Penn State that followed an Open Science Bootcamp.

The meeting sought feedback from the community on the topic:

“Making shared data about human participants maximally useful for secondary research”

Participants

Rick Gilmore, Penn State, Department of Psychology
Briana Wham, Penn State, University Libraries
Kevin Clair, Penn State, University Libraries
Jeff Spies, Principal, 221b (via Zoom)
Jenae Neiderhiser, Penn State, Department of Psychology
Doug Teti, Penn State, Department of Human Development and Family Studies
Nicole Lazar, Penn State, Department of Statistics
C. Lee Giles, Penn State, Professor of Information Science and Technology
Deborah Ehrenthal, Professor of Biobehavioral Health, Director, Socail Science Research Institute
Alaina Pearce, Assistant Research Professor, Department of Nutritional Sciences

Themes

Several invitees with specific data and metadata perspectives were unable to attend, the team shifted the focus to a discussion of what current Databrary offers researchers in terms of restricted access data storage and sharing and the systematic reporting of some core demographic data elements.

Findings

Implementing the most flexible and powerful search capability will be essential. Elastic Search was recommended.
- Databrary currently implements a version of Solr. The Databrary 2.0 sandbox has experimented with GraphQL.
Future research is going to depend on placing people in context by linking to other data sources, including those tied to geography. How do we link this broadly to their community (location, socioeconomic status, etc.) while preserving participant privacy? Latitude and longitude are the most useful, but also the riskiest.
- The Play & Learning Across a Year (PLAY) Project is collecting sharing permission in some locations that will permit sharing of Census Block Group-level data. The project uses Databrary extensively now, and will share data with the research community when the project is complete. The intent is to share county-level Census data for all of the \(~n=1,000\) families, since this level of geographic resolution does not require participant permission. So, PLAY represents one illustration of how broader, geographically informed, data might be shared more widely.
- The PLAY project also implements the same algorithm used by the NIMH Data Archive for generating a globally unique identifer (GUID). PLAY will use these identifiers.
Are there requirements for the way the data are organized? Variables and format? Are there controlled vocabularies or standards for sex, ethnicity/race, etc? How do we know how the information was collected (what was asked and who answered)? Better documentation of what measurements were and how they were collected would be optimal. One solution would be need to upload their protocol.
- Databrary 1.0 has semi-controlled vocabularies for some variables (see the page on metadata), but these are limited in scope, and only some investigators use them. Users may upload additional data, including information about questionnaires, context for the data collection, but this is at the discretion of individual researchers.
- PLAY has shared its entire protocol openly, all parent report questions, and has already published a protocol illustrating how those questions are handled after data collection. But none of these practices are standard in the field.
NIH has published the PhenX toolkits and many of these are implemented in the NIH Toolbox and NIH Infant and Toddler (Baby) Toolboxes. Could that work be build upon?
- Yes. There have been some preliminary discussions with the lead PIs on the NIH Baby Toolbox about storing and sharing the video portions of those data on Databrary. But there is much more work that could be done in this vein.
What is your objective for the data that will be in Databrary? What data should be collected from everybody will be based on the goal of what is included in Databrary?
- Video about natural behavior needs a home and is much richer when it is linked with other kinds of data. Databrary aspires to be that home.