5  Demographic data

This page summarizes the demographic data currently reported by shared Databrary volumes.

Databrary analytics

Databrary maintains a website with information about the system’s users and data. The site can be found at:

https://databrary.github.io/analytics.

The data reported here are updated regularly, typically on a weekly or bi-weekly basis.

Data summarized here reflect the state of the Databrary system on 2023-08-31.

Release levels

By default, Databrary exposes certain variables to the public and hides others from public view. Who can see what files and variables is indicated by a set of labels and icons that apply system-wide called the “Databrary Release Levels”:

The above is taken from https://databrary.org/support/irb/release-levels.html.

By default, all files uploaded to the system are assigned a Private release level.

Databrary spreadsheet

Researchers who are authorized by their institutions to access Databrary may curate their own datasets and add a number of types of data and metadata. These data are entered into, and visualized via a “spreadsheet” type of interface:

Figure 5.1: Illustrative spreadsheet from https://databrary.org/volume/8

The virtue of using this spreadsheet interface is that the current Databrary system permits a comma-separated value (CSV) format file of the spreadsheet to be easily exported. Indeed, the databraryr package makes use of this feature, as does the analytics pipeline.

In the images below, the blue ‘checked’ variables are those the system presumes researchers will want to enter in the typical or default case. The grey ‘checked’ variables are generated internally by the Databrary system. For example, the ID variable is a system-wide-unique value for a specific research participant’s data.

(Human) participant data

Most, but not all of Databrary’s current data depict humans. We have a small set of non-human animal data, and have plans to expand that in the future.

For now, when creating a new dataset/volume, the user is prompted to design a spreadsheet that is specific to the needs of their study. The following figure shows one of the “help” images associated with the process of designing a spreadsheet:

Figure 5.2: Participant-level variables supported by Databrary’s spreadsheet

Most participant-level variables are shown to Public audiences (globe icon), with the exception of birthdate and disability.

Exact birthdates are shown only to Authorized Investigators or Affiliates with full Databrary access and from datasets (volumes) that have been fully shared.

Exact age

If a researcher provides a valid test date and a valid birthdate, the system automatically calculates an age in days.

Here is information about age data from shared Databrary volumes:

Figure 5.3: Histogram of age data from shared Databrary volumes; source: https://databrary.github.io/analytics/participant-demographics.html#age

Figure 5.4: Histogram of age data from shared Databrary volumes; source: https://databrary.github.io/analytics/participant-demographics.html#age

Note: Databrary does not currently support entering age values directly.

Gender

Databrary does not currently use a standard vocabulary to collect information about participant gender, nor does the system support information about related measures like natal sex.

Nevertheless, there appear to be \(n=142\) volumes that report gender in some form. Here is a tabular summary of the values for the gender variable across those volumes:

Race

Some \(n=127\) shared volumes report information about race. Here is a tabular summary of the values for the race variable:

Note that the list contains some standard categories (OMB), but also a number of unique, non-standard, values.

Ethnicity

Note that the list contains some standard categories (OMB), but also a number of unique, non-standard, values.

Gestational age

The analytics report does not currently report data about the extent to which gestational_age is reported.

Pregnancy term

\(N=21\) volumes report pregancy term.

A summary table of the reported results shows that these are not standardized:

Birth weight

Only \(n=3\) shared volumes report birth weight. Here is a histogram of shared birth weights converted to pounds and ounces:

Summary of birth weight data from shared Databrary volumes; source: https://databrary.github.io/analytics/include/img/birthweight-hist-1.png
Note

The Play & Learning Across a Year project collects birth weight data among other measures, uses Databrary extensively, will eventually share the full dataset with the broader research community.

Disability

Some \(n=86\) volumes report on disability. Here is a table of the reported values:

Language

Some \(N=111\) volumes report participant_language. Here is a table of the reported values:

Country

Only \(n=2\) shared volumes report country. Here is a table summarizing that data:

State

The analytics report does not currently report data about the extent to which state is reported.

However, the Play & Learning Across a Year (PLAY) Project has collected and will eventually report the U.S. States in which data was collected. This project is also exploring reporting more micro-level geographic data (e.g., county), and where participant (and IRB) permission exists, Census Block Group level data.

Other variables

The analytics report does not currently report data about the extent to which user defined variables like pilot status, exclusion criteria, task names, condition names, group names, context factors, or setting are reported in the Databrary spreadsheet. Most of these user-reported variables require additional metadata to be useful to others. Future work should focus on standardizing how that metadata is reported and incorporated into Databrary.

Figure 5.5: Variables related to a participant or session’s ‘pilot’ status supported by Databrary’s spreadsheet

Figure 5.6: Variables related to a participant or session’s status as ‘included’ or ‘excluded’ from a final dataset supported by Databrary’s spreadsheet

Figure 5.7: Variable(s) related to the task or tasks associated with a participant or session supported by Databrary’s spreadsheet

Figure 5.8: Variable(s) related to an experimental condition associated with a participant or session supported by Databrary’s spreadsheet

Figure 5.9: Variable(s) related to some grouping condition associated with a participant or session supported by Databrary’s spreadsheet

Figure 5.10: Variable(s) related to the context associated with a participant or session supported by Databrary’s spreadsheet

Tags and Keywords

Databrary supports attaching free-form tags and keywords to single participant datasets or entire volumes.

There are \(n=410\) unique tags and keywords. Here is a word cloud that depicts them:

Figure 5.11: Most common tags and keywords across shared Databrary volumes; source: https://databrary.github.io/analytics/tags-and-keywords.html