Databrary 2.0
As of April 2024, Databrary has begun a rewrite led by Montrose Software.
We are making use of two (currently private) repositories for the rewrite:
Once internal licensing discussions have concluded, we may make the repositories public.
Technology stack
Frontend
- TypeScript
- ReactJS
- SCSS
- Webpack and npm
Backend
- Python
- Django
- PostgreSQL
- FFmpeg/Elastic Transcoder
- Docker
Requirements
This section provides additional information about the requirements for specific aspects of the Databrary 2.0 application.
Schema
The Databrary 1.0 schema can be found here.
Montrose 1 recommends that we implement separate tables for individuals
and institutions
. These entities are combined into a single party
table in the current schema.
User Access Privileges
Registration workflow
Registration consists of multiple steps. The following shows requested modifications to the Databrary 1.0 workflow.
Create Account
The reference page for the Databrary 1.0 registration process is here:
Fields are as follows:
- First and Middle Name (required)
- Last Name (required)
- Email (required)
- Add instruction text that specifies an official institutional email must be used for this field (e.g., @psu.edu).
- Implement email validation (against existing database of valid institutional email, flag @gmail.com, etc.)
- Permit free-text email if validation fails
- Affiliation (required)
- Type-ahead search of existing database of authorized Institutions.
- If no matching institution exists, the user can enter a new institution, but should be notified.
- Change label to Institutional Affiliation.
Get Started
- Require user to confirm that they have read and agree to the Databrary Access Agreement.
- Link to the Terms and Conditions of Use.
Confirm email
Set Password
- Require strong password (increase number of required characters from 7 to 14 characters. 2
- Validate password to ensure that password is strong
Volume interface
- (new): Create bibliographic contributor field. Allow sorting of authors.
- (new): New volumes have a single owner who must be an Authorized Investigator.
- (remove): “Does this valume correspond to a published paper…” and related look up published paper or resource; complete reference info. Will enter linked resources in Add Links tab below.
- (remove): “Add keywords” interface (push to >2.0)
- (future): Pull keywords from related/linked articles
- (change): Default Volume access to Private
- (change): Separate (single) volume owner from Collaborators. Allow transfer of ownership here.
- (remove): “Extend access to … affiliates” checkbox and related functionality.
- (remove): “Investigator (read/write/share)” option for collaborators.
- (change): Make “Read only” default access level for all added Collaborators.
- (discuss): Add expiration date for access to volume for all collaborators.
- (discuss): How to simplify
- (discuss): How to simplify
- 2024-08-06: Required variables:
- File name
- Sharing release level
- File type (from extension)
- 2024-08-06: Required variables:
- (change): “Enter title or paper/dataset citation”
- (future): Way to search Databrary for related datasets.
- (change): Lookup existing funder, but permit user-entered value(s)
- (change): Merge FILE RELEASE LEVELS data with “added on”, “sessions”, “participants” panel
- (change): “HOW TO CITE” field has user editable authors, Databrary specific info is added by the system
- (new): OWNER sub-panel
- (change): Fix column size so that long names and affiliations aren’t cut-off
- (remove): “Create highlight” button and associated workflow. Re-implement in >2.0
- (remove): “Show saved display mode” dropdown and associated workflow. May re-implement some portion in >2.0.
- (remove): “Show summary” functionality. May re-implement some portion in >2.0.
- (remove): Save current display mode functionality.
- (new): Separate interface for Materials
- (discuss): Simpler tablular interface for spreadsheet display?
- (remove): Comments. Consider re-implementing in >2.0
Sessions interface
- (remove): Keywords and Tags.
- (remove): Bars that summarize spreadsheet metadata values.
- (remove): File names sort by timestamp uploaded.
- (remove): Video editing within viewer to create highlights.
- (discuss): Better interface layout for previewing videos.
- 2024-08-06
- (change): Create pop-up window for previewing video/audio
- (discuss): Could pop-up viewer allow other file-type “previews”, e.g. PDF, docx, txt? Or push to later timepoint.
- 2024-08-06
- (change): Make “set as highlight” feature more visible. This applies to a file.
- (change): Simpler tablular interface viewing files, release level, highlight status, etc.
- Columns include: File name, sharing release level, file type, size, last modified (optional)
- (discuss): Move button to downloading single file to table
- (remove): Timeline.
- (discuss): “Download all files as zip”. With large sessions and/or large files, the zip files are also large and require application resources to create. Are there any third party libraries we could use that would off-load this process? Note: This problem is larger with the “Download all folders as zip” function on the volumes page.
Super User/Admin panel
- (discuss): Now that institutions and users are separate in the database schema, can we prevent the accidental creation of new institutions via the API? In Databrary 1.0, it appears that new users who do not give an email are created as parties with the is_institution flag set to true.
User profile
- (discuss): Toggle volume view between short-name and long name.
API
- (discuss): Are we creating a new version of the API?
- (discuss): What are best practices for supporting external scripting access (via e.g., databraryr or databrarypy) while maintaining system security?
- (discuss): We have users who want Databrary to support Cross-Origin Resource Sharing (CORS). How do we do this? Should it be on an application by application basis?
Scoping
Databrary 2.0 rewrite
While the core of Databrary 1.0 is understood and will be replicated in Databrary 2.0, some questions will be explored that relate to new features:
File uploads/downloads/transcoding
- (discuss): Should video and audio transcoding be automatic or optional; if optional, how can transcoding be triggered?
- Should transcoding be “premium” service?
- (discuss): More generally, how can up separate uploading from transcoding?
- (discuss): Can we generate and return a thumbnail of an uploaded video as soon as it is succesfully uploaded? Right now, the application returns a placeholder image and the user cannot
Spreadsheet
- (discuss): (What features of the existing session/slot spreadsheet interface for managing and visualizing demographic data can be implemented easily and at minimal cost using existing libraries?
See also below section on support for schemas.
Security
Can two-factor authentication be added? If so, at what cost?3
Admin/Superuser
- (discuss): Can per-institution (across users and projects), per-user (across projects) or per-project storage quotas be implemented? Can warnings be generated when storage amounts are nearing quotas? How could Super Users manage user requests to increase quotas?
- (discuss): Can a more informative administrative console be developed with by-volume, usage, and storage metrics, including shared vs. unshared data? If so, at what cost?
See also section below about other administrative upgrades. This will probably get pushed to later.
Access to volumes
- (discuss): Volume access expires after a user-defined date that is no longer than one year from the date of the last update.
Roadmap (>2.0)
These ideas are on the longer-term roadmap. Some of them may be supported by proposals under review (e.g., HNDS-I or NYU TAC).
Support for open data schemas
JSON-LD
- Databrary should eventually support standard schemas wherever practical, specifically in the JSON-LD format. These should use Schema.org properties.
- Examples of properties that seem relevant to Databrary include:
- Person
- Creative Work
- Event
- For a data collection session or change in status on the site.
- Place
- Testing locations, geographical information, locations of institutions
- Data Types
These examples are not exhaustive.
- By support, I mean that the application should add relevant schema information to the JSON data provided by the API.
NIH CDE
- Databrary should also support NIH Common Data Elements (CDEs), especially for “spreadsheet” data elements.
- Examples of CDE properties of a Person
- Gender: https://cde.nlm.nih.gov/formView?tinyId=vx35JcbgJI
- Sex at Birth: https://cde.nlm.nih.gov/deView?tinyId=rGEh0ckdmr
- Race: https://cde.nlm.nih.gov/deView?tinyId=Fakc6Jy2x
- Race/Ethnicity Self-Identification: https://cde.nlm.nih.gov/deView?tinyId=LakF0YkywC
- Ethnicity: https://cde.nlm.nih.gov/deView?tinyId=PtRlg7yLP_
- Disabilities: https://cde.nlm.nih.gov/deView?tinyId=0md12WGtZXE
- Birth date: https://cde.nlm.nih.gov/deView?tinyId=X1mJv5j3jx.
- List of languages: https://cde.nlm.nih.gov/deView?tinyId=7JDyF9o3ie
- Language secondary text: https://cde.nlm.nih.gov/deView?tinyId=mJsWHcxje9W
There are a set of CDEs that NIH endorses. These should probably be the highest priority.
- (discuss): Should specific data elements be represented in the database schema explicitly? Would doing so make it easier to search and filter data by these characteristics?
- (discuss): Should the database schema support linking data about the same individuals? This would require supporting something like the NIMH GUID, and possibly some additional data sharing consent language that permits data linkage.
- (discuss): Should species and other terms commonly used in non-human animal research be incorporated and supported.
It does not appear that there are NIH-Endorsed elements that contain the term “species”.
Improved search and filtering
Ideas from HNDS-I 2024 proposal:
- Broader set of demographic characteristics.
- Index other text documents (protocols, coding manuals) in materials folders.
Index annotation files, return segments
“Building on this foundation, we will upgrade Databrary to support searching within annotation files linked to videos that tag specific behaviors, utterances, or contexts, starting with the most popular annotation file formats stored on Databrary (Datavyu and CHAT)”
— HNDS-I proposal.
“Virtual volumes” or custom collections
To capitalize on enhanced search and filtering and ease data reuse, users must be able to create their own custom collections of video files, video segments, annotations, and other data derived from multiple, primary datasets. The custom collections or “virtual datasets” will link to but not copy parent datasets and their associated metadata.
— HNDS-I proposal.
Workspaces
…We will implement private, flexible, temporary workspaces for datasets that act like folders in cloud storage. Unlike other forms of cloud storage that provide only a temporary home for research data, Databrary’s workspaces will provide a permanent and flexible home that is just a button press away from being made accessible to the broader research community.
– HNDS-I proposal.
Expanding scriptable access
…We will build on the free, open-source, R package, databraryr, that PI Gilmore developed with NSF support and openly released to the research community. Databraryr wraps Databrary API calls into commands that are useful to researchers who want to download shared data from Databrary. We will add data uploading capabilities to the R package to support Aim 3, develop and publish a parallel Python package, databrarypy.
– HNDS-I proposal.
How do we permit scriptable access while protecting Databrary from unwanted “bot” access?
Some users find that the web interface times out when they try to download large numbers or large files.
More flexible downloading
When users select sessions with large numbers or large files, the application creates a zip file. Generating that file consumes central app resources (compute and memory) and can cause long pauses in a user’s experience with the app.
A better solution would be to create the archive offline and notify the user when it is ready to download.
Administrative panel upgrades
- Quotas on per-user, per-institution storage footprints.
- Infrastructure for managing subscriptions, curation assistance, data deposit fees.
- Links to institutional admin panel functions; data footprint, etc.
Version control
Private volumes for peer review
OSF implements this.
Support for institutional subscriptions
These features are described in a proposal submitted to NYU TAC in August 2024.