Workflow

Modified

August 26, 2024

This page describes the set of tools we use to generate these data and present them in this site.

GitHub

We use GitHub’s pages feature to serve the web site files.

At present, we are using a repository associated with the Databrary organization (https://github.com/databrary/analytics). This results in the analytics site having the following url: https://databrary.github.io/analytics/.

The site is built locally by Rick Gilmore or Andrea Seisler, then pushed to GitHub.

RStudio

We use RStudio as the integrated development environment for the site. Most of the code is in R Markdown and R, with some CSS and JavaScript.

We use a number of R packages in the workflow.

Databraryr

The databraryr package provides a set of tools for interacting with the Databrary API and gathering data from the site. This package may be useful to some analysts whether or not they care about Databrary-specific analytics.

Most data and metadata used in these reports can be accessed by the public, but specific data about individual participants requires that the user be authorized and logged in to the site using the databraryr::db_login() function.

The package may be installed via devtools::install_github(repo="databrary/databraryr").

See https://databrary.github.io/databraryr/ for documentation about the package.

Targets

We use the targets package to generate data and metadata files that are used to create the visualizations and summaries.

Some of the components are rendered on a regular, time-determined basis, like the weekly report. Others are rendered less often, typically quarterly.

Most of the targets call functions in R/functions.R. The specific targets can be viewed in the _targets.R file in the root directory of the repository.

A typical workflow to ‘make’ or update the data files, is as follows:

library(targets)
tar_make()

Quarto

To generate the site, we use Quarto.

A typical sequence of commands to regenerate the site is the following:

quarto render src

Configuration files using the YAML markup language control the rendering process. These files and the source R Markdown (.Rmd) files used to generate the site are in the src/ directory.

The rendering command creates a full website in the docs/ directory.

Package Reproducibility

We use the renv package to track package dependencies.

Strategy

Some of the data elements in the report change often, but others do not. We have found it is faster and more convenient in many cases to download various data files from Databrary and store copies as comma-separated value (CSV) text files in src/csv.

Some of the CSVs contain potentially identifiable human subjects data, so we use a special .gitignore file to keep those files out of the git tracking scheme and prevent uploading them to GitHub.

These data analyses and visualizations have developed piecemeal over several years. They could undoubtedly be optimized and improved.

The primary developer (Rick Gilmore) has had a ‘git-er-done’ attitude toward the project, especially with the latest refactoring to {bookdown}, {targets}, and {renv}.

Gilmore takes some solace in the following quotation from the father of literate programming, Donald Knuth:

…the real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

File-level dependencies

Several files contain historical data and are used for generating the time series plots in the weekly report:

  • src/csv/volumes-shared-unshared.csv
  • src/csv/citations-monthly.csv
  • src/csv/institutions-investigators.csv
  • src/csv/max-ids.csv

Roadmap

  • Fix when targets are run for which flows.
  • Separate functions in R/functions.R into separate files based on their context. The current R/functions.R contains mostly old code from the pre-bookdown version of the site. The styles are inconsistent and the large file is hard to maintain.
  • Devise visualizations of assets by investigator/institution.
  • Function to determine max party_id and max volume_id interactively.
  • Eliminate YAML params from index.Rmd.
  • Create JSON lat, lon file for Databrary home page.
  • Add state, country summaries to demographics.
  • Allow user to show code.
  • Move from old pipe %>% to new |>.
    • Done for *.Rmd files, but not *.R files as of 2023-02-24.
  • Migrate to Quarto