Skip to contents

CRAN status R-CMD-check

Transfer REDCap data to a database and use in R without exceeding available memory. Compatible with all databases but specifically optimized for DuckDB—a fast and portable SQL engine with first-class integration in Posit products.

Use Case and Solution

Your REDCap project is outgrowing your computer, and you get this error in R when you request data via the API:

🛑 Error: vector memory limit of 16.0 GB reached, see mem.maxVSize()

R objects live entirely in local memory, which causes problems when your data gets too big, and you eagerly try to load it all into R.

redquack’s solution to prevent this error is to:

  1. Request all record IDs in REDCap, and split them into chunks
  2. Request one chunk of the project data at a time
  3. Transfer the chunk of data to a database
  4. Remove the chunk from memory, and repeat from step 2

Once complete, you can retrieve your data from the database and use it in R (see Data Manipulation).

Features

redquack has additional features to make it robust and improve user experience:

  • Retry on API request failures
  • Resume from incomplete transfers
  • Convert data types for optimized queries
  • Store timestamped operation logs
  • Show progress bar and status messages
  • Play sound notifications (quacks on success 🦆)

Installation

From CRAN:

# install.packages("pak")
pak::pak("redquack")

Development version:

pak::pak("dylanpieper/redquack")

These packages are used in the examples but are not imported by redquack:

pak::pak(c("dplyr", "duckdb", "keyring"))

Setup API Token

An API token allows R to interface with REDCap, and it should be stored securely. I recommend using the keyring package to store your API token. For example:

keyring::key_set("redcap_token")

Basic Usage

Data from REDCap is transferred to a database via a DBI connection in chunks of record IDs:

library(redquack)

duckdb <- DBI::dbConnect(duckdb::duckdb(), "redcap.duckdb")

result <- redcap_to_db(
  conn = duckdb,
  redcap_uri = "https://redcap.example.org/api/",
  token = keyring::key_get("redcap_token"),
  record_id_name = "record_id",
  chunk_size = 1000  
)

The function returns a list of metadata with class redcap_transfer_result:

  • success: Logical if the transfer was completed with no failed processing
  • error_chunks: Vector of chunk numbers that failed processing
  • time_s: Numeric value for total seconds to transfer and optimize data

These metadata are useful for programming export pipelines and ETL workflows. The actual data is stored in database and accessed via the connection.

Database Structure

The database created by redcap_to_db() contains two tables:

  1. data: Contains all exported REDCap records with optimized column types

    data <- DBI::dbGetQuery(duckdb, "SELECT * FROM data LIMIT 1000")
  2. log: Contains timestamped logs of the transfer process for troubleshooting

    log <- DBI::dbGetQuery(duckdb, "SELECT * FROM log")

Data Types

Data is imported as VARCHAR/TEXT for consistent handling across chunks.

For DuckDB, data types are automatically optimized after transfer to improve query performance:

  • INTEGER: Columns with only whole numbers
  • DOUBLE: Columns with decimal numbers
  • DATE: Columns with valid dates
  • TIMESTAMP: Columns with valid timestamps
  • VARCHAR/TEXT: All other columns remain as strings

In DuckDB, you can query the data to inspect the data types:

DBI::dbGetQuery(duckdb, "PRAGMA table_info(data)")

You can also automatically convert data types in R using readr:

readr::type_convert(data)

To optimize query performance with other databases, alter the database table manually.

Data Manipulation

Manipulate your data with familiar dplyr syntax. The only difference is you reference the database table first and collect the data into memory last. Everything in between stays the same. Specifically, dplyr builds a lazy query plan through its verb functions, then the database engine executes the plan.

library(dplyr)

demographics <- tbl(duckdb, "data") |>
  filter(is.na(redcap_repeat_instrument)) |>
  select(record_id, age, race, sex, gender) |>
  collect()

By collecting your data into memory last, manipulating big data becomes SO MUCH FASTER. The following example data is 2,825,092 rows x 397 columns:

system.time(
    duckdb |>
    tbl("data") |>
    collect() |>
    group_by(redcap_repeat_instrument) |>
    summarize(count = n()) |>
    arrange(desc(count))
)
#>   user  system elapsed
#>  5.048   5.006   6.077

system.time(
    duckdb |>
    tbl("data") |>
    group_by(redcap_repeat_instrument) |>
    summarize(count = n()) |>
    arrange(desc(count)) |>
    collect()
)
#>    user  system elapsed
#>   0.040   0.015   0.040

You can also write a Parquet file directly from DuckDB and use arrow. A Parquet file will be about 5 times smaller than a DuckDB file and easy to share:

DBI::dbExecute(duckdb, "COPY (SELECT * FROM data) TO 'redcap.parquet' (FORMAT PARQUET)")

Remember to close the connection when finished:

DBI::dbDisconnect(duckdb)

Collaboration Opportunities

While this package is only optimized for DuckDB, I invite collaborators to help optimize it for other databases. Target your edits in R/optimize_data_types.R. Feel free to submit a PR and share any other ideas you may have.

Other REDCap Interfaces