Transfer REDCap data to a database and use in R without exceeding available memory. Compatible with all databases but specifically optimized for DuckDB—a fast and portable SQL engine with first-class integration in R/Posit products.
Motivation
R objects live entirely in memory, causing three problems if not using a specialized framework:
- You must load full datasets even if you only need a subset
- Unused objects still consume memory
- Large datasets can easily exceed available memory
redquack’s solution to this problem is to:
- Request all of the REDCap record IDs to sequence in chunks
- Process each chunk of the REDCap data in one R object at a time
- Remove each object from memory after it has been transferred to the database
Features
- Auto-resume from incomplete transfers
- Auto-retry for API request failures
- Auto-convert data types for DuckDB
- Timestamped operation logs
- Status messages and progress bar
- Sound notifications (🔊 🦆)
Installation
From CRAN:
# install.packages("pak")
pak::pak("redquack")
Development version:
pak::pak("dylanpieper/redquack")
These packages are used in the examples and are not imported by redquack:
Setup API Token
Your REDCap API token allows R to interface with REDCap and should be stored securely. I recommend using the keyring package to store your API token. For example:
keyring::key_set("redcap_token")
Basic Usage
Data from REDCap is transferred to a database via a DBI connection in chunks of record IDs:
library(redquack)
duckdb <- DBI::dbConnect(duckdb::duckdb(), "redcap.duckdb")
result <- redcap_to_db(
conn = duckdb,
redcap_uri = "https://redcap.example.org/api/",
token = keyring::key_get("redcap_token"),
record_id_name = "record_id",
chunk_size = 1000
# Increase chunk size for memory-efficient systems (faster)
# Decrease chunk size for memory-constrained systems (slower)
)
The function returns a list with class redcap_transfer_result
:
-
success
: Logical if the transfer was completed with no failed processing -
error_chunks
: Vector of chunk numbers that failed processing -
time_s
: Numeric value for total seconds to transfer and optimize data
Database Structure
The database created by redcap_to_db()
contains two tables:
-
data
: Contains all exported REDCap records with optimized column typesdata <- DBI::dbGetQuery(duckdb, "SELECT * FROM data LIMIT 1000")
-
log
: Contains timestamped logs of the transfer process for troubleshootinglog <- DBI::dbGetQuery(duckdb, "SELECT * FROM log")
Data Types
Data is imported as VARCHAR/TEXT for consistent handling across chunks.
For DuckDB, data types are automatically optimized after transfer to improve query performance:
- INTEGER: Columns with only whole numbers
- DOUBLE: Columns with decimal numbers
- DATE: Columns with valid dates
- TIMESTAMP: Columns with valid timestamps
- VARCHAR/TEXT: All other columns remain as strings
In DuckDB, you can query the data to inspect the data types:
DBI::dbGetQuery(duckdb, "PRAGMA table_info(data)")
You can also automatically convert data types in R using readr:
readr::type_convert(data)
To optimize query performance with other databases, you must alter the data table manually.
Data Manipulation
Query and collect the data with dplyr:
library(dplyr)
demographics <- tbl(duckdb, "data") |>
filter(is.na(redcap_repeat_instrument)) |>
select(record_id, age, race, sex, gender) |>
collect()
If you collect()
your data into memory in the last step, it can make a slow process nearly instantaneous. The following example data is 2,825,092 rows x 397 columns:
system.time(
records <- duckdb |>
tbl("data") |>
collect() |>
group_by(redcap_repeat_instrument) |>
summarize(count = n()) |>
arrange(desc(count))
)
#> user system elapsed
#> 5.048 5.006 6.077
system.time(
records <- duckdb |>
tbl("data") |>
group_by(redcap_repeat_instrument) |>
summarize(count = n()) |>
arrange(desc(count)) |>
collect()
)
#> user system elapsed
#> 0.040 0.015 0.040
You can also write a Parquet file directly from DuckDB and use arrow. A Parquet file will be about 5 times smaller than a DuckDB file:
DBI::dbExecute(duckdb, "COPY (SELECT * FROM data) TO 'redcap.parquet' (FORMAT PARQUET)")
Remember to close the connection when finished:
DBI::dbDisconnect(duckdb)
Collaboration Opportunities
While this package is only optimized for DuckDB, I invite collaborators to help optimize it for other databases. The pathway I suggest right now is to target your edits in R/optimize_data_types.R
. Feel free to submit a PR and share any other ideas you may have.
Other REDCap Interfaces
REDCapR (R package)
REDCapTidieR (R package)
tidyREDCap (R package)
redcapAPI (R package; includes a package comparison table)
REDCapSync (R package; in development)
PyCap (python module)