| Title: | Simulation-Based Tools for Bioacoustic Study Design |
|---|---|
| Description: | Many bioacoustic data workflows rely on manual review (i.e., validation) of a subset of call files to provide information to statistical models that account for misclassification by automated algorithms. Because manual review can be prohibitively expensive, simulation can be a valuable tool to aid the design of studies that use validation. This package provides user-friendly functions to reduce the programming burden of simulation studies that compare validation sampling designs. Simulations assume the count-detection model, which is a realistic model for bioacoustic data, especially for bats. For more information, see Oram et al. (2025) <doi:10.1214/25-AOAS2096>. |
| Authors: | Jacob Oram [aut, cre, cph] (ORCID: <https://orcid.org/0009-0001-8405-529X>), Katharine Banner [aut] (ORCID: <https://orcid.org/0000-0002-7103-0042>), Christian Stratton [aut] (ORCID: <https://orcid.org/0000-0001-8051-2185>), Kathryn Irvine [aut] (ORCID: <https://orcid.org/0000-0002-6426-940X>) |
| Maintainer: | Jacob Oram <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.1 |
| Built: | 2026-06-08 09:44:41 UTC |
| Source: | https://github.com/j-oram/validationexplorer |
A dataframe in the format output from running run_sims. This exists solely for the purpose of making documentation of visualization functions easier.
parameter. Parameters (lambda, psi, and theta for a two species assemblage).
Mean. The posterior mean.
SD. The standard deviation of the draws of reach parameter.
Naive SE. The standard error, not accounting for the correlation in draws
Time-series SE. The standard error, accounting for correlation in draws.
quantiles. 2.5%, 25%, 50%, 75% and 97.5% quantiles.
Rhat. The Gelman-Rubin statistic for each parameter.
ess_bulk. The effective sample size in the bulk of the distribution.
ess_tail. The effective sample size in the tails of the distribution.
truth. The known true data generating value.
capture. Did the 95% posterior interval contain the true value?
converge. Was the Gelman-Rubin statistic near 1?
theta_scenario. The ID for the classifier scenario.
scenario. The index of the validation scenario.
dataset. The dataset index.
A dataframe containing the columns described below.
A complete small-scale simulation study was run using the functions
in ValidationExplorer.
head(example_output) visualize_parameter_group( example_output, pars = "lambda", theta_scenario = "1", scenarios = 1:2 )head(example_output) visualize_parameter_group( example_output, pars = "lambda", theta_scenario = "1", scenarios = 1:2 )
Example output from summarize_n_validated.
A dataframe with 2 rows and 4 columns
theta_scenario. The ID of the classifier scenario.
scenario. The validation scenario index.
n_validated. The expected number of validated calls in an average dataset simulated under each scenario.
Data was simulated using simulate_validatedData and summarized using summarize_n_validated.
example_val_sumexample_val_sum
mask_by_spp: simulate a validation design
mask_by_spp(data, props_to_val)mask_by_spp(data, props_to_val)
data |
A dataframe containing the columns |
props_to_val |
a vector containing the proportion of recordings to validate for each species |
A list containing two elements: final_df and data_sum. final_df
is a copy of the input data masked according to the validation design
supplied by props_to_val. The second output, data_sum is a dataframe
containing a summary of the number and proportion of ambiguous
(i.e., not validated) recordings. It provides a check that the masking
function is working correctly.
library(dplyr) dat <- sim_dat()$full_df head(dat) dat <- dat %>% tidyr::uncount(weights = count, .remove = FALSE) val_dat <- mask_by_spp(dat, props_to_val = c(rep(.1, 4), rep(.4, 4))) val_dat$final_df %>% group_by(id_spp) %>% summarize(prop_vald = sum(!is.na(true_spp))/n())library(dplyr) dat <- sim_dat()$full_df head(dat) dat <- dat %>% tidyr::uncount(weights = count, .remove = FALSE) val_dat <- mask_by_spp(dat, props_to_val = c(rep(.1, 4), rep(.4, 4))) val_dat$final_df %>% group_by(id_spp) %>% summarize(prop_vald = sum(!is.na(true_spp))/n())
Mask a proportion of all visits: A function for simulating a fixed effort validation design.
mask_FE_all_visits(df, effort_prop, seed = NULL)mask_FE_all_visits(df, effort_prop, seed = NULL)
df |
A dataframe object in the format of |
effort_prop |
The proportion of recordings to be randomly sampled from the first visit for validation |
seed |
An optional random seed to make masking reproducible |
A dataframe object that is a copy of the input df, but with the
appropriate level of effort according to a fixed effort validation design.
library(dplyr) cd_data <- sim_dat()$full_df %>% tidyr::uncount(weights = count, .remove = FALSE) FE_data <- mask_FE_all_visits(cd_data, effort_prop = 0.1) head(cd_data) head(FE_data) FE_data %>% group_by(site, visit) %>% summarize( prop_validated = sum(!is.na(true_spp))/n() )library(dplyr) cd_data <- sim_dat()$full_df %>% tidyr::uncount(weights = count, .remove = FALSE) FE_data <- mask_FE_all_visits(cd_data, effort_prop = 0.1) head(cd_data) head(FE_data) FE_data %>% group_by(site, visit) %>% summarize( prop_validated = sum(!is.na(true_spp))/n() )
MCMC_sum: A custom function for summarizing MCMC posterior sampling
mcmc_sum(out, thin = 1, truth)mcmc_sum(out, thin = 1, truth)
out |
Draws from a model fit using a probabilistic programming language (e.g., Stan, NIMBLE or JAGS). The expected format of this input is a list, where each entry is a Markov chain. |
thin |
An optional thinning interval. |
truth |
A vector with the true parameter values organized alphanumerically by parameter value (e.g., lambda\[1\], lambda\[2\], psi\[1\], psi\[2\], theta\[1,1\], theta\[1,2\], theta\[2,1\], theta\[2,2\]) |
A dataframe object summarizing the MCMC draws, including diagnostics, quantiles and posterior means.
# An example fit of one dataset draws <- ValExp_example_fit # The data generating values truth <- c(11,2,0.3, 0.6, 0.9, 0.15, 0.10, 0.85) mcmc_sum(draws, truth = truth)# An example fit of one dataset draws <- ValExp_example_fit # The data generating values truth <- c(11,2,0.3, 0.6, 0.9, 0.15, 0.10, 0.85) mcmc_sum(draws, truth = truth)
plot_bias_vs_calls: Compare validation designs based on estimation error and expected level of effort
plot_bias_vs_calls( sim_summary, calls_summary, pars = NULL, regex_pars = NULL, theta_scenario, scenarios, max_calls = NULL, convergence_threshold = 1.1 )plot_bias_vs_calls( sim_summary, calls_summary, pars = NULL, regex_pars = NULL, theta_scenario, scenarios, max_calls = NULL, convergence_threshold = 1.1 )
sim_summary |
Simulation summary from many datasets under many validation scenarios in the format output by mcmc_sum. |
calls_summary |
Summary of the validation design in the format as output from summarize_n_validated |
pars |
A character vector of parameters to visualize. |
regex_pars |
String containing the name of a group of parameters to visualize. Must be one of "lambda", "psi", or "theta". |
theta_scenario |
String containing the theta scenario ID that
was used to run simulations in run_sims. This string must match
between |
scenarios |
A vector of integers corresponding to the validation designs you would like to visualize. |
max_calls |
The maximum number of calls that can be manually reviewed. All points beyond this threshold will be shaded gray in the plot. Default value is NULL. |
convergence_threshold |
A threshold for the Gelman-Rubin statistic; values below this threshold indicate that a parameter has converged. |
A ggplot2 object showing the number of calls validated on the x-axis and the average estimation error on the y-axis.
sim_summary <- example_output calls_summary <- example_val_sum plot_bias_vs_calls(sim_summary, calls_summary, regex_pars = "lambda", theta_scenario = "1", scenarios = 1:2, convergence_threshold = 1.05)sim_summary <- example_output calls_summary <- example_val_sum plot_bias_vs_calls(sim_summary, calls_summary, regex_pars = "lambda", theta_scenario = "1", scenarios = 1:2, convergence_threshold = 1.05)
plot_coverage_vs_calls: Compare validation designs based on coverage of 95% posterior intervals and expected level of effort
plot_coverage_vs_calls( sim_summary, calls_summary, pars = NULL, regex_pars = NULL, theta_scenario, scenarios, max_calls = NULL, convergence_threshold = 1.1 )plot_coverage_vs_calls( sim_summary, calls_summary, pars = NULL, regex_pars = NULL, theta_scenario, scenarios, max_calls = NULL, convergence_threshold = 1.1 )
sim_summary |
Simulation summary from many datasets under many validation scenarios in the format output by mcmc_sum. |
calls_summary |
Summary of the validation design in the format as output from summarize_n_validated |
pars |
A character vector of parameters to visualize. |
regex_pars |
String containing the name of a group of parameters to visualize. Must be one of "lambda", "psi", or "theta". |
theta_scenario |
String containing the theta scenario ID that
was used to run simulations in run_sims. This string must match
between |
scenarios |
A vector of integers corresponding to the validation designs you would like to visualize. |
max_calls |
The maximum number of calls that can be manually reviewed. All points beyond this threshold will be shaded gray in the plot. Default value is NULL. |
convergence_threshold |
A threshold for the Gelman-Rubin statistic; values below this threshold (and near 1) indicate that a parameter has converged. |
A ggplot2 object showing the number of calls validated on the x-axis and the average coverage of 95% credible intervals on the y-axis.
sim_summary <- example_output calls_summary <- example_val_sum plot_coverage_vs_calls(sim_summary, calls_summary, regex_pars = "lambda", theta_scenario = "1", scenarios = 1:2, convergence_threshold = 1.05)sim_summary <- example_output calls_summary <- example_val_sum plot_coverage_vs_calls(sim_summary, calls_summary, regex_pars = "lambda", theta_scenario = "1", scenarios = 1:2, convergence_threshold = 1.05)
plot_width_vs_calls: Compare validation designs based on 95% posterior interval width and expected level of effort
plot_width_vs_calls( sim_summary, calls_summary, pars = NULL, regex_pars = NULL, theta_scenario, scenarios, max_calls = NULL, convergence_threshold = 1.1 )plot_width_vs_calls( sim_summary, calls_summary, pars = NULL, regex_pars = NULL, theta_scenario, scenarios, max_calls = NULL, convergence_threshold = 1.1 )
sim_summary |
Simulation summary from many datasets under many validation scenarios in the format output by mcmc_sum. |
calls_summary |
Summary of the validation design in the format as output from summarize_n_validated |
pars |
A character vector of parameters to visualize. |
regex_pars |
String containing the name of a group of parameters to visualize. Must be one of "lambda", "psi", or "theta". |
theta_scenario |
String containing the theta scenario ID that
was used to run simulations in run_sims. This string must match
between |
scenarios |
A vector of integers corresponding to the validation designs you would like to visualize. |
max_calls |
The maximum number of calls that can be manually reviewed. All points beyond this threshold will be shaded gray in the plot. Default value is NULL. |
convergence_threshold |
A threshold for the Gelman-Rubin statistic; values below this threshold indicate that a parameter has converged. |
A ggplot2 object showing the number of calls validated on the x-axis and the average 95% credible interval on the y-axis.
sim_summary <- example_output calls_summary <- example_val_sum plot_width_vs_calls(sim_summary, calls_summary, regex_pars = "lambda", theta_scenario = "1", scenarios = 1:2, convergence_threshold = 1.05)sim_summary <- example_output calls_summary <- example_val_sum plot_width_vs_calls(sim_summary, calls_summary, regex_pars = "lambda", theta_scenario = "1", scenarios = 1:2, convergence_threshold = 1.05)
run_sims: conduct simulations easily
run_sims( data_list, zeros_list, DGVs, theta_scenario_id, parallel = TRUE, niter = 2000, nburn = floor(niter/2), thin = 1, nchains = 3, save_fits = FALSE, save_individual_summaries_list = FALSE, directory = tempdir() )run_sims( data_list, zeros_list, DGVs, theta_scenario_id, parallel = TRUE, niter = 2000, nburn = floor(niter/2), thin = 1, nchains = 3, save_fits = FALSE, save_individual_summaries_list = FALSE, directory = tempdir() )
data_list |
nested list of masked dataframes (datasets nested within scenarios –
this is the format of |
zeros_list |
list of dataframes containing the site/visit/true_spp/id_spp combinations where no calls were observed. |
DGVs |
A named list with entries psi, lambda and theta containing the true values of the respective parameters. |
theta_scenario_id |
A character string ID for the simulations being run. |
parallel |
Should models be fit in parallel? Default value is TRUE. |
niter |
Number of iterations per MCMC chain. |
nburn |
Number of warmup iterations. |
thin |
Thinning interval for the MCMC chains. |
nchains |
The number of chains. |
save_fits |
Should individual model fits be saved? This could require large amounts of disk space if you are fitting many large models to big datasets. Default value is FALSE. |
save_individual_summaries_list |
Should summaries for individual model fits
be saved? While this requires much less space than |
directory |
The directory to save objects. Defaults to |
a dataframe with the summaries (from mcmc_sum) for all scenarios and datasets. A copy of this output is also saved to the current working directory.
# :::::::::::: Simulate data ::::::::::::: # psi <- c(0.3, 0.6) lambda <- c(11, 2) nspecies <- length(psi) nsites <- 30 nvisits <- 5 test_theta1 <- matrix(c(0.9, 0.1, 0.15, 0.85), byrow = TRUE, nrow = 2) val_scenarios <- list(spp1 = c(.75, .5), spp2 = .5) td <- withr::local_tempdir() fake_data <- simulate_validatedData( n_datasets = 5, design_type = "BySpecies", scenarios = val_scenarios, nsites = nsites, nvisits = nvisits, nspecies = nspecies, psi = psi, lambda = lambda, theta = test_theta1, save_datasets = FALSE, save_masked_datasets = FALSE, directory = td ) # ::::::::::::: run simulations on sim'd data ::::::::::: # out <- run_sims( data_list = fake_data$masked_dfs, zeros_list = fake_data$zeros, DGVs = list(lambda = lambda, psi = psi, theta = test_theta1), theta_scenario_id = 'StratBySpecies_1', parallel = FALSE, nchains = 2, niter = 500, nburn = 250, thin = 1, save_fits = FALSE, save_individual_summaries_list = FALSE, directory = td )# :::::::::::: Simulate data ::::::::::::: # psi <- c(0.3, 0.6) lambda <- c(11, 2) nspecies <- length(psi) nsites <- 30 nvisits <- 5 test_theta1 <- matrix(c(0.9, 0.1, 0.15, 0.85), byrow = TRUE, nrow = 2) val_scenarios <- list(spp1 = c(.75, .5), spp2 = .5) td <- withr::local_tempdir() fake_data <- simulate_validatedData( n_datasets = 5, design_type = "BySpecies", scenarios = val_scenarios, nsites = nsites, nvisits = nvisits, nspecies = nspecies, psi = psi, lambda = lambda, theta = test_theta1, save_datasets = FALSE, save_masked_datasets = FALSE, directory = td ) # ::::::::::::: run simulations on sim'd data ::::::::::: # out <- run_sims( data_list = fake_data$masked_dfs, zeros_list = fake_data$zeros, DGVs = list(lambda = lambda, psi = psi, theta = test_theta1), theta_scenario_id = 'StratBySpecies_1', parallel = FALSE, nchains = 2, niter = 500, nburn = 250, thin = 1, save_fits = FALSE, save_individual_summaries_list = FALSE, directory = td )
Simulate data from the count-detection model with counts per site-visit
sim_dat( nsites = 100, nspecies = 8, nvisits = 4, seed = NULL, psi = stats::runif(nspecies, 0.4, 0.9), lambda = abs(stats::rnorm(nspecies, 0, 100)), theta = t(apply(18 * diag(nspecies) + 2, 1, function(x) nimble::rdirch(1, x))) )sim_dat( nsites = 100, nspecies = 8, nvisits = 4, seed = NULL, psi = stats::runif(nspecies, 0.4, 0.9), lambda = abs(stats::rnorm(nspecies, 0, 100)), theta = t(apply(18 * diag(nspecies) + 2, 1, function(x) nimble::rdirch(1, x))) )
nsites |
the number of sites assumed in the design. Default value is 100. |
nspecies |
the number of species in the assemblage. Default is 8. |
nvisits |
the number of visits (detector nights) assumed for each site. Default is 4. |
seed |
optional seed if you would like to reproduce the data simulation. |
psi |
a vector of length nspecies that contains the occurrence probabilities for each species in the assemblage. These values must be in \[0,1\]. Default is to draw a random vector from a U(.4, .9) distribution. |
lambda |
vector of length nspecies that contains the relative activity parameters for each species. Note these values need to be positive. By default, lambda values are the absolute value of normal(0, 100) random variables. |
theta |
n nspecies x nspecies matrix containing the (mis)classification probabilities for each species. All entries must be in (0,1], with the rows of the matrix summing to 1. The default draws rows from a dirichlet distribution with concentrations determined by location in the matrix (diagonal values have higher concentrations). |
A list containing full_df, a complete dataframe simulated under the
user's specified parameter settings. et_sim_datasets.R. The second list
element is params, the parameters used to simulate data in list form.
fake_data <- sim_dat( nsites = 30, nspecies = 2, nvisits = 3, seed = 101, psi = c(.3, .6), lambda = c(8, 3) ) head (fake_data$full_df)fake_data <- sim_dat( nsites = 30, nspecies = 2, nvisits = 3, seed = 101, psi = c(.3, .6), lambda = c(8, 3) ) head (fake_data$full_df)
Simulate many datasets under candidate validation designs
simulate_validatedData( n_datasets, design_type = c("BySpecies", "FixedPercent"), scenarios = NULL, nsites = 100, nspecies = 8, nvisits = 3, psi = runif(nspecies, 0.1, 0.9), lambda = abs(rnorm(nspecies, 0, 5)), theta = t(apply(diag(18, nrow = nspecies) + 2, 1, function(x) { nimble::rdirch(alpha = x) })), confirmable_limits = NULL, scen_expand = TRUE, scen_df = NULL, save_datasets = FALSE, save_masked_datasets = FALSE, directory = tempdir() )simulate_validatedData( n_datasets, design_type = c("BySpecies", "FixedPercent"), scenarios = NULL, nsites = 100, nspecies = 8, nvisits = 3, psi = runif(nspecies, 0.1, 0.9), lambda = abs(rnorm(nspecies, 0, 5)), theta = t(apply(diag(18, nrow = nspecies) + 2, 1, function(x) { nimble::rdirch(alpha = x) })), confirmable_limits = NULL, scen_expand = TRUE, scen_df = NULL, save_datasets = FALSE, save_masked_datasets = FALSE, directory = tempdir() )
n_datasets |
The number of datasets you would like to have simulated. Each of these simulated datasets will be subjected to all candidate validation designs. |
design_type |
Character string, either "BySpecies" for a stratified-by-species design, or "FixedPercentage" for a fixed effort design (see Oram et al., in review for more details on each of these) |
scenarios |
if |
nsites |
number of sites in each dataset |
nspecies |
size of the species assemblage |
nvisits |
the number of visits to each site. Note that these simulations assume a balanced design. |
psi |
a vector of length nspecies with the assumed occurrence probabilities for each species |
lambda |
a vector of length nspecies with the assumed relative activity levels for each species. Make sure the order is correct and matches psi. |
theta |
a matrix containing the (mis)classification probabilities. The rows of this matrix must sum to 1. See vignette for an example. |
confirmable_limits |
A numeric vector containing the lower and and upper bounds on the site-visit probabilities that a recording can be validated ("confirmed"). |
scen_expand |
If |
scen_df |
If |
save_datasets |
logical. If TRUE, the datasets without any masking of true species labels (i.e., corresponding to complete validation of all recordings) will be saved. Default value is FALSE. |
save_masked_datasets |
logical. If TRUE, the masked datasets (i.e., the simulated datasets with partial validation according to the simulation scenario) will be saved. This means that there will be n_datasets x nrow(scenarios_dataframe) datasets saved: one for each dataset under each validation scenario. Default value is FALSE. |
directory |
character. Required if save_datasets = TRUE or save_masked_datasets = TRUE. This is where the datasets will be saved. By default, a temporary directory will be used. This must be changed if access to saved datasets is desired after the end of the R session, as tempdir() is cleared at the end of the session. |
A list containing three elements:
full_datasets: A list of length n_datasets with unmasked datasets (i.e., full validation of all recordings).
If save_datasets = TRUE, then these will be saved individually in directory as dataset_n.rds, where n
is the dataset number.
zeros: A list of length n_datasets containing all of the site-visits where no recordings of a certain
classification were observed. For example, if, in dataset 10, there were no calls from species 1 that were
classified as 3 on visit 4 to site 156, then the 10th entry of this list would contain a dataset with
a row corresponding tosite = 156, visit = 4, true_spp = 1, id_spp = 3, with count = 0. These zeros are
necessary for housekeeping in the model-fitting process. If save_datasets = TRUE, the zeros for each
each dataset will be saved in directory individually as zeros_in_dataset_n.rds, where
n is the dataset number.
masked_dfs: A nested list containing each dataset masked under each scenario. masked_dfs\[\[9\]\]\[\[27\]\] contains
dataset 27, assuming validation scenario 9. If save_masked_datasets = TRUE, then each dataset/scenario
scenario combination is saved individually in directory as dataset_n_masked_under_scenario_s.rds,
where n is the dataset number and s is the scenario number.
psi <- c(0.3, 0.6) lambda <- c(11, 2) nspecies <- length(psi) nsites <- 30 nvisits <- 5 test_theta1 <- matrix(c(0.9, 0.1, 0.15, 0.85), byrow = TRUE, nrow = 2) val_scenarios <- list(spp1 = c(.75, .5), spp2 = .5) fake_data <- simulate_validatedData( n_datasets = 5, design_type = "BySpecies", scenarios = val_scenarios, nsites = nsites, nvisits = nvisits, nspecies = nspecies, psi = psi, lambda = lambda, theta = test_theta1, save_datasets = FALSE, save_masked_datasets = FALSE, directory = tempdir() )psi <- c(0.3, 0.6) lambda <- c(11, 2) nspecies <- length(psi) nsites <- 30 nvisits <- 5 test_theta1 <- matrix(c(0.9, 0.1, 0.15, 0.85), byrow = TRUE, nrow = 2) val_scenarios <- list(spp1 = c(.75, .5), spp2 = .5) fake_data <- simulate_validatedData( n_datasets = 5, design_type = "BySpecies", scenarios = val_scenarios, nsites = nsites, nvisits = nvisits, nspecies = nspecies, psi = psi, lambda = lambda, theta = test_theta1, save_datasets = FALSE, save_masked_datasets = FALSE, directory = tempdir() )
Summarize the number of validated recordings
summarize_n_validated(data_list, scenario_numbers, theta_scenario)summarize_n_validated(data_list, scenario_numbers, theta_scenario)
data_list |
A list of masked datasets in the format output from simulate_validatedData. |
scenario_numbers |
An integer vector denoting the scenarios to be summarized. |
theta_scenario |
The identifier for the classifier scenario as a character string. |
A dataframe object that contains the theta scenario, validation scenario, and number of validated calls.
psi <- c(0.3, 0.6) lambda <- c(11, 2) nspecies <- length(psi) nsites <- 30 nvisits <- 5 test_theta1 <- matrix(c(0.9, 0.1, 0.15, 0.85), byrow = TRUE, nrow = 2) val_scenarios <- list(spp1 = c(.75, .5), spp2 = .5) fake_data <- simulate_validatedData( n_datasets = 5, design_type = "BySpecies", scenarios = val_scenarios, nsites = nsites, nvisits = nvisits, nspecies = nspecies, psi = psi, lambda = lambda, theta = test_theta1, save_datasets = FALSE, save_masked_datasets = FALSE, directory = withr::local_tempdir() ) summarize_n_validated( data_list = fake_data$masked_dfs, scenario_numbers = 1:nrow(fake_data$scenarios_df), theta_scenario = '1' )psi <- c(0.3, 0.6) lambda <- c(11, 2) nspecies <- length(psi) nsites <- 30 nvisits <- 5 test_theta1 <- matrix(c(0.9, 0.1, 0.15, 0.85), byrow = TRUE, nrow = 2) val_scenarios <- list(spp1 = c(.75, .5), spp2 = .5) fake_data <- simulate_validatedData( n_datasets = 5, design_type = "BySpecies", scenarios = val_scenarios, nsites = nsites, nvisits = nvisits, nspecies = nspecies, psi = psi, lambda = lambda, theta = test_theta1, save_datasets = FALSE, save_masked_datasets = FALSE, directory = withr::local_tempdir() ) summarize_n_validated( data_list = fake_data$masked_dfs, scenario_numbers = 1:nrow(fake_data$scenarios_df), theta_scenario = '1' )
Get suggested MCMC settings prior to starting your simulations
tune_mcmc(dataset, zeros, return_fit = TRUE)tune_mcmc(dataset, zeros, return_fit = TRUE)
dataset |
A dataframe containing the validated and ambiguous data to be fit. Expected format is that of a single masked dataframe contained in the output from simulate_validatedData. We recommend using a dataset from your lowest-effort validation scenario. |
zeros |
A dataframe containing the site/visit/true_spp/id_spp combinations that were never observed (count = 0). This will be one of the elements of the zeros object ouput from simulate_validatedData. |
return_fit |
A logical indicating whether the draws from the posterior should be returned. Default value is set to TRUE to encourage visual inspection of chains during the tuning process. |
A list containing the expected time to fit a single dataset with 10,000 iterations per chain, the draws for each chain, a guess for the minimum number of iterations and warmup required to to ensure Rhat <= 1.1 for all model parameters, and a dataframe containing effective sample sizes for each parameter.
psi <- c(0.3, 0.6) lambda <- c(11, 2) nspecies <- length(psi) nsites <- 30 nvisits <- 5 test_theta1 <- matrix(c(0.9, 0.1, 0.15, 0.85), byrow = TRUE, nrow = 2) val_scenarios <- list(spp1 = c(.75, .5), spp2 = .5) fake_data <- simulate_validatedData( n_datasets = 5, design_type = "BySpecies", scenarios = val_scenarios, nsites = nsites, nvisits = nvisits, nspecies = nspecies, psi = psi, lambda = lambda, theta = test_theta1, save_datasets = FALSE, save_masked_datasets = FALSE, directory = withr::local_tempdir() ) # scenario 1 has the lowest effort, so use the 5th dataset from that scenario # Not run during checks if (interactive()) { # note the index of the zeros matches the index of the dataset tune_mcmc(dataset = fake_data$masked_dfs[[1]][[5]], zeros = fake_data$zeros[[5]]) }psi <- c(0.3, 0.6) lambda <- c(11, 2) nspecies <- length(psi) nsites <- 30 nvisits <- 5 test_theta1 <- matrix(c(0.9, 0.1, 0.15, 0.85), byrow = TRUE, nrow = 2) val_scenarios <- list(spp1 = c(.75, .5), spp2 = .5) fake_data <- simulate_validatedData( n_datasets = 5, design_type = "BySpecies", scenarios = val_scenarios, nsites = nsites, nvisits = nvisits, nspecies = nspecies, psi = psi, lambda = lambda, theta = test_theta1, save_datasets = FALSE, save_masked_datasets = FALSE, directory = withr::local_tempdir() ) # scenario 1 has the lowest effort, so use the 5th dataset from that scenario # Not run during checks if (interactive()) { # note the index of the zeros matches the index of the dataset tune_mcmc(dataset = fake_data$masked_dfs[[1]][[5]], zeros = fake_data$zeros[[5]]) }
Draws from an MCMC algorithm fit to simulated data. This object exists solely for easing examples in documentation of mcmc_sum.
A list of length 3 matrices with columns and rows as described here.
Columns: Parameters (lambda, psi, and theta for a two species assemblage)
Rows; Iterations of the MCMC. There are 1000 post-warmup draws for each.
Simulated data using simulate_validatedData, then isolated one dataset and the corresponding zeros. Draws were obtained from model fitting using the code inside tune_mcmc.
str(ValExp_example_fit)str(ValExp_example_fit)
Visualize simulation results for entire groups of parameters
visualize_parameter_group( sim_summary, pars, theta_scenario, scenarios, convergence_threshold = 1.1 )visualize_parameter_group( sim_summary, pars, theta_scenario, scenarios, convergence_threshold = 1.1 )
sim_summary |
Summary output in the format of |
pars |
The parameters to be visualized. |
theta_scenario |
The theta scenario IDs. This should match the |
scenarios |
Scenarios to be visualized. |
convergence_threshold |
If the Gelman-Rubin statistic is below this value, consider an MCMC to have converged. Default value is 1.1, but we recommend 1.05. |
A ggplot object summarizing simulation results.
visualize_parameter_group( example_output, pars = "lambda", theta_scenario = "1", scenarios = 1:2 )visualize_parameter_group( example_output, pars = "lambda", theta_scenario = "1", scenarios = 1:2 )
See detailed simulation study results for a parameter of interest
visualize_single_parameter( sim_summary, par, theta_scenario, scenarios, convergence_threshold = 1.1 )visualize_single_parameter( sim_summary, par, theta_scenario, scenarios, convergence_threshold = 1.1 )
sim_summary |
Summary output in the format of run_sims. |
par |
The parameter to be visualized. |
theta_scenario |
The theta scenario IDs. This should match the |
scenarios |
Scenarios to be visualized. |
convergence_threshold |
If the Gelman-Rubin statistic is below this value, consider an MCMC to have converged. Default value is 1.1, but we recommend 1.05. |
A ggplot object summarizing simulation results.
visualize_single_parameter( example_output, par = "lambda[1]", theta_scenario = "1", scenarios = 1:2, convergence_threshold = 1.05 )visualize_single_parameter( example_output, par = "lambda[1]", theta_scenario = "1", scenarios = 1:2, convergence_threshold = 1.05 )