 Methodology
 Open Access
 Open Peer Review
 Published:
An internal pilot design for prospective cancer screening trials with unknown disease prevalence
Trials volume 16, Article number: 458 (2015)
Abstract
Background
For studies that compare the diagnostic accuracy of two screening tests, the sample size depends on the prevalence of disease in the study population, and on the variance of the outcome. Both parameters may be unknown during the design stage, which makes finding an accurate sample size difficult.
Methods
To solve this problem, we propose adapting an internal pilot design. In this adapted design, researchers will accrue some percentage of the planned sample size, then estimate both the disease prevalence and the variances of the screening tests. The updated estimates of the disease prevalence and variance are used to conduct a more accurate power and sample size calculation.
Results
We demonstrate that in large samples, the adapted internal pilot design produces no Type I inflation. For small samples (N less than 50), we introduce a novel adjustment of the critical value to control the Type I error rate. We apply the method to two proposed prospective cancer screening studies: 1) a small oral cancer screening study in individuals with Fanconi anemia and 2) a large oral cancer screening trial.
Conclusion
Conducting an internal pilot study without adjusting the critical value can cause Type I error rate inflation in small samples, but not in large samples. An internal pilot approach usually achieves goal power and, for most studies with sample size greater than 50, requires no Type I error correction. Further, we have provided a flexible and accurate approach to bound Type I error below a goal level for studies with small sample size.
Background
Lingen et al. [1] proposed a study to compare the diagnostic accuracy of two screening modalities for the detection of oral premalignant and malignant lesions. During the planning phase of the trial, Lingen et al. considered a paired design with the full area under the receiver operating characteristic curve (AUC) as the outcome.
In a paired cancer screening trial, each participant is given two screening tests [1–4]. The participants are typically volunteers drawn from a standard screening population. Thus, the trial includes both participants with disease and participants without disease. At entry, the disease status of the participants is unknown. Presumably, the disease status of the participants in the trial mirrors the prevalence in the population.
The sample size for the trial proposed by Lingen et al. depended on the prevalence of disease in the population. The reported prevalence of oral malignant and premalignant lesions varied by as much as 16.5 % [5], even in published reports, depending on the population studied. If the prevalence of lesions was 12.1 %, as observed by [5], 2,450 participants would have been required to achieve 95 % power for the trial. However, if the prevalence of lesions was 0.2 % [6], Lingen and his colleagues would have needed to recruit 116,100 participants, a 47fold increase.
All researchers have an ethical responsibility to choose an accurate sample size. Participants in cancer screening trials may face emotional and physical harm from needless biopsy, false positive diagnoses, and overdiagnosis of nonfatal disease. A study that overestimates the sample size required for a cancer screening trial exposes study participants to needless harm. A study that underestimates the sample size lacks the power to answer the research question, while still exposing study participants to potential harm.
One possible solution to the ethical dilemma is an internal pilot study. In an internal pilot design, investigators use information from the first fraction of study participants accrued to estimate unknown parameters [7–10]. The estimates can then be used to calculate an updated sample size.
Previous work on internal pilot designs for screening studies has assumed that the ratio of cases is known prior to the start of the study and that the ratio is fixed throughout the course of the study. Wu et al. [11] proposed an internal pilot approach for the comparison of the diagnostic accuracy of screening tests, but, like Coffey and Muller [12], assumed that the ratio of cases to noncases was known before the study, and fixed by design during the study. In addition, the method of Wu et al. [11] does not control for possible Type I error inflation. While Gurka et al. [13] considered the use of internal pilot designs for observational studies, they did not suggest any Type I error correction techniques. In general, in small samples, internal pilot designs can inflate Type I error [14]. There are multiple approaches for controlling Type I error inflation in internal pilots, when the inflation occurs due to variance reestimation [12, 15–18].
We broaden the definition of the internal pilot design to match the sampling scheme in cancer screening trials. We adapt internal pilot methodology to the cancer screening setting by: 1) allowing the ratio of cases to noncases to vary randomly throughout the study, 2) reestimating the sample size with internal pilot sample estimates of both the disease prevalence and the variance of the outcome, and 3) adjusting the critical value to control for possible Type I error rate inflation caused by sample size reestimation. The critical value correction depends on the unconditional distribution of the test statistic. We show that the approach allows investigators to attain a targeted power level, and control Type I error rate inflation in small samples. We demonstrate, via simulation, that no correction is needed for large samples. The internal pilot approach is applied to two oral cancer screening examples: one small one, where the correction is needed, and one large one, where no correction is needed. We conclude the manuscript with a discussion of the results.
Methods
Study design, hypothesis test, and sample size reestimation
A novel internal pilot study design for screening trials
The novel internal pilot design includes the following steps:

1.
Initial planning stage: Initial estimation of the sample size needed.

2.
Pilot stage: Collection of paired screening test scores from a fraction of the planned sample size.

3.
Reestimation: Sample size reestimation using pilotsample based variance and prevalence estimates.

4.
Additional data collection: Collection of additional data based on the sample size reestimation.

5.
Analysis: Hypothesis testing, using an adjusted critical value to prevent Type I error inflation.
We expand the notation of Coffey and Muller [9, 12] and Coffey et al. [19] to accommodate our modifications in the internal pilot study design. Throughout the manuscript lower case letters represent fixed variables and upper case letters represent random variables. Matrices are written in bold text.
Data for the internal pilot study can be organized into four sets according to the stage of the study that is of interest (Fig. 1). Let k ∈ {0, 1, 2, +} index the stage of interest. Variables indexed by k = 0 describe the initial planning stage. Since no data has been collected, planning stage variables take on planned or speculated values. Variables indexed by k = 1 and k = 2 identify data observed in the pilot stage and the additional data collection stage, respectively. Variables indexed by k = + describe the entire sample, which includes data from all participants.
Let the random variable N_{ dk } be the number of study participants in stage k with disease status d ∈ {n, c}, with n indicating no disease, and c disease. For example, N_{c1} is the number of individuals with disease in the pilot sample. When the subscript d is dropped, the random variable N_{ k } denotes the number of people both with and without disease in the kth stage of the study. For example, N_{1} is the total number of individuals in the pilot sample, and N_{+} is the final sample size.
Let N_{min} and N_{max} be the minimum and maximum sample sizes allowed by the study investigator, and assume that N_{+} ∈ [min (N_{1}, N_{min}), N_{max}]. Let n_{0} be the initial sample size estimate, and define λ = n_{1}/n_{0}. Let \( {\gamma}_{\pi }=\pi /{\widehat{\pi}}_0 \), where π ∈ (0, 1) is the true prevalence of disease, and \( {\widehat{\pi}}_0\in \left(0,1\right) \) is the initial estimate of prevalence of disease. Let \( {\widehat{\pi}}_1={n}_{c1}/{n}_1 \) be the estimate of prevalence of disease from the pilot data. With σ^{2} the true variance of the difference in the two screening test scores, and \( {\widehat{\sigma}}_0^2>0 \) the variance estimate used for the initial sample size calculation, define \( \gamma ={\sigma}^2/{\widehat{\sigma}}_0^2 \). Let \( {\mathrm{SSE}}_1={\widehat{\sigma}}_1^2\times \left({n}_12\right) \) where \( {\widehat{\sigma}}_1^2 \) represents the variance of the difference in the two screening test scores estimated after the internal pilot study. Let P_{ t } and α_{ t } be the target power and Type I error level for the study.
A paired comparison of the diagnostic accuracy of two screening tests
Let y_{ idj } be the screening test score for individual i ∈ {1, 2, …, N_{+}}, with disease status d, on screening test j ∈ {A, B}. Assume that the two screening test scores [y_{ idAk }y_{ idBk }]^{'} have a bivariate normal distribution with mean μ_{ d } = [μ_{ dA }μ_{ dB }]^{'}, V(y_{ idjk }) = σ_{ dj }^{2}, and Cov(y_{ idAk }, y_{ idBk }) = ρ_{ d }σ_{ dA }σ_{ dB }. We assume that differences between the screening test scores for both the cases and noncases are distributed with equal variance, V(y_{ inA } − y_{ inB }) = V(y_{ icA } − y_{ icB }) = σ^{2}. Under the bivariate normal assumption, the AUC for screening test j is given by Φ[(μ_{ cA } − μ_{ nA })/σ] ([20], p. 83, Result 4.8) where Φ is the cumulative distribution function of the standard normal. The difference between the AUCs is given by Φ[(μ_{ cA } − μ_{ nA })/σ] − Φ[(μ_{ cB } − μ_{ nB })/σ].
For a paired comparison of the AUCs of the two screening tests, we test the hypothesis H_{0} : (μ_{ cA } − μ_{ nA }) − (μ_{ cB } − μ_{ nB }) = (μ_{ cA } − μ_{ cB }) − (μ_{ nA } − μ_{ nB }) = 0 against H_{ A } : ¬ H_{0}. If H_{0} holds, the AUCs, and hence the diagnostic accuracies of the two screening tests, are equal. To test H_{0}, we fit a general linear univariate model with the difference in the screening test scores as the outcome. The approach was inspired by the work of Demler et al. [21]. We assume that the difference between screening test scores is Gaussian and that the observations on different participants are independent.
The general linear univariate model for the final data set can be written as Y_{+} = X_{+}β + ϵ_{+}, where Y_{+} is an N_{+} × 1 matrix containing the difference in the screening test scores for each individual, [y_{ idA } − y_{ idB }]^{'}, X_{+} is an N_{+} × 2 design matrix that identifies disease status, β is a 2 × 1 matrix of mean differences [μ_{ cA } − μ_{ cB }μ_{ nA } − μ_{ nB }]^{'}, and ϵ_{+} is the N_{+} × 1 matrix of errors. We test H_{0} by writing the contrast matrix C = [1–1], forming θ = Cβ, and using an F statistic ([22], p. 51, Equation 2.32). The final F statistic used in our adapted internal pilot design is written as F_{+}.
Sample size reestimation for an internal pilot with unknown disease prevalence
The initial sample size is calculated as in Muller et al. [23]. For that calculation, the study investigator will specify σ_{0}^{2} and β_{0}. Ideally, speculated values will be based on data from previous studies, closely related published results, or clinical experience.
After the internal pilot, the final sample size can be recalculated using the following iterative algorithm. The goal of the algorithm is to find N_{+}, where the power of the study is equal to P_{ t }, the target power. First, check to see if the pilot data includes either all cases or all noncases. If so, set N_{+} = n_{0}. Otherwise, calculate the final sample size as follows. With n_{c1} and n_{n1} as observed in the initial pilot, define κ to be the greatest common factor of n_{c1} and n_{n1}. Let D = n_{c1}/κ, E = n_{n1}/κ, and R = (D + E).
Speculate that X_{+} will take the form X_{+} = Es(X) ⊗ 1_{ m }, where Es(X) is an (R × 2) matrix such that
and m is a positive integer chosen so that N_{+} = mR ≥ n_{1}.
Calculate the power as 1 − Pr[F_{+} ≤ f_{crit}] [23], where f_{crit} = F_{ F }^{− 1}[(1 − α_{ t }); 1, N_{+} − 2] and F_{+} has a noncentral F distribution with 1 numerator degrees of freedom, denominator degrees of freedom N_{+} − 2, and noncentrality parameter \( {\omega}_{+}={\delta}_{+}/{\widehat{\sigma}}_1^2 \), where \( {\delta}_{+}={\left(\boldsymbol{\theta} {\theta}_0\right)}^{\mathit{\hbox{'}}}{\left[\boldsymbol{C}{\left({\boldsymbol{X}}_{+}^{{}^{\mathit{\hbox{'}}}}{\boldsymbol{X}}_{+}\right)}^{}{\boldsymbol{C}}^{\mathit{\hbox{'}}}\right]}^{1}\left(\boldsymbol{\theta} {\theta}_0\right) \).
Sequentially increment or decrement m until the power of the experiment meets or exceeds P_{ t }, at m = m_{ t }. Set the final sample size to be N_{+} = m_{ t }R, unless N_{+} ≥ N_{ max } or N_{+} ≤ N_{ min }. If N_{+} ≥ N_{ max } then set N_{+} = N_{ max }. If N_{+} ≤ N_{ min } then set N_{+} = N_{ min }. Finally, calculate N_{2} as N_{2} = N_{+} − n_{1}.
Simulation studies
Verification of unconditional power
We conducted a simulation study designed to verify the result of Equation (10) below. Simulation study parameters came from modifying an example presented in Kairalla et al. [24]. Kairalla et al. [24] modified a balanced example in Wittes and Brittain [8] so that the numbers of cases to noncases were unequal. Kairalla et al. then assumed a fixed case mixture throughout the study. We, in turn, modified the example in [24] by allowing the ratio of cases to noncases to vary randomly.
Initial parameters were set at: C = [1–1], P_{ t } = 0.90, α_{ t } = 0.05, β = [1 0]^{'}, and σ_{0}^{2} = 2. The resulting initial sample size was n_{0} = 96 participants. With λ = 0.5, the pilot sample was fixed at n_{1} = 48. The true rate of disease was set at π = 1/3. The parameter y ranged between 0.5 to 2 by 0.25 while γ_{ π } was fixed at 1. Under the alternative hypothesis, the bivariate normal parameters were set at μ_{c1} = 3, μ_{c2} = 4, ρ_{c12} = 0, μ_{n1} = 0, μ_{n2} = 0, ρ_{n12} = 0, and σ_{ c }^{2} = σ_{ n }^{2} = 1. To calculate Type I error under H_{0}, the bivariate normal parameters were set at μ_{c1} = 3, μ_{c2} = 3, ρ_{c12} = 0, μ_{n1} = 0, μ_{n2} = 0, ρ_{n12} = 0, and σ_{ c }^{2} = σ_{ n }^{2} = 1. The distributional parameters under the null correspond to an AUC of 0.983 and a difference in AUC of 0.015 under the alternative. All programs were written in version 9.3 of SAS/IML® software [25] and are available upon request. The empirical power was calculated as the proportion of times the null hypothesis was rejected. The experiment was repeated 10,000 times. The maximum absolute deviation (MAD) was calculated as the maximum absolute difference between the empirical estimates and the theoretical value. Using a normal approximation to the distribution of a proportion, the halfwidth of the 95 % CI for a target power of 0.90 is 0.0053.
Assessment of Type I error rate inflation
We conducted a simulation study to assess the magnitude of the Type I error rate inflation for a variety of experimental conditions. The Type I error rate was simulated for a prospective cancer screening trial with an internal pilot design. The disease prevalence and variance were either correctly or incorrectly specified and then reestimated using pilot data. The hypothesis test was conducted using either an adjusted or unadjusted critical value.
The empirical Type I error was calculated for 648 different scenarios. The null hypothesis was that there was no difference in the diagnostic accuracy of the screening tests. For each scenario, we simulated 10,000 replicate data sets, conducted the hypothesis test, formed the Pvalue, and decided whether to accept or reject the null hypothesis at the α_{ t } = 0.05 level. The number of replicates was chosen so that the 95 % confidence interval of the proportion was no more than 0.005. The empirical Type I error was calculated as the proportion of replicates where the null hypothesis was rejected. For some scenarios, the study population was composed of either all cases or all noncases. For all such scenarios, we considered there to be insufficient evidence to reject the null hypothesis.
The 648 different scenarios came from a range of parameter values. Parameters of the bivariate normal distributions for the cases and noncases were fixed at μ_{ n } ∈ {[0 0]^{'}}, μ_{ c } ∈ {[0.2 0.2]^{'}, [0.5 0.5]^{'}}, σ_{0}^{2} ∈ {0.34}, and ρ_{ n } = ρ_{ c } = 0.5. This corresponded to a difference in the AUCs of test A and test B of 0.05 or 0.1, respectively. The proportion of the initial sample size used for the internal pilot was in the range of λ ∈ {0.25, 0.5, 0.75}. We varied target power, P_{ t } ∈ {0.80, 0.90}, the ratio of the true variance to the initial variance estimate, γ ∈ {0.5, 1, 1.5}, and the ratio of the true population disease prevalence to the initial prevalence estimate, γ_{ π } ∈ {0.1, 1, 1.9}. The initial prevalence estimate was fixed at π_{0} = 0.5, corresponding to a balanced study design.
Validation of Type I error control
We compared our adjusted method to an unadjusted internal pilot approach for a scenario where significant Type I error inflation occurred. The parameters that defined the scenario were μ_{ n } = {[0 0]^{'}}, μ_{ c } = {[0.3 0.92]^{'}}, σ_{0}^{2} ∈ {0.34}, ρ_{ n } = ρ_{ c } = 0.5, π = 0.5, and λ ∈ {0.5}. The parameters correspond to an AUC of 0.64 for test A and an AUC of 0.87 for test B. We varied γ between 0.25 and 4. With P_{ t } = 0.90 and α = 0.05, the initial sample size was 42. The adjusted method was applied to each of three possible prevalence misspecification scenarios with γ_{ π } ∈ {0.1, 1, 1.9}.
Results
Type I error rate control
Overview
In general, internal pilot studies can inflate Type I error rate [14]. Here, we describe a method to bound Type I error rate in internal pilot studies where both the variance of the outcome and the disease prevalence are reestimated in the internal pilot step. First, we give the unconditional power and hence the Type I error for the F test statistic. We uncondition over all possible realizations of N_{1}, N_{c1}, N_{c2}, and N_{2}. After demonstrating that the Type I error rate takes on a maximum value across a specified range of γ and γ_{ π }, we describe a method for identifying the values of γ and γ_{ π } at which the maximum occurs. We choose a critical value for the final hypothesis test so that the maximum Type I error rate is bounded.
Unconditional Type I error
We derive the distribution of the F_{+} statistic under H_{0} and H_{ A }. Under H_{0}, the formulae give an unconditional Type I error. Under H_{ A }, the formulae give unconditional power. Because both the variance and the disease prevalence are reestimated, the test statistic is a function of the pilot sample size and the final sample size. Derivation of the distribution of the test statistic requires obtaining three results:

1.
The distributions of N_{1}, N_{c1}, N_{c2}, N_{2}, and N_{+}.

2.
The distribution of F_{+} conditional on N_{1}, N_{c1}, N_{c2}, N_{2}, and N_{+}.

3.
The unconditional Type I error and power of the F_{+} test statistic.
Under the Type I error rate control subsection each of the three afore mentioned results are presented. Throughout the this subsection we find it useful to use functional notation to emphasize the dependence of variables on N_{1}, N_{c1}, N_{c2}, N_{2}, and σ_{1}^{2}. For example, we write N_{2}(σ_{1}^{2}, N_{c1}, N_{1}) to indicate that the additional sample size is a function of the pilot variance and the pilot case mixture.
Distributions of , N_{1}, N_{c1}, N_{c2}, and N_{2}
The number of participants in the pilot sample is fixed by study design: n_{1} = λn_{0}. Assuming a true disease prevalence of π, N_{c1} ∼ Binomial(n_{1}, π) and N_{c2} ∼ Binomial(N_{2}, π). The random variables \( {\widehat{\sigma}}_1^2 \) and N_{c1} are distributed independently. Summing over all possible values of n_{c1}, the unconditional probability mass distribution of the additional sample is:
where the first line extends Equation 18 of [9], and the second line follows from the law of total probability. The conditional probability mass function of N_{+} is calculated by extending Equation 17 of [9] as follows:
Note that since N_{2} = N_{+} − n_{1},
Power of the final hypothesis test conditional on N_{1}, N_{c1}, N_{c2}, N_{2}, and N_{+}
We show the dependence of the power on N_{1}, N_{c1}, N_{c2}, and N_{2}.
The additional sample size N_{2} is a function of \( {\widehat{\sigma}}_1^2 \) and N_{c1}. Since the power function is strictly monotone increasing, for fixed values of \( {\widehat{\sigma}}_1^2 \), n_{1}, and n_{c1}, there exists one and only one N_{2} = n_{2}. However, for a fixed n_{1} and n_{c1}, there exist infinitely many \( {\widehat{\sigma}}_1^2 \), all of which would yield the same final sample size.
Let q_{1}(n_{2}, n_{c1}) and q_{2}(n_{2}, n_{c1}) represent the smallest and the largest value of \( {\widehat{\sigma}}_1^2 \) that would lead to the additional sample size n_{2} for a fixed n_{1} and n_{c1}. Let q(n_{2}, n_{c1}) be the value of \( {\widehat{\sigma}}_1^2 \) that falls in the interval (q_{1}(n_{2}, n_{c1}), q_{2}(n_{2}, n_{c1})].
We can express the approximate power of the F_{+} test statistic for a value f(n_{2}, n_{c2}, n_{c1}) as a function of n_{2}, n_{c2}, and n_{c1}. Let I(n_{2}, n_{c2}, n_{c1}) represent the probability of rejecting H_{0} when the alternative is true, conditional on n_{c2}, n_{c1} and the value q(n_{2}, n_{c1}). Then
where ν_{+} = N_{+} − 2, c(n_{2}, n_{c2}, n_{c1}) = ν_{+}/[2f(n_{2}, n_{c2}, n_{c1})] with χ^{2}[a, ω_{+}(n_{1} + n_{2}, n_{c2} + n_{c1})] denoting a noncentral χ^{2} with a degrees of freedom and a noncentrality parameter of ω_{+}(n_{1} + n_{2}, n_{c2} + n_{c1}). Equation (5) follows from the proof in the Appendix of Coffey and Muller [9].
Expected power of the F test statistic unconditioned from N_{1}, N_{c1}, N_{c2}, N_{2}, and N_{+}
We uncondition Equation (5) from N_{c1}, q(n_{2}, n_{c1}), N_{c2}, and N_{2}. Using the law of total probability, the unconditional power is
Substituting Equation (6) into Equation (5) gives
Unconditioning the power from N_{c2}, we obtain
leading to
with \( {f}_{\chi^2}\left(t,\;{\nu}_1\right) \) defined in Johnson et al. [26]. The distributional results of Coffey et al. [19] hold, conditional on fixed values of N_{1}, N_{c1}, and N_{c2}. The expected power is given by
where F_{+}(N_{+}, N_{c +}, N_{c1}) is the final test statistic, f(N_{+}, N_{c +}, N_{c1}) is an observed value, \( {F}_{\chi^2} \) is the cumulative distribution function of a noncentral χ^{2} [27], F_{ β } is the cumulative distribution function of a beta (one) distributed random variable [27], ν_{1} = n_{1} − 2, and the bounds of the integration depend on n_{c1} and n_{2}. The Type I error can be calculated from Equation (10) when the null hypothesis is true. Notice that when the null hypothesis is true, the χ^{2} distribution in Equation (10) becomes a central χ^{2}.
Bounding Type I error
There exists a maximum Type I error across a specified range of γ and γ_{ π }. Let α^{max} be the global maximum Type I error. Power for a study design is maximized when the ratio of the number of study participants with disease to the number of study participants without disease is onetoone. Thus, α^{max} must occur for γ_{ π } = 1. The problem of showing that there is a maximum then reduces to showing that there exists a maximum with respect to γ for γ_{ π } = 1. Coffey and Muller [12] provide evidence to support this assertion.
We propose the following method to find the γ = γ* and γ_{ π } = γ_{ π }^{b} for which the maximum Type I error occurs:

9.
First, fix a range for γ∈ [a, b] and γ_{ π }∈ [c, d] a priori, based on the previous literature.

10.
Find the value of γ_{ π } = γ_{ π }^{b} that results in a study design with a permissible prevalence value that is closest to a onetoone ratio of cases to noncases (that is, the value closest to 1 ∈ [c, d]).

11.
Finally, for a fixed γ_{ π }^{b}, find the value of γ = γ* that yields the maximum Type I error inflation, using Equation (10) and a golden section search algorithm [28].
The maximum Type I error is bounded by identifying an adjusted critical value for the final test statistic. For γ = γ* and γ_{ π } = γ_{ π }^{b} we use a bisection search algorithm to find α* so that under H_{0}, Pr{F_{+}(N_{+}, N_{c +}, N_{c1}) ≤ f_{adj}} = α_{ t }, where f_{adj} = F_{ F }^{− 1}[(1 − α*); 1, N_{+} − 2].
Simulation studies results
Verification of unconditional power
The simulation study suggested that for the parameters chosen, Equation (10) provides a good estimate of unconditional power. The MAD between predicted empirical power and theoretical power always fell within the 95 % confidence interval (Tables 1 and 2). The halfwidth of the 95 % CI for a target Type I error of 0.05 is 0.0043.
Assessment of Type I error rate inflation
Results from the simulation are presented in Figs. 2, 3, and 4. Overall, the Type I error rate was inflated when the initial sample size was smaller than 50 and the initial prevalence estimate was correct. As the fraction of the initial sample size estimate used in the pilot study increased, the inflation grew smaller. The initial sample sizes for all 648 scenarios ranged from 12 to 2,028 participants, with an interquartile range of 61 to 635 participants. The median observed Type I error was 0.0495, with a minimum of 0.0244, a maximum of 0.0839, and an interquartile range of 0.0479 to 0.0521.
The figures suggest that no Type I error adjustment is needed when the sample size is large. This observation is consistent with the results from Wu et al. [11]. The results from the simulation study by Wu et al. [11] correspond to the subset of results in Figs. 2, 3, and 4 with γ = 1 and γ_{ π } = 1. However, Wu et al. [11] did not consider cases with small initial sample sizes, and thus did not observe the Type I error rate inflation shown in our results. In our first example, we present an application with a large sample size where no adjustment is needed to bound the Type I error rate.
Validation of Type I error control
Results from the Type I error control simulation appear in Fig. 5, which shows a comparison of the Type I error inflation for the adjusted and unadjusted methods. The figure plots Type I error rate as a function of γ, crossclassified by γ_{ π } for the two methods. Figure 5 shows that the adjusted method controls the Type I error rate in small samples. The maximum possible Type I error occurred with γ_{ π }^{b} = 1, γ* = 0.8541 for a Type I error of 0.0564. The adjusted Type I error rate was α* = 0.0438. Note that f_{ adj } is only assigned a value after the pilot sample is collected and N_{+} = n_{+} is reestimated.
Applications
Example 1: A large oral cancer screening trial where no adjustment is needed
One implication of this study is that internal pilot designs often require no penalty for reestimating both outcome variance and disease prevalence. In addition, the internal pilot design ensures that researchers will have sufficient power.
Recall the study by Lingen et al. discussed in the Background section. One aim of the study was to compare the diagnostic accuracy of a combined modality involving both visual and tactile oral exam with VELscope® [29]. The investigators wished to detect oral premalignancy and malignancy. There was substantial uncertainty about the rate of oral premalignancy and malignancy in the target population. The rate of suspicious lesions varies widely in Western populations, ranging from 0.2 % to 16.7 % [5]. Further, the variance of scores for visual and tactile oral exam and for examinations with VELscope was largely unknown. The uncertainty made an internal pilot design attractive.
One critical step for designing an internal pilot study is choosing N_{min} and N_{max}. The investigators wished to estimate a confidence interval for the percentage of oral lesions that were benign. To ensure that the confidence interval had a halfwidth of no more than 0.1 %, the investigators had to make sure that the entire study enrolled at least 96 people with lesions. If the rate of suspicious lesions was about 12.1 %, the minimum sample size could be no less than 800. The upper bound on sample size was fixed by monetary constraints. Previous experience had shown that a sample size of more than 30,000 was fiscally unfeasible. This set N_{max} at 30,000.
The initial power calculation was based on plausible values from the literature. A conservative estimate for the AUC for visual and tactile oral exam is 0.60. A clinically interesting difference between AUCs is 0.06. This corresponds to μ_{ n } ∈ {[0 0]^{'}}, μ_{ c } ∈ {[0.359 0.584]^{'}}, σ = 1, and ρ_{ n } = ρ_{ c } = 0. Assuming that the rate of suspicious lesions in the population is 12.1 %, the initial sample size needed for 95 % power is 2,156 noncases and 294 cases for a total sample size of 2,450.
The final sample size that would be needed for the study would depend on results from the internal pilot. The results presented in the Type I error control validation indicate that Type I error inflation would not be a problem for a study designed with an initial sample size of 2,450. Thus, the final hypothesis test could be carried out with α set to 0.05.
Example 2: A small oral cancer screening trial where adjustment prevents Type I error inflation
A second implication of this manuscript is that internal pilot designs with small sample size require an adjustment to prevent Type I error inflation. Small sample sizes often occur because of biological constraints. For example, Wong et al. [30] are currently recruiting for an oral cancer screening trial in people with Fanconi anemia. Fanconi anemia is a rare genetic disease that occurs in roughly 1 in 131,000 people in the United States. People with Fanconi anemia are at increased risk for oral cancer, although the magnitude of the risk is unknown. The prevalence of oral squamous cell carcinoma could be as high as 100 % or as low as 3 % [31, 32].
Because the study is still in progress, the design has not yet been published. To illustrate the results of our manuscript, we show how an internal pilot trial might be used to compare the diagnostic accuracy of two assays for IL8 for the prediction of oral cancer. In people with Fanconi anemia, IL8 is a useful biomarker for screening for oral cancer [33, 34].
Consider a trial in which people with Fanconi anemia are given two salivary assays: a salivary beadbased assay for IL8, and an enzymelinked immunosorbent assay (ELISA). The diagnostic accuracy (AUC) of the ELISA and the salivary beadbased assay is 0.85 and 0.94, respectively [34, 35]. The target power is set to 0.80. A clinically interesting difference in diagnostic accuracy is a difference between AUCs of 0.09. The target Type I error rate is 0.05. Means and variances of both ELISA and a salivary beadbased assay are available in the literature [34, 35], with μ_{ n } ∈ {[759.4 759.4]^{'}}, μ_{ c } ∈ {[3347.7 4700.0]^{'}}, and σ_{ nA } = σ_{ nB } = σ_{ cA } = σ_{ cB } = 3328174.5. Modest correlation is set at ρ_{ n } = ρ_{ c } = 0.5.
If half the people in the study have oral cancer, the initial sample size required is 84 participants. Thus, the study could be subject to Type I error inflation. If we reestimate the sample size after the first 42 participants have been collected, the study could have a Type I error rate inflated to 0.054. This inflation occurs at γ_{ π }^{b} = 1 and γ* = 0.7254. This is an 8 % inflation from the target Type I error rate of 0.05. Adjusting gives an adjusted alpha level of α_{adj} = 0.0463. The adjusted critical value can be calculated as f_{adj} = F_{ F }^{− 1}[(1 − α*); 1, N_{+} − 2]. Recall that the actual adjusted critical value will depend on the final sample size calculated after the internal pilot is observed. For example, if n_{+} = 100, then f_{adj} = 4.07. Thus with n_{+} = 100, any observed test statistic larger than 4.07 should be rejected.
Discussion
In this manuscript, we describe an internal pilot approach for cancer screening trials when the disease prevalence is unknown. We demonstrated that conducting an internal pilot study without adjusting the critical value caused Type I error rate inflation in small (N <50) samples, but not in large samples. We also demonstrated that our adjusted method controlled Type I error rate in small samples.
The approach has both strengths and limitations. A strength is that the method allows investigators to obtain expected power at least as high as needed, for all but the most rampant variance and prevalence misspecifications. One limitation is the assumption that the screening test scores have a bivariate normal distribution of the test scores and that the assumptions of the general linear univariate model [22] are met. Secondly, the method may be overly conservative, and result in a Type I error rate lower than nominal. However, for prospective cancer screening trials, being conservative is reasonable. Cancer screening methods may be adopted in large populations, and replicable research is vital for maintaining public trust. Finally, the computing time is somewhat lengthy, because the integration and sums from Equation (10) have high complexity. For any one study design, the amount of time is reasonable. For example, it took less than eight hours to run all programs used in Example 2. In addition, our simulation study demonstrated that the method is not necessary in screening studies with large sample sizes.
Conclusion
We have shown that an internal pilot approach usually achieves goal power, and, for most studies with sample size greater than 50, requires no Type I error correction. Further, we have provided a flexible and accurate approach to bound Type I error below a goal level for studies with small sample size (N < 50). Both investigators and statisticians should use the new methods for the design of cancer screening trials.
Abbreviations
 AUC:

area under the receiver operating characteristic curve
 MAD:

maximum absolute deviation
 ELISA:

enzymelinked immunosorbent assay
References
 1.
Lingen MW. Efficacy of oral cancer screening adjunctive techniques. Bethesda (MD): National Institute of Dental and Craniofacial Research, National Institutes of Health, US Department of Health and Human Services (NIH Project Number: 1RC2DE02077901); 2009.
 2.
Berg W, Zhang Z, Lehrer D, Jong R, Pisano E, Barr R, et al. Detection of breast cancer with addition of annual screening ultrasound or a single screening MRI to mammography in women with elevated breast cancer risk. JAMA. 2012;307(13):1394–404.
 3.
Lewin JM, Hendrick RE, D’Orsi CJ, Isaacs PK, Moss LJ, Karellas A, et al. Comparison of fullfield digital mammography with screenfilm mammography for cancer detection: results of 4,945 paired examinations. Radiology. 2001;218(3):873–80.
 4.
Pisano ED, Gatsonis C, Hendrick E, Yaffe M, Baum JK, Acharyya S, et al. Diagnostic performance of digital versus film mammography for breastcancer screening. N Engl J Med. 2005;353(17):1773–83.
 5.
Lim K, Moles DR, Downer MC, Speight PM. Opportunistic screening for oral cancer and precancer in general dental practice: results of a demonstration study. Br Dent J. 2003;194(9):497–502. discussion 493.
 6.
Field EA, Morrison T, Darling AE, Parr TA, Zakrzewska JM. Oral mucosal screening as an integral part of routine dental care. Br Dent J. 1995;179(7):262–6.
 7.
Stein C. A twosample test for a linear hypothesis whose power is independent of the variance. Ann Math Stat. 1945;16(3):243–58.
 8.
Wittes J, Brittain E. The role of internal pilot studies in increasing the efficiency of clinical trials. Stat Med. 1990;9(1–2):65–71. discussion −2.
 9.
Coffey CS, Muller KE. Exact test size and power of a Gaussian error linear model for an internal pilot study. Stat Med. 1999;18(10):1199–214.
 10.
Friede T, Kieser M. Sample size recalculation in internal pilot study designs: a review. Biom J. 2006;48(4):537–55.
 11.
Wu C, Liu A, Yu KF. An adaptive approach to designing comparative diagnostic accuracy studies. J Biopharm Stat. 2008;18(1):116–25.
 12.
Coffey CS, Muller KE. Controlling test size while gaining the benefits of an internal pilot design. Biometrics. 2001;57(2):625–31.
 13.
Gurka MJ, Coffey CS, Gurka KK. Internal pilots for observational studies. Biom J. 2010;52(5):590–603. doi:10.1002/bimj.201000050.
 14.
Wittes J, Schabenberger O, Zucker D, Brittain E, Proschan M. Internal pilot studies I: Type I error rate of the naive ttest. Stat Med. 1999;18(24):3481–91.
 15.
Zucker DM, Wittes JT, Schabenberger O, Brittain E. Internal pilot studies II: comparison of various procedures. Stat Med. 1999;18(24):3493–509.
 16.
Miller F. Variance estimation in clinical studies with interim sample size reestimation. Biometrics. 2005;61(2):355–61.
 17.
Denne JS, Jennison C. Estimating the sample size for a ttest using an internal pilot. Stat Med. 1999;18(13):1575–85. doi:10.1002/(SICI)10970258(19990715)18:13<1575::AIDSIM153>3.0.CO;2Z.
 18.
Kieser M, Friede T. Recalculating the sample size in internal pilot study designs with control of the type I error rate. Stat Med. 2000;19(7):901–11. doi:10.1002/(SICI)10970258(20000415)19:7<901::AIDSIM405>3.0.CO;2L.
 19.
Coffey CS, Kairalla JA, Muller KE. Practical methods for bounding Type I error rate with an internal pilot design. Commun Stat Theory Methods. 2007;36(11):2143–57.
 20.
Pepe MS. The statistical evaluation of medical tests for classification and prediction. New York, NY: Oxford University Press; 2003.
 21.
Demler OV, Pencina MJ, D’Agostino RB. Equivalence of improvement in area under ROC curve and linear discriminant analysis coefficient under assumption of normality. Stat Med. 2011;30(12):1410–8.
 22.
Muller KE, Stewart PW. Linear model theory: univariate, multivariate, and mixed models. New York: WileyInterscience; 2006.
 23.
Muller KE, LaVange LM, Ramey SL, Ramey CT. Power calculations for general linear multivariate models including repeated measures applications. J Am Stat Assoc. 1992;87(420):1209–26.
 24.
Kairalla JA, Coffey CS, Muller KE. GLUMIP 2.0: SAS/IML software for planning internal pilots. J Stat Softw. 2008;28(7):1–32.
 25.
Inc. SI. SAS/STAT® 9.3 User’s Guide. SAS Institute Inc., Cary, NC. 2011.
 26.
Johnson NL, Kotz S, Balakrishnan N. Continuous univariate distributions, vol. 1. New York: WileyInterscience; 1994.
 27.
Johnson NL, Kotz S, Balakrishnan N. Continuous univariate distributions, vol. 2. New York: WileyInterscience; 1995.
 28.
Thisted RA. Elements of statistical computing: NUMERICAL COMPUTATION. Ipswich, Suffolk: Chapman and Hall/CRC; 1988.
 29.
Poh CF, MacAulay CE, Zhang L, Rosin MP. Tracing the “atrisk” oral mucosa field with autofluorescence: steps toward clinical impact. Cancer Prev Res. 2009;2(5):401–4.
 30.
Wong DT. Oral cancer biomarker study. 2012.
 31.
Scheckenbach K, Wagenmann M, Freund M, Schipper J, Hanenberg H. Squamous cell carcinomas of the head and neck in Fanconi anemia: risk, prevention, therapy, and the need for guidelines. Klin Padiatr. 2012;224(3):132–8.
 32.
Rosenberg PS, Socie G, Alter BP, Gluckman E. Risk of head and neck squamous cell cancer and death in patients with Fanconi anemia who did and did not receive transplants. Blood. 2005;105(1):67–73.
 33.
Elashoff D, Zhou H, Reiss J, Wang J, Xiao H, Henson B, et al. Prevalidation of salivary biomarkers for oral cancer detection. Cancer Epidemiol Biomarkers Prev. 2012;21(4):664–72.
 34.
Hu S, Arellano M, Boontheung P, Wang J, Zhou H, Jiang J, et al. Salivary proteomics for oral cancer biomarker discovery. Clin Cancer Res. 2008;14(19):6246–52.
 35.
ArellanoGarcia M, Hu S, Wang J, Henson B, Zhou H, Chia D, et al. Multiplexed immunobeadbased assay for detection of oral cancer protein biomarkers in saliva. Oral Dis. 2008;14(8):705–12.
Acknowledgements
The research presented in this paper was supported in part by NIDCR RC2DE020779 and by NIDCR 1 R01 DE02083201A1. The content of this paper is solely the responsibility of the authors, and does not necessarily represent the official views of the National Institute of Dental and Craniofacial Research, nor the National Institutes of Health. This manuscript was submitted to the Department of Biostatistics and Informatics in the Colorado School of Public Health, University of Colorado Denver, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Biostatistics for J. T. Brinton.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JTB conducted the literature review, derived the mathematical results, designed and programmed the simulation studies, interpreted the results, and prepared the manuscript. DHG assisted with the literature review, assisted with the mathematical derivations, provided guidance for the design and programming of the simulation studies, and provided expertise on the context of the topic in relation to other work in the field. BMR reviewed the intellectual content of the work and gave important editorial suggestions. DHG conceived of the topic and guided the development of the work. All authors read and approved the final manuscript.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Brinton, J.T., Ringham, B.M. & Glueck, D.H. An internal pilot design for prospective cancer screening trials with unknown disease prevalence. Trials 16, 458 (2015) doi:10.1186/s1306301509513
Received
Accepted
Published
DOI
Keywords
 Cancer screening
 Internal pilot area under the curve
 Type I error
 Power
 Receiver operating characteristic analysis
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate. Please note that comments may be removed without notice if they are flagged by another user or do not comply with our community guidelines.