Sample Size for Two-Group Comparisons: t-Test Power Analysis Step by Step

April 6, 2026

You have two groups, one treatment condition, and a continuous outcome. Before collecting a single data point, you need an answer: how many participants per group?

This is the most common power analysis scenario in biomedical and behavioral research. The independent-samples t-test compares means between two groups, and its sample size formula is the foundation that more complex designs build on. If you understand the mechanics here, every other power analysis becomes a variation on the same theme.

What You Need Before You Start

A t-test sample size calculation requires exactly four inputs. You cannot skip or guess at any of them—each one directly controls the result.

1. Expected Effect Size (Cohen’s d)

The effect size quantifies how large a difference you expect between your two groups, expressed in standard deviation units. Cohen’s conventions provide benchmarks:

d = 0.2 – Small effect. Subtle differences that require large samples to detect. Common in behavioral interventions, nutritional studies, and public health measures.
d = 0.5 – Medium effect. Visible to a careful observer. Typical in many psychological and pharmacological studies.
d = 0.8 – Large effect. Obvious differences. Drug vs. placebo in acute conditions, surgical vs. conservative treatment.

Conventions are starting points, not substitutes for domain knowledge. The right effect size comes from pilot data, published meta-analyses in your field, or the minimum difference that would change clinical practice (the MCID). If your pilot found d = 0.6 but the literature suggests d = 0.4, use the more conservative estimate.

2. Significance Level (α)

The probability of concluding the groups differ when they actually don’t (Type I error). Standard: α = 0.05, two-tailed. Use α = 0.01 for higher-stakes decisions or when multiple comparisons are involved.

3. Statistical Power (1 – β)

The probability of detecting the effect if it exists. Standard minimum: 80%. For grant applications and clinical trials, 90% is increasingly expected—it provides a buffer against the inevitable erosion from dropouts and measurement noise.

4. Allocation Ratio

Equal groups (1:1) maximize statistical efficiency. Unequal allocation (e.g., 2:1) may be justified ethically or practically but always increases the total N required.

The Formula

For an independent two-sample t-test with equal group sizes:

n = 2 × [(Z_α/2 + Z_β) / d]²

Where n is the per-group sample size and d is Cohen’s d. This simplified form works because d already normalizes by the pooled standard deviation.

Equivalently, if you have raw means and standard deviations:

n = 2σ² × (Z_α/2 + Z_β)² / Δ²

Where σ is the pooled SD and Δ is the expected mean difference.

Quick Reference Table

Sample sizes per group for a two-tailed independent t-test at α = 0.05:

Cohen’s d	Power = 80%	Power = 90%	Power = 95%
0.2 (small)	394	527	651
0.3	176	234	290
0.4	99	132	163
0.5 (medium)	64	85	105
0.6	45	59	73
0.8 (large)	26	34	42
1.0	17	22	27

These numbers are per group. Total N is double. Notice the non-linear scaling: halving the effect size roughly quadruples the required sample.

Worked Example: Drug vs. Placebo

A researcher is testing whether a new anxiolytic reduces Hamilton Anxiety Scale (HAM-A) scores compared to placebo.

Expected difference: 4 points on the HAM-A scale
Published SD for HAM-A in similar populations: 8 points
Cohen’s d: 4 / 8 = 0.5 (medium effect)
α = 0.05 (two-tailed)
Power = 80%

From the table: n = 64 per group, or 128 total participants.

With 15% expected attrition: 128 / 0.85 ≈ 151 total participants to enroll.

If the team wants 90% power instead: 85 per group, 170 total, adjusted to 200 with attrition.

Three Mistakes That Produce Wrong Numbers

1. Using Conventions When You Have Data

Cohen’s d = 0.2/0.5/0.8 are discipline-agnostic defaults from the 1960s. If you have pilot data or a relevant meta-analysis, calculate d from your actual data. Conventions should be a last resort, not the first choice.

2. Forgetting About Hedges’ g for Small Pilots

Cohen’s d overestimates the effect size in small samples (n < 50). If your pilot had 15 participants per group, apply the Hedges’ correction before using the estimate for power analysis. The correction factor is approximately (1 – 3/(4N – 9)), which shrinks d by a few percent—enough to matter for borderline sample sizes.

3. One-Tailed Tests to Reduce Sample Size

Switching from a two-tailed to a one-tailed test reduces the required n by about 20%. This is only appropriate when there is genuinely no interest in an effect in the opposite direction—a rare situation in practice. Reviewers will question a one-tailed test, and for good reason: it doubles your Type I error rate in the untested direction.

When the t-Test Assumptions Fail

The independent t-test assumes normally distributed outcomes and equal variances between groups. When these don’t hold, the power analysis changes:

Non-normal data: If you plan to use a Mann-Whitney U test instead, the sample size from the t-test formula needs a correction factor. For normally distributed data, the Mann-Whitney requires about 5% more participants to achieve the same power. For non-normal distributions, the adjustment can be larger or smaller depending on the distribution shape.
Unequal variances: If one group is substantially more variable, Welch’s t-test is appropriate. Power analysis for Welch’s test requires specifying separate variances for each group rather than a pooled SD.
Paired design: If the same participants provide both measurements (pre/post, crossover), use the paired t-test formula, which substitutes the within-subject SD for the between-subject SD. This is typically 40–60% smaller, dramatically reducing the required n.

From Formula to Defensible Protocol

A power analysis is only as credible as the justification behind each parameter. When writing the sample size section of a protocol or grant application:

State the primary hypothesis and the statistical test that will be used.
Cite the source of every parameter—the effect size estimate, the SD estimate, and where they came from (which study, which meta-analysis, which pilot dataset).
Show the sensitivity range. Recalculate for d ± 0.1 around your estimate. Reviewers want to know the sample size isn’t fragile.
Include the dropout adjustment and its justification (attrition rates from comparable studies).
Use a sample size calculator that documents its assumptions, or report the software and version used (G*Power, R pwr package, etc.).

The gap between “I ran a power analysis” and “my power analysis will survive peer review” is entirely about the documentation. The formula is the easy part. The hard part—and the part that determines whether your study design holds up—is defending every number you plugged into it.

For clinical trial applications where sample size interacts with regulatory requirements and adaptive designs, see our guide on sample size calculation for clinical trials.