"Backtest Sharpe is high" tells you almost nothing — if you tried enough variants, one will always look great. This is the multiple-testing trap. Here is the discipline that distinguishes real strategies from artifacts.
If you test N strategy variants on the same history, even if all variants are random, the best one will appear "significant" by chance. Naively reporting the top Sharpe is self-deception. Two-thirds of the published quant literature would not survive proper multiple-testing correction.
Time-series-friendly cross-validation: slice the series into blocks, do combinatorial train/test splits, and apply purge + embargo between train and test to prevent information leakage. Output is a distribution of backtest metrics — not a single point. Far more robust than single hold-out.
CPCV originates from Marcos López de Prado's work; it has become the de-facto standard for serious A-share quant.
DSR explicitly corrects for how many variants you tried, the backtest length, and the skew/kurtosis of returns. A factor that is DSR-significant is far more credible than one with a high naked Sharpe. Standard formulation:
DSR = Z[ (SR_observed - E[max_SR | N_tries]) / std_max_SR ]
where the second term captures the expected best-of-N under the null. Most "great Sharpe" candidates collapse here.
Backtests must net impact cost + fees + slippage. Many "high Sharpe" factors collapse to zero net of costs. The higher your turnover, the more lethal this gate.
Backtest numbers must be labelled backtest-grade; live equity curves should reconstruct from real positions × real close prices, reconcilable against broker statements. A high backtest Sharpe in marketing copy is a regulatory red flag in licensed-advisor jurisdictions.