You think you have 50 factors — the correlation matrix shows 5-10 truly independent dimensions, the rest are close cousins. Collinearity destabilizes model weights and inflates backtest results. This post covers diagnosis and dedup (methodology only).
Most A-share factors derive from a handful of underlying signals: price, volume, fund flow, fundamentals. Different windows + different transformations create "new" factors that are highly correlated: 5-day momentum, 10-day momentum, 20-day momentum — pairwise |ρ| typically 0.6+. Implied vol vs historical vol — almost linear.
Without deduplication, your "50 factors" model has maybe 5 effective dimensions — but each is "copied" 10× — model over-relies on those 5 real signals, and weights jump fold-to-fold. Unstable + over-confident.
Spearman correlation pairwise. |ρ| > 0.7 = high correlation. ρ > 0.9 = essentially the same factor.
Cluster factors by correlation, keep one representative per cluster (typically highest IC).
Regress X_i on all other X, VIF = 1/(1-R²). VIF > 10 = severe collinearity.
| Method | Use case |
|---|---|
| Cluster + keep representative | Simple, interpretable |
| Orthogonalization (PCA / Gram-Schmidt) | Replaces originals with principal components — information-preserving but loses economic meaning |
| Regularization (Ridge / Lasso) | Suppresses collinearity at model level, simple |
Regress factor B on factor A — the residual is "B's incremental information beyond A." If residual still significantly correlates with target, B has value beyond A. Otherwise it's redundant. This is the litmus test for "does this new factor matter?"