RR-Agent › Insights › Factor Collinearity: Diagnosis & Deduplication

Factor Collinearity: Diagnosis & Deduplication

You think you have 50 factors — the correlation matrix shows 5-10 truly independent dimensions, the rest are close cousins. Collinearity destabilizes model weights and inflates backtest results. This post covers diagnosis and dedup (methodology only).

Why collinearity happens

Most A-share factors derive from a handful of underlying signals: price, volume, fund flow, fundamentals. Different windows + different transformations create "new" factors that are highly correlated: 5-day momentum, 10-day momentum, 20-day momentum — pairwise |ρ| typically 0.6+. Implied vol vs historical vol — almost linear.

Without deduplication, your "50 factors" model has maybe 5 effective dimensions — but each is "copied" 10× — model over-relies on those 5 real signals, and weights jump fold-to-fold. Unstable + over-confident.

Diagnosis triad

1. Correlation matrix

Spearman correlation pairwise. |ρ| > 0.7 = high correlation. ρ > 0.9 = essentially the same factor.

2. Hierarchical clustering

Cluster factors by correlation, keep one representative per cluster (typically highest IC).

3. VIF (Variance Inflation Factor)

Regress X_i on all other X, VIF = 1/(1-R²). VIF > 10 = severe collinearity.

Handling methods

Method	Use case
Cluster + keep representative	Simple, interpretable
Orthogonalization (PCA / Gram-Schmidt)	Replaces originals with principal components — information-preserving but loses economic meaning
Regularization (Ridge / Lasso)	Suppresses collinearity at model level, simple

The value of residual factors

Regress factor B on factor A — the residual is "B's incremental information beyond A." If residual still significantly correlates with target, B has value beyond A. Otherwise it's redundant. This is the litmus test for "does this new factor matter?"

← All EN insights · Home · reachrich.ai