Skip to content

feat: add normality and variance homogeneity statistical tests#67

Merged
maskedsyntax merged 2 commits intomainfrom
feat/statistical-tests
Mar 3, 2026
Merged

feat: add normality and variance homogeneity statistical tests#67
maskedsyntax merged 2 commits intomainfrom
feat/statistical-tests

Conversation

@maskedsyntax
Copy link
Member

@maskedsyntax maskedsyntax commented Mar 3, 2026

Summary

  • 2 new checks in checks/statistical_tests.py:
    • normality: Shapiro-Wilk for n <= 5,000; D'Agostino-Pearson for larger samples. Flags non-normal numeric columns with statistic + p-value in the issue description. Severity is critical at p < 0.001, warning otherwise.
    • variance_homogeneity: Levene's test (median-centred, robust to non-normality) across groups defined by the target column. Reports std ratio between groups. Skipped automatically when no target column is set or groups are too small.
  • Normality in summaries: _summarize_numeric now embeds normality: {test, statistic, p_value, is_normal} directly in each numeric column's summary dict: visible in JSON/HTML/Markdown reports without needing to run the check separately.
  • Bug fix: _summarize_numeric was crashing when a column contained inf values (np.histogram with range=(-inf, inf)). All distribution statistics (quantiles, histogram, MAD, skewness, etc.) are now computed on finite-only values; infinite_count still reflects the raw column.
  • Config: StatisticalTestThresholds added to config.py with all tunable knobs (p-values, cutoffs, min group sizes).
  • 30 new tests in tests/test_statistical_tests.py. All 180 tests pass (150 existing + 30 new).

Test plan

  • uv run pytest tests/test_statistical_tests.py -v 30/30 pass
  • uv run pytest tests/ --ignore=tests/test_statistical_tests.py -q 150/150 pass (no regressions)
  • uv run ruff check . clean

- config.py: add StatisticalTestThresholds (normality p-value, Shapiro
  max-n cutoff, min sample size, Levene p-value and min group size)
  wired into HashPrepConfig
- checks/statistical_tests.py: two new checks —
    - normality: Shapiro-Wilk (n ≤ 5000) or D'Agostino-Pearson (n > 5000)
      per numeric column; flags non-normal distributions with stat + p-value
      in the issue description
    - variance_homogeneity: Levene's test (median-centred, robust) across
      target-column groups; reports std ratio alongside the test result;
      skipped when no target column is set or groups are too small
- summaries/variables.py: embed normality result (test name, statistic,
  p_value, is_normal bool) into each numeric column's summary dict; also
  fix a pre-existing crash where infinite values in a column caused
  np.histogram to receive range=(-inf, inf) — all distribution stats
  now computed on finite-only values
- checks/__init__.py + core/analyzer.py: register normality and
  variance_homogeneity in CHECKS registry and ALL_CHECKS list
- tests/test_statistical_tests.py: 30 tests covering normality check
  unit, Levene check unit, summary embedding, and end-to-end integration
  via DatasetAnalyzer; all 180 tests pass (150 existing + 30 new)
@vercel
Copy link

vercel bot commented Mar 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hashprep Ready Ready Preview, Comment Mar 3, 2026 10:50am

@maskedsyntax maskedsyntax merged commit 9f6f822 into main Mar 3, 2026
6 checks passed
@maskedsyntax maskedsyntax deleted the feat/statistical-tests branch March 3, 2026 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant