Skip to content

feat: add mutual information scoring and Shannon entropy#68

Merged
maskedsyntax merged 1 commit intomainfrom
feat/mutual-information-entropy
Mar 3, 2026
Merged

feat: add mutual information scoring and Shannon entropy#68
maskedsyntax merged 1 commit intomainfrom
feat/mutual-information-entropy

Conversation

@maskedsyntax
Copy link
Member

@maskedsyntax maskedsyntax commented Mar 3, 2026

Summary

  • Shannon entropy in variable summaries:

    • Categorical columns: entropy_bits + normalized_entropy from value-count probabilities
    • Numeric columns: same, after discretising into configurable equal-width bins
    • Helper _shannon_entropy() returns None for constant columns (< 2 distinct values)
  • summarize_mutual_information() in summaries/mutual_info.py:

    • Detects task type from the target column's inferred type (Categorical -> classification, Numeric -> regression)
    • Uses mutual_info_classif or mutual_info_regression from sklearn
    • Label-encodes categorical features; median-fills missing numerics
    • Excludes high-cardinality categoricals (> 50 unique) and text/datetime columns
    • Returns scores sorted descending; stored in summaries["mutual_information"] when a target column is set
  • low_mutual_information check in checks/mutual_info.py:

    • Flags features with MI below the configured warning threshold (default 0.01 nats)
    • Requires target_col to be set; no-op otherwise
  • Config: MutualInfoThresholds added to config.py with low_mi_warning, max_categories_for_mi, min_samples_for_mi, and entropy_bins

  • 28 new tests in tests/test_mutual_info.py. Threshold-sensitive tests use per-test seeded RNGs and n=2000 to avoid KNN estimator variance. All 208 tests pass (180 existing + 28 new).

Test plan

  • uv run pytest tests/test_mutual_info.py -v 28/28 pass
  • uv run pytest tests/ --ignore=tests/test_mutual_info.py -q 180/180 pass
  • uv run ruff check . && uv run ruff format --check . clean

- config.py: add MutualInfoThresholds (low_mi_warning, max_categories,
  min_samples, entropy_bins) wired into HashPrepConfig
- summaries/mutual_info.py: new module — summarize_mutual_information()
  computes sklearn MI scores (mutual_info_classif for categorical targets,
  mutual_info_regression for numeric targets) for all eligible features,
  with label-encoding for categoricals; scores sorted descending and stored
  in summaries["mutual_information"] when a target column is set
- summaries/variables.py: add _shannon_entropy() helper; embed entropy
  (entropy_bits + normalized_entropy) in numeric summaries (discretised
  into bins) and categorical summaries (from value-count probabilities)
- checks/mutual_info.py: new low_mutual_information check — flags features
  whose MI with the target is below the configured warning threshold
- checks/__init__.py + core/analyzer.py: register low_mutual_information
  in CHECKS and ALL_CHECKS; inject MI summary into analyzer.summaries
- summaries/__init__.py: export summarize_mutual_information
- tests/test_mutual_info.py: 28 tests covering entropy in summaries, MI
  computation correctness, low_mi check unit, and end-to-end integration;
  threshold-sensitive tests use per-test seeded RNGs and n=2000 to avoid
  KNN estimator variance; all 208 tests pass (180 existing + 28 new)
@vercel
Copy link

vercel bot commented Mar 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hashprep Ready Ready Preview, Comment Mar 3, 2026 1:19pm

@maskedsyntax maskedsyntax merged commit af1d38d into main Mar 3, 2026
6 checks passed
@maskedsyntax maskedsyntax deleted the feat/mutual-information-entropy branch March 3, 2026 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant