Code Metrics Analysis Project
This project analyzes code metrics from open-source Python projects on GitHub to investigate relationships between code complexity and issues/fixes.
Features
- Code Metrics Analysis: Measures LOC, cyclomatic complexity, cognitive complexity, inheritance depth, and maintainability index
- Git Commit Analysis: Analyzes git commit logs to find commits with "fix" in the message and tracks which files were changed
- Statistical Analysis:
- Correlation analysis (Pearson and Spearman)
- Linear regression modeling
- Hypothesis testing (ANOVA, Kruskal-Wallis)
- Confidence intervals
- Variance-covariance analysis
- Pivot tables
- Discrete distribution analysis
- Visualizations: Creates comprehensive plots and charts
Setup
- Install dependencies:
pip install -r requirements.txt
- Set up GitHub API token (optional but recommended):
- Create a
.envfile in the project root - Add your GitHub token:
GITHUB_TOKEN=your_token_here - Get a token from: https://github.com/settings/tokens
Usage
Run the main analysis script:
python main.py
The script will:
- Use a curated list of popular Python projects that use semantic commits
- Clone repositories with full git history
- Analyze code metrics for all Python files
- Parse git commit logs to find "fix" commits (using semantic commit formats like "fix:", "fix(scope):", etc.) and track changed files
- Perform statistical analysis
- Generate visualizations
- Save results to
results/andfigures/directories
Configuration
Edit config.py to customize:
- Number of repositories to analyze (
MAX_REPOSITORIES) - Minimum stars for repository selection (
MIN_STARS) - Excluded directories (
EXCLUDE_DIRS) - Statistical significance level (
SIGNIFICANCE_LEVEL) - Confidence level (
CONFIDENCE_LEVEL)
Output
results/raw_metrics.csv: All collected code metricsresults/analysis_results.json: Statistical analysis resultsfigures/: Various visualization plots
Project Structure
main.py: Main orchestration scriptgithub_client.py: GitHub API clientcode_analyzer.py: Code metrics analyzerdata_collector.py: Data collection pipelinestatistical_analysis.py: Statistical analysis functionsvisualizer.py: Visualization functionsconfig.py: Configuration settings
Requirements
- Python 3.8+
- Git (for cloning repositories)
- GitHub API token (optional, increases rate limits)
Notes
- The analysis focuses on popular Python projects that use semantic commits
- Fix detection recognizes semantic commit formats:
fix:(conventional commits)fix(scope):(conventional commits with scope)Fix:,FIX:(case variations)fixes #123,fix #123(issue references)fixed,fixing,bugfix,bug fix(variations)- Only Python files (.py) are tracked for fix commits
- Full git history is cloned (not shallow) to analyze all commits
- Temporary cloned repositories are cleaned up after analysis
- The curated repository list can be modified in
config.py