aboutsummaryrefslogtreecommitdiff

Code Metrics Analysis Project

This project analyzes code metrics from open-source Python projects on GitHub to investigate relationships between code complexity and issues/fixes.

Features

  • Code Metrics Analysis: Measures LOC, cyclomatic complexity, cognitive complexity, inheritance depth, and maintainability index
  • Git Commit Analysis: Analyzes git commit logs to find commits with "fix" in the message and tracks which files were changed
  • Statistical Analysis:
  • Correlation analysis (Pearson and Spearman)
  • Linear regression modeling
  • Hypothesis testing (ANOVA, Kruskal-Wallis)
  • Confidence intervals
  • Variance-covariance analysis
  • Pivot tables
  • Discrete distribution analysis
  • Visualizations: Creates comprehensive plots and charts

Setup

  1. Install dependencies:
pip install -r requirements.txt
  1. Set up GitHub API token (optional but recommended):
  2. Create a .env file in the project root
  3. Add your GitHub token: GITHUB_TOKEN=your_token_here
  4. Get a token from: https://github.com/settings/tokens

Usage

Run the main analysis script:

python main.py

The script will:

  1. Use a curated list of popular Python projects that use semantic commits
  2. Clone repositories with full git history
  3. Analyze code metrics for all Python files
  4. Parse git commit logs to find "fix" commits (using semantic commit formats like "fix:", "fix(scope):", etc.) and track changed files
  5. Perform statistical analysis
  6. Generate visualizations
  7. Save results to results/ and figures/ directories

Configuration

Edit config.py to customize:

  • Number of repositories to analyze (MAX_REPOSITORIES)
  • Minimum stars for repository selection (MIN_STARS)
  • Excluded directories (EXCLUDE_DIRS)
  • Statistical significance level (SIGNIFICANCE_LEVEL)
  • Confidence level (CONFIDENCE_LEVEL)

Output

  • results/raw_metrics.csv: All collected code metrics
  • results/analysis_results.json: Statistical analysis results
  • figures/: Various visualization plots

Project Structure

  • main.py: Main orchestration script
  • github_client.py: GitHub API client
  • code_analyzer.py: Code metrics analyzer
  • data_collector.py: Data collection pipeline
  • statistical_analysis.py: Statistical analysis functions
  • visualizer.py: Visualization functions
  • config.py: Configuration settings

Requirements

  • Python 3.8+
  • Git (for cloning repositories)
  • GitHub API token (optional, increases rate limits)

Notes

  • The analysis focuses on popular Python projects that use semantic commits
  • Fix detection recognizes semantic commit formats:
  • fix: (conventional commits)
  • fix(scope): (conventional commits with scope)
  • Fix:, FIX: (case variations)
  • fixes #123, fix #123 (issue references)
  • fixed, fixing, bugfix, bug fix (variations)
  • Only Python files (.py) are tracked for fix commits
  • Full git history is cloned (not shallow) to analyze all commits
  • Temporary cloned repositories are cleaned up after analysis
  • The curated repository list can be modified in config.py