1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
|
# Code Metrics Analysis Project
This project analyzes code metrics from open-source Python projects on GitHub to investigate relationships between code complexity and issues/fixes.
## Features
- **Code Metrics Analysis**: Measures LOC, cyclomatic complexity, cognitive complexity, inheritance depth, and maintainability index
- **Git Commit Analysis**: Analyzes git commit logs to find commits with "fix" in the message and tracks which files were changed
- **Statistical Analysis**:
- Correlation analysis (Pearson and Spearman)
- Linear regression modeling
- Hypothesis testing (ANOVA, Kruskal-Wallis)
- Confidence intervals
- Variance-covariance analysis
- Pivot tables
- Discrete distribution analysis
- **Visualizations**: Creates comprehensive plots and charts
## Setup
1. **Install dependencies**:
```bash
pip install -r requirements.txt
```
2. **Set up GitHub API token** (optional but recommended):
- Create a `.env` file in the project root
- Add your GitHub token: `GITHUB_TOKEN=your_token_here`
- Get a token from: https://github.com/settings/tokens
## Usage
Run the main analysis script:
```bash
python main.py
```
The script will:
1. Use a curated list of popular Python projects that use semantic commits
2. Clone repositories with full git history
3. Analyze code metrics for all Python files
4. Parse git commit logs to find "fix" commits (using semantic commit formats like "fix:", "fix(scope):", etc.) and track changed files
5. Perform statistical analysis
6. Generate visualizations
7. Save results to `results/` and `figures/` directories
## Configuration
Edit `config.py` to customize:
- Number of repositories to analyze (`MAX_REPOSITORIES`)
- Minimum stars for repository selection (`MIN_STARS`)
- Excluded directories (`EXCLUDE_DIRS`)
- Statistical significance level (`SIGNIFICANCE_LEVEL`)
- Confidence level (`CONFIDENCE_LEVEL`)
## Output
- `results/raw_metrics.csv`: All collected code metrics
- `results/analysis_results.json`: Statistical analysis results
- `figures/`: Various visualization plots
## Project Structure
- `main.py`: Main orchestration script
- `github_client.py`: GitHub API client
- `code_analyzer.py`: Code metrics analyzer
- `data_collector.py`: Data collection pipeline
- `statistical_analysis.py`: Statistical analysis functions
- `visualizer.py`: Visualization functions
- `config.py`: Configuration settings
## Requirements
- Python 3.8+
- Git (for cloning repositories)
- GitHub API token (optional, increases rate limits)
## Notes
- The analysis focuses on popular Python projects that use semantic commits
- Fix detection recognizes semantic commit formats:
- `fix:` (conventional commits)
- `fix(scope):` (conventional commits with scope)
- `Fix:`, `FIX:` (case variations)
- `fixes #123`, `fix #123` (issue references)
- `fixed`, `fixing`, `bugfix`, `bug fix` (variations)
- Only Python files (.py) are tracked for fix commits
- Full git history is cloned (not shallow) to analyze all commits
- Temporary cloned repositories are cleaned up after analysis
- The curated repository list can be modified in `config.py`
|