aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md94
1 files changed, 94 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..83b74d4
--- /dev/null
+++ b/README.md
@@ -0,0 +1,94 @@
+# Code Metrics Analysis Project
+
+This project analyzes code metrics from open-source Python projects on GitHub to investigate relationships between code complexity and issues/fixes.
+
+## Features
+
+- **Code Metrics Analysis**: Measures LOC, cyclomatic complexity, cognitive complexity, inheritance depth, and maintainability index
+- **Git Commit Analysis**: Analyzes git commit logs to find commits with "fix" in the message and tracks which files were changed
+- **Statistical Analysis**:
+ - Correlation analysis (Pearson and Spearman)
+ - Linear regression modeling
+ - Hypothesis testing (ANOVA, Kruskal-Wallis)
+ - Confidence intervals
+ - Variance-covariance analysis
+ - Pivot tables
+ - Discrete distribution analysis
+- **Visualizations**: Creates comprehensive plots and charts
+
+## Setup
+
+1. **Install dependencies**:
+
+```bash
+pip install -r requirements.txt
+```
+
+2. **Set up GitHub API token** (optional but recommended):
+ - Create a `.env` file in the project root
+ - Add your GitHub token: `GITHUB_TOKEN=your_token_here`
+ - Get a token from: https://github.com/settings/tokens
+
+## Usage
+
+Run the main analysis script:
+
+```bash
+python main.py
+```
+
+The script will:
+
+1. Use a curated list of popular Python projects that use semantic commits
+2. Clone repositories with full git history
+3. Analyze code metrics for all Python files
+4. Parse git commit logs to find "fix" commits (using semantic commit formats like "fix:", "fix(scope):", etc.) and track changed files
+5. Perform statistical analysis
+6. Generate visualizations
+7. Save results to `results/` and `figures/` directories
+
+## Configuration
+
+Edit `config.py` to customize:
+
+- Number of repositories to analyze (`MAX_REPOSITORIES`)
+- Minimum stars for repository selection (`MIN_STARS`)
+- Excluded directories (`EXCLUDE_DIRS`)
+- Statistical significance level (`SIGNIFICANCE_LEVEL`)
+- Confidence level (`CONFIDENCE_LEVEL`)
+
+## Output
+
+- `results/raw_metrics.csv`: All collected code metrics
+- `results/analysis_results.json`: Statistical analysis results
+- `figures/`: Various visualization plots
+
+## Project Structure
+
+- `main.py`: Main orchestration script
+- `github_client.py`: GitHub API client
+- `code_analyzer.py`: Code metrics analyzer
+- `data_collector.py`: Data collection pipeline
+- `statistical_analysis.py`: Statistical analysis functions
+- `visualizer.py`: Visualization functions
+- `config.py`: Configuration settings
+
+## Requirements
+
+- Python 3.8+
+- Git (for cloning repositories)
+- GitHub API token (optional, increases rate limits)
+
+## Notes
+
+- The analysis focuses on popular Python projects that use semantic commits
+- Fix detection recognizes semantic commit formats:
+ - `fix:` (conventional commits)
+ - `fix(scope):` (conventional commits with scope)
+ - `Fix:`, `FIX:` (case variations)
+ - `fixes #123`, `fix #123` (issue references)
+ - `fixed`, `fixing`, `bugfix`, `bug fix` (variations)
+- Only Python files (.py) are tracked for fix commits
+- Full git history is cloned (not shallow) to analyze all commits
+- Temporary cloned repositories are cleaned up after analysis
+- The curated repository list can be modified in `config.py`