diff options
Diffstat (limited to 'README.md')
| -rw-r--r-- | README.md | 94 |
1 files changed, 94 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..83b74d4 --- /dev/null +++ b/README.md @@ -0,0 +1,94 @@ +# Code Metrics Analysis Project + +This project analyzes code metrics from open-source Python projects on GitHub to investigate relationships between code complexity and issues/fixes. + +## Features + +- **Code Metrics Analysis**: Measures LOC, cyclomatic complexity, cognitive complexity, inheritance depth, and maintainability index +- **Git Commit Analysis**: Analyzes git commit logs to find commits with "fix" in the message and tracks which files were changed +- **Statistical Analysis**: + - Correlation analysis (Pearson and Spearman) + - Linear regression modeling + - Hypothesis testing (ANOVA, Kruskal-Wallis) + - Confidence intervals + - Variance-covariance analysis + - Pivot tables + - Discrete distribution analysis +- **Visualizations**: Creates comprehensive plots and charts + +## Setup + +1. **Install dependencies**: + +```bash +pip install -r requirements.txt +``` + +2. **Set up GitHub API token** (optional but recommended): + - Create a `.env` file in the project root + - Add your GitHub token: `GITHUB_TOKEN=your_token_here` + - Get a token from: https://github.com/settings/tokens + +## Usage + +Run the main analysis script: + +```bash +python main.py +``` + +The script will: + +1. Use a curated list of popular Python projects that use semantic commits +2. Clone repositories with full git history +3. Analyze code metrics for all Python files +4. Parse git commit logs to find "fix" commits (using semantic commit formats like "fix:", "fix(scope):", etc.) and track changed files +5. Perform statistical analysis +6. Generate visualizations +7. Save results to `results/` and `figures/` directories + +## Configuration + +Edit `config.py` to customize: + +- Number of repositories to analyze (`MAX_REPOSITORIES`) +- Minimum stars for repository selection (`MIN_STARS`) +- Excluded directories (`EXCLUDE_DIRS`) +- Statistical significance level (`SIGNIFICANCE_LEVEL`) +- Confidence level (`CONFIDENCE_LEVEL`) + +## Output + +- `results/raw_metrics.csv`: All collected code metrics +- `results/analysis_results.json`: Statistical analysis results +- `figures/`: Various visualization plots + +## Project Structure + +- `main.py`: Main orchestration script +- `github_client.py`: GitHub API client +- `code_analyzer.py`: Code metrics analyzer +- `data_collector.py`: Data collection pipeline +- `statistical_analysis.py`: Statistical analysis functions +- `visualizer.py`: Visualization functions +- `config.py`: Configuration settings + +## Requirements + +- Python 3.8+ +- Git (for cloning repositories) +- GitHub API token (optional, increases rate limits) + +## Notes + +- The analysis focuses on popular Python projects that use semantic commits +- Fix detection recognizes semantic commit formats: + - `fix:` (conventional commits) + - `fix(scope):` (conventional commits with scope) + - `Fix:`, `FIX:` (case variations) + - `fixes #123`, `fix #123` (issue references) + - `fixed`, `fixing`, `bugfix`, `bug fix` (variations) +- Only Python files (.py) are tracked for fix commits +- Full git history is cloned (not shallow) to analyze all commits +- Temporary cloned repositories are cleaned up after analysis +- The curated repository list can be modified in `config.py` |