1 files changed, 94 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..83b74d4
--- /dev/null
+++ b/README.md
@@ -0,0 +1,94 @@
+# Code Metrics Analysis Project
+
+This project analyzes code metrics from open-source Python projects on GitHub to investigate relationships between code complexity and issues/fixes.
+
+## Features
+
+- **Code Metrics Analysis**: Measures LOC, cyclomatic complexity, cognitive complexity, inheritance depth, and maintainability index
+- **Git Commit Analysis**: Analyzes git commit logs to find commits with "fix" in the message and tracks which files were changed
+- **Statistical Analysis**:
+  - Correlation analysis (Pearson and Spearman)
+  - Linear regression modeling
+  - Hypothesis testing (ANOVA, Kruskal-Wallis)
+  - Confidence intervals
+  - Variance-covariance analysis
+  - Pivot tables
+  - Discrete distribution analysis
+- **Visualizations**: Creates comprehensive plots and charts
+
+## Setup
+
+1. **Install dependencies**:
+
+```bash
+pip install -r requirements.txt
+```
+
+2. **Set up GitHub API token** (optional but recommended):
+   - Create a `.env` file in the project root
+   - Add your GitHub token: `GITHUB_TOKEN=your_token_here`
+   - Get a token from: https://github.com/settings/tokens
+
+## Usage
+
+Run the main analysis script:
+
+```bash
+python main.py
+```
+
+The script will:
+
+1. Use a curated list of popular Python projects that use semantic commits
+2. Clone repositories with full git history
+3. Analyze code metrics for all Python files
+4. Parse git commit logs to find "fix" commits (using semantic commit formats like "fix:", "fix(scope):", etc.) and track changed files
+5. Perform statistical analysis
+6. Generate visualizations
+7. Save results to `results/` and `figures/` directories
+
+## Configuration
+
+Edit `config.py` to customize:
+
+- Number of repositories to analyze (`MAX_REPOSITORIES`)
+- Minimum stars for repository selection (`MIN_STARS`)
+- Excluded directories (`EXCLUDE_DIRS`)
+- Statistical significance level (`SIGNIFICANCE_LEVEL`)
+- Confidence level (`CONFIDENCE_LEVEL`)
+
+## Output
+
+- `results/raw_metrics.csv`: All collected code metrics
+- `results/analysis_results.json`: Statistical analysis results
+- `figures/`: Various visualization plots
+
+## Project Structure
+
+- `main.py`: Main orchestration script
+- `github_client.py`: GitHub API client
+- `code_analyzer.py`: Code metrics analyzer
+- `data_collector.py`: Data collection pipeline
+- `statistical_analysis.py`: Statistical analysis functions
+- `visualizer.py`: Visualization functions
+- `config.py`: Configuration settings
+
+## Requirements
+
+- Python 3.8+
+- Git (for cloning repositories)
+- GitHub API token (optional, increases rate limits)
+
+## Notes
+
+- The analysis focuses on popular Python projects that use semantic commits
+- Fix detection recognizes semantic commit formats:
+  - `fix:` (conventional commits)
+  - `fix(scope):` (conventional commits with scope)
+  - `Fix:`, `FIX:` (case variations)
+  - `fixes #123`, `fix #123` (issue references)
+  - `fixed`, `fixing`, `bugfix`, `bug fix` (variations)
+- Only Python files (.py) are tracked for fix commits
+- Full git history is cloned (not shallow) to analyze all commits
+- Temporary cloned repositories are cleaned up after analysis
+- The curated repository list can be modified in `config.py`