diff --git a/tooling/Notes.md b/tooling/Notes.md new file mode 100644 index 00000000..0f22ea02 --- /dev/null +++ b/tooling/Notes.md @@ -0,0 +1,177 @@ +# C++ Function/Global Parser Tool - Database Output Summary + +## Overview +This tool parses C++ source files using Tree-sitter to extract function and global variable information along with their memory addresses from comments. The extracted data is stored in an SQLite database for analysis and lookup purposes. + +## Database Schema + +The tool creates an SQLite database (default: `gh.db`) with three main tables: + +### 1. Functions Table +```sql +CREATE TABLE Functions ( + filepath TEXT, + name TEXT, + address TEXT, + PRIMARY KEY (name, filepath) +); +``` + +**Purpose**: Stores function definitions that have function bodies (actual implementations) +- `filepath`: Source file path where the function is defined +- `name`: Function name (identifier) +- `address`: 8-character hexadecimal memory address extracted from comments +- **Primary Key**: Combination of name and filepath (allows same function name in different files) + +### 2. Imports Table +```sql +CREATE TABLE Imports ( + filepath TEXT, + name TEXT, + address TEXT, + PRIMARY KEY (name, filepath) +); +``` + +**Purpose**: Stores function declarations without bodies (imports/forward declarations) +- Same schema as Functions table +- Distinguishes between function definitions and declarations +- Useful for tracking external function references + +### 3. Globals Table +```sql +CREATE TABLE Globals ( + filepath TEXT, + name TEXT, + address TEXT +); +``` + +**Purpose**: Stores global variable declarations marked with `extern` +- `filepath`: Source file path where the global is declared +- `name`: Global variable name (identifier) +- `address`: 8-character hexadecimal memory address from comments +- **No Primary Key**: Allows duplicate global names across files + +## Address Format + +The tool extracts addresses from C++ comments using this regex pattern: +```regex +//\s*([0-9a-fA-F]{8}) +``` + +**Expected Comment Format**: +```cpp +void myFunction(); // 12345678 +extern int globalVar; // ABCDEF00 +``` + +- Addresses must be exactly 8 hexadecimal characters +- Can be uppercase or lowercase +- Must be in a C++ line comment (`//`) +- Whitespace after `//` is optional + +## Tool Modes + +### 1. Functions Mode (`-m functions`) +- **Default mode** +- Parses C++ files for function definitions and declarations +- Populates `Functions` and `Imports` tables +- Distinguishes between functions with bodies vs. declarations only + +### 2. Globals Mode (`-m globals`) +- Parses C++ files for `extern` global variable declarations +- Populates `Globals` table +- Only processes variables marked with `extern` storage class + +### 3. Duplicates Mode (`-m duplicates`) +- **Analysis mode** - doesn't process files +- Checks existing database for duplicate addresses and names +- Reports conflicts across all tables +- Returns exit code 1 if duplicates found, 0 if clean + +### 4. Dump-Tree Mode (`-m dump-tree`) +- **Debug mode** - doesn't use database +- Outputs Tree-sitter AST for debugging parsing issues +- Useful for understanding how the parser interprets source code + +## Data Quality Checks + +The tool includes built-in validation: + +### Duplicate Address Detection +- Scans all tables for addresses used multiple times +- Reports format: `"DUPLICATE ADDRESS: {address} appears {count} times in: {entries}"` +- Cross-references Functions, Imports, and Globals tables + +### Duplicate Name Detection +- Checks for function names appearing in multiple files +- Checks for global names appearing in multiple files +- Helps identify naming conflicts and potential issues + +## Usage Examples + +### Basic Function Extraction +```bash +./tool file1.cpp file2.cpp -d output.db -m functions +``` + +### Global Variable Extraction +```bash +./tool globals.h -d output.db -m globals +``` + +### Batch Processing with File List +```bash +./tool -l filelist.txt -d output.db -m functions +``` + +### Quality Assurance Check +```bash +./tool -d output.db -m duplicates +``` + +## Database Queries for Users + +### Find Function by Name +```sql +SELECT * FROM Functions WHERE name = 'functionName'; +SELECT * FROM Imports WHERE name = 'functionName'; +``` + +### Find All Symbols at Address +```sql +SELECT 'Function' as type, name, filepath FROM Functions WHERE address = '12345678' +UNION ALL +SELECT 'Import' as type, name, filepath FROM Imports WHERE address = '12345678' +UNION ALL +SELECT 'Global' as type, name, filepath FROM Globals WHERE address = '12345678'; +``` + +### List All Functions in File +```sql +SELECT name, address FROM Functions WHERE filepath = 'path/to/file.cpp' +ORDER BY name; +``` + +### Find Functions Without Addresses +```sql +SELECT name, filepath FROM Functions WHERE address = '' OR address IS NULL; +``` + +### Address Range Analysis +```sql +SELECT name, address, filepath FROM Functions +WHERE CAST(address AS INTEGER) BETWEEN 0x10000000 AND 0x20000000 +ORDER BY CAST(address AS INTEGER); +``` + +## Integration Considerations + +- **Database Format**: Standard SQLite3 - compatible with most tools and languages +- **File Paths**: Relative to the game source directory, meaning there will be gh_auto, gh_fix subfolders. (relative to the game_re folder in repo root) +- **Address Format**: Always 8-character hex strings (32 bit addresses) - pad with leading zeros if needed +- **Case Sensitivity**: Function/global names are case-sensitive as per C++ standards +- **Unicode Support**: Handles UTF-8 encoded source files + +This database serves as a comprehensive symbol table for reverse engineering, debugging, and code analysis workflows.