Add notes

This commit is contained in:
Guus Waals 2025-05-29 15:57:22 +08:00
parent 3d40dc7e80
commit 7c18d04724
1 changed files with 177 additions and 0 deletions

177
tooling/Notes.md Normal file
View File

@ -0,0 +1,177 @@
# C++ Function/Global Parser Tool - Database Output Summary
## Overview
This tool parses C++ source files using Tree-sitter to extract function and global variable information along with their memory addresses from comments. The extracted data is stored in an SQLite database for analysis and lookup purposes.
## Database Schema
The tool creates an SQLite database (default: `gh.db`) with three main tables:
### 1. Functions Table
```sql
CREATE TABLE Functions (
filepath TEXT,
name TEXT,
address TEXT,
PRIMARY KEY (name, filepath)
);
```
**Purpose**: Stores function definitions that have function bodies (actual implementations)
- `filepath`: Source file path where the function is defined
- `name`: Function name (identifier)
- `address`: 8-character hexadecimal memory address extracted from comments
- **Primary Key**: Combination of name and filepath (allows same function name in different files)
### 2. Imports Table
```sql
CREATE TABLE Imports (
filepath TEXT,
name TEXT,
address TEXT,
PRIMARY KEY (name, filepath)
);
```
**Purpose**: Stores function declarations without bodies (imports/forward declarations)
- Same schema as Functions table
- Distinguishes between function definitions and declarations
- Useful for tracking external function references
### 3. Globals Table
```sql
CREATE TABLE Globals (
filepath TEXT,
name TEXT,
address TEXT
);
```
**Purpose**: Stores global variable declarations marked with `extern`
- `filepath`: Source file path where the global is declared
- `name`: Global variable name (identifier)
- `address`: 8-character hexadecimal memory address from comments
- **No Primary Key**: Allows duplicate global names across files
## Address Format
The tool extracts addresses from C++ comments using this regex pattern:
```regex
//\s*([0-9a-fA-F]{8})
```
**Expected Comment Format**:
```cpp
void myFunction(); // 12345678
extern int globalVar; // ABCDEF00
```
- Addresses must be exactly 8 hexadecimal characters
- Can be uppercase or lowercase
- Must be in a C++ line comment (`//`)
- Whitespace after `//` is optional
## Tool Modes
### 1. Functions Mode (`-m functions`)
- **Default mode**
- Parses C++ files for function definitions and declarations
- Populates `Functions` and `Imports` tables
- Distinguishes between functions with bodies vs. declarations only
### 2. Globals Mode (`-m globals`)
- Parses C++ files for `extern` global variable declarations
- Populates `Globals` table
- Only processes variables marked with `extern` storage class
### 3. Duplicates Mode (`-m duplicates`)
- **Analysis mode** - doesn't process files
- Checks existing database for duplicate addresses and names
- Reports conflicts across all tables
- Returns exit code 1 if duplicates found, 0 if clean
### 4. Dump-Tree Mode (`-m dump-tree`)
- **Debug mode** - doesn't use database
- Outputs Tree-sitter AST for debugging parsing issues
- Useful for understanding how the parser interprets source code
## Data Quality Checks
The tool includes built-in validation:
### Duplicate Address Detection
- Scans all tables for addresses used multiple times
- Reports format: `"DUPLICATE ADDRESS: {address} appears {count} times in: {entries}"`
- Cross-references Functions, Imports, and Globals tables
### Duplicate Name Detection
- Checks for function names appearing in multiple files
- Checks for global names appearing in multiple files
- Helps identify naming conflicts and potential issues
## Usage Examples
### Basic Function Extraction
```bash
./tool file1.cpp file2.cpp -d output.db -m functions
```
### Global Variable Extraction
```bash
./tool globals.h -d output.db -m globals
```
### Batch Processing with File List
```bash
./tool -l filelist.txt -d output.db -m functions
```
### Quality Assurance Check
```bash
./tool -d output.db -m duplicates
```
## Database Queries for Users
### Find Function by Name
```sql
SELECT * FROM Functions WHERE name = 'functionName';
SELECT * FROM Imports WHERE name = 'functionName';
```
### Find All Symbols at Address
```sql
SELECT 'Function' as type, name, filepath FROM Functions WHERE address = '12345678'
UNION ALL
SELECT 'Import' as type, name, filepath FROM Imports WHERE address = '12345678'
UNION ALL
SELECT 'Global' as type, name, filepath FROM Globals WHERE address = '12345678';
```
### List All Functions in File
```sql
SELECT name, address FROM Functions WHERE filepath = 'path/to/file.cpp'
ORDER BY name;
```
### Find Functions Without Addresses
```sql
SELECT name, filepath FROM Functions WHERE address = '' OR address IS NULL;
```
### Address Range Analysis
```sql
SELECT name, address, filepath FROM Functions
WHERE CAST(address AS INTEGER) BETWEEN 0x10000000 AND 0x20000000
ORDER BY CAST(address AS INTEGER);
```
## Integration Considerations
- **Database Format**: Standard SQLite3 - compatible with most tools and languages
- **File Paths**: Relative to the game source directory, meaning there will be gh_auto, gh_fix subfolders. (relative to the game_re folder in repo root)
- **Address Format**: Always 8-character hex strings (32 bit addresses) - pad with leading zeros if needed
- **Case Sensitivity**: Function/global names are case-sensitive as per C++ standards
- **Unicode Support**: Handles UTF-8 encoded source files
This database serves as a comprehensive symbol table for reverse engineering, debugging, and code analysis workflows.