reman3/tooling/Notes.md

5.4 KiB

C++ Function/Global Parser Tool - Database Output Summary

Overview

This tool parses C++ source files using Tree-sitter to extract function and global variable information along with their memory addresses from comments. The extracted data is stored in an SQLite database for analysis and lookup purposes.

Database Schema

The tool creates an SQLite database (default: gh.db) with three main tables:

1. Functions Table

CREATE TABLE Functions (
    filepath TEXT,
    name TEXT,
    address TEXT,
    type INTEGER,
    PRIMARY KEY (name, filepath)
);

Where type is one of the following:

  • 0: Auto
  • 1: Fix
  • 2: Stub
  • 3: Ref

Purpose: Stores function definitions that have function bodies (actual implementations)

  • filepath: Source file path where the function is defined
  • name: Function name (identifier)
  • address: 8-character hexadecimal memory address extracted from comments
  • Primary Key: Combination of name and filepath (allows same function name in different files)

2. Imports Table

CREATE TABLE Imports (
    filepath TEXT,
    name TEXT,
    address TEXT,
    type INTEGER,
    PRIMARY KEY (name, filepath)
);

Purpose: Stores function declarations without bodies (imports/forward declarations)

  • Same schema as Functions table
  • Distinguishes between function definitions and declarations
  • Useful for tracking external function references

3. Globals Table

CREATE TABLE Globals (
    filepath TEXT,
    name TEXT,
    address TEXT
);

Purpose: Stores global variable declarations marked with extern

  • filepath: Source file path where the global is declared
  • name: Global variable name (identifier)
  • address: 8-character hexadecimal memory address from comments
  • No Primary Key: Allows duplicate global names across files

Address Format

The tool extracts addresses from C++ comments using this regex pattern:

//\s*([0-9a-fA-F]{8})

Expected Comment Format:

void myFunction(); // 12345678
extern int globalVar; // ABCDEF00
  • Addresses must be exactly 8 hexadecimal characters
  • Can be uppercase or lowercase
  • Must be in a C++ line comment (//)
  • Whitespace after // is optional

Tool Modes

1. Functions Mode (-m functions)

  • Default mode
  • Parses C++ files for function definitions and declarations
  • Populates Functions and Imports tables
  • Distinguishes between functions with bodies vs. declarations only

2. Globals Mode (-m globals)

  • Parses C++ files for extern global variable declarations
  • Populates Globals table
  • Only processes variables marked with extern storage class

3. Duplicates Mode (-m duplicates)

  • Analysis mode - doesn't process files
  • Checks existing database for duplicate addresses and names
  • Reports conflicts across all tables
  • Returns exit code 1 if duplicates found, 0 if clean

4. Dump-Tree Mode (-m dump-tree)

  • Debug mode - doesn't use database
  • Outputs Tree-sitter AST for debugging parsing issues
  • Useful for understanding how the parser interprets source code

Data Quality Checks

The tool includes built-in validation:

Duplicate Address Detection

  • Scans all tables for addresses used multiple times
  • Reports format: "DUPLICATE ADDRESS: {address} appears {count} times in: {entries}"
  • Cross-references Functions, Imports, and Globals tables

Duplicate Name Detection

  • Checks for function names appearing in multiple files
  • Checks for global names appearing in multiple files
  • Helps identify naming conflicts and potential issues

Usage Examples

Basic Function Extraction

./tool file1.cpp file2.cpp -d output.db -m functions

Global Variable Extraction

./tool globals.h -d output.db -m globals

Batch Processing with File List

./tool -l filelist.txt -d output.db -m functions

Quality Assurance Check

./tool -d output.db -m duplicates

Database Queries for Users

Find Function by Name

SELECT * FROM Functions WHERE name = 'functionName';
SELECT * FROM Imports WHERE name = 'functionName';

Find All Symbols at Address

SELECT 'Function' as type, name, filepath FROM Functions WHERE address = '12345678'
UNION ALL
SELECT 'Import' as type, name, filepath FROM Imports WHERE address = '12345678'
UNION ALL
SELECT 'Global' as type, name, filepath FROM Globals WHERE address = '12345678';

List All Functions in File

SELECT name, address FROM Functions WHERE filepath = 'path/to/file.cpp'
ORDER BY name;

Find Functions Without Addresses

SELECT name, filepath FROM Functions WHERE address = '' OR address IS NULL;

Address Range Analysis

SELECT name, address, filepath FROM Functions
WHERE CAST(address AS INTEGER) BETWEEN 0x10000000 AND 0x20000000
ORDER BY CAST(address AS INTEGER);

Integration Considerations

  • Database Format: Standard SQLite3 - compatible with most tools and languages
  • File Paths: Relative to the game source directory, meaning there will be gh_auto, gh_fix subfolders. (relative to the game_re folder in repo root)
  • Address Format: Always 8-character hex strings (32 bit addresses) - pad with leading zeros if needed
  • Case Sensitivity: Function/global names are case-sensitive as per C++ standards
  • Unicode Support: Handles UTF-8 encoded source files

This database serves as a comprehensive symbol table for reverse engineering, debugging, and code analysis workflows.