Parser Implementation

The libmagic-rs parser is built using the nom parser combinator library, providing a robust and efficient way to parse magic file syntax into our AST representation.

Architecture Overview

The parser follows a modular design where individual components are implemented and tested separately, then composed into higher-level parsers:

Magic File Text → Individual Parsers → Combined Parsers → Complete AST
                      ↓
              Numbers, Offsets, Operators, Values → Rules → Rule Hierarchies

Implemented Components

Number Parsing (`parse_number`)

Handles both decimal and hexadecimal number formats with comprehensive overflow protection:

#![allow(unused)]
fn main() {
// Decimal numbers
parse_number("123")    // Ok(("", 123))
parse_number("-456")   // Ok(("", -456))

// Hexadecimal numbers
parse_number("0x1a")   // Ok(("", 26))
parse_number("-0xFF")  // Ok(("", -255))
}

Features:

✅ Decimal and hexadecimal format support
✅ Signed and unsigned number handling
✅ Overflow protection with proper error reporting
✅ Comprehensive test coverage (15+ test cases)

Offset Parsing (`parse_offset`)

Converts numeric values into OffsetSpec::Absolute variants:

#![allow(unused)]
fn main() {
// Basic offsets
parse_offset("0")      // Ok(("", OffsetSpec::Absolute(0)))
parse_offset("0x10")   // Ok(("", OffsetSpec::Absolute(16)))
parse_offset("-4")     // Ok(("", OffsetSpec::Absolute(-4)))

// With whitespace handling
parse_offset(" 123 ")  // Ok(("", OffsetSpec::Absolute(123)))
}

Features:

✅ Absolute offset parsing with full number format support
✅ Whitespace handling (leading and trailing)
✅ Negative offset support for relative positioning
📋 Indirect offset parsing (planned)
📋 Relative offset parsing (planned)

Operator Parsing (`parse_operator`)

Parses comparison and bitwise operators with multiple syntax variants:

#![allow(unused)]
fn main() {
// Equality operators
parse_operator("=")    // Ok(("", Operator::Equal))
parse_operator("==")   // Ok(("", Operator::Equal))

// Inequality operators
parse_operator("!=")   // Ok(("", Operator::NotEqual))
parse_operator("<>")   // Ok(("", Operator::NotEqual))

// Bitwise operators
parse_operator("&")    // Ok(("", Operator::BitwiseAnd))
}

Features:

✅ Multiple syntax variants for compatibility
✅ Precedence handling (longer operators matched first)
✅ Whitespace tolerance
✅ Invalid operator rejection with clear errors

Value Parsing (`parse_value`)

Handles multiple value types with intelligent type detection:

#![allow(unused)]
fn main() {
// String literals with escape sequences
parse_value("\"Hello\"")           // Value::String("Hello".to_string())
parse_value("\"Line1\\nLine2\"")   // Value::String("Line1\nLine2".to_string())

// Numeric values
parse_value("123")                 // Value::Uint(123)
parse_value("-456")                // Value::Int(-456)
parse_value("0x1a")                // Value::Uint(26)

// Hex byte sequences
parse_value("\\x7f\\x45")          // Value::Bytes(vec![0x7f, 0x45])
parse_value("7f454c46")            // Value::Bytes(vec![0x7f, 0x45, 0x4c, 0x46])
}

Features:

✅ Quoted string parsing with escape sequence support
✅ Numeric literal parsing (decimal and hexadecimal)
✅ Hex byte sequence parsing (with and without \x prefix)
✅ Intelligent type precedence to avoid parsing conflicts
✅ Comprehensive escape sequence handling (\n, \t, \r, \\, \", \', \0)

Parser Design Principles

Error Handling

All parsers use nom's IResult type for consistent error handling:

#![allow(unused)]
fn main() {
pub fn parse_number(input: &str) -> IResult<&str, i64> {
    // Implementation with proper error propagation
}
}

Error Categories:

Syntax Errors: Invalid characters or malformed input
Overflow Errors: Numbers too large for target type
Format Errors: Invalid hex digits, unterminated strings, etc.

Memory Safety

All parsing operations are memory-safe with no unsafe code:

Bounds Checking: All buffer access is bounds-checked
Overflow Protection: Numeric parsing includes overflow detection
Resource Management: No manual memory management required

Performance Optimization

The parser is designed for efficiency:

Zero-Copy: String slices used where possible to avoid allocations
Early Termination: Parsers fail fast on invalid input
Minimal Backtracking: Parser combinators designed to minimize backtracking

Testing Strategy

Each parser component has comprehensive test coverage:

Test Categories

Basic Functionality: Core parsing behavior
Edge Cases: Boundary values, empty input, etc.
Error Conditions: Invalid input handling
Whitespace Handling: Leading/trailing whitespace tolerance
Remaining Input: Proper handling of unconsumed input

Example Test Structure

#![allow(unused)]
fn main() {
#[test]
fn test_parse_number_positive() {
    assert_eq!(parse_number("123"), Ok(("", 123)));
    assert_eq!(parse_number("0x1a"), Ok(("", 26)));
}

#[test]
fn test_parse_number_with_remaining_input() {
    assert_eq!(parse_number("123abc"), Ok(("abc", 123)));
    assert_eq!(parse_number("0xFF rest"), Ok((" rest", 255)));
}

#[test]
fn test_parse_number_edge_cases() {
    assert_eq!(parse_number("0"), Ok(("", 0)));
    assert_eq!(parse_number("-0"), Ok(("", 0)));
    assert!(parse_number("").is_err());
    assert!(parse_number("abc").is_err());
}
}

Current Limitations

Not Yet Implemented

Complete Rule Parsing: Integration of components into full rule parser
Hierarchical Structure: Parent-child rule relationships
Advanced Offsets: Indirect and relative offset specifications
Extended Operators: Additional comparison and bitwise operators
Type Specifications: Parsing of type declarations (byte, short, long, string)

Planned Enhancements

Better Error Messages: More descriptive error reporting with line numbers
Performance Optimization: Specialized parsers for common patterns
Streaming Support: Incremental parsing for large magic files
Syntax Extensions: Support for additional magic file syntax variants

Integration Points

The parser components are designed to integrate seamlessly:

#![allow(unused)]
fn main() {
// Future complete rule parser will combine components:
fn parse_magic_rule(input: &str) -> IResult<&str, MagicRule> {
    let (input, offset) = parse_offset(input)?;
    let (input, typ) = parse_type(input)?; // Not yet implemented
    let (input, op) = parse_operator(input)?;
    let (input, value) = parse_value(input)?;
    let (input, message) = parse_message(input)?; // Not yet implemented

    Ok((
        input,
        MagicRule {
            offset,
            typ,
            op,
            value,
            message,
            children: vec![],
            level: 0,
        },
    ))
}
}

This modular approach ensures each component is thoroughly tested and can be composed reliably into more complex parsers.

Keyboard shortcuts

Libmagic-rs Developer Guide