Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Stringy is a smarter alternative to the standard strings command that uses binary analysis to extract meaningful strings from executables. Unlike traditional string extraction tools, Stringy focuses on data structures rather than arbitrary byte runs.

Why Stringy?

The standard strings command has several limitations:

  • Noise: Dumps every printable byte sequence, including padding and table data
  • UTF-16 Issues: Produces interleaved garbage when scanning UTF-16 strings
  • No Context: Provides no information about where strings come from
  • No Prioritization: Treats all strings equally, regardless of relevance

Stringy addresses these issues by being:

  • Data-structure aware: Only extracts strings from actual binary data structures
  • Section-aware: Prioritizes meaningful sections like .rodata, .rdata, __cstring
  • Encoding-aware: Properly handles ASCII/UTF-8, UTF-16LE, and UTF-16BE
  • Semantically intelligent: Identifies and tags URLs, domains, file paths, GUIDs, etc.
  • Ranked: Presents the most relevant strings first

Key Features

Multi-Format Support

  • ELF (Linux executables and libraries)
  • PE (Windows executables and DLLs)
  • Mach-O (macOS executables and frameworks)

Smart String Extraction

  • Section-aware extraction prioritizing string-rich sections
  • Multi-encoding support (ASCII, UTF-8, UTF-16LE/BE)
  • Deduplication with metadata preservation
  • Configurable minimum length filtering

Semantic Classification

  • Network: URLs, domains, IP addresses
  • Filesystem: File paths, registry keys
  • Identifiers: GUIDs, email addresses, user agents
  • Code: Format strings, Base64 data
  • Symbols: Import/export names, demangled symbols

Multiple Output Formats

  • Human-readable: Sorted tables for interactive analysis
  • JSONL: Machine-readable format for automation
  • YARA-friendly: Optimized for security rule creation

Use Cases

Binary Analysis & Reverse Engineering

Extract meaningful strings to understand program functionality, identify libraries, and discover embedded resources.

Malware Analysis

Quickly identify network indicators, file paths, registry keys, and other artifacts of interest in suspicious binaries.

YARA Rule Development

Generate high-confidence string candidates for creating detection rules, with automatic escaping and formatting.

Security Research

Analyze binaries for hardcoded credentials, API endpoints, configuration data, and other security-relevant strings.

Project Status

Stringy is in active development with a solid foundation already in place. The core infrastructure is complete and robust:

Implemented:

  • Complete binary format detection (ELF, PE, Mach-O)
  • Comprehensive section classification with intelligent weighting
  • Import/export symbol extraction from all formats
  • String extraction engines (ASCII/UTF-8, UTF-16LE/BE)
  • Semantic classification system (URLs, paths, GUIDs, etc.)
  • Ranking, scoring, and normalization algorithms
  • Output formatters (table, JSONL, YARA)
  • Full CLI interface with filtering, encoding, and mode flags
  • Noise filtering with multi-layered heuristics
  • Type-safe error handling and data structures
  • Extensible architecture with trait-based parsers

See the Architecture Overview for technical details and the Contributing guide to get involved.

Installation

Pre-built Binaries

Pre-built binaries for Linux, macOS, and Windows are available on the Releases page.

Download the appropriate archive for your platform, extract it, and place the stringy binary somewhere on your PATH.

From Source

Prerequisites

  • Rust: Version 1.91 or later (see rustup.rs if you need to install Rust)
  • Git: For cloning the repository

Build and Install

git clone https://github.com/EvilBit-Labs/Stringy
cd Stringy
cargo install --path .

This installs the stringy binary to ~/.cargo/bin/, which should be in your PATH.

Verify Installation

stringy --version

Development Build

For development and testing, Stringy uses just and mise to manage tooling:

git clone https://github.com/EvilBit-Labs/Stringy
cd Stringy
just setup         # Install tools and components
just gen-fixtures  # Generate test fixtures (requires Zig via mise)
just test          # Run tests

If you do not use just, the minimum requirements are:

cargo build --release
cargo test

Troubleshooting

Build Failures

Update Rust to the latest version:

rustup update

Clear the build cache:

cargo clean
cargo build --release

Getting Help

If you encounter issues:

  1. Check the troubleshooting guide
  2. Search existing GitHub issues
  3. Open a new issue with your OS, Rust version (rustc --version), and complete error output

Next Steps

Once installed, see the Quick Start guide to begin using Stringy.

Quick Start

This guide will get you up and running with Stringy in minutes.

Basic Usage

Analyze a Binary

stringy /path/to/binary

Stringy will:

  • Detect ELF, PE, or Mach-O format automatically
  • Extract ASCII and UTF-16 strings from prioritized sections
  • Apply semantic classification (URLs, paths, GUIDs, etc.)
  • Rank results by relevance and display them in a table

Example Output (TTY)

String                                   Tags              Score  Section
------                                   ----              -----  -------
https://api.example.com/v1/users         url                 95   .rdata
{12345678-1234-1234-1234-123456789abc}   guid                87   .rdata
/usr/local/bin/application               filepath            82   __cstring
Error: %s at line %d                     fmt                 78   .rdata
MyApplication v1.2.3                     version             75   .rsrc

Common Use Cases

Security Analysis

Extract network indicators and file paths:

stringy --only-tags url --only-tags domain --only-tags filepath --only-tags regpath malware.exe

YARA Rule Development

Generate rule candidates:

stringy --yara --min-len 8 target.bin > candidates.yar

JSON Output for Automation

stringy --json --debug binary.elf | jq 'select(.display_score > 80)'

Extraction-Only Mode

Skip classification and ranking for fast raw extraction:

stringy --raw binary

Understanding the Output

Score Column

Strings are ranked using a display score from 0-100:

  • 90-100: High-value indicators (URLs, GUIDs in high-priority sections)
  • 70-89: Meaningful strings (file paths, format strings)
  • 50-69: Moderate relevance (imports, version info)
  • 0-49: Low relevance (short or noisy strings)

See Output Formats for the full band-mapping table.

Tags

Semantic classifications help identify string types:

TagDescriptionExample
urlWeb URLshttps://example.com/api
domainDomain namesapi.example.com
ipv4/ipv6IP addresses192.168.1.1
filepathFile paths/usr/bin/app
regpathRegistry pathsHKEY_LOCAL_MACHINE\...
guidGUIDs/UUIDs{12345678-1234-...}
emailEmail addressesuser@example.com
b64Base64 dataSGVsbG8gV29ybGQ=
fmtFormat stringsError: %s
import/exportSymbol namesCreateFileW
demangledDemangled symbolsstd::io::Read::read
user-agent-ishUser-agent-like stringsMozilla/5.0 ...
versionVersion stringsv1.2.3
manifestManifest dataPE/Mach-O embedded XML
resourceResource stringsPE VERSIONINFO/STRINGTABLE
dylib-pathDynamic library paths/usr/lib/libfoo.dylib
rpathRuntime search paths/usr/local/lib
rpath-varRpath variables@loader_path/../lib
framework-pathFramework paths (macOS)/System/Library/...

Sections

Shows where strings were found:

  • ELF: .rodata, .data.rel.ro, .comment
  • PE: .rdata, .rsrc, version info
  • Mach-O: __TEXT,__cstring, __DATA_CONST

Filtering and Options

By String Length

# Minimum 6 characters
stringy --min-len 6 binary

By Encoding

# ASCII only
stringy --enc ascii binary

# UTF-16 only (useful for Windows binaries)
stringy --enc utf16 binary.exe

By Tags

# Only network-related strings
stringy --only-tags url --only-tags domain --only-tags ipv4 --only-tags ipv6 binary

# Exclude Base64 noise
stringy --no-tags b64 binary

Limit Results

# Top 50 results
stringy --top 50 binary

Summary

Append a summary block after table output (TTY only):

stringy --summary binary

Output Formats

Table (Default)

Best for interactive analysis:

stringy binary

JSON Lines

For programmatic processing:

stringy --json binary | jq 'select(.tags[] == "Url")'

YARA Format

For security rule creation:

stringy --yara binary > rule_candidates.yar

Tips and Best Practices

Start Broad, Then Focus

  1. Run basic analysis first: stringy binary
  2. Identify interesting patterns in high-scoring results
  3. Use filters to focus: --only-tags url --only-tags filepath

Combine with Other Tools

# Find strings, then search for references
stringy --json binary | jq -r 'select(.score > 80) | .text' | xargs -I {} grep -r "{}" /path/to/source

# Extract URLs for further analysis
stringy --only-tags url --json binary | jq -r '.text' | sort -u

Performance Considerations

  • Use --top N to limit output for large binaries
  • Use --enc to restrict to a single encoding
  • Consider --min-len to reduce noise

Next Steps

Command Line Interface

Basic Syntax

stringy [OPTIONS] <FILE>
stringy [OPTIONS] -        # read from stdin

Options

Input/Output

OptionDescriptionDefault
<FILE>Binary file to analyze (use - for stdin)-
--jsonJSONL output; conflicts with --yara-
--yaraYARA rule output; conflicts with --json-
--helpShow help-
--versionShow version-

Filtering

OptionDescriptionDefault
--min-len NMinimum string length (must be >= 1)4
--top NLimit to top N strings by score (applied after all filters)-
--enc ENCODINGFilter by encoding: ascii, utf8, utf16, utf16le, utf16beall
--only-tags TAGInclude strings with any of these tags (OR); repeatableall
--no-tags TAGExclude strings with any of these tags; repeatablenone

Mode Flags

OptionDescription
--rawExtraction-only mode (no tagging, ranking, or scoring); conflicts with --only-tags, --no-tags, --top, --debug, --yara
--summaryAppend summary block (TTY table mode only); conflicts with --json, --yara
--debugInclude score-breakdown fields (section_weight, semantic_boost, noise_penalty) in JSON output; conflicts with --raw

Encoding Options

The --enc flag accepts exactly one encoding value per invocation:

ValueDescription
ascii7-bit ASCII only
utf8UTF-8 (includes ASCII)
utf16UTF-16 (both little- and big-endian)
utf16leUTF-16 Little Endian only
utf16beUTF-16 Big Endian only

Examples

# ASCII only
stringy --enc ascii binary

# UTF-16 only (common for Windows)
stringy --enc utf16 app.exe

# UTF-8 only
stringy --enc utf8 binary

Tag Filtering

Tags are specified with the repeatable --only-tags and --no-tags flags. Repeat the flag for each tag value:

# Network indicators only
stringy --only-tags url --only-tags domain --only-tags ipv4 --only-tags ipv6 malware.exe

# Exclude noisy Base64
stringy --no-tags b64 binary

# File system related
stringy --only-tags filepath --only-tags regpath app.exe

Available Tags

TagDescriptionExample
urlHTTP/HTTPS URLshttps://api.example.com
domainDomain namesexample.com
ipv4IPv4 addresses192.168.1.1
ipv6IPv6 addresses2001:db8::1
filepathFile paths/usr/bin/app
regpathRegistry pathsHKEY_LOCAL_MACHINE\...
guidGUIDs/UUIDs{12345678-1234-...}
emailEmail addressesuser@example.com
b64Base64 dataSGVsbG8=
fmtFormat stringsError: %s
user-agent-ishUser-agent-like stringsMozilla/5.0 ...
demangledDemangled symbolsstd::io::Read::read
importImport namesCreateFileW
exportExport namesmain
versionVersion stringsv1.2.3
manifestManifest dataXML/JSON config
resourceResource stringsUI text
dylib-pathDynamic library paths/usr/lib/libfoo.dylib
rpathRuntime search paths/usr/local/lib
rpath-varRpath variables@loader_path/../lib
framework-pathFramework paths (macOS)/System/Library/Frameworks/...

Output Formats

Table (Default, TTY)

When stdout is a TTY, results are shown as a table with columns:

String | Tags | Score | Section

When piped (non-TTY), output is plain text with one string per line and no headers.

JSON Lines (--json)

Each line is a JSON object with full metadata. See Output Formats for the schema.

YARA (--yara)

Generates a YARA rule template. See Output Formats for details.

Exit Codes

CodeMeaning
0Success (including unknown binary format, empty binary, no filter matches)
1General runtime error
2Configuration or validation error (tag overlap, --summary in non-TTY)
3File not found
4Permission denied

Clap argument parsing errors (invalid flag, flag conflict, invalid tag name) use clap’s own exit code (typically 2).

Advanced Usage

Pipeline Integration

# Extract URLs and check them
stringy --only-tags url --json binary | jq -r '.text' | xargs -I {} curl -I {}

# Find high-score strings
stringy --json binary | jq 'select(.score > 80)'

# Count strings by tag
stringy --json binary | jq -r '.tags[]' | sort | uniq -c

Batch Processing

# Process multiple files
find /path/to/binaries -type f -exec stringy --json {} \; > all_strings.jsonl

# Compare two versions
stringy --json old_binary > old.jsonl
stringy --json new_binary > new.jsonl
diff <(jq -r '.text' old.jsonl | sort) <(jq -r '.text' new.jsonl | sort)

Focused Analysis

# Fast scan for high-value strings only
stringy --top 20 --min-len 8 --only-tags url --only-tags guid --only-tags filepath large_binary

# Extraction-only mode (no classification overhead)
stringy --raw binary

Output Formats

Stringy supports three output formats optimized for different use cases.

Table Format (Default)

TTY Mode

When stdout is a TTY, results are shown as an aligned table. Columns appear in this order:

String                                   Tags              Score  Section
------                                   ----              -----  -------
https://api.example.com/v1/users         Url                 95   .rdata
{12345678-1234-1234-1234-123456789abc}   guid                87   .rdata
/usr/local/bin/application               filepath            82   __cstring
Error: %s at line %d                     fmt                 78   .rdata

Features:

  • Truncation: Long strings are truncated with ... indicator
  • Sorting: Results sorted by score (highest first)
  • Alignment: Columns properly aligned for readability

Plain Text (Piped / Non-TTY)

When stdout is piped, output switches to plain text with one string per line and no headers or table formatting. This is designed for downstream tool consumption.

--summary Block

When --summary is passed (TTY mode only; conflicts with --json and --yara), a summary block is appended after the table showing aggregate statistics about the extraction.

JSON Lines Format

Machine-readable format with one JSON object per line (JSONL), ideal for automation and pipeline integration.

stringy --json binary

Example Output

{"text":"https://api.example.com/v1/users","encoding":"Ascii","offset":4096,"rva":4096,"section":".rdata","length":31,"tags":["Url"],"score":95,"confidence":1.0,"source":"SectionData"}
{"text":"{12345678-1234-1234-1234-123456789abc}","encoding":"Ascii","offset":8192,"rva":8192,"section":".rdata","length":38,"tags":["guid"],"score":87,"confidence":0.95,"source":"SectionData"}

Schema

Each JSON object contains:

FieldTypeDescription
textstringThe extracted string (demangled if applicable)
original_textstring or nullOriginal mangled form (present only when demangled)
encodingstringEncoding: Ascii, Utf8, Utf16Le, Utf16Be
offsetnumberFile offset in bytes
rvanumber or nullRelative Virtual Address (if available)
sectionstring or nullSection name where found
lengthnumberString length in bytes
tagsarraySemantic classification tags
scorenumberInternal relevance score
display_scorenumber or nullDisplay score (0-100 band-mapped); only present with --debug
confidencenumberConfidence score from noise filtering (0.0-1.0)
sourcestringSource type: SectionData, ImportName, ExportName, etc.

Debug Fields

When --debug is passed, four additional fields appear:

FieldTypeDescription
display_scorenumber or nullDisplay score (0-100 band-mapped)
section_weightnumber or nullSection weight contribution to score
semantic_boostnumber or nullSemantic classification boost
noise_penaltynumber or nullNoise penalty applied

Raw Mode

With --raw --json, output contains extraction-only data: score is 0, tags is empty, and display_score is absent.

Processing Examples

# Extract only URLs
stringy --json binary | jq 'select(.tags[] == "Url") | .text'

# High-score strings only
stringy --json binary | jq 'select(.score > 80)'

# Group by section
stringy --json binary | jq -r '.section' | sort | uniq -c

# Find strings in specific section
stringy --json binary | jq 'select(.section == ".rdata")'

YARA Format

Specialized format for creating YARA detection rules with proper escaping and metadata.

stringy --yara binary

Example Output

// YARA rule generated by Stringy
// Binary: binary
// Generated: 1234567890

rule binary_strings {
  meta:
    description = "Strings extracted from binary"
    generated_by = "stringy"
    generated_at = "1234567890"
  strings:
    // tag: filepath
    // score: 82
    $filepath_1 = "/usr/local/bin/application" ascii
    // tag: fmt
    // score: 78
    $fmt_1 = "Error: %s at line %d" ascii
    // tag: Url
    // score: 95
    $Url_1 = "https://api.example.com/v1/users" ascii
    // skipped (length > 200 chars): 245
  condition:
    any of them
}

Features

  • Rule naming: Rule name is derived from the filename with non-alphanumeric characters replaced by _ and a _strings suffix added
  • Tag grouping: Strings are grouped by their first tag with // tag: <name> comments and per-string // score: <N> annotations
  • Variable naming: Variables use tag-derived names (e.g., $Url_1, $filepath_1, $fmt_1) rather than sequential $sN
  • Proper escaping: Handles special characters and binary data
  • Long string handling: Strings over 200 characters are replaced with // skipped (length > 200 chars): N (where N is the character count)
  • Modifiers: Appropriate ascii/wide modifiers based on encoding

Score Behavior

Stringy uses a band-mapping system to convert internal scores to display scores (0-100):

Internal ScoreDisplay ScoreMeaning
<= 00Low relevance
1-791-49Low relevance
80-11950-69Moderate
120-15970-89Meaningful
160-22090-100High-value
> 220100 (clamped)High-value

Format Comparison

FeatureTableJSONYARA
Interactive useYesNoNo
AutomationNoYesNo
Rule creationNoNoYes
Full metadataNoYesNo

Output Customization

Filtering

All formats support the same filtering options:

# Limit results
stringy --top 50 --json binary

# Filter by tags
stringy --only-tags url --only-tags domain --yara binary

# Minimum score threshold (post-process)
stringy --json binary | jq 'select(.score >= 70)'

Redirection

# Save to file
stringy --json binary > strings.jsonl
stringy --yara binary > rules.yar

# Pipe to other tools
stringy --json binary | jq 'select(.tags[] == "Url")' | less

Architecture Overview

Stringy is built as a modular Rust library with a clear separation of concerns. The architecture follows a pipeline approach where binary data flows through several processing stages.

High-Level Architecture

Binary File -> Format Detection -> Container Parsing -> String Extraction -> Deduplication -> Classification -> Ranking -> Output

Core Components

1. Container Module (src/container/)

Handles binary format detection and parsing using the goblin crate with comprehensive section analysis.

  • Format Detection: Automatically identifies ELF, PE, and Mach-O formats via goblin::Object::parse()
  • Section Classification: Categorizes sections by string likelihood with weighted scoring
  • Metadata Extraction: Collects imports, exports, and detailed structural information
  • Cross-Platform Support: Handles platform-specific section characteristics and naming conventions

Supported Formats

FormatParserKey Sections (Weight)Import/Export Support
ELFElfParser.rodata (10.0), .comment (9.0), .data.rel.ro (7.0)Dynamic and static
PEPeParser.rdata (10.0), .rsrc (9.0), read-only .data (7.0)Import/export tables
Mach-OMachoParser__TEXT,__cstring (10.0), __TEXT,__const (9.0)Symbol tables

Section Weight System

Container parsers assign weights (1.0-10.0) to sections based on how likely they are to contain meaningful strings. Higher weights indicate higher-value sections. For example, .rodata (read-only data) receives a weight of 10.0, while .text (executable code) receives 1.0.

2. Extraction Module (src/extraction/)

Implements encoding-aware string extraction algorithms with configurable parameters.

  • ASCII/UTF-8: Scans for printable character sequences with noise filtering
  • UTF-16: Detects little-endian and big-endian wide strings with confidence scoring
  • PE Resources: Extracts version info, manifests, and string table resources from PE binaries
  • Mach-O Load Commands: Extracts strings from Mach-O load commands (dylib paths, rpaths)
  • Deduplication: Groups strings by (text, encoding) keys, preserves all occurrence metadata, merges tags using set union, and calculates combined scores with occurrence-based bonuses
  • Noise Filters: Applies configurable filters to reduce false positives
  • Section-Aware: Uses container parser weights to prioritize extraction areas

Deduplication System

The deduplication module (src/extraction/dedup/) provides comprehensive string deduplication:

  • Grouping Strategy: Strings are grouped by (text, encoding) tuple, ensuring UTF-8 and UTF-16 versions are kept separate
  • Occurrence Preservation: All occurrence metadata (offset, RVA, section, source, tags, score, confidence) is preserved
  • Tag Merging: Tags from all occurrences are merged using HashSet for uniqueness, then converted to a sorted Vec<Tag>
  • Combined Scoring: Calculates combined scores using a base score (maximum across occurrences) plus bonuses for multiple occurrences, cross-section appearances, and multi-source appearances

3. Classification Module (src/classification/)

Applies semantic analysis to extracted strings with comprehensive tagging system.

  • Pattern Matching: Uses regex to identify URLs, IPs, domains, paths, GUIDs, emails, format strings, base64, user-agent strings, and version strings
  • Symbol Processing: Demangles Rust symbols and processes imports/exports
  • Context Analysis: Considers section context and source type for classification

Supported Classification Tags

CategoryTagsExamples
Networkurl, domain, ipv4, ipv6https://api.com, example.com, 192.168.1.1
Filesystemfilepath, regpath, dylib-path, rpath, rpath-var, framework-path/usr/bin/app, HKEY_LOCAL_MACHINE\...
Identifiersguid, email, user-agent-ish{12345678-...}, user@domain.com
Codefmt, b64, import, export, demangledError: %s, SGVsbG8=, CreateFileW
Resourcesversion, manifest, resourcev1.2.3, XML config, UI strings

4. Ranking Module (src/classification/ranking.rs)

Implements the scoring algorithm to prioritize relevant strings using multiple factors:

  • Section Weight: Based on the section’s classification (higher weights for string-oriented sections like .rodata)
  • Semantic Boost: Bonus points for strings with recognized semantic tags (URLs, GUIDs, paths, etc.)
  • Noise Penalty: Penalty for characteristics indicating noise (low confidence, repetitive patterns, high entropy)

The internal score is then mapped to a display score (0-100) using a band-mapping system. See Output Formats for the display-score band table.

5. Output Module (src/output/)

Formats results for different use cases:

  • Table (src/output/table/): TTY-aware output with color-coded scores, or plain text when piped. Columns: String, Tags, Score, Section.
  • JSON (src/output/json.rs): JSONL format with complete structured data including all metadata fields
  • YARA (src/output/yara/): Properly escaped strings with appropriate modifiers and long-string skipping

6. Pipeline Module (src/pipeline/)

Orchestrates the entire flow from file reading through output:

  • Configuration (src/pipeline/config.rs): PipelineConfig, FilterConfig, and EncodingFilter
  • Filtering (src/pipeline/filter.rs): FilterEngine applies post-extraction filtering by min-length, encoding, tags, and top-N
  • Score Normalization (src/pipeline/normalizer.rs): ScoreNormalizer maps internal scores to display scores (0-100) and populates display_score on each FoundString unconditionally in all non-raw executions
  • Orchestration (src/pipeline/mod.rs): Pipeline::run drives the full pipeline

Data Flow

1. Binary Analysis Phase

The pipeline reads the file, detects the binary format via goblin, and dispatches to the appropriate container parser (ELF, PE, or Mach-O). The parser returns a ContainerInfo struct containing sections with weights, imports, and exports. Unknown or unparseable formats fall back to unstructured raw byte scanning.

2. String Extraction Phase

Strings are extracted from each section using encoding-specific extractors (ASCII, UTF-8, UTF-16LE, UTF-16BE). Import and export symbol names are included as high-value strings. PE resources (version info, manifests, string tables) and Mach-O load command strings are also extracted. Results are then deduplicated by grouping on (text, encoding).

3. Classification Phase

Each string is passed through pattern matchers that assign semantic tags based on content. Rust mangled symbols are demangled. The ranking algorithm then computes a score for each string combining section weight, semantic boost, and noise penalty.

4. Output Phase

Strings are sorted by score (descending), filtered according to user options (tags, encoding, top-N), and formatted for the selected output mode (table, JSON, or YARA).

Data Structures

Core types are defined in src/types/mod.rs:

#![allow(unused)]
fn main() {
pub struct FoundString {
    pub text: String,
    pub original_text: Option<String>, // pre-demangled form
    pub encoding: Encoding,
    pub offset: u64,
    pub rva: Option<u64>,
    pub section: Option<String>,
    pub length: u32,
    pub tags: Vec<Tag>,
    pub score: i32,
    pub section_weight: Option<i32>, // debug only
    pub semantic_boost: Option<i32>, // debug only
    pub noise_penalty: Option<i32>,  // debug only
    pub display_score: Option<i32>,  // populated in all non-raw executions
    pub source: StringSource,
    pub confidence: f32,
}
}

Key Design Decisions

Error Handling

  • Comprehensive error types with context via thiserror
  • Graceful degradation for partially corrupted binaries
  • Unknown formats fall back to raw byte scanning rather than erroring

Extensibility

  • Trait-based architecture for easy format addition
  • Pluggable classification systems
  • Configurable output formats

Performance

  • Section-aware extraction reduces scan time
  • Regex caching via once_cell::sync::Lazy for repeated pattern matching
  • Weight-based prioritization avoids scanning low-value sections

Module Dependencies

main.rs
+-- lib.rs (public API, re-exports)
+-- types/
|   +-- mod.rs (core data structures: Tag, FoundString, Encoding, etc.)
|   +-- error.rs (StringyError, Result)
|   +-- constructors.rs (constructor implementations)
|   +-- found_string.rs (FoundString builder methods)
|   +-- tests.rs
+-- container/
|   +-- mod.rs (format detection, ContainerParser trait)
|   +-- elf/
|   |   +-- mod.rs (ELF parser)
|   |   +-- tests.rs
|   +-- pe/
|   |   +-- mod.rs (PE parser)
|   |   +-- tests.rs
|   +-- macho/
|   |   +-- mod.rs (Mach-O parser)
|   |   +-- tests.rs
+-- extraction/
|   +-- mod.rs (extraction orchestration)
|   +-- ascii/ (ASCII/UTF-8 extraction)
|   +-- utf16/ (UTF-16LE/BE extraction with confidence scoring)
|   +-- dedup/ (deduplication with scoring)
|   +-- filters/ (noise filter implementations)
|   +-- pe_resources/ (PE version info, manifests, string tables)
|   +-- macho_load_commands.rs (Mach-O load command strings)
+-- classification/
|   +-- mod.rs (classification framework)
|   +-- patterns/ (regex-based pattern matching)
|   +-- symbols.rs (symbol processing and demangling)
|   +-- ranking.rs (scoring algorithm)
+-- output/
|   +-- mod.rs (OutputFormat, OutputMetadata, formatting dispatch)
|   +-- json.rs (JSONL format)
|   +-- table/ (TTY and plain text table formatting)
|   +-- yara/ (YARA rule generation with escaping)
+-- pipeline/
    +-- mod.rs (Pipeline::run orchestration)
    +-- config.rs (PipelineConfig, FilterConfig, EncodingFilter)
    +-- filter.rs (post-extraction filtering)
    +-- normalizer.rs (score band mapping)

External Dependencies

Core Dependencies

  • goblin - Multi-format binary parsing (ELF, PE, Mach-O)
  • pelite - PE resource extraction (version info, manifests, string tables)
  • serde + serde_json - Serialization
  • thiserror - Error handling
  • clap - CLI argument parsing
  • regex - Pattern matching for classification
  • rustc-demangle - Rust symbol demangling
  • indicatif - Progress bars and spinners for CLI output
  • tempfile - Temporary file creation for stdin-to-Pipeline bridging
  • once_cell - Lazy-initialized static regex patterns
  • patharg - Input argument handling (file path or stdin)

Testing Strategy

Unit Tests

  • Each module has comprehensive unit tests
  • Mock data for parser testing
  • Edge case coverage for string extraction

Integration Tests

  • End-to-end CLI functionality via assert_cmd
  • Real binary file testing with compiled fixtures
  • Snapshot testing via insta
  • Cross-platform validation

Performance Tests

  • Benchmarks via criterion in benches/

Binary Format Support

Stringy supports the three major executable formats across different platforms. Each format has unique characteristics that influence string extraction strategies.

ELF (Executable and Linkable Format)

Used primarily on Linux and other Unix-like systems.

Key Sections for String Extraction

SectionPriorityDescription
.rodataHighRead-only data, often contains string literals
.rodata.str1.1HighAligned string literals
.data.rel.roMediumRead-only after relocation
.commentMediumCompiler and build information
.note.*LowVarious metadata notes

ELF-Specific Features

  • Symbol Tables: Extract import/export names from .dynsym and .symtab
  • Dynamic Strings: Process .dynstr for library names and symbols
  • Section Flags: Use SHF_EXECINSTR and SHF_WRITE for classification
  • Virtual Addresses: Map file offsets to runtime addresses
  • Dynamic Linking: Parse DT_NEEDED entries to extract library dependencies
  • Symbol Types: Support for functions (STT_FUNC), objects (STT_OBJECT), TLS variables (STT_TLS), and indirect functions (STT_GNU_IFUNC)
  • Symbol Visibility: Filter hidden and internal symbols from exports (STV_HIDDEN, STV_INTERNAL)

Enhanced Symbol Extraction

The ELF parser now provides comprehensive symbol extraction with:

  1. Import Detection: Identifies all undefined symbols (SHN_UNDEF) that need runtime resolution

    • Supports multiple symbol types: functions, objects, TLS variables, and indirect functions
    • Handles both global and weak bindings
    • Maps symbols to their providing libraries using version information
  2. Export Detection: Extracts all globally visible defined symbols

    • Filters out hidden (STV_HIDDEN) and internal (STV_INTERNAL) symbols
    • Includes both strong and weak symbols
    • Supports all relevant symbol types
  3. Library Dependencies: Extracts DT_NEEDED entries from the dynamic section

    • Provides list of required shared libraries
    • Used in conjunction with version information for symbol-to-library mapping
  4. Symbol-to-Library Mapping: Maps imported symbols to their providing libraries

    • Uses ELF version tables (versym and verneed) for best-effort attribution
    • Process: versym index → verneed entry → library filename
    • Falls back to heuristics for unversioned symbols (e.g., common libc symbols)
    • Returns None when version information is unavailable or ambiguous

Implementation Details

impl ElfParser {
    fn classify_section(section: &SectionHeader, name: &str) -> SectionType {
        // Check executable flag first
        if section.sh_flags & SHF_EXECINSTR != 0 {
            return SectionType::Code;
        }

        // Classify by name patterns
        match name {
            ".rodata" | ".rodata.str1.1" => SectionType::StringData,
            ".data.rel.ro" => SectionType::ReadOnlyData,
            // ... more classifications
        }
    }

    fn extract_imports(&self, elf: &Elf, libraries: &[String]) -> Vec<ImportInfo> {
        // Extract undefined symbols from dynamic symbol table
        // Supports STT_FUNC, STT_OBJECT, STT_TLS, STT_GNU_IFUNC, STT_NOTYPE
        // Handles both STB_GLOBAL and STB_WEAK bindings
        // Maps symbols to libraries using version information
    }

    fn extract_exports(&self, elf: &Elf) -> Vec<ExportInfo> {
        // Extract defined symbols with global/weak binding
        // Filters out STV_HIDDEN and STV_INTERNAL symbols
        // Includes all relevant symbol types
    }

    fn extract_needed_libraries(&self, elf: &Elf) -> Vec<String> {
        // Parse DT_NEEDED entries from dynamic section
        // Returns list of required shared library names
    }

    fn get_symbol_providing_library(
        &self,
        elf: &Elf,
        sym_index: usize,
        libraries: &[String],
    ) -> Option<String> {
        // 1. Get version index from versym table for this symbol
        // 2. Look up version in verneed to find library name
        // 3. Match with DT_NEEDED entries
        // 4. Fallback to heuristics for unversioned symbols
    }
}

Library Dependency Mapping

The ELF parser implements symbol-to-library mapping using ELF version information:

  1. Version Symbol Table (versym): Maps each dynamic symbol to a version index

    • Index 0 (VER_NDX_LOCAL): Local symbol, not available externally
    • Index 1 (VER_NDX_GLOBAL): Global symbol, no specific version
    • Index ≥ 2: Versioned symbol, references verneed entry
  2. Version Needed Table (verneed): Lists library dependencies with version requirements

    • Each entry contains a library filename (from DT_NEEDED)
    • Auxiliary entries specify version names and indices
    • Links version indices to specific libraries
  3. Mapping Process:

    Symbol → versym[sym_index] → version_index → verneed lookup → library_name
    
  4. Fallback Strategies:

    • For unversioned symbols: Attempt to match common symbols (e.g., printf, malloc) to libc
    • If only one library is needed: Attribute to that library (least accurate)
    • Otherwise: Return None to avoid false positives

Limitations

ELF’s indirect linking model means symbol-to-library mapping is best-effort:

  • Accuracy: Version-based mapping is accurate when version information is present, but many binaries lack version info
  • Unversioned Symbols: Symbols without version information cannot be definitively mapped without relocation analysis
  • Relocation Tables: PLT/GOT relocations would provide definitive mapping but require complex analysis
  • Static Linking: Statically linked binaries have no dynamic section, so all imports have library: None
  • Stripped Binaries: Stripped binaries may lack symbol tables entirely

The current implementation is sufficient for most string classification use cases where approximate library attribution is acceptable.

PE (Portable Executable)

Used on Windows for executables, DLLs, and drivers.

Key Sections for String Extraction

SectionPriorityDescription
.rdataHighRead-only data section
.rsrcHighResources (version info, strings, etc.)
.dataMediumInitialized data (check write flag)
.textLowCode section (imports/exports only)

PE-Specific Features

  • Resources: Extract from VERSIONINFO, STRINGTABLE, and manifest resources
  • Import/Export Tables: Process IAT and EAT for symbol names
  • UTF-16 Prevalence: Windows APIs favor wide strings
  • Section Characteristics: Use IMAGE_SCN_* flags for classification

Enhanced Import/Export Extraction

The PE parser provides comprehensive import/export extraction:

  1. Import Extraction: Extracts from PE import directory using goblin’s pe.imports

    • Each import includes: function name, DLL name, and RVA
    • Example: printf from msvcrt.dll
    • Iterates through pe.imports to create ImportInfo with name, library (DLL), and address (RVA)
  2. Export Extraction: Extracts from PE export directory using goblin’s pe.exports

    • Each export includes: function name, address, and ordinal
    • Note: PE executables typically don’t export symbols (only DLLs do)
    • Ordinal is derived from index since goblin doesn’t expose it directly
    • Handles unnamed exports with “ordinal_{i}” naming

Resource Extraction (Phase 2 Complete)

PE resources are particularly rich sources of strings. The PE parser now provides comprehensive resource string extraction:

VERSIONINFO Extraction

  • Extracts all StringFileInfo key-value pairs from VS_VERSIONINFO structures
  • Supports multiple language variants via translation table
  • Common extracted fields:
    • CompanyName: Company or organization name
    • FileDescription: File purpose and description
    • FileVersion: File version string (e.g., “1.0.0.0”)
    • ProductName: Product name
    • ProductVersion: Product version string
    • LegalCopyright: Copyright information
    • InternalName: Internal file identifier
    • OriginalFilename: Original filename
  • Uses pelite’s high-level version_info() API for reliable parsing
  • All strings are UTF-16LE encoded in the resource
  • Tagged with Tag::Version and Tag::Resource

STRINGTABLE Extraction

  • Parses RT_STRING resources (type 6) containing localized UI strings
  • Handles block structure: strings grouped in blocks of 16
  • Block ID calculation: (StringID >> 4) + 1
  • String format: u16 length (in UTF-16 code units) + UTF-16LE string data
  • Supports multiple language variants
  • Extracts all non-empty strings from all blocks
  • Tagged with Tag::Resource
  • Common use cases: UI labels, error messages, dialog text

MANIFEST Extraction

  • Extracts RT_MANIFEST resources (type 24) containing application manifests
  • Automatic encoding detection:
    • UTF-8 with BOM (EF BB BF)
    • UTF-16LE with BOM (FF FE)
    • UTF-16BE with BOM (FE FF)
    • Fallback: byte pattern analysis
  • Returns full XML manifest content
  • Tagged with Tag::Manifest and Tag::Resource
  • Manifest contains:
    • Assembly identity (name, version, architecture)
    • Dependency information
    • Compatibility settings
    • Security settings (requestedExecutionLevel)

Usage Example

use stringy::extraction::extract_resource_strings;
use stringy::types::Tag;

let pe_data = std::fs::read("example.exe")?;
let strings = extract_resource_strings(&pe_data);

// Filter version info strings
let version_strings: Vec<_> = strings.iter()
    .filter(|s| s.tags.contains(&Tag::Version))
    .collect();

// Filter string table entries
let ui_strings: Vec<_> = strings.iter()
    .filter(|s| s.tags.contains(&Tag::Resource) && !s.tags.contains(&Tag::Version))
    .collect();

Implementation Details

impl PeParser {
    fn classify_section(section: &SectionTable) -> SectionType {
        let name = String::from_utf8_lossy(&section.name);

        // Check characteristics
        if section.characteristics & IMAGE_SCN_CNT_CODE != 0 {
            return SectionType::Code;
        }

        match name.trim_end_matches('\0') {
            ".rdata" => SectionType::StringData,
            ".rsrc" => SectionType::Resources,
            // ... more classifications
        }
    }

    fn extract_imports(&self, pe: &PE) -> Vec<ImportInfo> {
        // Iterates through pe.imports
        // Creates ImportInfo with name, library (DLL), and address (RVA)
    }

    fn extract_exports(&self, pe: &PE) -> Vec<ExportInfo> {
        // Iterates through pe.exports
        // Creates ExportInfo with name, address, and ordinal
        // Handles unnamed exports with "ordinal_{i}" naming
    }

    fn calculate_section_weight(section_type: SectionType, name: &str) -> f32 {
        // Returns weight values based on section type and name
        // Higher weights indicate higher string likelihood
    }
}

Section Weight Calculation

The PE parser uses a weight-based system to prioritize sections for string extraction:

Section TypeWeightRationale
StringData (.rdata)10.0Primary string storage
Resources (.rsrc)9.0Version info, string tables
ReadOnlyData7.0May contain constants
WritableData (.data)5.0Runtime state, lower priority
Code (.text)1.0Unlikely to contain strings
Debug2.0Internal metadata
Other1.0Minimal priority

Limitations

The current PE parser implementation provides comprehensive resource string extraction:

  • VERSIONINFO: Complete extraction of all StringFileInfo fields
  • STRINGTABLE: Full parsing of RT_STRING blocks with language support
  • MANIFEST: Encoding detection and XML extraction
  • Dialog Resources: RT_DIALOG parsing not yet implemented (future enhancement)
  • Menu Resources: RT_MENU parsing not yet implemented (future enhancement)
  • Icon Strings: RT_ICON metadata extraction not yet implemented

Future Enhancements:

  • Dialog resource parsing for control text and window titles
  • Menu resource parsing for menu item text
  • Icon and cursor resource metadata
  • Accelerator table string extraction

Mach-O (Mach Object)

Used on macOS and iOS for executables, frameworks, and libraries.

Key Sections for String Extraction

SegmentSectionPriorityDescription
__TEXT__cstringHighC string literals
__TEXT__constHighConstant data
__DATA_CONST*MediumRead-only after fixups
__DATA*LowWritable data

Mach-O-Specific Features

  • Load Commands: Extract strings from LC_* commands
  • Segment/Section Model: Two-level naming scheme
  • Fat Binaries: Multi-architecture support
  • String Pools: Centralized string storage in __cstring

Load Command Processing

Mach-O load commands contain valuable strings:

  • LC_LOAD_DYLIB: Library paths and names
  • LC_RPATH: Runtime search paths
  • LC_ID_DYLIB: Library identification
  • LC_BUILD_VERSION: Build tool information

Implementation Details

impl MachoParser {
    fn classify_section(segment_name: &str, section_name: &str) -> SectionType {
        match (segment_name, section_name) {
            ("__TEXT", "__cstring") => SectionType::StringData,
            ("__DATA_CONST", _) => SectionType::ReadOnlyData,
            ("__DATA", _) => SectionType::WritableData,
            // ... more classifications
        }
    }
}

Cross-Platform Considerations

Encoding Differences

PlatformPrimary EncodingNotes
Linux/UnixUTF-8ASCII-compatible, variable width
WindowsUTF-16LEWide strings common in APIs
macOSUTF-8Similar to Linux, some UTF-16

String Storage Patterns

  • ELF: Strings often in .rodata with null terminators
  • PE: Mix of ANSI and Unicode APIs, resources use UTF-16
  • Mach-O: Centralized in __cstring, mostly UTF-8

Section Weight Calculation

Different formats require different weighting strategies:

fn calculate_section_weight(format: BinaryFormat, section_type: SectionType) -> i32 {
    match (format, section_type) {
        (BinaryFormat::Elf, SectionType::StringData) => 10, // .rodata
        (BinaryFormat::Pe, SectionType::Resources) => 9,    // .rsrc
        (BinaryFormat::MachO, SectionType::StringData) => 10, // __cstring
                                                             // ... more weights
    }
}

Format Detection

Stringy uses goblin for robust format detection:

pub fn detect_format(data: &[u8]) -> BinaryFormat {
    match Object::parse(data) {
        Ok(Object::Elf(_)) => BinaryFormat::Elf,
        Ok(Object::PE(_)) => BinaryFormat::Pe,
        Ok(Object::Mach(_)) => BinaryFormat::MachO,
        _ => BinaryFormat::Unknown,
    }
}

Future Enhancements

Planned Format Extensions

  • WebAssembly (WASM): Growing importance in web and edge computing
  • Java Class Files: JVM bytecode analysis
  • Android APK/DEX: Mobile application analysis

Enhanced Resource Support

  • PE: Dialog resources, icon strings, version blocks
  • Mach-O: Plist resources, framework bundles
  • ELF: Note sections, build IDs, GNU attributes

Architecture-Specific Features

  • ARM64: Pointer authentication, tagged pointers
  • x86-64: RIP-relative addressing hints
  • RISC-V: Emerging architecture support

This comprehensive format support ensures Stringy can effectively analyze binaries across all major platforms while respecting the unique characteristics of each format.

String Extraction

Stringy’s string extraction engine is designed to find meaningful strings while avoiding noise and false positives. The extraction process is encoding-aware, section-aware, and configurable.

Extraction Pipeline

Binary Data → Section Analysis → Encoding Detection → String Scanning → Deduplication → Classification

Encoding Support

ASCII Extraction

The most common encoding in most binaries. ASCII extraction provides foundational string extraction with configurable minimum length thresholds.

UTF-16LE Extraction

UTF-16LE extraction is now implemented and available for Windows PE binary string extraction. It provides UTF-16LE string extraction with confidence scoring and noise filtering integration.

Algorithm

  1. Scan for printable sequences: Characters in range 0x20-0x7E (strict printable ASCII)
  2. Length filtering: Configurable minimum length (default: 4 characters)
  3. Null termination: Respect null terminators but don’t require them
  4. Section awareness: Integrate with section metadata for context-aware filtering

Basic Extraction

use stringy::extraction::{extract_ascii_strings, AsciiExtractionConfig};

let data = b"Hello\0World\0Test123";
let config = AsciiExtractionConfig::default();
let strings = extract_ascii_strings(data, &config);

for string in strings {
    println!("Found: {} at offset {}", string.text, string.offset);
}

Configuration

use stringy::extraction::AsciiExtractionConfig;

// Default configuration (min_length: 4, no max_length)
let config = AsciiExtractionConfig::default();

// Custom minimum length
let config = AsciiExtractionConfig::new(8);

// Custom minimum and maximum length
let mut config = AsciiExtractionConfig::default();
config.max_length = Some(256);

UTF-8 Extraction

UTF-8 extraction builds on ASCII extraction and handles multi-byte characters. See the main extraction module for UTF-8 support.

Implementation Details

fn extract_ascii_strings(data: &[u8], min_len: usize) -> Vec<RawString> {
    let mut strings = Vec::new();
    let mut current_string = Vec::new();
    let mut start_offset = 0;

    for (i, &byte) in data.iter().enumerate() {
        if is_printable_ascii(byte) {
            if current_string.is_empty() {
                start_offset = i;
            }
            current_string.push(byte);
        } else {
            if current_string.len() >= min_len {
                strings.push(RawString {
                    data: current_string.clone(),
                    offset: start_offset,
                    encoding: Encoding::Ascii,
                });
            }
            current_string.clear();
        }
    }

    strings
}

Noise Filtering

Stringy implements a multi-layered heuristic filtering system to reduce false positives and identify noise in extracted strings. The filtering system uses a combination of entropy analysis, character distribution, linguistic patterns, length checks, repetition detection, and context-aware filtering.

Filter Architecture

The noise filtering system consists of multiple independent filters that can be combined with configurable weights:

  1. Character Distribution Filter: Detects abnormal character frequency distributions
  2. Entropy Filter: Uses Shannon entropy to detect padding/repetition and random binary
  3. Linguistic Pattern Filter: Analyzes vowel-to-consonant ratios and common bigrams
  4. Length Filter: Penalizes excessively long strings and very short strings in low-weight sections
  5. Repetition Filter: Detects repeated character patterns and repeated substrings
  6. Context-Aware Filter: Boosts confidence for strings in high-weight sections

Character Distribution Analysis

Detects strings with abnormal character distributions:

  • Excessive punctuation (>80%): Low confidence (0.2)
  • Excessive repetition (>90% same character): Very low confidence (0.1)
  • Excessive non-alphanumeric (>70%): Low confidence (0.3)
  • Reasonable distribution: High confidence (1.0)

Entropy-Based Filtering

Uses Shannon entropy (bits per byte) to classify strings:

  • Very low entropy (<1.5 bits/byte): Likely padding or repetition (confidence: 0.1)
  • Very high entropy (>7.5 bits/byte): Likely random binary (confidence: 0.2)
  • Optimal range (3.5-6.0 bits/byte): High confidence (1.0)
  • Acceptable range (2.0-7.0 bits/byte): Moderate confidence (0.4-0.7)

Linguistic Pattern Detection

Analyzes text for word-like patterns:

  • Vowel-to-consonant ratio: Reasonable range 0.2-0.8 for English
  • Common bigrams: Detects common English patterns (th, he, in, er, an, re, on, at, en, nd)
  • Handles non-English: Gracefully handles non-English strings without over-penalizing

Length-Based Filtering

Applies penalties based on string length:

  • Excessively long (>200 characters): Low confidence (0.3) - likely table data
  • Very short in low-weight sections (<4 chars, weight <0.5): Moderate confidence (0.5)
  • Normal length (4-100 characters): High confidence (1.0)

Repetition Detection

Identifies repetitive patterns:

  • Repeated characters (e.g., “AAAA”, “0000”): Very low confidence (0.1)
  • Repeated substrings (e.g., “abcabcabc”): Low confidence (0.2)
  • Normal strings: High confidence (1.0)

Context-Aware Filtering

Boosts or reduces confidence based on section context:

  • String data sections (.rodata, .rdata, __cstring): High confidence (0.9-1.0)
  • Read-only data sections: High confidence (0.9)
  • Resource sections: Maximum confidence (1.0) - known-good sources
  • Code sections: Lower confidence (0.3-0.5)
  • Writable data sections: Moderate confidence (0.6)

Configuration

use stringy::extraction::config::{NoiseFilterConfig, FilterWeights};

// Default configuration
let config = NoiseFilterConfig::default();

// Customize thresholds
let mut config = NoiseFilterConfig::default();
config.entropy_min = 2.0;
config.entropy_max = 7.0;
config.max_length = 150;

// Customize filter weights
config.filter_weights = FilterWeights {
    entropy_weight: 0.3,
    char_distribution_weight: 0.25,
    linguistic_weight: 0.2,
    length_weight: 0.15,
    repetition_weight: 0.05,
    context_weight: 0.05,
};

Using Noise Filters

use stringy::extraction::config::NoiseFilterConfig;
use stringy::extraction::filters::{CompositeNoiseFilter, FilterContext};
use stringy::types::SectionType;

let filter_config = NoiseFilterConfig::default();
let filter = CompositeNoiseFilter::new(&filter_config);
let context = FilterContext::default();

let confidence = filter.calculate_confidence("Hello, World!", &context);
if confidence >= 0.5 {
    // String passed filtering threshold
}

Confidence Scoring

Each string is assigned a confidence score (0.0-1.0) indicating how likely it is to be legitimate:

  • 1.0: Maximum confidence (strings from known-good sources like imports, exports, resources)
  • 0.7-0.9: High confidence (likely legitimate strings)
  • 0.5-0.7: Moderate confidence (may need review)
  • 0.0-0.5: Low confidence (likely noise, filtered out by default)

The confidence score is separate from the score field used for final ranking. Confidence specifically represents the noise filtering assessment.

Performance

Noise filtering is designed to add minimal overhead (<10% per acceptance criteria). Individual filters are optimized for performance, and the composite filter allows enabling/disabling specific filters to balance accuracy and speed.

UTF-16 Extraction

Critical for Windows binaries and some resources. Supports both UTF-16LE (Little-Endian) and UTF-16BE (Big-Endian) with automatic byte order detection.

UTF-16LE (Little-Endian)

Most common on Windows platforms. Default 3 character minimum.

Detection heuristics:

  • Even-length sequences (2-byte alignment required)
  • Low byte printable, high byte mostly zero
  • Null termination patterns (0x00 0x00)
  • Advanced confidence scoring with multiple heuristics

UTF-16BE (Big-Endian)

Found in Java .class files, network protocols, some cross-platform binaries.

Detection heuristics:

  • Even-length sequences
  • High byte printable, low byte mostly zero
  • Reverse byte order from UTF-16LE
  • Same advanced confidence scoring as UTF-16LE

Automatic Byte Order Detection

The ByteOrder::Auto mode automatically detects and extracts both UTF-16LE and UTF-16BE strings from the same data, avoiding duplicates and correctly identifying the encoding of each string.

Implementation

UTF-16 extraction is implemented in src/extraction/utf16.rs following the pattern established in the ASCII extractor. The implementation provides:

  • extract_utf16_strings(): Main extraction function supporting both byte orders
  • extract_utf16le_strings(): UTF-16LE-specific extraction (backward compatibility)
  • extract_from_section(): Section-aware extraction with proper metadata population
  • Utf16ExtractionConfig: Configuration for minimum/maximum character count, byte order selection, and confidence thresholds
  • ByteOrder enum: Control which byte order(s) to scan (LE, BE, Auto)

Usage Example:

use stringy::extraction::utf16::{extract_utf16_strings, Utf16ExtractionConfig, ByteOrder};

// Extract UTF-16LE strings from Windows PE binary
let config = Utf16ExtractionConfig {
    byte_order: ByteOrder::LE,
    min_length: 3,
    confidence_threshold: 0.6,
    ..Default::default()
};
let strings = extract_utf16_strings(data, &config);

// Extract both UTF-16LE and UTF-16BE with auto-detection
let config = Utf16ExtractionConfig {
    byte_order: ByteOrder::Auto,
    ..Default::default()
};
let strings = extract_utf16_strings(data, &config);

Configuration:

use stringy::extraction::utf16::{Utf16ExtractionConfig, ByteOrder};

// Default configuration (min_length: 3, byte_order: Auto, confidence_threshold: 0.5)
let config = Utf16ExtractionConfig::default();

// Custom minimum character length
let config = Utf16ExtractionConfig::new(5);

// Custom configuration
let mut config = Utf16ExtractionConfig::default();
config.min_length = 3;
config.max_length = Some(256);
config.byte_order = ByteOrder::LE;
config.confidence_threshold = 0.6;

UTF-16-Specific Confidence Scoring

UTF-16 extraction uses advanced confidence scoring to detect false positives from null-interleaved binary data. The confidence score combines multiple heuristics:

  1. Valid Unicode range check: Validates code points are in valid Unicode ranges (U+0020-U+D7FF, U+E000-U+FFFD, U+10000-U+10FFFF), penalizes private use areas and invalid surrogates

  2. Printable character ratio: Calculates ratio of printable characters including common Unicode ranges

  3. ASCII ratio: Boosts confidence for ASCII-heavy strings (>50% characters in ASCII printable range)

  4. Null pattern detection: Flags suspicious patterns like:

    • Excessive nulls (>30% of characters)
    • Regular null intervals (every 2nd, 4th, 8th position)
    • Fixed-offset nulls indicating structured binary data
  5. Byte order consistency: Verifies byte order is consistent throughout the string (for Auto mode)

Confidence Formula:

confidence = (valid_unicode_weight × valid_ratio)
           + (printable_weight × printable_ratio)
           + (ascii_weight × ascii_ratio)
           - (null_pattern_penalty)
           - (invalid_range_penalty)

The result is clamped to 0.0-1.0 range.

Examples:

  • High confidence: “Microsoft Corporation” (>90% printable, valid Unicode, no null patterns)
  • Medium confidence: “Test123” (>70% printable, valid Unicode)
  • Low confidence: Null-interleaved binary table data (excessive nulls, regular patterns)

The UTF-16-specific confidence score is combined with general noise filtering confidence when noise filtering is enabled, using the minimum of both scores.

False Positive Prevention

UTF-16 extraction is prone to false positives because binary data with null bytes can look like UTF-16 strings. The confidence scoring system mitigates this by:

  • Detecting null-interleaved patterns: Binary tables with numeric data (e.g., [0x01, 0x00, 0x02, 0x00]) are flagged as suspicious
  • Penalizing regular null patterns: Data with nulls at fixed intervals (every 2nd, 4th, 8th byte) receives lower confidence
  • Validating Unicode ranges: Invalid code points and surrogate pairs reduce confidence
  • Configurable threshold: The utf16_confidence_threshold (default 0.5) can be tuned to balance recall and precision

Recommendations:

  • For Windows PE binaries: Use ByteOrder::LE with confidence_threshold: 0.6
  • For Java .class files: Use ByteOrder::BE with confidence_threshold: 0.5
  • For unknown formats: Use ByteOrder::Auto with confidence_threshold: 0.5
  • For high-precision extraction: Increase confidence_threshold to 0.7-0.8

Performance Considerations

UTF-16 scanning adds overhead compared to ASCII/UTF-8 extraction:

  • Scanning both byte orders: Auto mode doubles the work by scanning for both LE and BE
  • Confidence scoring: The multi-heuristic confidence calculation adds computational cost
  • Recommendations:
    • Use specific byte order (LE or BE) when the target format is known
    • Auto mode is best for unknown or mixed-format binaries
    • Consider disabling UTF-16 extraction for formats that don’t use it (e.g., pure ELF binaries)

Section-Aware Extraction

Different sections have different string extraction strategies.

High-Priority Sections

ELF: .rodata and variants

  • Strategy: Aggressive extraction, low noise filtering
  • Encodings: ASCII/UTF-8 primary, UTF-16 secondary
  • Minimum length: 3 characters

PE: .rdata

  • Strategy: Balanced extraction
  • Encodings: ASCII and UTF-16LE equally
  • Minimum length: 4 characters

Mach-O: __TEXT,__cstring

  • Strategy: High confidence, null-terminated focus
  • Encodings: UTF-8 primary
  • Minimum length: 3 characters

Medium-Priority Sections

ELF: .data.rel.ro

  • Strategy: Conservative extraction
  • Noise filtering: Enhanced
  • Minimum length: 5 characters

PE: .data (read-only)

  • Strategy: Moderate extraction
  • Context checking: Enhanced validation

Low-Priority Sections

Writable data sections

  • Strategy: Very conservative
  • High noise filtering: Skip obvious runtime data
  • Minimum length: 6+ characters

Resource Sections

PE Resources (.rsrc)

  • VERSIONINFO: Extract version strings, product names
  • STRINGTABLE: Localized UI strings
  • RT_MANIFEST: XML manifest data
fn extract_pe_resources(pe: &PE, data: &[u8]) -> Vec<RawString> {
    let mut strings = Vec::new();

    // Extract version info
    if let Some(version_info) = extract_version_info(pe, data) {
        strings.extend(version_info);
    }

    // Extract string tables
    if let Some(string_tables) = extract_string_tables(pe, data) {
        strings.extend(string_tables);
    }

    strings
}

Deduplication Strategy

Canonicalization

Strings are canonicalized while preserving important metadata:

  1. Normalize whitespace: Convert tabs/newlines to spaces
  2. Trim boundaries: Remove leading/trailing whitespace
  3. Case preservation: Maintain original case for analysis
  4. Encoding normalization: Convert to UTF-8 for comparison

Metadata Preservation

When duplicates are found:

struct DeduplicatedString {
    canonical_text: String,
    occurrences: Vec<StringOccurrence>,
    primary_encoding: Encoding,
    best_section: Option<String>,
}

struct StringOccurrence {
    offset: u64,
    section: Option<String>,
    encoding: Encoding,
    length: u32,
}

Deduplication Algorithm

fn deduplicate_strings(strings: Vec<RawString>) -> Vec<DeduplicatedString> {
    let mut map: HashMap<String, DeduplicatedString> = HashMap::new();

    for string in strings {
        let canonical = canonicalize(&string.text);

        map.entry(canonical.clone())
            .or_insert_with(|| DeduplicatedString::new(canonical))
            .add_occurrence(string);
    }

    map.into_values().collect()
}

Configuration Options

Extraction Configuration

use stringy::extraction::{ByteOrder, Encoding, ExtractionConfig};

pub struct ExtractionConfig {
    pub min_ascii_length: usize,          // Default: 4
    pub min_wide_length: usize,           // Default: 3 (for UTF-16)
    pub enabled_encodings: Vec<Encoding>, // Default: ASCII, UTF-8
    pub noise_filtering_enabled: bool,    // Default: true
    pub min_confidence_threshold: f32,    // Default: 0.5
    pub utf16_min_confidence: f32,        // Default: 0.7 (for UTF-16LE)
    pub utf16_byte_order: ByteOrder,      // Default: Auto
    pub utf16_confidence_threshold: f32,  // Default: 0.5 (UTF-16-specific)
}

UTF-16 Configuration Examples:

use stringy::extraction::{ExtractionConfig, Encoding, ByteOrder};

// Extract UTF-16LE strings from Windows PE binary
let mut config = ExtractionConfig::default();
config.min_wide_length = 3;
config.utf16_confidence_threshold = 0.6;
config.utf16_byte_order = ByteOrder::LE;
config.enabled_encodings.push(Encoding::Utf16Le);

// Extract both UTF-16LE and UTF-16BE with auto-detection
let mut config = ExtractionConfig::default();
config.enabled_encodings.push(Encoding::Utf16Le);
config.enabled_encodings.push(Encoding::Utf16Be);
config.utf16_byte_order = ByteOrder::Auto;

Noise Filter Configuration

use stringy::extraction::config::NoiseFilterConfig;

pub struct NoiseFilterConfig {
    pub entropy_min: f32,              // Default: 1.5
    pub entropy_max: f32,              // Default: 7.5
    pub max_length: usize,             // Default: 200
    pub max_repetition_ratio: f32,     // Default: 0.7
    pub min_vowel_ratio: f32,          // Default: 0.1
    pub max_vowel_ratio: f32,          // Default: 0.9
    pub filter_weights: FilterWeights, // Default: balanced weights
}

Filter Weights

use stringy::extraction::config::FilterWeights;

pub struct FilterWeights {
    pub entropy_weight: f32,           // Default: 0.25
    pub char_distribution_weight: f32, // Default: 0.20
    pub linguistic_weight: f32,        // Default: 0.20
    pub length_weight: f32,            // Default: 0.15
    pub repetition_weight: f32,        // Default: 0.10
    pub context_weight: f32,           // Default: 0.10
}

All weights must sum to 1.0. The configuration validates this automatically.

Encoding Selection

#[non_exhaustive]
pub enum EncodingFilter {
    /// Match a specific encoding exactly
    Exact(Encoding),
    /// Match any UTF-16 variant (UTF-16LE or UTF-16BE)
    Utf16Any,
}

Section Filtering

pub struct SectionFilter {
    pub include_sections: Option<Vec<String>>,
    pub exclude_sections: Option<Vec<String>>,
    pub include_debug: bool,
    pub include_resources: bool,
}

Performance Optimizations

Memory Mapping

Large files use memory mapping for efficient access via mmap-guard:

fn extract_from_large_file(path: &Path) -> Result<Vec<RawString>> {
    let data = mmap_guard::map_file(path)?;
    // data implements Deref<Target = [u8]>
    extract_strings(&data[..])
}

Note: The Pipeline::run API handles memory mapping automatically.

Parallel Processing

Parallel processing is not yet implemented. Section extraction currently runs sequentially.

Regex Caching

Pattern matching uses cached regex compilation:

lazy_static! {
    static ref URL_REGEX: Regex = Regex::new(r"https?://[^\s]+").unwrap();
    static ref GUID_REGEX: Regex = Regex::new(r"\{[0-9a-fA-F-]{36}\}").unwrap();
}

Quality Assurance

Validation Heuristics

The noise filtering system implements comprehensive validation:

  • Entropy checking: Uses Shannon entropy to detect padding/repetition and random binary data
  • Language detection: Analyzes vowel-to-consonant ratios and common bigrams
  • Context validation: Considers section type, weight, and permissions
  • Character distribution: Detects abnormal frequency distributions
  • Repetition detection: Identifies repeated patterns and padding

False Positive Reduction

The multi-layered filtering system targets common sources of false positives:

  • Padding detection: Identifies repeated character sequences (e.g., “AAAA”, “\x00\x00\x00\x00”)
  • Table data: Filters excessively long strings likely to be structured data
  • Binary noise: High-entropy strings are flagged as likely random binary
  • Context awareness: Strings in code sections receive lower confidence scores

Performance Characteristics

Noise filtering is designed for minimal overhead:

  • Target overhead: <10% compared to extraction without filtering
  • Optimized filters: Each filter is independently optimized
  • Configurable: Can enable/disable individual filters to balance accuracy and speed
  • Scalable: Handles large binaries efficiently

Examples

Basic Extraction with Filtering

use stringy::extraction::{extract_ascii_strings, AsciiExtractionConfig};
use stringy::extraction::config::NoiseFilterConfig;
use stringy::extraction::filters::{CompositeNoiseFilter, FilterContext};

let data = b"Hello World\0AAAA\0Test123";
let config = AsciiExtractionConfig::default();
let strings = extract_ascii_strings(data, &config);

let filter_config = NoiseFilterConfig::default();
let filter = CompositeNoiseFilter::new(&filter_config);
let context = FilterContext::default();

let filtered: Vec<_> = strings
    .into_iter()
    .filter(|s| filter.calculate_confidence(&s.text, &context) >= 0.5)
    .collect();

Custom Filter Configuration

use stringy::extraction::config::{NoiseFilterConfig, FilterWeights};

let mut config = NoiseFilterConfig::default();
config.entropy_min = 2.0;
config.entropy_max = 7.0;
config.max_length = 150;

config.filter_weights = FilterWeights {
    entropy_weight: 0.4,
    char_distribution_weight: 0.3,
    linguistic_weight: 0.15,
    length_weight: 0.1,
    repetition_weight: 0.03,
    context_weight: 0.02,
};

This comprehensive extraction system ensures high-quality string extraction while maintaining performance and minimizing false positives through multi-layered noise filtering.

Classification System

Stringy applies semantic analysis to extracted strings, identifying patterns that indicate specific types of data. This helps analysts focus on the most relevant information quickly.

Classification Pipeline

Raw String -> Pattern Matching -> Validation -> Tag Assignment

Semantic Categories

URLs

  • Pattern: https?://[^\s<>"{}|\\\^\[\]\]+`
  • Examples: https://example.com/path, http://malware.site/payload
  • Validation: Must start with http:// or https://

Domain Names

  • Pattern: RFC 1035 compliant domain format
  • Examples: example.com, subdomain.evil.site
  • Validation: Valid TLD from known list, not a URL or email

IP Addresses

  • IPv4 Pattern: Standard dotted-decimal notation
  • IPv6 Pattern: Full and compressed formats
  • Examples: 192.168.1.1, ::1, 2001:db8::1
  • Validation: Valid octet ranges for IPv4, proper format for IPv6

File Paths

  • POSIX Pattern: Paths starting with /
  • Windows Pattern: Drive letters (C:\) or relative paths
  • UNC Pattern: \\server\share format
  • Examples: /etc/passwd, C:\Windows\System32, \\server\share\file

Registry Paths

  • Pattern: HKEY_* or HK*\ prefixes
  • Examples: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft
  • Validation: Must start with valid registry root key

GUIDs

  • Pattern: \{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}
  • Examples: {12345678-1234-1234-1234-123456789abc}
  • Validation: Strict format compliance with braces required

Email Addresses

  • Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
  • Examples: admin@malware.com, user.name+tag@example.co.uk
  • Validation: Single @, valid TLD length and characters, no empty parts

Base64 Data

  • Pattern: [A-Za-z0-9+/]{20,}={0,2}
  • Examples: U29tZSBsb25nZXIgYmFzZTY0IHN0cmluZw==
  • Validation: Length >= 20, length divisible by 4, padding rules, entropy threshold

Format Strings

  • Pattern: %[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}
  • Examples: Error: %s at line %d, User {0} logged in
  • Validation: Reasonable specifier count, context-aware thresholds

User Agents

  • Pattern: Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+
  • Examples: Mozilla/5.0 (Windows NT 10.0; Win64; x64), Chrome/117.0.5938.92
  • Validation: Known browser identifiers and minimum length

Pattern Matching Engine

The semantic classifier uses cached regex patterns via once_cell::sync::Lazy and applies validation checks to reduce false positives.

#![allow(unused)]
fn main() {
use once_cell::sync::Lazy;
use regex::Regex;

static GUID_REGEX: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"^\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}$")
        .expect("Invalid GUID regex")
});
}

Using the Classification System

#![allow(unused)]
fn main() {
use stringy::classification::SemanticClassifier;
use stringy::types::{BinaryFormat, Encoding, SectionType, StringContext, StringSource, Tag};

let classifier = SemanticClassifier::new();
let context = StringContext::new(
    SectionType::StringData,
    BinaryFormat::Elf,
    Encoding::Ascii,
    StringSource::SectionData,
)
.with_section_name(".rodata".to_string());

let tags = classifier.classify("{12345678-1234-1234-1234-123456789abc}", &context);
if tags.contains(&Tag::Guid) {
    // Handle GUID indicator
}
}

Validation Rules

  • GUID: Braced, hyphenated, hex-only format.
  • Email: TLD length must be between 2 and 24 and alphabetic; domain must include a dot.
  • Base64: Length must be divisible by 4, padding allowed only at the end, entropy threshold applied.
  • Format String: Must contain at least one specifier and pass context-aware length checks.
  • User Agent: Must contain a known browser token and meet minimum length.

Performance Notes

  • Regexes are compiled once via once_cell::sync::Lazy and reused across calls.
  • Minimum length checks avoid unnecessary regex work on short inputs.
  • The classifier is stateless and thread-safe.

Testing

  • Unit tests: tests/classification_tests.rs
  • Integration tests: tests/classification_integration_tests.rs

Run tests with:

just test

Ranking Algorithm

Stringy’s ranking system prioritizes strings by relevance, helping analysts focus on the most important findings first. The algorithm combines multiple factors to produce a comprehensive relevance score.

Scoring Formula

Final Score = SectionWeight + SemanticBoost - NoisePenalty

Each component contributes to the overall relevance assessment. The resulting internal score is then mapped to a display score (0-100) via band mapping.

Note: Section weights use a 1.0-10.0 scale, and semantic boosts add to the internal score. The pipeline’s normalizer then maps the combined internal score to a 0-100 display score using the band table shown in Display Score Mapping below.

Section Weight

Different sections have varying likelihood of containing meaningful strings. Container parsers assign weights (1.0-10.0) to each section based on its type and name.

Weight Ranges

Section TypeTypical WeightExamples
Dedicated string storage8.0-10.0.rodata, __TEXT,__cstring, .rsrc
Read-only data7.0.data.rel.ro, __DATA_CONST
General data5.0.data
Code sections1.0.text

Format-specific adjustments are applied based on section names. For example, ELF .rodata.str1.1 (aligned strings) and PE .rsrc (rich resources) receive additional priority.

Semantic Boost

Strings with recognized semantic meaning receive score boosts based on their tags.

Boost Categories

Tag CategoryBoost LevelExamples
Network (URL, Domain, IP)Highhttps://api.evil.com
Identifiers (GUID, Email)High{12345678-1234-...}
File System (Path, Registry)Medium-HighC:\Windows\System32\evil.dll
User-Agent-like stringsMedium-HighMozilla/5.0 ...
Version/ManifestMediumMyApp v1.2.3
Code Artifacts (Format strings, Base64)MediumError: %s at line %d
Symbols (Import, Export)Low-MediumCreateFileW, main

Strings with multiple semantic tags receive additional (diminishing) bonuses for each extra tag.

Noise Penalty

Various factors indicate low-quality or noisy strings, and receive penalties:

Penalty Categories

  • High Entropy: Strings with high Shannon entropy (randomness) are likely binary data or encoded content and receive significant penalties.

  • Excessive Length: Very long strings are often noise (padding, embedded data). Longer strings receive progressively larger penalties.

  • Repeated Patterns: Strings with excessive character repetition (e.g., AAAAAAA...) are penalized based on the repetition ratio.

  • Common Noise Patterns: Known noise patterns receive penalties, including padding characters, hex dump patterns, and table-like data with excessive delimiters.

Display Score Mapping

The internal score is mapped to a display score (0-100) using bands:

Internal ScoreDisplay ScoreMeaning
<= 00Low relevance
1-791-49Low relevance
80-11950-69Moderate
120-15970-89Meaningful
160-22090-100High-value
> 220100 (clamped)High-value

Filtering Recommendations

  • Interactive analysis: Show display scores >= 50
  • Automated processing: Use display scores >= 70
  • YARA rules: Focus on display scores >= 80
  • High-confidence indicators: Display scores >= 90

Contributing to Stringy

We welcome contributions to Stringy! This guide will help you get started with development, testing, and submitting changes.

Development Setup

Prerequisites

  • Rust: 1.91 or later (MSRV - Minimum Supported Rust Version)
  • Git: For version control
  • just: Task runner (install via cargo install just or your package manager)
  • mise: Tool version manager (manages Zig and other dev tools)
  • Zig: Cross-compiler for test fixtures (managed by mise)

Clone and Setup

git clone https://github.com/EvilBit-Labs/Stringy
cd Stringy

# Generate test fixtures (ELF/PE/Mach-O via Zig cross-compilation)
just gen-fixtures

# Run the full check suite
just check

Development Tools

Install recommended tools for development:

# Code formatting
rustup component add rustfmt

# Linting
rustup component add clippy

# Documentation
cargo install mdbook

# Test runner (required by just recipes)
cargo install cargo-nextest

# Coverage (optional)
cargo install cargo-llvm-cov

Project Structure

src/
+-- main.rs              # CLI entry point (thin wrapper)
+-- lib.rs               # Library root and public API re-exports
+-- types/
|   +-- mod.rs           # Core data structures (Tag, FoundString, Encoding, etc.)
|   +-- error.rs         # StringyError enum, Result alias
+-- container/
|   +-- mod.rs           # Format detection, ContainerParser trait
|   +-- elf.rs           # ELF parser
|   +-- pe.rs            # PE parser
|   +-- macho.rs         # Mach-O parser
+-- extraction/
|   +-- mod.rs           # Extraction orchestration
|   +-- ascii/           # ASCII/UTF-8 extraction
|   +-- utf16/           # UTF-16LE/BE extraction
|   +-- dedup/           # Deduplication with scoring
|   +-- filters/         # Noise filter implementations
|   +-- pe_resources/    # PE version info, manifests, string tables
+-- classification/
|   +-- mod.rs           # Classification framework
|   +-- patterns/        # Regex-based pattern matching
|   +-- symbols.rs       # Symbol processing and demangling
|   +-- ranking.rs       # Scoring algorithm
+-- output/
|   +-- mod.rs           # OutputFormat, OutputMetadata, dispatch
|   +-- json.rs          # JSONL format
|   +-- table/           # TTY and plain text table formatting
|   +-- yara/            # YARA rule generation
+-- pipeline/
    +-- mod.rs           # Pipeline::run orchestration
    +-- config.rs        # PipelineConfig, FilterConfig, EncodingFilter
    +-- filter.rs        # Post-extraction filtering
    +-- normalizer.rs    # Score band mapping

tests/
+-- integration_cli.rs           # CLI argument and flag tests
+-- integration_cli_errors.rs    # CLI error handling tests
+-- integration_elf.rs           # ELF-specific tests
+-- integration_pe.rs            # PE-specific tests
+-- integration_macho.rs         # Mach-O-specific tests
+-- integration_extraction.rs    # Extraction tests
+-- integration_flows_1_5.rs     # End-to-end flow tests (1-5)
+-- integration_flows_6_8.rs     # End-to-end flow tests (6-8)
+-- ... (additional test files)
+-- fixtures/                    # Test binary files (flat structure)
+-- snapshots/                   # Insta snapshot files

docs/
+-- src/                 # mdbook documentation
+-- book.toml            # Documentation config

Development Workflow

1. Create a Branch

git checkout -b feature/your-feature-name
# or
git checkout -b fix/issue-description

2. Make Changes

Use just recipes for development commands:

# Format code
just format

# Lint (clippy with -D warnings)
just lint

# Run tests
just test

# Full pre-commit check (fmt + lint + test)
just check

# Full CI suite locally
just ci-check

3. Test Your Changes

# Generate fixtures if needed
just gen-fixtures

# Run all tests
just test

# Run a specific test
cargo nextest run test_name

# Regenerate snapshots after changing test_binary.c
INSTA_UPDATE=always cargo nextest run

4. Update Documentation

If your changes affect the public API or add new features:

# Update API docs
cargo doc --open

# Update user documentation
cd docs
mdbook serve --open

Coding Standards

Rust Style

  • Use cargo fmt for formatting
  • Follow cargo clippy recommendations (warnings are errors)
  • No unsafe code (#![forbid(unsafe_code)] is enforced)
  • Zero warnings policy
  • ASCII only in source code (no emojis, em-dashes, smart quotes)
  • Files under 500 lines; split larger files into module directories
  • No blanket #[allow] without inline justification

Testing

Write comprehensive tests:

  • Use insta for snapshot testing
  • Binary fixtures in tests/fixtures/ (flat structure)
  • Integration tests use two naming patterns: integration_*.rs and test_*.rs
  • Use assert_cmd for CLI testing (note: assert_cmd is non-TTY)

Contribution Areas

High-Priority Areas

  1. String Extraction Engine - UTF-16 detection improvements, noise filtering enhancements
  2. Classification System - New semantic patterns, improved confidence scoring
  3. Output Formats - Customization options, additional format support

Getting Started Ideas

  • Add new semantic patterns (email formats, crypto constants)
  • Improve test coverage
  • Enhance error messages
  • Add documentation examples

Submitting Changes

Pull Request Process

  1. Fork the repository on GitHub
  2. Create a feature branch from main
  3. Make your changes following the guidelines above
  4. Add tests for new functionality (this is policy, not optional)
  5. Sign off commits with git commit -s (DCO enforced by GitHub App)
  6. Submit a pull request with a clear description

PR Requirements

  • Sign off commits with git commit -s (DCO enforced)
  • Pass CI (clippy, rustfmt, tests, CodeQL, cargo-deny)
  • Include tests for new functionality
  • Be reviewed (human or CodeRabbit) for correctness, safety, and style
  • No unwrap() in library code, unchecked errors, or unvalidated input

Review Process

  1. Automated checks must pass (CI/CD)
  2. Code review by maintainers
  3. Testing on multiple platforms
  4. Documentation review if applicable
  5. Merge after approval

Community Guidelines

Getting Help

  • GitHub Issues: Bug reports and feature requests
  • Discussions: General questions and ideas
  • Documentation: Check existing docs first

Release Process

Version Numbering

We follow Semantic Versioning:

  • MAJOR: Breaking changes
  • MINOR: New features (backward compatible)
  • PATCH: Bug fixes (backward compatible)

Release Checklist

  1. Update version in Cargo.toml
  2. Update changelog via git-cliff
  3. Run full test suite
  4. Update documentation
  5. Create release tag (vX.Y.Z)
  6. Releases are built via cargo-dist

Thank you for contributing to Stringy!

API Documentation

This page provides an overview of Stringy’s public API. For complete API documentation, run cargo doc --open in the project directory.

Core Types

FoundString

The primary data structure representing an extracted string with metadata.

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FoundString {
    /// The extracted string text
    pub text: String,
    /// Pre-demangled form (if symbol was demangled)
    pub original_text: Option<String>,
    /// The encoding used for this string
    pub encoding: Encoding,
    /// File offset where the string was found
    pub offset: u64,
    /// Relative Virtual Address (if available)
    pub rva: Option<u64>,
    /// Section name where the string was found
    pub section: Option<String>,
    /// Length of the string in bytes
    pub length: u32,
    /// Semantic tags applied to this string
    pub tags: Vec<Tag>,
    /// Relevance score for ranking
    pub score: i32,
    /// Section weight component of score (debug only)
    pub section_weight: Option<i32>,
    /// Semantic boost component of score (debug only)
    pub semantic_boost: Option<i32>,
    /// Noise penalty component of score (debug only)
    pub noise_penalty: Option<i32>,
    /// Display score 0-100, populated by ScoreNormalizer in all non-raw executions
    pub display_score: Option<i32>,
    /// Source of the string (section data, import, etc.)
    pub source: StringSource,
    /// UTF-16 confidence score
    pub confidence: f32,
}

Encoding

Supported string encodings.

#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum Encoding {
    Ascii,
    Utf8,
    Utf16Le,
    Utf16Be,
}

Tag

Semantic classification tags.

#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
pub enum Tag {
    Url,
    Domain,
    IPv4,
    IPv6,
    FilePath,
    RegistryPath,
    Guid,
    Email,
    Base64,
    FormatString,
    UserAgent,
    DemangledSymbol,
    Import,
    Export,
    Version,
    Manifest,
    Resource,
    DylibPath,
    Rpath,
    RpathVariable,
    FrameworkPath,
}

EncodingFilter

Filter for restricting output by string encoding, corresponding to the --enc CLI flag.

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum EncodingFilter {
    /// Match a specific encoding exactly
    Exact(Encoding),
    /// Match any UTF-16 variant (UTF-16LE or UTF-16BE)
    Utf16Any,
}

Used with FilterConfig to limit results to a specific encoding. Utf16Any matches both Utf16Le and Utf16Be.

FilterConfig

Post-extraction filtering configuration. All fields have sensible defaults; empty tag vectors are no-ops.

pub struct FilterConfig {
    /// Minimum string length to include (default: 4)
    pub min_length: usize,          // --min-len
    /// Restrict to a specific encoding
    pub encoding: Option<EncodingFilter>, // --enc
    /// Only include strings with these tags (empty = no filter)
    pub include_tags: Vec<Tag>,     // --only-tags
    /// Exclude strings with these tags (empty = no filter)
    pub exclude_tags: Vec<Tag>,     // --no-tags
    /// Limit output to top N strings by score
    pub top_n: Option<usize>,      // --top
}

Builder-style construction:

let config = FilterConfig::new()
    .with_min_length(6)
    .with_encoding(EncodingFilter::Exact(Encoding::Utf8))
    .with_include_tags(vec![Tag::Url, Tag::Domain])
    .with_top_n(20);

Main API Functions

BasicExtractor::extract

Extract strings from binary data using the BasicExtractor, which implements the StringExtractor trait.

pub trait StringExtractor {
    fn extract(
        &self,
        data: &[u8],
        container_info: &ContainerInfo,
        config: &ExtractionConfig,
    ) -> Result<Vec<FoundString>>;
}

Parameters:

  • data: Binary data to analyze
  • container_info: Parsed container metadata (sections, imports, exports)
  • config: Extraction configuration options

Returns:

  • Result<Vec<FoundString>>: Extracted strings with metadata

Example:

use stringy::{BasicExtractor, ExtractionConfig, StringExtractor};
use stringy::container::{detect_format, create_parser};

let data = std::fs::read("binary.exe")?;
let format = detect_format(&data);
let parser = create_parser(format)?;
let container_info = parser.parse(&data)?;

let extractor = BasicExtractor::new();
let config = ExtractionConfig::default();
let strings = extractor.extract(&data, &container_info, &config)?;

for string in strings {
    println!("{}: {}", string.score, string.text);
}

detect_format

Detect the binary format of the given data.

pub fn detect_format(data: &[u8]) -> BinaryFormat

Parameters:

  • data: Binary data to analyze

Returns:

  • BinaryFormat: Detected format (ELF, PE, MachO, or Unknown)

Example:

use stringy::detect_format;

let data = std::fs::read("binary")?;
let format = detect_format(&data);
println!("Detected format: {:?}", format);

Configuration

ExtractionConfig

Configuration options for string extraction. The struct has 16 fields with sensible defaults.

pub struct ExtractionConfig {
    /// Minimum string length in bytes (default: 1)
    pub min_length: usize,
    /// Maximum string length in bytes (default: 4096)
    pub max_length: usize,
    /// Whether to scan executable sections (default: true)
    pub scan_code_sections: bool,
    /// Whether to include debug sections (default: false)
    pub include_debug: bool,
    /// Section types to prioritize (default: StringData, ReadOnlyData, Resources)
    pub section_priority: Vec<SectionType>,
    /// Whether to include import/export names (default: true)
    pub include_symbols: bool,
    /// Minimum length for ASCII strings (default: 1)
    pub min_ascii_length: usize,
    /// Minimum length for UTF-16 strings (default: 1)
    pub min_wide_length: usize,
    /// Which encodings to extract (default: ASCII, UTF-8)
    pub enabled_encodings: Vec<Encoding>,
    /// Enable/disable noise filtering (default: true)
    pub noise_filtering_enabled: bool,
    /// Minimum confidence threshold (default: 0.5)
    pub min_confidence_threshold: f32,
    /// Minimum UTF-16LE confidence threshold (default: 0.7)
    pub utf16_min_confidence: f32,
    /// Which UTF-16 byte order(s) to scan (default: Auto)
    pub utf16_byte_order: ByteOrder,
    /// Minimum UTF-16-specific confidence threshold (default: 0.5)
    pub utf16_confidence_threshold: f32,
    /// Enable/disable deduplication (default: true)
    pub enable_deduplication: bool,
    /// Deduplication threshold (default: None)
    pub dedup_threshold: Option<usize>,
}

impl Default for ExtractionConfig {
    fn default() -> Self {
        Self {
            min_length: 1,
            max_length: 4096,
            scan_code_sections: true,
            include_debug: false,
            section_priority: vec![
                SectionType::StringData,
                SectionType::ReadOnlyData,
                SectionType::Resources,
            ],
            include_symbols: true,
            min_ascii_length: 1,
            min_wide_length: 1,
            enabled_encodings: vec![Encoding::Ascii, Encoding::Utf8],
            noise_filtering_enabled: true,
            min_confidence_threshold: 0.5,
            utf16_min_confidence: 0.7,
            utf16_byte_order: ByteOrder::Auto,
            utf16_confidence_threshold: 0.5,
            enable_deduplication: true,
            dedup_threshold: None,
        }
    }
}

SemanticClassifier

The SemanticClassifier is constructed via SemanticClassifier::new() and currently has no configuration options. Classification patterns are built-in.

Pipeline Components

ScoreNormalizer

Maps internal relevance scores to a 0-100 display scale using band mapping.

let normalizer = ScoreNormalizer::new();
normalizer.normalize(&mut strings);
// Each FoundString now has display_score populated

Invoked unconditionally by the pipeline in all non-raw executions. Negative internal scores map to display_score = 0. See Ranking for the full band-mapping table.

FilterEngine

Applies post-extraction filtering and sorting. Consumes the input vector and returns a filtered, sorted result.

let engine = FilterEngine::new();
let filtered = engine.apply(strings, &filter_config);

Filter order:

  1. Minimum length (min_length)
  2. Encoding match (encoding)
  3. Include tags (include_tags – keep only strings with at least one matching tag)
  4. Exclude tags (exclude_tags – remove strings with any matching tag)
  5. Stable sort by score (descending), then offset (ascending), then text (ascending)
  6. Top-N truncation (top_n)

Example: FilterConfig + FilterEngine

use stringy::{FilterConfig, FilterEngine, EncodingFilter, Encoding, Tag};

let config = FilterConfig::new()
    .with_min_length(6)
    .with_include_tags(vec![Tag::Url, Tag::Domain])
    .with_top_n(10);

let engine = FilterEngine::new();
let results = engine.apply(strings, &config);
// results contains at most 10 strings, all >= 6 chars,
// all tagged Url or Domain, sorted by score descending

Container Parsing

ContainerParser Trait

Trait for implementing binary format parsers.

pub trait ContainerParser {
    /// Detect if this parser can handle the given data
    fn detect(data: &[u8]) -> bool
    where
        Self: Sized;

    /// Parse the container and extract metadata
    fn parse(&self, data: &[u8]) -> Result<ContainerInfo>;
}

ContainerInfo

Information about a parsed binary container.

pub struct ContainerInfo {
    /// The binary format detected
    pub format: BinaryFormat,
    /// List of sections in the binary
    pub sections: Vec<SectionInfo>,
    /// Import information
    pub imports: Vec<ImportInfo>,
    /// Export information
    pub exports: Vec<ExportInfo>,
    /// Resource metadata (PE format only)
    pub resources: Option<Vec<ResourceMetadata>>,
}

SectionInfo

Information about a section within the binary.

pub struct SectionInfo {
    /// Section name
    pub name: String,
    /// File offset of the section
    pub offset: u64,
    /// Size of the section in bytes
    pub size: u64,
    /// Relative Virtual Address (if available)
    pub rva: Option<u64>,
    /// Classification of the section type
    pub section_type: SectionType,
    /// Whether the section is executable
    pub is_executable: bool,
    /// Whether the section is writable
    pub is_writable: bool,
    /// Weight indicating likelihood of containing meaningful strings (1.0-10.0)
    pub weight: f32,
}

Output Formatting

OutputFormatter Trait

Trait for implementing output formatters.

pub trait OutputFormatter {
    /// Returns the name of this formatter
    fn name(&self) -> &'static str;

    /// Format the strings for output
    fn format(&self, strings: &[FoundString], metadata: &OutputMetadata) -> Result<String>;
}

Built-in Formatters

The library provides free functions rather than formatter structs:

  • format_table(strings, metadata) - Human-readable table format (TTY-aware)
  • format_json(strings, metadata) - JSONL format
  • format_yara(strings, metadata) - YARA rule format
  • format_output(strings, metadata) - Dispatches based on metadata.output_format

Example:

use stringy::output::{format_json, OutputMetadata};

let metadata = OutputMetadata::new("binary.exe".to_string());
let output = format_json(&strings, &metadata)?;
println!("{}", output);

Error Handling

StringyError

Comprehensive error type for the library.

#[derive(Debug, thiserror::Error)]
pub enum StringyError {
    #[error("Unsupported file format (supported: ELF, PE, Mach-O)")]
    UnsupportedFormat,

    #[error("File I/O error: {0}")]
    IoError(#[from] std::io::Error),

    #[error("Binary parsing error: {0}")]
    ParseError(String),

    #[error("Invalid encoding in string at offset {offset}")]
    EncodingError { offset: u64 },

    #[error("Configuration error: {0}")]
    ConfigError(String),

    #[error("Serialization error: {0}")]
    SerializationError(String),

    #[error("Validation error: {0}")]
    ValidationError(String),

    #[error("Memory mapping error: {0}")]
    MemoryMapError(String),
}

Result Type

Convenient result type alias.

pub type Result<T> = std::result::Result<T, StringyError>;

Advanced Usage

Custom Classification

Implement custom semantic classifiers:

use stringy::classification::{ClassificationResult, Classifier};

pub struct CustomClassifier {
    // Custom implementation
}

impl Classifier for CustomClassifier {
    fn classify(&self, text: &str, context: &StringContext) -> Vec<ClassificationResult> {
        // Custom classification logic
        vec![]
    }
}

Memory-Mapped Files

For large files, use memory mapping via mmap-guard:

let data = mmap_guard::map_file(path)?;
// data implements Deref<Target = [u8]>

Note: The Pipeline::run API handles memory mapping automatically. Direct use of mmap_guard is only needed when using lower-level APIs.

Parallel Processing

Parallel processing is not yet implemented. Stringy currently processes files sequentially. The Pipeline API processes one file at a time.

Feature Flags

Stringy currently has no optional feature flags. All functionality is included by default.

Examples

Basic String Extraction (Pipeline API)

use stringy::pipeline::{Pipeline, PipelineConfig};
use std::path::Path;

fn main() -> stringy::Result<()> {
    let config = PipelineConfig::default();
    let pipeline = Pipeline::new(config);
    pipeline.run(Path::new("binary.exe"))?;
    Ok(())
}

Filtered Extraction

use stringy::{BasicExtractor, ExtractionConfig, StringExtractor, Tag};
use stringy::container::{detect_format, create_parser};

fn extract_network_indicators(data: &[u8]) -> stringy::Result<Vec<String>> {
    let format = detect_format(data);
    let parser = create_parser(format)?;
    let container_info = parser.parse(data)?;

    let extractor = BasicExtractor::new();
    let config = ExtractionConfig::default();
    let strings = extractor.extract(data, &container_info, &config)?;

    let network_strings: Vec<String> = strings
        .into_iter()
        .filter(|s| {
            s.tags
                .iter()
                .any(|tag| matches!(tag, Tag::Url | Tag::Domain | Tag::IPv4 | Tag::IPv6))
        })
        .filter(|s| s.score >= 70)
        .map(|s| s.text)
        .collect();

    Ok(network_strings)
}

Custom Output Format

use serde_json::json;
use stringy::output::{OutputMetadata, OutputFormatter};
use stringy::FoundString;

pub struct CustomFormatter;

impl OutputFormatter for CustomFormatter {
    fn name(&self) -> &'static str {
        "custom"
    }

    fn format(&self, strings: &[FoundString], _metadata: &OutputMetadata) -> stringy::Result<String> {
        let output = json!({
            "total_strings": strings.len(),
            "high_confidence": strings.iter().filter(|s| s.score >= 80).count(),
            "strings": strings.iter().take(20).collect::<Vec<_>>()
        });

        Ok(serde_json::to_string_pretty(&output)?)
    }
}

For complete API documentation with all methods and implementation details, run:

cargo doc --open

Configuration

Note

The configuration file system described below is planned but not yet implemented. Stringy currently uses CLI flags exclusively for configuration. See the CLI Reference for available options.

Stringy provides extensive configuration options to customize string extraction, classification, and output formatting. Configuration can be provided through command-line arguments, configuration files, or programmatically via the API.

Configuration File

Note: Configuration file support is planned for future releases.

Default Location

~/.config/stringy/config.toml

Example Configuration

[extraction]
min_ascii_len = 4
min_utf16_len = 3
max_string_len = 1024
encodings = ["ascii", "utf16le"]
include_debug = false
include_symbols = true

[classification]
detect_urls = true
detect_domains = true
detect_ips = true
detect_paths = true
detect_guids = true
detect_emails = true
detect_base64 = true
detect_format_strings = true
min_confidence = 0.7

[output]
format = "human"
max_results = 100
show_scores = true
show_offsets = true
color = true

[ranking]
section_weight_multiplier = 1.0
semantic_boost_multiplier = 1.0
noise_penalty_multiplier = 1.0

# Profile-specific configurations
[profiles.security]
encodings = ["ascii", "utf8", "utf16le"]
min_ascii_len = 6
only_tags = ["url", "domain", "ipv4", "ipv6", "filepath", "regpath"]
min_score = 70

[profiles.yara]
format = "yara"
min_ascii_len = 8
exclude_tags = ["import", "export"]
min_score = 80

[profiles.development]
include_debug = true
include_symbols = true
max_results = 500

Extraction Configuration

String Length Limits

Control the minimum and maximum string lengths:

[extraction]
min_ascii_len = 4     # Minimum ASCII string length
min_utf16_len = 3     # Minimum UTF-16 string length
max_string_len = 1024 # Maximum string length (prevents memory issues)

CLI equivalent:

stringy --min-len 6 --max-len 500 binary

Encoding Selection

Choose which encodings to extract:

[extraction]
encodings = ["ascii", "utf8", "utf16le", "utf16be"]

Available encodings:

  • ascii: 7-bit ASCII
  • utf8: UTF-8 (includes ASCII)
  • utf16le: UTF-16 Little Endian
  • utf16be: UTF-16 Big Endian

CLI equivalent:

stringy --enc ascii,utf16le binary

Section Filtering

Control which sections to analyze:

[extraction]
include_sections = [".rodata", ".rdata", "__cstring"]
exclude_sections = [".debug_info", ".comment"]
include_debug = false
include_resources = true

CLI equivalent:

stringy --sections .rodata,.rdata --no-debug binary

Symbol Processing

Configure import/export symbol handling:

[extraction]
include_symbols = true
demangle_rust = true
demangle_cpp = false   # Future feature

CLI equivalent:

stringy --no-symbols --no-demangle binary

Classification Configuration

Pattern Detection

Enable/disable specific semantic patterns:

[classification]
detect_urls = true
detect_domains = true
detect_ips = true
detect_paths = true
detect_guids = true
detect_emails = true
detect_base64 = true
detect_format_strings = true
detect_user_agents = true

Confidence Thresholds

Set minimum confidence levels:

[classification]
min_confidence = 0.7         # Overall minimum confidence
url_min_confidence = 0.8     # URL-specific threshold
domain_min_confidence = 0.75 # Domain-specific threshold
path_min_confidence = 0.6    # File path threshold

Custom Patterns

Add custom regex patterns:

[classification.custom_patterns]
api_key = 'api[_-]?key["\s]*[:=]["\s]*[a-zA-Z0-9]{20,}'
crypto_address = '(bc1|[13])[a-zA-HJ-NP-Z0-9]{25,62}'
jwt_token = 'eyJ[a-zA-Z0-9_-]+\.[a-zA-Z0-9_-]+\.[a-zA-Z0-9_-]+'

Ranking Configuration

Weight Adjustments

Customize section and semantic weights:

[ranking.section_weights]
string_data = 40
resources = 35
readonly_data = 25
debug = 15
writable_data = 10
code = 5
other = 0

[ranking.semantic_boosts]
url = 25
domain = 20
guid = 20
filepath = 15
format_string = 10
import = 8
export = 8

Penalty Configuration

Adjust noise detection penalties:

[ranking.penalties]
high_entropy_threshold = 4.5
high_entropy_penalty = -15
length_penalty_threshold = 200
max_length_penalty = -20
repetition_threshold = 0.7
repetition_penalty = -12

Output Configuration

Format Selection

Choose default output format:

[output]
format = "human"  # human, json, yara
max_results = 100 # Limit number of results
show_all = false  # Override max_results limit

Display Options

Customize what information to show:

[output]
show_scores = true
show_offsets = true
show_sections = true
show_encodings = true
show_tags = true
color = true                   # Enable colored output
truncate_long_strings = true
max_string_display_length = 80

Filtering

Set default filters:

[output]
min_score = 50
only_tags = []         # Empty = show all tags
exclude_tags = ["b64"] # Exclude Base64 by default

Format-Specific Configuration

PE Configuration

Windows PE-specific options:

[formats.pe]
extract_version_info = true
extract_manifests = true
extract_string_tables = true
prefer_utf16 = true
include_resource_names = true

ELF Configuration

Linux ELF-specific options:

[formats.elf]
include_build_id = true
include_gnu_version = true
process_dynamic_strings = true
include_note_sections = false

Mach-O Configuration

macOS Mach-O-specific options:

[formats.macho]
process_load_commands = true
include_framework_paths = true
process_fat_binaries = "first" # first, all, or specific arch

Performance Configuration

Memory Management

Control memory usage:

[performance]
use_memory_mapping = true
memory_map_threshold = 10485760 # 10MB
max_memory_usage = 1073741824   # 1GB

Parallel Processing

Configure parallelization:

[performance]
enable_parallel = true
max_threads = 0        # 0 = auto-detect
chunk_size = 1048576   # 1MB chunks

Caching

Enable various caches:

[performance]
cache_regex_compilation = true
cache_section_analysis = true
cache_string_hashes = true

Environment Variables

Override configuration with environment variables:

VariableDescriptionExample
STRINGY_CONFIGConfig file path~/.stringy.toml
STRINGY_MIN_LENMinimum string length6
STRINGY_FORMATOutput formatjson
STRINGY_MAX_RESULTSResult limit50
NO_COLORDisable colored output1

Profiles

Use predefined configuration profiles:

Security Analysis Profile

stringy --profile security malware.exe

Equivalent to:

min_ascii_len = 6
encodings = ["ascii", "utf8", "utf16le"]
only_tags = ["url", "domain", "ipv4", "ipv6", "filepath", "regpath"]
min_score = 70

YARA Development Profile

stringy --profile yara suspicious.dll

Equivalent to:

format = "yara"
min_ascii_len = 8
exclude_tags = ["import", "export"]
min_score = 80
max_results = 50

Development Profile

stringy --profile dev application

Equivalent to:

include_debug = true
include_symbols = true
max_results = 500
show_all_metadata = true

Validation

Configuration validation ensures settings are compatible:

# This would generate a warning
[extraction]
min_ascii_len = 10
max_string_len = 5 # Invalid: min > max

Migration

When upgrading Stringy versions, configuration migration is handled automatically:

# Backup current config
cp ~/.config/stringy/config.toml ~/.config/stringy/config.toml.backup

# Stringy will migrate on first run
stringy --version

Examples

Minimal Configuration

[extraction]
min_ascii_len = 6

[output]
format = "json"
max_results = 50

Comprehensive Security Analysis

[extraction]
min_ascii_len = 6
min_utf16_len = 4
encodings = ["ascii", "utf8", "utf16le"]
include_debug = false

[classification]
detect_urls = true
detect_domains = true
detect_ips = true
detect_paths = true
detect_guids = true
min_confidence = 0.8

[output]
format = "json"
min_score = 70
only_tags = ["url", "domain", "ipv4", "ipv6", "filepath", "regpath", "guid"]

[ranking.semantic_boosts]
url = 30
domain = 25
ipv4 = 25
ipv6 = 25
filepath = 20
regpath = 20
guid = 20

This flexible configuration system allows Stringy to be adapted for various use cases, from interactive analysis to automated security pipelines.

Performance

Stringy is designed for efficient analysis of binary files, from small executables to large system libraries.

How It Works

Stringy memory-maps input files via mmap-guard for zero-copy access, then processes sections in weight-priority order. Regex patterns for semantic classification are compiled once using LazyLock statics.

The processing pipeline is single-threaded and sequential:

  1. Format detection and section analysis – O(n) where n = number of sections
  2. String extraction – O(m) where m = total section size
  3. Deduplication – hash-based grouping of identical strings
  4. Classification – O(k) where k = number of unique strings
  5. Ranking and sorting – O(k log k)

Reducing Processing Time

Use CLI flags to narrow the work Stringy does:

# Limit to top results (skip sorting the long tail)
stringy --top 50 binary

# Increase minimum length to reduce noise and string count
stringy --min-len 8 binary

# Restrict to a single encoding (skip UTF-16 detection)
stringy --enc ascii binary

# Skip classification and ranking entirely
stringy --raw binary

--raw mode is the fastest option – it extracts and deduplicates strings without running the classifier or ranker.

Benchmarking

Stringy includes Criterion benchmarks for core components:

# Run all benchmarks
just bench

# Run a specific benchmark
cargo bench --bench elf
cargo bench --bench pe
cargo bench --bench classification
cargo bench --bench ascii_extraction

Profiling

# CPU profiling with perf (Linux)
perf record --call-graph dwarf -- stringy large_file.exe
perf report

# macOS profiling with Instruments
xcrun xctrace record --template "Time Profiler" --launch -- stringy binary

# Memory profiling
/usr/bin/time -l stringy large_file.exe   # macOS
/usr/bin/time -v stringy large_file.exe   # Linux

Batch Processing

Stringy processes one file per invocation. For batch workflows, use standard Unix tools:

# Process multiple files
find /path/to/binaries -type f -exec stringy --json {} \; > all_strings.jsonl

# Parallel processing with xargs
find /binaries -name "*.exe" -print0 | xargs -0 -P 4 -I {} stringy --json {} > results.jsonl

Troubleshooting

This guide helps resolve common issues when using Stringy. If you don’t find a solution here, please check the GitHub issues or create a new issue.

Installation Issues

“cargo: command not found”

Problem: Rust/Cargo is not installed or not in PATH.

Solution:

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env

# Verify installation
cargo --version

Build Failures

Problem: Compilation errors during cargo build.

Common causes and solutions:

Outdated Rust Version

# Update Rust
rustup update

# Check version (should be 1.91+)
rustc --version

Missing System Dependencies

# Ubuntu/Debian
sudo apt update
sudo apt install build-essential pkg-config

# Fedora/RHEL
sudo dnf groupinstall "Development Tools"
sudo dnf install pkg-config

# macOS
xcode-select --install

Network Issues

# Use alternative registry
export CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse

# Or use offline mode if dependencies are cached
cargo build --offline

Permission Denied

Problem: Cannot execute the binary after installation.

Solution:

# Make binary executable
chmod +x ~/.cargo/bin/stringy

# Or reinstall with proper permissions
cargo install --path . --force

Runtime Issues

“Unsupported file format”

Problem: Stringy cannot detect the binary format.

Diagnosis:

# Check file type
file binary_file

# Check if it's actually a binary
hexdump -C binary_file | head

Note: Unknown or unparseable formats (plain text, etc.) do not error. Stringy falls back to unstructured raw byte scanning and succeeds. This message only appears if something else goes wrong during parsing.

“Permission denied” when reading files

Problem: Cannot read the target binary file.

Solutions:

# Check file permissions
ls -l binary_file

# Make readable
chmod +r binary_file

# Run with appropriate privileges
sudo stringy system_binary

Output Issues

No Strings Found

Problem: Stringy reports no strings in a binary that should have strings.

Diagnosis:

# Check with traditional strings command
strings binary_file | head -20

# Try different encodings (one at a time)
stringy --enc ascii binary_file
stringy --enc utf8 binary_file
stringy --enc utf16le binary_file
stringy --enc utf16be binary_file

# Lower minimum length
stringy --min-len 1 binary_file

Common causes:

  • Packed or encrypted binary
  • Unusual string encoding
  • Strings in unexpected sections
  • Very short strings below minimum length

Garbled Output

Problem: String output contains garbled or binary characters.

Solutions:

# Force specific encoding
stringy --enc ascii binary_file

# Increase minimum length to filter noise
stringy --min-len 6 binary_file

# Use JSON output which properly escapes invalid sequences
stringy --json binary_file | jq '.text'

# Filter by score
stringy --json binary_file | jq 'select(.score > 70)'

Missing Expected Strings

Problem: Known strings are not appearing in output.

Diagnosis:

# Check if strings exist with traditional tools
strings binary_file | grep "expected_string"

# Try each encoding
stringy --enc ascii binary_file | grep "expected"
stringy --enc utf8 binary_file | grep "expected"
stringy --enc utf16le binary_file | grep "expected"

# Lower score threshold
stringy --json binary_file | jq 'select(.score > 0)' | grep "expected"

Error Messages

“–summary requires a TTY”

Problem: --summary flag used when stdout is piped or redirected.

Solution: --summary only works when stdout is a terminal. Remove --summary when piping output:

# This will error (exit 1):
stringy --summary binary | grep foo

# This works:
stringy --summary binary

Tag overlap error

Problem: Same tag appears in both --only-tags and --no-tags.

Solution: Remove the duplicate tag from one of the two flags. This is a runtime validation error (exit 1).

“Invalid UTF-8 sequence”

Problem: String contains invalid UTF-8 bytes.

Solution: This is usually normal for binary data. Stringy handles this automatically, but you can:

# Use ASCII only to avoid UTF-8 issues
stringy --enc ascii binary_file

# Use JSON output which properly escapes invalid sequences
stringy --json binary_file

“Regex compilation failed”

Problem: Internal regex pattern compilation error.

Solution: This indicates a bug. Please report it with:

# Get version information
stringy --version

File Not Found

Problem: The specified file does not exist.

Exit code: 1

Solution: Check the file path and ensure the file exists:

ls -l /path/to/binary

Performance Issues

Very Slow Processing

Problem: Stringy takes too long to process files.

Solutions:

# Increase minimum length to reduce extraction volume
stringy --min-len 8 large_file.exe

# Limit results
stringy --top 50 large_file.exe

# Use ASCII only
stringy --enc ascii large_file.exe

# Use raw mode (skip classification)
stringy --raw large_file.exe

High Memory Usage

Problem: Stringy uses too much memory.

Solutions:

# Limit results
stringy --top 100 file.exe

# Increase minimum length
stringy --min-len 8 file.exe

# Use raw mode to skip classification
stringy --raw file.exe

Exit Code Reference

CodeMeaning
0Success (including unknown binary format, empty binary, no filter matches)
1Runtime error (file not found, tag overlap, --summary in non-TTY)
2Argument parsing error (invalid flag, flag conflict, invalid tag name)

Debugging Tips

Compare with Traditional Tools

# Compare with standard strings
strings binary_file > traditional.txt
stringy --json binary_file | jq -r '.text' > stringy.txt
diff traditional.txt stringy.txt

Test with Known Good Files

# Test with system binaries
stringy /bin/ls        # Linux
stringy /bin/cat       # Linux
stringy /usr/bin/grep  # macOS

Getting Help

Information to Include in Bug Reports

  1. System information:

    stringy --version
    rustc --version
    uname -a  # Linux/macOS
    
  2. File information:

    file binary_file
    ls -l binary_file
    
  3. Exact command line used

  4. Expected vs actual behavior

Where to Get Help

  1. Documentation: Check this guide and the project documentation
  2. GitHub Issues: Search existing issues or create a new one
  3. Discussions: Use GitHub Discussions for questions and ideas