Introduction

Stringy is a smarter alternative to the standard strings command that uses binary analysis to extract meaningful strings from executables. Unlike traditional string extraction tools, Stringy focuses on data structures rather than arbitrary byte runs.

Why Stringy?

The standard strings command has several limitations:

Noise: Dumps every printable byte sequence, including padding and table data
UTF-16 Issues: Produces interleaved garbage when scanning UTF-16 strings
No Context: Provides no information about where strings come from
No Prioritization: Treats all strings equally, regardless of relevance

Stringy addresses these issues by being:

Data-structure aware: Only extracts strings from actual binary data structures
Section-aware: Prioritizes meaningful sections like .rodata, .rdata, __cstring
Encoding-aware: Properly handles ASCII/UTF-8, UTF-16LE, and UTF-16BE
Semantically intelligent: Identifies and tags URLs, domains, file paths, GUIDs, etc.
Ranked: Presents the most relevant strings first

Key Features

Multi-Format Support

ELF (Linux executables and libraries)
PE (Windows executables and DLLs)
Mach-O (macOS executables and frameworks)

Smart String Extraction

Section-aware extraction prioritizing string-rich sections
Multi-encoding support (ASCII, UTF-8, UTF-16LE/BE)
Deduplication with metadata preservation
Configurable minimum length filtering

Semantic Classification

Network: URLs, domains, IP addresses
Filesystem: File paths, registry keys
Identifiers: GUIDs, email addresses, user agents
Code: Format strings, Base64 data
Symbols: Import/export names, demangled symbols

Multiple Output Formats

Human-readable: Sorted tables for interactive analysis
JSONL: Machine-readable format for automation
YARA-friendly: Optimized for security rule creation

Use Cases

Binary Analysis & Reverse Engineering

Extract meaningful strings to understand program functionality, identify libraries, and discover embedded resources.

Malware Analysis

Quickly identify network indicators, file paths, registry keys, and other artifacts of interest in suspicious binaries.

YARA Rule Development

Generate high-confidence string candidates for creating detection rules, with automatic escaping and formatting.

Security Research

Analyze binaries for hardcoded credentials, API endpoints, configuration data, and other security-relevant strings.

Project Status

Stringy is in active development with a solid foundation already in place. The core infrastructure is complete and robust:

Implemented:

Complete binary format detection (ELF, PE, Mach-O)
Comprehensive section classification with intelligent weighting
Import/export symbol extraction from all formats
String extraction engines (ASCII/UTF-8, UTF-16LE/BE)
Semantic classification system (URLs, paths, GUIDs, etc.)
Ranking, scoring, and normalization algorithms
Output formatters (table, JSONL, YARA)
Full CLI interface with filtering, encoding, and mode flags
Noise filtering with multi-layered heuristics
Type-safe error handling and data structures
Extensible architecture with trait-based parsers

See the Architecture Overview for technical details and the Contributing guide to get involved.

Installation

Pre-built Binaries

Pre-built binaries for Linux, macOS, and Windows are available on the Releases page.

Download the appropriate archive for your platform, extract it, and place the stringy binary somewhere on your PATH.

From Source

Prerequisites

Rust: Version 1.91 or later (see rustup.rs if you need to install Rust)
Git: For cloning the repository

Build and Install

git clone https://github.com/EvilBit-Labs/Stringy
cd Stringy
cargo install --path .

This installs the stringy binary to ~/.cargo/bin/, which should be in your PATH.

Verify Installation

stringy --version

Development Build

For development and testing, Stringy uses just and mise to manage tooling:

git clone https://github.com/EvilBit-Labs/Stringy
cd Stringy
just setup         # Install tools and components
just gen-fixtures  # Generate test fixtures (requires Zig via mise)
just test          # Run tests

If you do not use just, the minimum requirements are:

cargo build --release
cargo test

Troubleshooting

Build Failures

Update Rust to the latest version:

rustup update

Clear the build cache:

cargo clean
cargo build --release

Getting Help

If you encounter issues:

Check the troubleshooting guide
Search existing GitHub issues
Open a new issue with your OS, Rust version (rustc --version), and complete error output

Next Steps

Once installed, see the Quick Start guide to begin using Stringy.

Quick Start

This guide will get you up and running with Stringy in minutes.

Basic Usage

Analyze a Binary

stringy /path/to/binary

Stringy will:

Detect ELF, PE, or Mach-O format automatically
Extract ASCII and UTF-16 strings from prioritized sections
Apply semantic classification (URLs, paths, GUIDs, etc.)
Rank results by relevance and display them in a table

Example Output (TTY)

String                                   Tags              Score  Section
------                                   ----              -----  -------
https://api.example.com/v1/users         url                 95   .rdata
{12345678-1234-1234-1234-123456789abc}   guid                87   .rdata
/usr/local/bin/application               filepath            82   __cstring
Error: %s at line %d                     fmt                 78   .rdata
MyApplication v1.2.3                     version             75   .rsrc

Common Use Cases

Security Analysis

Extract network indicators and file paths:

stringy --only-tags url --only-tags domain --only-tags filepath --only-tags regpath malware.exe

YARA Rule Development

Generate rule candidates:

stringy --yara --min-len 8 target.bin > candidates.yar

JSON Output for Automation

stringy --json --debug binary.elf | jq 'select(.display_score > 80)'

Extraction-Only Mode

Skip classification and ranking for fast raw extraction:

stringy --raw binary

Understanding the Output

Score Column

Strings are ranked using a display score from 0-100:

90-100: High-value indicators (URLs, GUIDs in high-priority sections)
70-89: Meaningful strings (file paths, format strings)
50-69: Moderate relevance (imports, version info)
0-49: Low relevance (short or noisy strings)

See Output Formats for the full band-mapping table.

Tag	Description	Example
`url`	Web URLs	`https://example.com/api`
`domain`	Domain names	`api.example.com`
`ipv4`/`ipv6`	IP addresses	`192.168.1.1`
`filepath`	File paths	`/usr/bin/app`
`regpath`	Registry paths	`HKEY_LOCAL_MACHINE\...`
`guid`	GUIDs/UUIDs	`{12345678-1234-...}`
`email`	Email addresses	`user@example.com`
`b64`	Base64 data	`SGVsbG8gV29ybGQ=`
`fmt`	Format strings	`Error: %s`
`import`/`export`	Symbol names	`CreateFileW`
`demangled`	Demangled symbols	`std::io::Read::read`
`user-agent-ish`	User-agent-like strings	`Mozilla/5.0 ...`
`version`	Version strings	`v1.2.3`
`manifest`	Manifest data	PE/Mach-O embedded XML
`resource`	Resource strings	PE VERSIONINFO/STRINGTABLE
`dylib-path`	Dynamic library paths	`/usr/lib/libfoo.dylib`
`rpath`	Runtime search paths	`/usr/local/lib`
`rpath-var`	Rpath variables	`@loader_path/../lib`
`framework-path`	Framework paths (macOS)	`/System/Library/...`

Sections

Shows where strings were found:

ELF: .rodata, .data.rel.ro, .comment
PE: .rdata, .rsrc, version info
Mach-O: __TEXT,__cstring, __DATA_CONST

Filtering and Options

By String Length

# Minimum 6 characters
stringy --min-len 6 binary

By Encoding

# ASCII only
stringy --enc ascii binary

# UTF-16 only (useful for Windows binaries)
stringy --enc utf16 binary.exe

By Tags

# Only network-related strings
stringy --only-tags url --only-tags domain --only-tags ipv4 --only-tags ipv6 binary

# Exclude Base64 noise
stringy --no-tags b64 binary

Limit Results

# Top 50 results
stringy --top 50 binary

Summary

Append a summary block after table output (TTY only):

stringy --summary binary

Output Formats

Table (Default)

Best for interactive analysis:

stringy binary

JSON Lines

For programmatic processing:

stringy --json binary | jq 'select(.tags[] == "Url")'

YARA Format

For security rule creation:

stringy --yara binary > rule_candidates.yar

Tips and Best Practices

Start Broad, Then Focus

Run basic analysis first: stringy binary
Identify interesting patterns in high-scoring results
Use filters to focus: --only-tags url --only-tags filepath

Combine with Other Tools

# Find strings, then search for references
stringy --json binary | jq -r 'select(.score > 80) | .text' | xargs -I {} grep -r "{}" /path/to/source

# Extract URLs for further analysis
stringy --only-tags url --json binary | jq -r '.text' | sort -u

Performance Considerations

Use --top N to limit output for large binaries
Use --enc to restrict to a single encoding
Consider --min-len to reduce noise

Next Steps

Learn about output formats in detail
Understand the classification system
Explore advanced CLI options
Read about performance optimization

Command Line Interface

Basic Syntax

stringy [OPTIONS] <FILE>
stringy [OPTIONS] -        # read from stdin

Options

Input/Output

Option	Description	Default
`<FILE>`	Binary file to analyze (use `-` for stdin)	-
`--json`	JSONL output; conflicts with `--yara`	-
`--yara`	YARA rule output; conflicts with `--json`	-
`--help`	Show help	-
`--version`	Show version	-

Filtering

Option	Description	Default
`--min-len N`	Minimum string length (must be >= 1)	4
`--top N`	Limit to top N strings by score (applied after all filters)	-
`--enc ENCODING`	Filter by encoding: `ascii`, `utf8`, `utf16`, `utf16le`, `utf16be`	all
`--only-tags TAG`	Include strings with any of these tags (OR); repeatable	all
`--no-tags TAG`	Exclude strings with any of these tags; repeatable	none

Mode Flags

Option	Description
`--raw`	Extraction-only mode (no tagging, ranking, or scoring); conflicts with `--only-tags`, `--no-tags`, `--top`, `--debug`, `--yara`
`--summary`	Append summary block (TTY table mode only); conflicts with `--json`, `--yara`
`--debug`	Include score-breakdown fields (`section_weight`, `semantic_boost`, `noise_penalty`) in JSON output; conflicts with `--raw`

Encoding Options

The --enc flag accepts exactly one encoding value per invocation:

Value	Description
`ascii`	7-bit ASCII only
`utf8`	UTF-8 (includes ASCII)
`utf16`	UTF-16 (both little- and big-endian)
`utf16le`	UTF-16 Little Endian only
`utf16be`	UTF-16 Big Endian only

Examples

# ASCII only
stringy --enc ascii binary

# UTF-16 only (common for Windows)
stringy --enc utf16 app.exe

# UTF-8 only
stringy --enc utf8 binary

Tag Filtering

Tags are specified with the repeatable --only-tags and --no-tags flags. Repeat the flag for each tag value:

# Network indicators only
stringy --only-tags url --only-tags domain --only-tags ipv4 --only-tags ipv6 malware.exe

# Exclude noisy Base64
stringy --no-tags b64 binary

# File system related
stringy --only-tags filepath --only-tags regpath app.exe

Available Tags

Tag	Description	Example
`url`	HTTP/HTTPS URLs	`https://api.example.com`
`domain`	Domain names	`example.com`
`ipv4`	IPv4 addresses	`192.168.1.1`
`ipv6`	IPv6 addresses	`2001:db8::1`
`filepath`	File paths	`/usr/bin/app`
`regpath`	Registry paths	`HKEY_LOCAL_MACHINE\...`
`guid`	GUIDs/UUIDs	`{12345678-1234-...}`
`email`	Email addresses	`user@example.com`
`b64`	Base64 data	`SGVsbG8=`
`fmt`	Format strings	`Error: %s`
`user-agent-ish`	User-agent-like strings	`Mozilla/5.0 ...`
`demangled`	Demangled symbols	`std::io::Read::read`
`import`	Import names	`CreateFileW`
`export`	Export names	`main`
`version`	Version strings	`v1.2.3`
`manifest`	Manifest data	XML/JSON config
`resource`	Resource strings	UI text
`dylib-path`	Dynamic library paths	`/usr/lib/libfoo.dylib`
`rpath`	Runtime search paths	`/usr/local/lib`
`rpath-var`	Rpath variables	`@loader_path/../lib`
`framework-path`	Framework paths (macOS)	`/System/Library/Frameworks/...`

Output Formats

Table (Default, TTY)

When stdout is a TTY, results are shown as a table with columns:

String | Tags | Score | Section

When piped (non-TTY), output is plain text with one string per line and no headers.

JSON Lines (`--json`)

Each line is a JSON object with full metadata. See Output Formats for the schema.

YARA (`--yara`)

Generates a YARA rule template. See Output Formats for details.

Exit Codes

Code	Meaning
0	Success (including unknown binary format, empty binary, no filter matches)
1	General runtime error
2	Configuration or validation error (tag overlap, `--summary` in non-TTY)
3	File not found
4	Permission denied

Clap argument parsing errors (invalid flag, flag conflict, invalid tag name) use clap’s own exit code (typically 2).

Advanced Usage

Pipeline Integration

# Extract URLs and check them
stringy --only-tags url --json binary | jq -r '.text' | xargs -I {} curl -I {}

# Find high-score strings
stringy --json binary | jq 'select(.score > 80)'

# Count strings by tag
stringy --json binary | jq -r '.tags[]' | sort | uniq -c

Batch Processing

# Process multiple files
find /path/to/binaries -type f -exec stringy --json {} \; > all_strings.jsonl

# Compare two versions
stringy --json old_binary > old.jsonl
stringy --json new_binary > new.jsonl
diff <(jq -r '.text' old.jsonl | sort) <(jq -r '.text' new.jsonl | sort)

Focused Analysis

# Fast scan for high-value strings only
stringy --top 20 --min-len 8 --only-tags url --only-tags guid --only-tags filepath large_binary

# Extraction-only mode (no classification overhead)
stringy --raw binary

Output Formats

Stringy supports three output formats optimized for different use cases.

Table Format (Default)

TTY Mode

When stdout is a TTY, results are shown as an aligned table. Columns appear in this order:

String                                   Tags              Score  Section
------                                   ----              -----  -------
https://api.example.com/v1/users         Url                 95   .rdata
{12345678-1234-1234-1234-123456789abc}   guid                87   .rdata
/usr/local/bin/application               filepath            82   __cstring
Error: %s at line %d                     fmt                 78   .rdata

Features:

Truncation: Long strings are truncated with ... indicator
Sorting: Results sorted by score (highest first)
Alignment: Columns properly aligned for readability

Plain Text (Piped / Non-TTY)

When stdout is piped, output switches to plain text with one string per line and no headers or table formatting. This is designed for downstream tool consumption.

`--summary` Block

When --summary is passed (TTY mode only; conflicts with --json and --yara), a summary block is appended after the table showing aggregate statistics about the extraction.

JSON Lines Format

Machine-readable format with one JSON object per line (JSONL), ideal for automation and pipeline integration.

stringy --json binary

Example Output

{"text":"https://api.example.com/v1/users","encoding":"Ascii","offset":4096,"rva":4096,"section":".rdata","length":31,"tags":["Url"],"score":95,"confidence":1.0,"source":"SectionData"}
{"text":"{12345678-1234-1234-1234-123456789abc}","encoding":"Ascii","offset":8192,"rva":8192,"section":".rdata","length":38,"tags":["guid"],"score":87,"confidence":0.95,"source":"SectionData"}

Schema

Each JSON object contains:

Field	Type	Description
`text`	string	The extracted string (demangled if applicable)
`original_text`	string or null	Original mangled form (present only when demangled)
`encoding`	string	Encoding: `Ascii`, `Utf8`, `Utf16Le`, `Utf16Be`
`offset`	number	File offset in bytes
`rva`	number or null	Relative Virtual Address (if available)
`section`	string or null	Section name where found
`length`	number	String length in bytes
`tags`	array	Semantic classification tags
`score`	number	Internal relevance score
`display_score`	number or null	Display score (0-100 band-mapped); only present with `--debug`
`confidence`	number	Confidence score from noise filtering (0.0-1.0)
`source`	string	Source type: `SectionData`, `ImportName`, `ExportName`, etc.

Debug Fields

When --debug is passed, four additional fields appear:

Field	Type	Description
`display_score`	number or null	Display score (0-100 band-mapped)
`section_weight`	number or null	Section weight contribution to score
`semantic_boost`	number or null	Semantic classification boost
`noise_penalty`	number or null	Noise penalty applied

Raw Mode

With --raw --json, output contains extraction-only data: score is 0, tags is empty, and display_score is absent.

Processing Examples

# Extract only URLs
stringy --json binary | jq 'select(.tags[] == "Url") | .text'

# High-score strings only
stringy --json binary | jq 'select(.score > 80)'

# Group by section
stringy --json binary | jq -r '.section' | sort | uniq -c

# Find strings in specific section
stringy --json binary | jq 'select(.section == ".rdata")'

YARA Format

Specialized format for creating YARA detection rules with proper escaping and metadata.

stringy --yara binary

Example Output

// YARA rule generated by Stringy
// Binary: binary
// Generated: 1234567890

rule binary_strings {
  meta:
    description = "Strings extracted from binary"
    generated_by = "stringy"
    generated_at = "1234567890"
  strings:
    // tag: filepath
    // score: 82
    $filepath_1 = "/usr/local/bin/application" ascii
    // tag: fmt
    // score: 78
    $fmt_1 = "Error: %s at line %d" ascii
    // tag: Url
    // score: 95
    $Url_1 = "https://api.example.com/v1/users" ascii
    // skipped (length > 200 chars): 245
  condition:
    any of them
}

Features

Rule naming: Rule name is derived from the filename with non-alphanumeric characters replaced by _ and a _strings suffix added
Tag grouping: Strings are grouped by their first tag with // tag: <name> comments and per-string // score: <N> annotations
Variable naming: Variables use tag-derived names (e.g., $Url_1, $filepath_1, $fmt_1) rather than sequential $sN
Proper escaping: Handles special characters and binary data
Long string handling: Strings over 200 characters are replaced with // skipped (length > 200 chars): N (where N is the character count)
Modifiers: Appropriate ascii/wide modifiers based on encoding

Score Behavior

Stringy uses a band-mapping system to convert internal scores to display scores (0-100):

Internal Score	Display Score	Meaning
<= 0	0	Low relevance
1-79	1-49	Low relevance
80-119	50-69	Moderate
120-159	70-89	Meaningful
160-220	90-100	High-value
> 220	100 (clamped)	High-value

Format Comparison

Feature	Table	JSON	YARA
Interactive use	Yes	No	No
Automation	No	Yes	No
Rule creation	No	No	Yes
Full metadata	No	Yes	No

Output Customization

Filtering

All formats support the same filtering options:

# Limit results
stringy --top 50 --json binary

# Filter by tags
stringy --only-tags url --only-tags domain --yara binary

# Minimum score threshold (post-process)
stringy --json binary | jq 'select(.score >= 70)'

Redirection

# Save to file
stringy --json binary > strings.jsonl
stringy --yara binary > rules.yar

# Pipe to other tools
stringy --json binary | jq 'select(.tags[] == "Url")' | less

Architecture Overview

Stringy is built as a modular Rust library with a clear separation of concerns. The architecture follows a pipeline approach where binary data flows through several processing stages.

High-Level Architecture

Binary File -> Format Detection -> Container Parsing -> String Extraction -> Deduplication -> Classification -> Ranking -> Output

Core Components

1. Container Module (`src/container/`)

Handles binary format detection and parsing using the goblin crate with comprehensive section analysis.

Format Detection: Automatically identifies ELF, PE, and Mach-O formats via goblin::Object::parse()
Section Classification: Categorizes sections by string likelihood with weighted scoring
Metadata Extraction: Collects imports, exports, and detailed structural information
Cross-Platform Support: Handles platform-specific section characteristics and naming conventions

Supported Formats

Format	Parser	Key Sections (Weight)	Import/Export Support
ELF	`ElfParser`	`.rodata` (10.0), `.comment` (9.0), `.data.rel.ro` (7.0)	Dynamic and static
PE	`PeParser`	`.rdata` (10.0), `.rsrc` (9.0), read-only `.data` (7.0)	Import/export tables
Mach-O	`MachoParser`	`__TEXT,__cstring` (10.0), `__TEXT,__const` (9.0)	Symbol tables

Section Weight System

Container parsers assign weights (1.0-10.0) to sections based on how likely they are to contain meaningful strings. Higher weights indicate higher-value sections. For example, .rodata (read-only data) receives a weight of 10.0, while .text (executable code) receives 1.0.

2. Extraction Module (`src/extraction/`)

Implements encoding-aware string extraction algorithms with configurable parameters.

ASCII/UTF-8: Scans for printable character sequences with noise filtering
UTF-16: Detects little-endian and big-endian wide strings with confidence scoring
PE Resources: Extracts version info, manifests, and string table resources from PE binaries
Mach-O Load Commands: Extracts strings from Mach-O load commands (dylib paths, rpaths)
Deduplication: Groups strings by (text, encoding) keys, preserves all occurrence metadata, merges tags using set union, and calculates combined scores with occurrence-based bonuses
Noise Filters: Applies configurable filters to reduce false positives
Section-Aware: Uses container parser weights to prioritize extraction areas

Deduplication System

The deduplication module (src/extraction/dedup/) provides comprehensive string deduplication:

Grouping Strategy: Strings are grouped by (text, encoding) tuple, ensuring UTF-8 and UTF-16 versions are kept separate
Occurrence Preservation: All occurrence metadata (offset, RVA, section, source, tags, score, confidence) is preserved
Tag Merging: Tags from all occurrences are merged using HashSet for uniqueness, then converted to a sorted Vec<Tag>
Combined Scoring: Calculates combined scores using a base score (maximum across occurrences) plus bonuses for multiple occurrences, cross-section appearances, and multi-source appearances

3. Classification Module (`src/classification/`)

Applies semantic analysis to extracted strings with comprehensive tagging system.

Pattern Matching: Uses regex to identify URLs, IPs, domains, paths, GUIDs, emails, format strings, base64, user-agent strings, and version strings
Symbol Processing: Demangles Rust symbols and processes imports/exports
Context Analysis: Considers section context and source type for classification

Supported Classification Tags

Category	Tags	Examples
Network	`url`, `domain`, `ipv4`, `ipv6`	`https://api.com`, `example.com`, `192.168.1.1`
Filesystem	`filepath`, `regpath`, `dylib-path`, `rpath`, `rpath-var`, `framework-path`	`/usr/bin/app`, `HKEY_LOCAL_MACHINE\...`
Identifiers	`guid`, `email`, `user-agent-ish`	`{12345678-...}`, `user@domain.com`
Code	`fmt`, `b64`, `import`, `export`, `demangled`	`Error: %s`, `SGVsbG8=`, `CreateFileW`
Resources	`version`, `manifest`, `resource`	`v1.2.3`, XML config, UI strings

4. Ranking Module (`src/classification/ranking.rs`)

Implements the scoring algorithm to prioritize relevant strings using multiple factors:

Section Weight: Based on the section’s classification (higher weights for string-oriented sections like .rodata)
Semantic Boost: Bonus points for strings with recognized semantic tags (URLs, GUIDs, paths, etc.)
Noise Penalty: Penalty for characteristics indicating noise (low confidence, repetitive patterns, high entropy)

The internal score is then mapped to a display score (0-100) using a band-mapping system. See Output Formats for the display-score band table.

5. Output Module (`src/output/`)

Formats results for different use cases:

Table (src/output/table/): TTY-aware output with color-coded scores, or plain text when piped. Columns: String, Tags, Score, Section.
JSON (src/output/json.rs): JSONL format with complete structured data including all metadata fields
YARA (src/output/yara/): Properly escaped strings with appropriate modifiers and long-string skipping

6. Pipeline Module (`src/pipeline/`)

Orchestrates the entire flow from file reading through output:

Configuration (src/pipeline/config.rs): PipelineConfig, FilterConfig, and EncodingFilter
Filtering (src/pipeline/filter.rs): FilterEngine applies post-extraction filtering by min-length, encoding, tags, and top-N
Score Normalization (src/pipeline/normalizer.rs): ScoreNormalizer maps internal scores to display scores (0-100) and populates display_score on each FoundString unconditionally in all non-raw executions
Orchestration (src/pipeline/mod.rs): Pipeline::run drives the full pipeline

Data Flow

1. Binary Analysis Phase

The pipeline reads the file, detects the binary format via goblin, and dispatches to the appropriate container parser (ELF, PE, or Mach-O). The parser returns a ContainerInfo struct containing sections with weights, imports, and exports. Unknown or unparseable formats fall back to unstructured raw byte scanning.

2. String Extraction Phase

Strings are extracted from each section using encoding-specific extractors (ASCII, UTF-8, UTF-16LE, UTF-16BE). Import and export symbol names are included as high-value strings. PE resources (version info, manifests, string tables) and Mach-O load command strings are also extracted. Results are then deduplicated by grouping on (text, encoding).

3. Classification Phase

Each string is passed through pattern matchers that assign semantic tags based on content. Rust mangled symbols are demangled. The ranking algorithm then computes a score for each string combining section weight, semantic boost, and noise penalty.

4. Output Phase

Strings are sorted by score (descending), filtered according to user options (tags, encoding, top-N), and formatted for the selected output mode (table, JSON, or YARA).

Data Structures

Core types are defined in src/types/mod.rs:

#![allow(unused)]
fn main() {
pub struct FoundString {
    pub text: String,
    pub original_text: Option<String>, // pre-demangled form
    pub encoding: Encoding,
    pub offset: u64,
    pub rva: Option<u64>,
    pub section: Option<String>,
    pub length: u32,
    pub tags: Vec<Tag>,
    pub score: i32,
    pub section_weight: Option<i32>, // debug only
    pub semantic_boost: Option<i32>, // debug only
    pub noise_penalty: Option<i32>,  // debug only
    pub display_score: Option<i32>,  // populated in all non-raw executions
    pub source: StringSource,
    pub confidence: f32,
}
}

Key Design Decisions

Error Handling

Comprehensive error types with context via thiserror
Graceful degradation for partially corrupted binaries
Unknown formats fall back to raw byte scanning rather than erroring

Extensibility

Trait-based architecture for easy format addition
Pluggable classification systems
Configurable output formats

Performance

Section-aware extraction reduces scan time
Regex caching via once_cell::sync::Lazy for repeated pattern matching
Weight-based prioritization avoids scanning low-value sections

Module Dependencies

main.rs
+-- lib.rs (public API, re-exports)
+-- types/
|   +-- mod.rs (core data structures: Tag, FoundString, Encoding, etc.)
|   +-- error.rs (StringyError, Result)
|   +-- constructors.rs (constructor implementations)
|   +-- found_string.rs (FoundString builder methods)
|   +-- tests.rs
+-- container/
|   +-- mod.rs (format detection, ContainerParser trait)
|   +-- elf/
|   |   +-- mod.rs (ELF parser)
|   |   +-- tests.rs
|   +-- pe/
|   |   +-- mod.rs (PE parser)
|   |   +-- tests.rs
|   +-- macho/
|   |   +-- mod.rs (Mach-O parser)
|   |   +-- tests.rs
+-- extraction/
|   +-- mod.rs (extraction orchestration)
|   +-- ascii/ (ASCII/UTF-8 extraction)
|   +-- utf16/ (UTF-16LE/BE extraction with confidence scoring)
|   +-- dedup/ (deduplication with scoring)
|   +-- filters/ (noise filter implementations)
|   +-- pe_resources/ (PE version info, manifests, string tables)
|   +-- macho_load_commands.rs (Mach-O load command strings)
+-- classification/
|   +-- mod.rs (classification framework)
|   +-- patterns/ (regex-based pattern matching)
|   +-- symbols.rs (symbol processing and demangling)
|   +-- ranking.rs (scoring algorithm)
+-- output/
|   +-- mod.rs (OutputFormat, OutputMetadata, formatting dispatch)
|   +-- json.rs (JSONL format)
|   +-- table/ (TTY and plain text table formatting)
|   +-- yara/ (YARA rule generation with escaping)
+-- pipeline/
    +-- mod.rs (Pipeline::run orchestration)
    +-- config.rs (PipelineConfig, FilterConfig, EncodingFilter)
    +-- filter.rs (post-extraction filtering)
    +-- normalizer.rs (score band mapping)

External Dependencies

Core Dependencies

goblin - Multi-format binary parsing (ELF, PE, Mach-O)
pelite - PE resource extraction (version info, manifests, string tables)
serde + serde_json - Serialization
thiserror - Error handling
clap - CLI argument parsing
regex - Pattern matching for classification
rustc-demangle - Rust symbol demangling
indicatif - Progress bars and spinners for CLI output
tempfile - Temporary file creation for stdin-to-Pipeline bridging
once_cell - Lazy-initialized static regex patterns
patharg - Input argument handling (file path or stdin)

Testing Strategy

Unit Tests

Each module has comprehensive unit tests
Mock data for parser testing
Edge case coverage for string extraction

Integration Tests

End-to-end CLI functionality via assert_cmd
Real binary file testing with compiled fixtures
Snapshot testing via insta
Cross-platform validation

Performance Tests

Benchmarks via criterion in benches/

Binary Format Support

Stringy supports the three major executable formats across different platforms. Each format has unique characteristics that influence string extraction strategies.

ELF (Executable and Linkable Format)

Used primarily on Linux and other Unix-like systems.

Key Sections for String Extraction

Section	Priority	Description
`.rodata`	High	Read-only data, often contains string literals
`.rodata.str1.1`	High	Aligned string literals
`.data.rel.ro`	Medium	Read-only after relocation
`.comment`	Medium	Compiler and build information
`.note.*`	Low	Various metadata notes

ELF-Specific Features

Symbol Tables: Extract import/export names from .dynsym and .symtab
Dynamic Strings: Process .dynstr for library names and symbols
Section Flags: Use SHF_EXECINSTR and SHF_WRITE for classification
Virtual Addresses: Map file offsets to runtime addresses
Dynamic Linking: Parse DT_NEEDED entries to extract library dependencies
Symbol Types: Support for functions (STT_FUNC), objects (STT_OBJECT), TLS variables (STT_TLS), and indirect functions (STT_GNU_IFUNC)
Symbol Visibility: Filter hidden and internal symbols from exports (STV_HIDDEN, STV_INTERNAL)

Enhanced Symbol Extraction

The ELF parser now provides comprehensive symbol extraction with:

Import Detection: Identifies all undefined symbols (SHN_UNDEF) that need runtime resolution
- Supports multiple symbol types: functions, objects, TLS variables, and indirect functions
- Handles both global and weak bindings
- Maps symbols to their providing libraries using version information
Export Detection: Extracts all globally visible defined symbols
- Filters out hidden (STV_HIDDEN) and internal (STV_INTERNAL) symbols
- Includes both strong and weak symbols
- Supports all relevant symbol types
Library Dependencies: Extracts DT_NEEDED entries from the dynamic section
- Provides list of required shared libraries
- Used in conjunction with version information for symbol-to-library mapping
Symbol-to-Library Mapping: Maps imported symbols to their providing libraries
- Uses ELF version tables (versym and verneed) for best-effort attribution
- Process: versym index → verneed entry → library filename
- Falls back to heuristics for unversioned symbols (e.g., common libc symbols)
- Returns None when version information is unavailable or ambiguous

Implementation Details

impl ElfParser {
    fn classify_section(section: &SectionHeader, name: &str) -> SectionType {
        // Check executable flag first
        if section.sh_flags & SHF_EXECINSTR != 0 {
            return SectionType::Code;
        }

        // Classify by name patterns
        match name {
            ".rodata" | ".rodata.str1.1" => SectionType::StringData,
            ".data.rel.ro" => SectionType::ReadOnlyData,
            // ... more classifications
        }
    }

    fn extract_imports(&self, elf: &Elf, libraries: &[String]) -> Vec<ImportInfo> {
        // Extract undefined symbols from dynamic symbol table
        // Supports STT_FUNC, STT_OBJECT, STT_TLS, STT_GNU_IFUNC, STT_NOTYPE
        // Handles both STB_GLOBAL and STB_WEAK bindings
        // Maps symbols to libraries using version information
    }

    fn extract_exports(&self, elf: &Elf) -> Vec<ExportInfo> {
        // Extract defined symbols with global/weak binding
        // Filters out STV_HIDDEN and STV_INTERNAL symbols
        // Includes all relevant symbol types
    }

    fn extract_needed_libraries(&self, elf: &Elf) -> Vec<String> {
        // Parse DT_NEEDED entries from dynamic section
        // Returns list of required shared library names
    }

    fn get_symbol_providing_library(
        &self,
        elf: &Elf,
        sym_index: usize,
        libraries: &[String],
    ) -> Option<String> {
        // 1. Get version index from versym table for this symbol
        // 2. Look up version in verneed to find library name
        // 3. Match with DT_NEEDED entries
        // 4. Fallback to heuristics for unversioned symbols
    }
}

Library Dependency Mapping

The ELF parser implements symbol-to-library mapping using ELF version information:

Version Symbol Table (versym): Maps each dynamic symbol to a version index
- Index 0 (VER_NDX_LOCAL): Local symbol, not available externally
- Index 1 (VER_NDX_GLOBAL): Global symbol, no specific version
- Index ≥ 2: Versioned symbol, references verneed entry
Version Needed Table (verneed): Lists library dependencies with version requirements
- Each entry contains a library filename (from DT_NEEDED)
- Auxiliary entries specify version names and indices
- Links version indices to specific libraries

Mapping Process:

Symbol → versym[sym_index] → version_index → verneed lookup → library_name

Fallback Strategies:
- For unversioned symbols: Attempt to match common symbols (e.g., printf, malloc) to libc
- If only one library is needed: Attribute to that library (least accurate)
- Otherwise: Return None to avoid false positives

Limitations

ELF’s indirect linking model means symbol-to-library mapping is best-effort:

Accuracy: Version-based mapping is accurate when version information is present, but many binaries lack version info
Unversioned Symbols: Symbols without version information cannot be definitively mapped without relocation analysis
Relocation Tables: PLT/GOT relocations would provide definitive mapping but require complex analysis
Static Linking: Statically linked binaries have no dynamic section, so all imports have library: None
Stripped Binaries: Stripped binaries may lack symbol tables entirely

The current implementation is sufficient for most string classification use cases where approximate library attribution is acceptable.

PE (Portable Executable)

Used on Windows for executables, DLLs, and drivers.

Key Sections for String Extraction

Section	Priority	Description
`.rdata`	High	Read-only data section
`.rsrc`	High	Resources (version info, strings, etc.)
`.data`	Medium	Initialized data (check write flag)
`.text`	Low	Code section (imports/exports only)

PE-Specific Features

Resources: Extract from VERSIONINFO, STRINGTABLE, and manifest resources
Import/Export Tables: Process IAT and EAT for symbol names
UTF-16 Prevalence: Windows APIs favor wide strings
Section Characteristics: Use IMAGE_SCN_* flags for classification

Enhanced Import/Export Extraction

The PE parser provides comprehensive import/export extraction:

Import Extraction: Extracts from PE import directory using goblin’s pe.imports
- Each import includes: function name, DLL name, and RVA
- Example: printf from msvcrt.dll
- Iterates through pe.imports to create ImportInfo with name, library (DLL), and address (RVA)
Export Extraction: Extracts from PE export directory using goblin’s pe.exports
- Each export includes: function name, address, and ordinal
- Note: PE executables typically don’t export symbols (only DLLs do)
- Ordinal is derived from index since goblin doesn’t expose it directly
- Handles unnamed exports with “ordinal_{i}” naming

Resource Extraction (Phase 2 Complete)

PE resources are particularly rich sources of strings. The PE parser now provides comprehensive resource string extraction:

VERSIONINFO Extraction

Extracts all StringFileInfo key-value pairs from VS_VERSIONINFO structures
Supports multiple language variants via translation table
Common extracted fields:
- CompanyName: Company or organization name
- FileDescription: File purpose and description
- FileVersion: File version string (e.g., “1.0.0.0”)
- ProductName: Product name
- ProductVersion: Product version string
- LegalCopyright: Copyright information
- InternalName: Internal file identifier
- OriginalFilename: Original filename
Uses pelite’s high-level version_info() API for reliable parsing
All strings are UTF-16LE encoded in the resource
Tagged with Tag::Version and Tag::Resource

STRINGTABLE Extraction

Parses RT_STRING resources (type 6) containing localized UI strings
Handles block structure: strings grouped in blocks of 16
Block ID calculation: (StringID >> 4) + 1
String format: u16 length (in UTF-16 code units) + UTF-16LE string data
Supports multiple language variants
Extracts all non-empty strings from all blocks
Tagged with Tag::Resource
Common use cases: UI labels, error messages, dialog text

MANIFEST Extraction

Extracts RT_MANIFEST resources (type 24) containing application manifests
Automatic encoding detection:
- UTF-8 with BOM (EF BB BF)
- UTF-16LE with BOM (FF FE)
- UTF-16BE with BOM (FE FF)
- Fallback: byte pattern analysis
Returns full XML manifest content
Tagged with Tag::Manifest and Tag::Resource
Manifest contains:
- Assembly identity (name, version, architecture)
- Dependency information
- Compatibility settings
- Security settings (requestedExecutionLevel)

Usage Example

use stringy::extraction::extract_resource_strings;
use stringy::types::Tag;

let pe_data = std::fs::read("example.exe")?;
let strings = extract_resource_strings(&pe_data);

// Filter version info strings
let version_strings: Vec<_> = strings.iter()
    .filter(|s| s.tags.contains(&Tag::Version))
    .collect();

// Filter string table entries
let ui_strings: Vec<_> = strings.iter()
    .filter(|s| s.tags.contains(&Tag::Resource) && !s.tags.contains(&Tag::Version))
    .collect();

Implementation Details

impl PeParser {
    fn classify_section(section: &SectionTable) -> SectionType {
        let name = String::from_utf8_lossy(&section.name);

        // Check characteristics
        if section.characteristics & IMAGE_SCN_CNT_CODE != 0 {
            return SectionType::Code;
        }

        match name.trim_end_matches('\0') {
            ".rdata" => SectionType::StringData,
            ".rsrc" => SectionType::Resources,
            // ... more classifications
        }
    }

    fn extract_imports(&self, pe: &PE) -> Vec<ImportInfo> {
        // Iterates through pe.imports
        // Creates ImportInfo with name, library (DLL), and address (RVA)
    }

    fn extract_exports(&self, pe: &PE) -> Vec<ExportInfo> {
        // Iterates through pe.exports
        // Creates ExportInfo with name, address, and ordinal
        // Handles unnamed exports with "ordinal_{i}" naming
    }

    fn calculate_section_weight(section_type: SectionType, name: &str) -> f32 {
        // Returns weight values based on section type and name
        // Higher weights indicate higher string likelihood
    }
}

Section Weight Calculation

The PE parser uses a weight-based system to prioritize sections for string extraction:

Section Type	Weight	Rationale
StringData (.rdata)	10.0	Primary string storage
Resources (.rsrc)	9.0	Version info, string tables
ReadOnlyData	7.0	May contain constants
WritableData (.data)	5.0	Runtime state, lower priority
Code (.text)	1.0	Unlikely to contain strings
Debug	2.0	Internal metadata
Other	1.0	Minimal priority

Limitations

The current PE parser implementation provides comprehensive resource string extraction:

VERSIONINFO: Complete extraction of all StringFileInfo fields
STRINGTABLE: Full parsing of RT_STRING blocks with language support
MANIFEST: Encoding detection and XML extraction
Dialog Resources: RT_DIALOG parsing not yet implemented (future enhancement)
Menu Resources: RT_MENU parsing not yet implemented (future enhancement)
Icon Strings: RT_ICON metadata extraction not yet implemented

Future Enhancements:

Dialog resource parsing for control text and window titles
Menu resource parsing for menu item text
Icon and cursor resource metadata
Accelerator table string extraction

Mach-O (Mach Object)

Used on macOS and iOS for executables, frameworks, and libraries.

Key Sections for String Extraction

Segment	Section	Priority	Description
`__TEXT`	`__cstring`	High	C string literals
`__TEXT`	`__const`	High	Constant data
`__DATA_CONST`	`*`	Medium	Read-only after fixups
`__DATA`	`*`	Low	Writable data

Mach-O-Specific Features

Load Commands: Extract strings from LC_* commands
Segment/Section Model: Two-level naming scheme
Fat Binaries: Multi-architecture support
String Pools: Centralized string storage in __cstring

Load Command Processing

Mach-O load commands contain valuable strings:

LC_LOAD_DYLIB: Library paths and names
LC_RPATH: Runtime search paths
LC_ID_DYLIB: Library identification
LC_BUILD_VERSION: Build tool information

Implementation Details

impl MachoParser {
    fn classify_section(segment_name: &str, section_name: &str) -> SectionType {
        match (segment_name, section_name) {
            ("__TEXT", "__cstring") => SectionType::StringData,
            ("__DATA_CONST", _) => SectionType::ReadOnlyData,
            ("__DATA", _) => SectionType::WritableData,
            // ... more classifications
        }
    }
}

Cross-Platform Considerations

Encoding Differences

Platform	Primary Encoding	Notes
Linux/Unix	UTF-8	ASCII-compatible, variable width
Windows	UTF-16LE	Wide strings common in APIs
macOS	UTF-8	Similar to Linux, some UTF-16

String Storage Patterns

ELF: Strings often in .rodata with null terminators
PE: Mix of ANSI and Unicode APIs, resources use UTF-16
Mach-O: Centralized in __cstring, mostly UTF-8

Section Weight Calculation

Different formats require different weighting strategies:

fn calculate_section_weight(format: BinaryFormat, section_type: SectionType) -> i32 {
    match (format, section_type) {
        (BinaryFormat::Elf, SectionType::StringData) => 10, // .rodata
        (BinaryFormat::Pe, SectionType::Resources) => 9,    // .rsrc
        (BinaryFormat::MachO, SectionType::StringData) => 10, // __cstring
                                                             // ... more weights
    }
}

Format Detection

Stringy uses goblin for robust format detection:

pub fn detect_format(data: &[u8]) -> BinaryFormat {
    match Object::parse(data) {
        Ok(Object::Elf(_)) => BinaryFormat::Elf,
        Ok(Object::PE(_)) => BinaryFormat::Pe,
        Ok(Object::Mach(_)) => BinaryFormat::MachO,
        _ => BinaryFormat::Unknown,
    }
}

Future Enhancements

Planned Format Extensions

WebAssembly (WASM): Growing importance in web and edge computing
Java Class Files: JVM bytecode analysis
Android APK/DEX: Mobile application analysis

Enhanced Resource Support

PE: Dialog resources, icon strings, version blocks
Mach-O: Plist resources, framework bundles
ELF: Note sections, build IDs, GNU attributes

Architecture-Specific Features

ARM64: Pointer authentication, tagged pointers
x86-64: RIP-relative addressing hints
RISC-V: Emerging architecture support

This comprehensive format support ensures Stringy can effectively analyze binaries across all major platforms while respecting the unique characteristics of each format.

String Extraction

Stringy’s string extraction engine is designed to find meaningful strings while avoiding noise and false positives. The extraction process is encoding-aware, section-aware, and configurable.

Extraction Pipeline

Binary Data → Section Analysis → Encoding Detection → String Scanning → Deduplication → Classification

Encoding Support

ASCII Extraction

The most common encoding in most binaries. ASCII extraction provides foundational string extraction with configurable minimum length thresholds.

UTF-16LE Extraction

UTF-16LE extraction is now implemented and available for Windows PE binary string extraction. It provides UTF-16LE string extraction with confidence scoring and noise filtering integration.

Algorithm

Scan for printable sequences: Characters in range 0x20-0x7E (strict printable ASCII)
Length filtering: Configurable minimum length (default: 4 characters)
Null termination: Respect null terminators but don’t require them
Section awareness: Integrate with section metadata for context-aware filtering

Basic Extraction

use stringy::extraction::{extract_ascii_strings, AsciiExtractionConfig};

let data = b"Hello\0World\0Test123";
let config = AsciiExtractionConfig::default();
let strings = extract_ascii_strings(data, &config);

for string in strings {
    println!("Found: {} at offset {}", string.text, string.offset);
}

Configuration

use stringy::extraction::AsciiExtractionConfig;

// Default configuration (min_length: 4, no max_length)
let config = AsciiExtractionConfig::default();

// Custom minimum length
let config = AsciiExtractionConfig::new(8);

// Custom minimum and maximum length
let mut config = AsciiExtractionConfig::default();
config.max_length = Some(256);

UTF-8 Extraction

UTF-8 extraction builds on ASCII extraction and handles multi-byte characters. See the main extraction module for UTF-8 support.

Implementation Details

fn extract_ascii_strings(data: &[u8], min_len: usize) -> Vec<RawString> {
    let mut strings = Vec::new();
    let mut current_string = Vec::new();
    let mut start_offset = 0;

    for (i, &byte) in data.iter().enumerate() {
        if is_printable_ascii(byte) {
            if current_string.is_empty() {
                start_offset = i;
            }
            current_string.push(byte);
        } else {
            if current_string.len() >= min_len {
                strings.push(RawString {
                    data: current_string.clone(),
                    offset: start_offset,
                    encoding: Encoding::Ascii,
                });
            }
            current_string.clear();
        }
    }

    strings
}

Noise Filtering

Stringy implements a multi-layered heuristic filtering system to reduce false positives and identify noise in extracted strings. The filtering system uses a combination of entropy analysis, character distribution, linguistic patterns, length checks, repetition detection, and context-aware filtering.

Filter Architecture

The noise filtering system consists of multiple independent filters that can be combined with configurable weights:

Character Distribution Filter: Detects abnormal character frequency distributions
Entropy Filter: Uses Shannon entropy to detect padding/repetition and random binary
Linguistic Pattern Filter: Analyzes vowel-to-consonant ratios and common bigrams
Length Filter: Penalizes excessively long strings and very short strings in low-weight sections
Repetition Filter: Detects repeated character patterns and repeated substrings
Context-Aware Filter: Boosts confidence for strings in high-weight sections

Character Distribution Analysis

Detects strings with abnormal character distributions:

Excessive punctuation (>80%): Low confidence (0.2)
Excessive repetition (>90% same character): Very low confidence (0.1)
Excessive non-alphanumeric (>70%): Low confidence (0.3)
Reasonable distribution: High confidence (1.0)

Entropy-Based Filtering

Uses Shannon entropy (bits per byte) to classify strings:

Very low entropy (<1.5 bits/byte): Likely padding or repetition (confidence: 0.1)
Very high entropy (>7.5 bits/byte): Likely random binary (confidence: 0.2)
Optimal range (3.5-6.0 bits/byte): High confidence (1.0)
Acceptable range (2.0-7.0 bits/byte): Moderate confidence (0.4-0.7)

Linguistic Pattern Detection

Analyzes text for word-like patterns:

Vowel-to-consonant ratio: Reasonable range 0.2-0.8 for English
Common bigrams: Detects common English patterns (th, he, in, er, an, re, on, at, en, nd)
Handles non-English: Gracefully handles non-English strings without over-penalizing

Length-Based Filtering

Applies penalties based on string length:

Excessively long (>200 characters): Low confidence (0.3) - likely table data
Very short in low-weight sections (<4 chars, weight <0.5): Moderate confidence (0.5)
Normal length (4-100 characters): High confidence (1.0)

Repetition Detection

Identifies repetitive patterns:

Repeated characters (e.g., “AAAA”, “0000”): Very low confidence (0.1)
Repeated substrings (e.g., “abcabcabc”): Low confidence (0.2)
Normal strings: High confidence (1.0)

Context-Aware Filtering

Boosts or reduces confidence based on section context:

String data sections (.rodata, .rdata, __cstring): High confidence (0.9-1.0)
Read-only data sections: High confidence (0.9)
Resource sections: Maximum confidence (1.0) - known-good sources
Code sections: Lower confidence (0.3-0.5)
Writable data sections: Moderate confidence (0.6)

Configuration

use stringy::extraction::config::{NoiseFilterConfig, FilterWeights};

// Default configuration
let config = NoiseFilterConfig::default();

// Customize thresholds
let mut config = NoiseFilterConfig::default();
config.entropy_min = 2.0;
config.entropy_max = 7.0;
config.max_length = 150;

// Customize filter weights
config.filter_weights = FilterWeights {
    entropy_weight: 0.3,
    char_distribution_weight: 0.25,
    linguistic_weight: 0.2,
    length_weight: 0.15,
    repetition_weight: 0.05,
    context_weight: 0.05,
};

Using Noise Filters

use stringy::extraction::config::NoiseFilterConfig;
use stringy::extraction::filters::{CompositeNoiseFilter, FilterContext};
use stringy::types::SectionType;

let filter_config = NoiseFilterConfig::default();
let filter = CompositeNoiseFilter::new(&filter_config);
let context = FilterContext::default();

let confidence = filter.calculate_confidence("Hello, World!", &context);
if confidence >= 0.5 {
    // String passed filtering threshold
}

Confidence Scoring

Each string is assigned a confidence score (0.0-1.0) indicating how likely it is to be legitimate:

1.0: Maximum confidence (strings from known-good sources like imports, exports, resources)
0.7-0.9: High confidence (likely legitimate strings)
0.5-0.7: Moderate confidence (may need review)
0.0-0.5: Low confidence (likely noise, filtered out by default)

The confidence score is separate from the score field used for final ranking. Confidence specifically represents the noise filtering assessment.

Performance

Noise filtering is designed to add minimal overhead (<10% per acceptance criteria). Individual filters are optimized for performance, and the composite filter allows enabling/disabling specific filters to balance accuracy and speed.

UTF-16 Extraction

Critical for Windows binaries and some resources. Supports both UTF-16LE (Little-Endian) and UTF-16BE (Big-Endian) with automatic byte order detection.

UTF-16LE (Little-Endian)

Most common on Windows platforms. Default 3 character minimum.

Detection heuristics:

Even-length sequences (2-byte alignment required)
Low byte printable, high byte mostly zero
Null termination patterns (0x00 0x00)
Advanced confidence scoring with multiple heuristics

UTF-16BE (Big-Endian)

Found in Java .class files, network protocols, some cross-platform binaries.

Detection heuristics:

Even-length sequences
High byte printable, low byte mostly zero
Reverse byte order from UTF-16LE
Same advanced confidence scoring as UTF-16LE

Automatic Byte Order Detection

The ByteOrder::Auto mode automatically detects and extracts both UTF-16LE and UTF-16BE strings from the same data, avoiding duplicates and correctly identifying the encoding of each string.

Implementation

UTF-16 extraction is implemented in src/extraction/utf16.rs following the pattern established in the ASCII extractor. The implementation provides:

extract_utf16_strings(): Main extraction function supporting both byte orders
extract_utf16le_strings(): UTF-16LE-specific extraction (backward compatibility)
extract_from_section(): Section-aware extraction with proper metadata population
Utf16ExtractionConfig: Configuration for minimum/maximum character count, byte order selection, and confidence thresholds
ByteOrder enum: Control which byte order(s) to scan (LE, BE, Auto)

Usage Example:

use stringy::extraction::utf16::{extract_utf16_strings, Utf16ExtractionConfig, ByteOrder};

// Extract UTF-16LE strings from Windows PE binary
let config = Utf16ExtractionConfig {
    byte_order: ByteOrder::LE,
    min_length: 3,
    confidence_threshold: 0.6,
    ..Default::default()
};
let strings = extract_utf16_strings(data, &config);

// Extract both UTF-16LE and UTF-16BE with auto-detection
let config = Utf16ExtractionConfig {
    byte_order: ByteOrder::Auto,
    ..Default::default()
};
let strings = extract_utf16_strings(data, &config);

Configuration:

use stringy::extraction::utf16::{Utf16ExtractionConfig, ByteOrder};

// Default configuration (min_length: 3, byte_order: Auto, confidence_threshold: 0.5)
let config = Utf16ExtractionConfig::default();

// Custom minimum character length
let config = Utf16ExtractionConfig::new(5);

// Custom configuration
let mut config = Utf16ExtractionConfig::default();
config.min_length = 3;
config.max_length = Some(256);
config.byte_order = ByteOrder::LE;
config.confidence_threshold = 0.6;

UTF-16-Specific Confidence Scoring

UTF-16 extraction uses advanced confidence scoring to detect false positives from null-interleaved binary data. The confidence score combines multiple heuristics:

Valid Unicode range check: Validates code points are in valid Unicode ranges (U+0020-U+D7FF, U+E000-U+FFFD, U+10000-U+10FFFF), penalizes private use areas and invalid surrogates
Printable character ratio: Calculates ratio of printable characters including common Unicode ranges
ASCII ratio: Boosts confidence for ASCII-heavy strings (>50% characters in ASCII printable range)
Null pattern detection: Flags suspicious patterns like:
- Excessive nulls (>30% of characters)
- Regular null intervals (every 2nd, 4th, 8th position)
- Fixed-offset nulls indicating structured binary data
Byte order consistency: Verifies byte order is consistent throughout the string (for Auto mode)

Confidence Formula:

confidence = (valid_unicode_weight × valid_ratio)
           + (printable_weight × printable_ratio)
           + (ascii_weight × ascii_ratio)
           - (null_pattern_penalty)
           - (invalid_range_penalty)

The result is clamped to 0.0-1.0 range.

Examples:

High confidence: “Microsoft Corporation” (>90% printable, valid Unicode, no null patterns)
Medium confidence: “Test123” (>70% printable, valid Unicode)
Low confidence: Null-interleaved binary table data (excessive nulls, regular patterns)

The UTF-16-specific confidence score is combined with general noise filtering confidence when noise filtering is enabled, using the minimum of both scores.

False Positive Prevention

UTF-16 extraction is prone to false positives because binary data with null bytes can look like UTF-16 strings. The confidence scoring system mitigates this by:

Detecting null-interleaved patterns: Binary tables with numeric data (e.g., [0x01, 0x00, 0x02, 0x00]) are flagged as suspicious
Penalizing regular null patterns: Data with nulls at fixed intervals (every 2nd, 4th, 8th byte) receives lower confidence
Validating Unicode ranges: Invalid code points and surrogate pairs reduce confidence
Configurable threshold: The utf16_confidence_threshold (default 0.5) can be tuned to balance recall and precision

Recommendations:

For Windows PE binaries: Use ByteOrder::LE with confidence_threshold: 0.6
For Java .class files: Use ByteOrder::BE with confidence_threshold: 0.5
For unknown formats: Use ByteOrder::Auto with confidence_threshold: 0.5
For high-precision extraction: Increase confidence_threshold to 0.7-0.8

Performance Considerations

UTF-16 scanning adds overhead compared to ASCII/UTF-8 extraction:

Scanning both byte orders: Auto mode doubles the work by scanning for both LE and BE
Confidence scoring: The multi-heuristic confidence calculation adds computational cost
Recommendations:
- Use specific byte order (LE or BE) when the target format is known
- Auto mode is best for unknown or mixed-format binaries
- Consider disabling UTF-16 extraction for formats that don’t use it (e.g., pure ELF binaries)

Section-Aware Extraction

Different sections have different string extraction strategies.

High-Priority Sections

ELF: `.rodata` and variants

Strategy: Aggressive extraction, low noise filtering
Encodings: ASCII/UTF-8 primary, UTF-16 secondary
Minimum length: 3 characters

PE: `.rdata`

Strategy: Balanced extraction
Encodings: ASCII and UTF-16LE equally
Minimum length: 4 characters

Mach-O: `TEXT,cstring`

Strategy: High confidence, null-terminated focus
Encodings: UTF-8 primary
Minimum length: 3 characters

Medium-Priority Sections

ELF: `.data.rel.ro`

Strategy: Conservative extraction
Noise filtering: Enhanced
Minimum length: 5 characters

PE: `.data` (read-only)

Strategy: Moderate extraction
Context checking: Enhanced validation

Low-Priority Sections

Writable data sections

Strategy: Very conservative
High noise filtering: Skip obvious runtime data
Minimum length: 6+ characters

Resource Sections

PE Resources (`.rsrc`)

VERSIONINFO: Extract version strings, product names
STRINGTABLE: Localized UI strings
RT_MANIFEST: XML manifest data

fn extract_pe_resources(pe: &PE, data: &[u8]) -> Vec<RawString> {
    let mut strings = Vec::new();

    // Extract version info
    if let Some(version_info) = extract_version_info(pe, data) {
        strings.extend(version_info);
    }

    // Extract string tables
    if let Some(string_tables) = extract_string_tables(pe, data) {
        strings.extend(string_tables);
    }

    strings
}

Deduplication Strategy

Canonicalization

Strings are canonicalized while preserving important metadata:

Normalize whitespace: Convert tabs/newlines to spaces
Trim boundaries: Remove leading/trailing whitespace
Case preservation: Maintain original case for analysis
Encoding normalization: Convert to UTF-8 for comparison

Metadata Preservation

When duplicates are found:

struct DeduplicatedString {
    canonical_text: String,
    occurrences: Vec<StringOccurrence>,
    primary_encoding: Encoding,
    best_section: Option<String>,
}

struct StringOccurrence {
    offset: u64,
    section: Option<String>,
    encoding: Encoding,
    length: u32,
}

Deduplication Algorithm

fn deduplicate_strings(strings: Vec<RawString>) -> Vec<DeduplicatedString> {
    let mut map: HashMap<String, DeduplicatedString> = HashMap::new();

    for string in strings {
        let canonical = canonicalize(&string.text);

        map.entry(canonical.clone())
            .or_insert_with(|| DeduplicatedString::new(canonical))
            .add_occurrence(string);
    }

    map.into_values().collect()
}

Configuration Options

Extraction Configuration

use stringy::extraction::{ByteOrder, Encoding, ExtractionConfig};

pub struct ExtractionConfig {
    pub min_ascii_length: usize,          // Default: 4
    pub min_wide_length: usize,           // Default: 3 (for UTF-16)
    pub enabled_encodings: Vec<Encoding>, // Default: ASCII, UTF-8
    pub noise_filtering_enabled: bool,    // Default: true
    pub min_confidence_threshold: f32,    // Default: 0.5
    pub utf16_min_confidence: f32,        // Default: 0.7 (for UTF-16LE)
    pub utf16_byte_order: ByteOrder,      // Default: Auto
    pub utf16_confidence_threshold: f32,  // Default: 0.5 (UTF-16-specific)
}

UTF-16 Configuration Examples:

use stringy::extraction::{ExtractionConfig, Encoding, ByteOrder};

// Extract UTF-16LE strings from Windows PE binary
let mut config = ExtractionConfig::default();
config.min_wide_length = 3;
config.utf16_confidence_threshold = 0.6;
config.utf16_byte_order = ByteOrder::LE;
config.enabled_encodings.push(Encoding::Utf16Le);

// Extract both UTF-16LE and UTF-16BE with auto-detection
let mut config = ExtractionConfig::default();
config.enabled_encodings.push(Encoding::Utf16Le);
config.enabled_encodings.push(Encoding::Utf16Be);
config.utf16_byte_order = ByteOrder::Auto;

Noise Filter Configuration

use stringy::extraction::config::NoiseFilterConfig;

pub struct NoiseFilterConfig {
    pub entropy_min: f32,              // Default: 1.5
    pub entropy_max: f32,              // Default: 7.5
    pub max_length: usize,             // Default: 200
    pub max_repetition_ratio: f32,     // Default: 0.7
    pub min_vowel_ratio: f32,          // Default: 0.1
    pub max_vowel_ratio: f32,          // Default: 0.9
    pub filter_weights: FilterWeights, // Default: balanced weights
}

Filter Weights

use stringy::extraction::config::FilterWeights;

pub struct FilterWeights {
    pub entropy_weight: f32,           // Default: 0.25
    pub char_distribution_weight: f32, // Default: 0.20
    pub linguistic_weight: f32,        // Default: 0.20
    pub length_weight: f32,            // Default: 0.15
    pub repetition_weight: f32,        // Default: 0.10
    pub context_weight: f32,           // Default: 0.10
}

All weights must sum to 1.0. The configuration validates this automatically.

Encoding Selection

#[non_exhaustive]
pub enum EncodingFilter {
    /// Match a specific encoding exactly
    Exact(Encoding),
    /// Match any UTF-16 variant (UTF-16LE or UTF-16BE)
    Utf16Any,
}

Section Filtering

pub struct SectionFilter {
    pub include_sections: Option<Vec<String>>,
    pub exclude_sections: Option<Vec<String>>,
    pub include_debug: bool,
    pub include_resources: bool,
}

Performance Optimizations

Memory Mapping

Large files use memory mapping for efficient access via mmap-guard:

fn extract_from_large_file(path: &Path) -> Result<Vec<RawString>> {
    let data = mmap_guard::map_file(path)?;
    // data implements Deref<Target = [u8]>
    extract_strings(&data[..])
}

Note: The Pipeline::run API handles memory mapping automatically.

Parallel Processing

Parallel processing is not yet implemented. Section extraction currently runs sequentially.

Regex Caching

Pattern matching uses cached regex compilation:

lazy_static! {
    static ref URL_REGEX: Regex = Regex::new(r"https?://[^\s]+").unwrap();
    static ref GUID_REGEX: Regex = Regex::new(r"\{[0-9a-fA-F-]{36}\}").unwrap();
}

Quality Assurance

Validation Heuristics

The noise filtering system implements comprehensive validation:

Entropy checking: Uses Shannon entropy to detect padding/repetition and random binary data
Language detection: Analyzes vowel-to-consonant ratios and common bigrams
Context validation: Considers section type, weight, and permissions
Character distribution: Detects abnormal frequency distributions
Repetition detection: Identifies repeated patterns and padding

False Positive Reduction

The multi-layered filtering system targets common sources of false positives:

Padding detection: Identifies repeated character sequences (e.g., “AAAA”, “\x00\x00\x00\x00”)
Table data: Filters excessively long strings likely to be structured data
Binary noise: High-entropy strings are flagged as likely random binary
Context awareness: Strings in code sections receive lower confidence scores

Performance Characteristics

Noise filtering is designed for minimal overhead:

Target overhead: <10% compared to extraction without filtering
Optimized filters: Each filter is independently optimized
Configurable: Can enable/disable individual filters to balance accuracy and speed
Scalable: Handles large binaries efficiently

Examples

Basic Extraction with Filtering

use stringy::extraction::{extract_ascii_strings, AsciiExtractionConfig};
use stringy::extraction::config::NoiseFilterConfig;
use stringy::extraction::filters::{CompositeNoiseFilter, FilterContext};

let data = b"Hello World\0AAAA\0Test123";
let config = AsciiExtractionConfig::default();
let strings = extract_ascii_strings(data, &config);

let filter_config = NoiseFilterConfig::default();
let filter = CompositeNoiseFilter::new(&filter_config);
let context = FilterContext::default();

let filtered: Vec<_> = strings
    .into_iter()
    .filter(|s| filter.calculate_confidence(&s.text, &context) >= 0.5)
    .collect();

Custom Filter Configuration

use stringy::extraction::config::{NoiseFilterConfig, FilterWeights};

let mut config = NoiseFilterConfig::default();
config.entropy_min = 2.0;
config.entropy_max = 7.0;
config.max_length = 150;

config.filter_weights = FilterWeights {
    entropy_weight: 0.4,
    char_distribution_weight: 0.3,
    linguistic_weight: 0.15,
    length_weight: 0.1,
    repetition_weight: 0.03,
    context_weight: 0.02,
};

This comprehensive extraction system ensures high-quality string extraction while maintaining performance and minimizing false positives through multi-layered noise filtering.

Classification System

Stringy applies semantic analysis to extracted strings, identifying patterns that indicate specific types of data. This helps analysts focus on the most relevant information quickly.

Classification Pipeline

Raw String -> Pattern Matching -> Validation -> Tag Assignment

Semantic Categories

URLs

Pattern: https?://[^\s<>"{}|\\\^\[\]\]+`
Examples: https://example.com/path, http://malware.site/payload
Validation: Must start with http:// or https://

Domain Names

Pattern: RFC 1035 compliant domain format
Examples: example.com, subdomain.evil.site
Validation: Valid TLD from known list, not a URL or email

IP Addresses

IPv4 Pattern: Standard dotted-decimal notation
IPv6 Pattern: Full and compressed formats
Examples: 192.168.1.1, ::1, 2001:db8::1
Validation: Valid octet ranges for IPv4, proper format for IPv6

File Paths

POSIX Pattern: Paths starting with /
Windows Pattern: Drive letters (C:\) or relative paths
UNC Pattern: \\server\share format
Examples: /etc/passwd, C:\Windows\System32, \\server\share\file

Registry Paths

Pattern: HKEY_* or HK*\ prefixes
Examples: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft
Validation: Must start with valid registry root key

GUIDs

Pattern: \{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}
Examples: {12345678-1234-1234-1234-123456789abc}
Validation: Strict format compliance with braces required

Email Addresses

Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
Examples: admin@malware.com, user.name+tag@example.co.uk
Validation: Single @, valid TLD length and characters, no empty parts

Base64 Data

Pattern: [A-Za-z0-9+/]{20,}={0,2}
Examples: U29tZSBsb25nZXIgYmFzZTY0IHN0cmluZw==
Validation: Length >= 20, length divisible by 4, padding rules, entropy threshold

Format Strings

Pattern: %[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}
Examples: Error: %s at line %d, User {0} logged in
Validation: Reasonable specifier count, context-aware thresholds

User Agents

Pattern: Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+
Examples: Mozilla/5.0 (Windows NT 10.0; Win64; x64), Chrome/117.0.5938.92
Validation: Known browser identifiers and minimum length

Pattern Matching Engine

The semantic classifier uses cached regex patterns via once_cell::sync::Lazy and applies validation checks to reduce false positives.

#![allow(unused)]
fn main() {
use once_cell::sync::Lazy;
use regex::Regex;

static GUID_REGEX: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"^\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}$")
        .expect("Invalid GUID regex")
});
}

Using the Classification System

#![allow(unused)]
fn main() {
use stringy::classification::SemanticClassifier;
use stringy::types::{BinaryFormat, Encoding, SectionType, StringContext, StringSource, Tag};

let classifier = SemanticClassifier::new();
let context = StringContext::new(
    SectionType::StringData,
    BinaryFormat::Elf,
    Encoding::Ascii,
    StringSource::SectionData,
)
.with_section_name(".rodata".to_string());

let tags = classifier.classify("{12345678-1234-1234-1234-123456789abc}", &context);
if tags.contains(&Tag::Guid) {
    // Handle GUID indicator
}
}

Validation Rules

GUID: Braced, hyphenated, hex-only format.
Email: TLD length must be between 2 and 24 and alphabetic; domain must include a dot.
Base64: Length must be divisible by 4, padding allowed only at the end, entropy threshold applied.
Format String: Must contain at least one specifier and pass context-aware length checks.
User Agent: Must contain a known browser token and meet minimum length.

Performance Notes

Regexes are compiled once via once_cell::sync::Lazy and reused across calls.
Minimum length checks avoid unnecessary regex work on short inputs.
The classifier is stateless and thread-safe.

Testing

Unit tests: tests/classification_tests.rs
Integration tests: tests/classification_integration_tests.rs

Run tests with:

just test

Ranking Algorithm

Stringy’s ranking system prioritizes strings by relevance, helping analysts focus on the most important findings first. The algorithm combines multiple factors to produce a comprehensive relevance score.

Scoring Formula

Final Score = SectionWeight + SemanticBoost - NoisePenalty

Each component contributes to the overall relevance assessment. The resulting internal score is then mapped to a display score (0-100) via band mapping.

Note: Section weights use a 1.0-10.0 scale, and semantic boosts add to the internal score. The pipeline’s normalizer then maps the combined internal score to a 0-100 display score using the band table shown in Display Score Mapping below.

Section Weight

Different sections have varying likelihood of containing meaningful strings. Container parsers assign weights (1.0-10.0) to each section based on its type and name.

Weight Ranges

Section Type	Typical Weight	Examples
Dedicated string storage	8.0-10.0	`.rodata`, `__TEXT,__cstring`, `.rsrc`
Read-only data	7.0	`.data.rel.ro`, `__DATA_CONST`
General data	5.0	`.data`
Code sections	1.0	`.text`

Format-specific adjustments are applied based on section names. For example, ELF .rodata.str1.1 (aligned strings) and PE .rsrc (rich resources) receive additional priority.

Semantic Boost

Strings with recognized semantic meaning receive score boosts based on their tags.

Boost Categories

Tag Category	Boost Level	Examples
Network (URL, Domain, IP)	High	`https://api.evil.com`
Identifiers (GUID, Email)	High	`{12345678-1234-...}`
File System (Path, Registry)	Medium-High	`C:\Windows\System32\evil.dll`
User-Agent-like strings	Medium-High	`Mozilla/5.0 ...`
Version/Manifest	Medium	`MyApp v1.2.3`
Code Artifacts (Format strings, Base64)	Medium	`Error: %s at line %d`
Symbols (Import, Export)	Low-Medium	`CreateFileW`, `main`

Strings with multiple semantic tags receive additional (diminishing) bonuses for each extra tag.

Noise Penalty

Various factors indicate low-quality or noisy strings, and receive penalties:

Penalty Categories

High Entropy: Strings with high Shannon entropy (randomness) are likely binary data or encoded content and receive significant penalties.
Excessive Length: Very long strings are often noise (padding, embedded data). Longer strings receive progressively larger penalties.
Repeated Patterns: Strings with excessive character repetition (e.g., AAAAAAA...) are penalized based on the repetition ratio.
Common Noise Patterns: Known noise patterns receive penalties, including padding characters, hex dump patterns, and table-like data with excessive delimiters.

Display Score Mapping

The internal score is mapped to a display score (0-100) using bands:

Internal Score	Display Score	Meaning
<= 0	0	Low relevance
1-79	1-49	Low relevance
80-119	50-69	Moderate
120-159	70-89	Meaningful
160-220	90-100	High-value
> 220	100 (clamped)	High-value

Filtering Recommendations

Interactive analysis: Show display scores >= 50
Automated processing: Use display scores >= 70
YARA rules: Focus on display scores >= 80
High-confidence indicators: Display scores >= 90

Contributing to Stringy

We welcome contributions to Stringy! This guide will help you get started with development, testing, and submitting changes.

Development Setup

Prerequisites

Rust: 1.91 or later (MSRV - Minimum Supported Rust Version)
Git: For version control
just: Task runner (install via cargo install just or your package manager)
mise: Tool version manager (manages Zig and other dev tools)
Zig: Cross-compiler for test fixtures (managed by mise)

Clone and Setup

git clone https://github.com/EvilBit-Labs/Stringy
cd Stringy

# Generate test fixtures (ELF/PE/Mach-O via Zig cross-compilation)
just gen-fixtures

# Run the full check suite
just check

Development Tools

Install recommended tools for development:

# Code formatting
rustup component add rustfmt

# Linting
rustup component add clippy

# Documentation
cargo install mdbook

# Test runner (required by just recipes)
cargo install cargo-nextest

# Coverage (optional)
cargo install cargo-llvm-cov

Project Structure

src/
+-- main.rs              # CLI entry point (thin wrapper)
+-- lib.rs               # Library root and public API re-exports
+-- types/
|   +-- mod.rs           # Core data structures (Tag, FoundString, Encoding, etc.)
|   +-- error.rs         # StringyError enum, Result alias
+-- container/
|   +-- mod.rs           # Format detection, ContainerParser trait
|   +-- elf.rs           # ELF parser
|   +-- pe.rs            # PE parser
|   +-- macho.rs         # Mach-O parser
+-- extraction/
|   +-- mod.rs           # Extraction orchestration
|   +-- ascii/           # ASCII/UTF-8 extraction
|   +-- utf16/           # UTF-16LE/BE extraction
|   +-- dedup/           # Deduplication with scoring
|   +-- filters/         # Noise filter implementations
|   +-- pe_resources/    # PE version info, manifests, string tables
+-- classification/
|   +-- mod.rs           # Classification framework
|   +-- patterns/        # Regex-based pattern matching
|   +-- symbols.rs       # Symbol processing and demangling
|   +-- ranking.rs       # Scoring algorithm
+-- output/
|   +-- mod.rs           # OutputFormat, OutputMetadata, dispatch
|   +-- json.rs          # JSONL format
|   +-- table/           # TTY and plain text table formatting
|   +-- yara/            # YARA rule generation
+-- pipeline/
    +-- mod.rs           # Pipeline::run orchestration
    +-- config.rs        # PipelineConfig, FilterConfig, EncodingFilter
    +-- filter.rs        # Post-extraction filtering
    +-- normalizer.rs    # Score band mapping

tests/
+-- integration_cli.rs           # CLI argument and flag tests
+-- integration_cli_errors.rs    # CLI error handling tests
+-- integration_elf.rs           # ELF-specific tests
+-- integration_pe.rs            # PE-specific tests
+-- integration_macho.rs         # Mach-O-specific tests
+-- integration_extraction.rs    # Extraction tests
+-- integration_flows_1_5.rs     # End-to-end flow tests (1-5)
+-- integration_flows_6_8.rs     # End-to-end flow tests (6-8)
+-- ... (additional test files)
+-- fixtures/                    # Test binary files (flat structure)
+-- snapshots/                   # Insta snapshot files

docs/
+-- src/                 # mdbook documentation
+-- book.toml            # Documentation config

Development Workflow

1. Create a Branch

git checkout -b feature/your-feature-name
# or
git checkout -b fix/issue-description

2. Make Changes

Use just recipes for development commands:

# Format code
just format

# Lint (clippy with -D warnings)
just lint

# Run tests
just test

# Full pre-commit check (fmt + lint + test)
just check

# Full CI suite locally
just ci-check

3. Test Your Changes

# Generate fixtures if needed
just gen-fixtures

# Run all tests
just test

# Run a specific test
cargo nextest run test_name

# Regenerate snapshots after changing test_binary.c
INSTA_UPDATE=always cargo nextest run

4. Update Documentation

If your changes affect the public API or add new features:

# Update API docs
cargo doc --open

# Update user documentation
cd docs
mdbook serve --open

Coding Standards

Rust Style

Use cargo fmt for formatting
Follow cargo clippy recommendations (warnings are errors)
No unsafe code (#![forbid(unsafe_code)] is enforced)
Zero warnings policy
ASCII only in source code (no emojis, em-dashes, smart quotes)
Files under 500 lines; split larger files into module directories
No blanket #[allow] without inline justification

Testing

Write comprehensive tests:

Use insta for snapshot testing
Binary fixtures in tests/fixtures/ (flat structure)
Integration tests use two naming patterns: integration_*.rs and test_*.rs
Use assert_cmd for CLI testing (note: assert_cmd is non-TTY)

Contribution Areas

High-Priority Areas

String Extraction Engine - UTF-16 detection improvements, noise filtering enhancements
Classification System - New semantic patterns, improved confidence scoring
Output Formats - Customization options, additional format support

Getting Started Ideas

Add new semantic patterns (email formats, crypto constants)
Improve test coverage
Enhance error messages
Add documentation examples

Submitting Changes

Pull Request Process

Fork the repository on GitHub
Create a feature branch from main
Make your changes following the guidelines above
Add tests for new functionality (this is policy, not optional)
Sign off commits with git commit -s (DCO enforced by GitHub App)
Submit a pull request with a clear description

PR Requirements

Sign off commits with git commit -s (DCO enforced)
Pass CI (clippy, rustfmt, tests, CodeQL, cargo-deny)
Include tests for new functionality
Be reviewed (human or CodeRabbit) for correctness, safety, and style
No unwrap() in library code, unchecked errors, or unvalidated input

Review Process

Automated checks must pass (CI/CD)
Code review by maintainers
Testing on multiple platforms
Documentation review if applicable
Merge after approval

Community Guidelines

Getting Help

GitHub Issues: Bug reports and feature requests
Discussions: General questions and ideas
Documentation: Check existing docs first

Release Process

Version Numbering

We follow Semantic Versioning:

MAJOR: Breaking changes
MINOR: New features (backward compatible)
PATCH: Bug fixes (backward compatible)

Release Checklist

Update version in Cargo.toml
Update changelog via git-cliff
Run full test suite
Update documentation
Create release tag (vX.Y.Z)
Releases are built via cargo-dist

Thank you for contributing to Stringy!

API Documentation

This page provides an overview of Stringy’s public API. For complete API documentation, run cargo doc --open in the project directory.

Core Types

FoundString

The primary data structure representing an extracted string with metadata.

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FoundString {
    /// The extracted string text
    pub text: String,
    /// Pre-demangled form (if symbol was demangled)
    pub original_text: Option<String>,
    /// The encoding used for this string
    pub encoding: Encoding,
    /// File offset where the string was found
    pub offset: u64,
    /// Relative Virtual Address (if available)
    pub rva: Option<u64>,
    /// Section name where the string was found
    pub section: Option<String>,
    /// Length of the string in bytes
    pub length: u32,
    /// Semantic tags applied to this string
    pub tags: Vec<Tag>,
    /// Relevance score for ranking
    pub score: i32,
    /// Section weight component of score (debug only)
    pub section_weight: Option<i32>,
    /// Semantic boost component of score (debug only)
    pub semantic_boost: Option<i32>,
    /// Noise penalty component of score (debug only)
    pub noise_penalty: Option<i32>,
    /// Display score 0-100, populated by ScoreNormalizer in all non-raw executions
    pub display_score: Option<i32>,
    /// Source of the string (section data, import, etc.)
    pub source: StringSource,
    /// UTF-16 confidence score
    pub confidence: f32,
}

Encoding

Supported string encodings.

#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum Encoding {
    Ascii,
    Utf8,
    Utf16Le,
    Utf16Be,
}

Tag

Semantic classification tags.

#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
pub enum Tag {
    Url,
    Domain,
    IPv4,
    IPv6,
    FilePath,
    RegistryPath,
    Guid,
    Email,
    Base64,
    FormatString,
    UserAgent,
    DemangledSymbol,
    Import,
    Export,
    Version,
    Manifest,
    Resource,
    DylibPath,
    Rpath,
    RpathVariable,
    FrameworkPath,
}

EncodingFilter

Filter for restricting output by string encoding, corresponding to the --enc CLI flag.

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum EncodingFilter {
    /// Match a specific encoding exactly
    Exact(Encoding),
    /// Match any UTF-16 variant (UTF-16LE or UTF-16BE)
    Utf16Any,
}

Used with FilterConfig to limit results to a specific encoding. Utf16Any matches both Utf16Le and Utf16Be.

FilterConfig

Post-extraction filtering configuration. All fields have sensible defaults; empty tag vectors are no-ops.

pub struct FilterConfig {
    /// Minimum string length to include (default: 4)
    pub min_length: usize,          // --min-len
    /// Restrict to a specific encoding
    pub encoding: Option<EncodingFilter>, // --enc
    /// Only include strings with these tags (empty = no filter)
    pub include_tags: Vec<Tag>,     // --only-tags
    /// Exclude strings with these tags (empty = no filter)
    pub exclude_tags: Vec<Tag>,     // --no-tags
    /// Limit output to top N strings by score
    pub top_n: Option<usize>,      // --top
}

Builder-style construction:

let config = FilterConfig::new()
    .with_min_length(6)
    .with_encoding(EncodingFilter::Exact(Encoding::Utf8))
    .with_include_tags(vec![Tag::Url, Tag::Domain])
    .with_top_n(20);

Main API Functions

BasicExtractor::extract

Extract strings from binary data using the BasicExtractor, which implements the StringExtractor trait.

pub trait StringExtractor {
    fn extract(
        &self,
        data: &[u8],
        container_info: &ContainerInfo,
        config: &ExtractionConfig,
    ) -> Result<Vec<FoundString>>;
}

Parameters:

data: Binary data to analyze
container_info: Parsed container metadata (sections, imports, exports)
config: Extraction configuration options

Returns:

Result<Vec<FoundString>>: Extracted strings with metadata

Example:

use stringy::{BasicExtractor, ExtractionConfig, StringExtractor};
use stringy::container::{detect_format, create_parser};

let data = std::fs::read("binary.exe")?;
let format = detect_format(&data);
let parser = create_parser(format)?;
let container_info = parser.parse(&data)?;

let extractor = BasicExtractor::new();
let config = ExtractionConfig::default();
let strings = extractor.extract(&data, &container_info, &config)?;

for string in strings {
    println!("{}: {}", string.score, string.text);
}

detect_format

Detect the binary format of the given data.

pub fn detect_format(data: &[u8]) -> BinaryFormat

Parameters:

data: Binary data to analyze

Returns:

BinaryFormat: Detected format (ELF, PE, MachO, or Unknown)

Example:

use stringy::detect_format;

let data = std::fs::read("binary")?;
let format = detect_format(&data);
println!("Detected format: {:?}", format);

Configuration

ExtractionConfig

Configuration options for string extraction. The struct has 16 fields with sensible defaults.

pub struct ExtractionConfig {
    /// Minimum string length in bytes (default: 1)
    pub min_length: usize,
    /// Maximum string length in bytes (default: 4096)
    pub max_length: usize,
    /// Whether to scan executable sections (default: true)
    pub scan_code_sections: bool,
    /// Whether to include debug sections (default: false)
    pub include_debug: bool,
    /// Section types to prioritize (default: StringData, ReadOnlyData, Resources)
    pub section_priority: Vec<SectionType>,
    /// Whether to include import/export names (default: true)
    pub include_symbols: bool,
    /// Minimum length for ASCII strings (default: 1)
    pub min_ascii_length: usize,
    /// Minimum length for UTF-16 strings (default: 1)
    pub min_wide_length: usize,
    /// Which encodings to extract (default: ASCII, UTF-8)
    pub enabled_encodings: Vec<Encoding>,
    /// Enable/disable noise filtering (default: true)
    pub noise_filtering_enabled: bool,
    /// Minimum confidence threshold (default: 0.5)
    pub min_confidence_threshold: f32,
    /// Minimum UTF-16LE confidence threshold (default: 0.7)
    pub utf16_min_confidence: f32,
    /// Which UTF-16 byte order(s) to scan (default: Auto)
    pub utf16_byte_order: ByteOrder,
    /// Minimum UTF-16-specific confidence threshold (default: 0.5)
    pub utf16_confidence_threshold: f32,
    /// Enable/disable deduplication (default: true)
    pub enable_deduplication: bool,
    /// Deduplication threshold (default: None)
    pub dedup_threshold: Option<usize>,
}

impl Default for ExtractionConfig {
    fn default() -> Self {
        Self {
            min_length: 1,
            max_length: 4096,
            scan_code_sections: true,
            include_debug: false,
            section_priority: vec![
                SectionType::StringData,
                SectionType::ReadOnlyData,
                SectionType::Resources,
            ],
            include_symbols: true,
            min_ascii_length: 1,
            min_wide_length: 1,
            enabled_encodings: vec![Encoding::Ascii, Encoding::Utf8],
            noise_filtering_enabled: true,
            min_confidence_threshold: 0.5,
            utf16_min_confidence: 0.7,
            utf16_byte_order: ByteOrder::Auto,
            utf16_confidence_threshold: 0.5,
            enable_deduplication: true,
            dedup_threshold: None,
        }
    }
}

SemanticClassifier

The SemanticClassifier is constructed via SemanticClassifier::new() and currently has no configuration options. Classification patterns are built-in.

Pipeline Components

ScoreNormalizer

Maps internal relevance scores to a 0-100 display scale using band mapping.

let normalizer = ScoreNormalizer::new();
normalizer.normalize(&mut strings);
// Each FoundString now has display_score populated

Invoked unconditionally by the pipeline in all non-raw executions. Negative internal scores map to display_score = 0. See Ranking for the full band-mapping table.

FilterEngine

Applies post-extraction filtering and sorting. Consumes the input vector and returns a filtered, sorted result.

let engine = FilterEngine::new();
let filtered = engine.apply(strings, &filter_config);

Filter order:

Minimum length (min_length)
Encoding match (encoding)
Include tags (include_tags – keep only strings with at least one matching tag)
Exclude tags (exclude_tags – remove strings with any matching tag)
Stable sort by score (descending), then offset (ascending), then text (ascending)
Top-N truncation (top_n)

Example: FilterConfig + FilterEngine

use stringy::{FilterConfig, FilterEngine, EncodingFilter, Encoding, Tag};

let config = FilterConfig::new()
    .with_min_length(6)
    .with_include_tags(vec![Tag::Url, Tag::Domain])
    .with_top_n(10);

let engine = FilterEngine::new();
let results = engine.apply(strings, &config);
// results contains at most 10 strings, all >= 6 chars,
// all tagged Url or Domain, sorted by score descending

Container Parsing

ContainerParser Trait

Trait for implementing binary format parsers.

pub trait ContainerParser {
    /// Detect if this parser can handle the given data
    fn detect(data: &[u8]) -> bool
    where
        Self: Sized;

    /// Parse the container and extract metadata
    fn parse(&self, data: &[u8]) -> Result<ContainerInfo>;
}

ContainerInfo

Information about a parsed binary container.

pub struct ContainerInfo {
    /// The binary format detected
    pub format: BinaryFormat,
    /// List of sections in the binary
    pub sections: Vec<SectionInfo>,
    /// Import information
    pub imports: Vec<ImportInfo>,
    /// Export information
    pub exports: Vec<ExportInfo>,
    /// Resource metadata (PE format only)
    pub resources: Option<Vec<ResourceMetadata>>,
}

SectionInfo

Information about a section within the binary.

pub struct SectionInfo {
    /// Section name
    pub name: String,
    /// File offset of the section
    pub offset: u64,
    /// Size of the section in bytes
    pub size: u64,
    /// Relative Virtual Address (if available)
    pub rva: Option<u64>,
    /// Classification of the section type
    pub section_type: SectionType,
    /// Whether the section is executable
    pub is_executable: bool,
    /// Whether the section is writable
    pub is_writable: bool,
    /// Weight indicating likelihood of containing meaningful strings (1.0-10.0)
    pub weight: f32,
}

Output Formatting

OutputFormatter Trait

Trait for implementing output formatters.

pub trait OutputFormatter {
    /// Returns the name of this formatter
    fn name(&self) -> &'static str;

    /// Format the strings for output
    fn format(&self, strings: &[FoundString], metadata: &OutputMetadata) -> Result<String>;
}

Built-in Formatters

The library provides free functions rather than formatter structs:

format_table(strings, metadata) - Human-readable table format (TTY-aware)
format_json(strings, metadata) - JSONL format
format_yara(strings, metadata) - YARA rule format
format_output(strings, metadata) - Dispatches based on metadata.output_format

Example:

use stringy::output::{format_json, OutputMetadata};

let metadata = OutputMetadata::new("binary.exe".to_string());
let output = format_json(&strings, &metadata)?;
println!("{}", output);

Error Handling

StringyError

Comprehensive error type for the library.

#[derive(Debug, thiserror::Error)]
pub enum StringyError {
    #[error("Unsupported file format (supported: ELF, PE, Mach-O)")]
    UnsupportedFormat,

    #[error("File I/O error: {0}")]
    IoError(#[from] std::io::Error),

    #[error("Binary parsing error: {0}")]
    ParseError(String),

    #[error("Invalid encoding in string at offset {offset}")]
    EncodingError { offset: u64 },

    #[error("Configuration error: {0}")]
    ConfigError(String),

    #[error("Serialization error: {0}")]
    SerializationError(String),

    #[error("Validation error: {0}")]
    ValidationError(String),

    #[error("Memory mapping error: {0}")]
    MemoryMapError(String),
}

Result Type

Convenient result type alias.

pub type Result<T> = std::result::Result<T, StringyError>;

Advanced Usage

Custom Classification

Implement custom semantic classifiers:

use stringy::classification::{ClassificationResult, Classifier};

pub struct CustomClassifier {
    // Custom implementation
}

impl Classifier for CustomClassifier {
    fn classify(&self, text: &str, context: &StringContext) -> Vec<ClassificationResult> {
        // Custom classification logic
        vec![]
    }
}

Memory-Mapped Files

For large files, use memory mapping via mmap-guard:

let data = mmap_guard::map_file(path)?;
// data implements Deref<Target = [u8]>

Note: The Pipeline::run API handles memory mapping automatically. Direct use of mmap_guard is only needed when using lower-level APIs.

Parallel Processing

Parallel processing is not yet implemented. Stringy currently processes files sequentially. The Pipeline API processes one file at a time.

Feature Flags

Stringy currently has no optional feature flags. All functionality is included by default.

Examples

Basic String Extraction (Pipeline API)

use stringy::pipeline::{Pipeline, PipelineConfig};
use std::path::Path;

fn main() -> stringy::Result<()> {
    let config = PipelineConfig::default();
    let pipeline = Pipeline::new(config);
    pipeline.run(Path::new("binary.exe"))?;
    Ok(())
}

Filtered Extraction

use stringy::{BasicExtractor, ExtractionConfig, StringExtractor, Tag};
use stringy::container::{detect_format, create_parser};

fn extract_network_indicators(data: &[u8]) -> stringy::Result<Vec<String>> {
    let format = detect_format(data);
    let parser = create_parser(format)?;
    let container_info = parser.parse(data)?;

    let extractor = BasicExtractor::new();
    let config = ExtractionConfig::default();
    let strings = extractor.extract(data, &container_info, &config)?;

    let network_strings: Vec<String> = strings
        .into_iter()
        .filter(|s| {
            s.tags
                .iter()
                .any(|tag| matches!(tag, Tag::Url | Tag::Domain | Tag::IPv4 | Tag::IPv6))
        })
        .filter(|s| s.score >= 70)
        .map(|s| s.text)
        .collect();

    Ok(network_strings)
}

Custom Output Format

use serde_json::json;
use stringy::output::{OutputMetadata, OutputFormatter};
use stringy::FoundString;

pub struct CustomFormatter;

impl OutputFormatter for CustomFormatter {
    fn name(&self) -> &'static str {
        "custom"
    }

    fn format(&self, strings: &[FoundString], _metadata: &OutputMetadata) -> stringy::Result<String> {
        let output = json!({
            "total_strings": strings.len(),
            "high_confidence": strings.iter().filter(|s| s.score >= 80).count(),
            "strings": strings.iter().take(20).collect::<Vec<_>>()
        });

        Ok(serde_json::to_string_pretty(&output)?)
    }
}

For complete API documentation with all methods and implementation details, run:

cargo doc --open

Configuration

Note

The configuration file system described below is planned but not yet implemented. Stringy currently uses CLI flags exclusively for configuration. See the CLI Reference for available options.

Stringy provides extensive configuration options to customize string extraction, classification, and output formatting. Configuration can be provided through command-line arguments, configuration files, or programmatically via the API.

Configuration File

Note: Configuration file support is planned for future releases.

Default Location

~/.config/stringy/config.toml

Example Configuration

[extraction]
min_ascii_len = 4
min_utf16_len = 3
max_string_len = 1024
encodings = ["ascii", "utf16le"]
include_debug = false
include_symbols = true

[classification]
detect_urls = true
detect_domains = true
detect_ips = true
detect_paths = true
detect_guids = true
detect_emails = true
detect_base64 = true
detect_format_strings = true
min_confidence = 0.7

[output]
format = "human"
max_results = 100
show_scores = true
show_offsets = true
color = true

[ranking]
section_weight_multiplier = 1.0
semantic_boost_multiplier = 1.0
noise_penalty_multiplier = 1.0

# Profile-specific configurations
[profiles.security]
encodings = ["ascii", "utf8", "utf16le"]
min_ascii_len = 6
only_tags = ["url", "domain", "ipv4", "ipv6", "filepath", "regpath"]
min_score = 70

[profiles.yara]
format = "yara"
min_ascii_len = 8
exclude_tags = ["import", "export"]
min_score = 80

[profiles.development]
include_debug = true
include_symbols = true
max_results = 500

Extraction Configuration

String Length Limits

Control the minimum and maximum string lengths:

[extraction]
min_ascii_len = 4     # Minimum ASCII string length
min_utf16_len = 3     # Minimum UTF-16 string length
max_string_len = 1024 # Maximum string length (prevents memory issues)

CLI equivalent:

stringy --min-len 6 --max-len 500 binary

Encoding Selection

Choose which encodings to extract:

[extraction]
encodings = ["ascii", "utf8", "utf16le", "utf16be"]

Available encodings:

ascii: 7-bit ASCII
utf8: UTF-8 (includes ASCII)
utf16le: UTF-16 Little Endian
utf16be: UTF-16 Big Endian

CLI equivalent:

stringy --enc ascii,utf16le binary

Section Filtering

Control which sections to analyze:

[extraction]
include_sections = [".rodata", ".rdata", "__cstring"]
exclude_sections = [".debug_info", ".comment"]
include_debug = false
include_resources = true

CLI equivalent:

stringy --sections .rodata,.rdata --no-debug binary

Symbol Processing

Configure import/export symbol handling:

[extraction]
include_symbols = true
demangle_rust = true
demangle_cpp = false   # Future feature

CLI equivalent:

stringy --no-symbols --no-demangle binary

Classification Configuration

Pattern Detection

Enable/disable specific semantic patterns:

[classification]
detect_urls = true
detect_domains = true
detect_ips = true
detect_paths = true
detect_guids = true
detect_emails = true
detect_base64 = true
detect_format_strings = true
detect_user_agents = true

Confidence Thresholds

Set minimum confidence levels:

[classification]
min_confidence = 0.7         # Overall minimum confidence
url_min_confidence = 0.8     # URL-specific threshold
domain_min_confidence = 0.75 # Domain-specific threshold
path_min_confidence = 0.6    # File path threshold

Custom Patterns

Add custom regex patterns:

[classification.custom_patterns]
api_key = 'api[_-]?key["\s]*[:=]["\s]*[a-zA-Z0-9]{20,}'
crypto_address = '(bc1|[13])[a-zA-HJ-NP-Z0-9]{25,62}'
jwt_token = 'eyJ[a-zA-Z0-9_-]+\.[a-zA-Z0-9_-]+\.[a-zA-Z0-9_-]+'

Ranking Configuration

Weight Adjustments

Customize section and semantic weights:

[ranking.section_weights]
string_data = 40
resources = 35
readonly_data = 25
debug = 15
writable_data = 10
code = 5
other = 0

[ranking.semantic_boosts]
url = 25
domain = 20
guid = 20
filepath = 15
format_string = 10
import = 8
export = 8

Penalty Configuration

Adjust noise detection penalties:

[ranking.penalties]
high_entropy_threshold = 4.5
high_entropy_penalty = -15
length_penalty_threshold = 200
max_length_penalty = -20
repetition_threshold = 0.7
repetition_penalty = -12

Output Configuration

Format Selection

Choose default output format:

[output]
format = "human"  # human, json, yara
max_results = 100 # Limit number of results
show_all = false  # Override max_results limit

Display Options

Customize what information to show:

[output]
show_scores = true
show_offsets = true
show_sections = true
show_encodings = true
show_tags = true
color = true                   # Enable colored output
truncate_long_strings = true
max_string_display_length = 80

Filtering

Set default filters:

[output]
min_score = 50
only_tags = []         # Empty = show all tags
exclude_tags = ["b64"] # Exclude Base64 by default

Format-Specific Configuration

PE Configuration

Windows PE-specific options:

[formats.pe]
extract_version_info = true
extract_manifests = true
extract_string_tables = true
prefer_utf16 = true
include_resource_names = true

ELF Configuration

Linux ELF-specific options:

[formats.elf]
include_build_id = true
include_gnu_version = true
process_dynamic_strings = true
include_note_sections = false

Mach-O Configuration

macOS Mach-O-specific options:

[formats.macho]
process_load_commands = true
include_framework_paths = true
process_fat_binaries = "first" # first, all, or specific arch

Performance Configuration

Memory Management

Control memory usage:

[performance]
use_memory_mapping = true
memory_map_threshold = 10485760 # 10MB
max_memory_usage = 1073741824   # 1GB

Parallel Processing

Configure parallelization:

[performance]
enable_parallel = true
max_threads = 0        # 0 = auto-detect
chunk_size = 1048576   # 1MB chunks

Caching

Enable various caches:

[performance]
cache_regex_compilation = true
cache_section_analysis = true
cache_string_hashes = true

Environment Variables

Override configuration with environment variables:

Variable	Description	Example
`STRINGY_CONFIG`	Config file path	`~/.stringy.toml`
`STRINGY_MIN_LEN`	Minimum string length	`6`
`STRINGY_FORMAT`	Output format	`json`
`STRINGY_MAX_RESULTS`	Result limit	`50`
`NO_COLOR`	Disable colored output	`1`

Profiles

Use predefined configuration profiles:

Security Analysis Profile

stringy --profile security malware.exe

Equivalent to:

min_ascii_len = 6
encodings = ["ascii", "utf8", "utf16le"]
only_tags = ["url", "domain", "ipv4", "ipv6", "filepath", "regpath"]
min_score = 70

YARA Development Profile

stringy --profile yara suspicious.dll

Equivalent to:

format = "yara"
min_ascii_len = 8
exclude_tags = ["import", "export"]
min_score = 80
max_results = 50

Development Profile

stringy --profile dev application

Equivalent to:

include_debug = true
include_symbols = true
max_results = 500
show_all_metadata = true

Validation

Configuration validation ensures settings are compatible:

# This would generate a warning
[extraction]
min_ascii_len = 10
max_string_len = 5 # Invalid: min > max

Migration

When upgrading Stringy versions, configuration migration is handled automatically:

# Backup current config
cp ~/.config/stringy/config.toml ~/.config/stringy/config.toml.backup

# Stringy will migrate on first run
stringy --version

Examples

Minimal Configuration

[extraction]
min_ascii_len = 6

[output]
format = "json"
max_results = 50

Comprehensive Security Analysis

[extraction]
min_ascii_len = 6
min_utf16_len = 4
encodings = ["ascii", "utf8", "utf16le"]
include_debug = false

[classification]
detect_urls = true
detect_domains = true
detect_ips = true
detect_paths = true
detect_guids = true
min_confidence = 0.8

[output]
format = "json"
min_score = 70
only_tags = ["url", "domain", "ipv4", "ipv6", "filepath", "regpath", "guid"]

[ranking.semantic_boosts]
url = 30
domain = 25
ipv4 = 25
ipv6 = 25
filepath = 20
regpath = 20
guid = 20

This flexible configuration system allows Stringy to be adapted for various use cases, from interactive analysis to automated security pipelines.

Performance

Stringy is designed for efficient analysis of binary files, from small executables to large system libraries.

How It Works

Stringy memory-maps input files via mmap-guard for zero-copy access, then processes sections in weight-priority order. Regex patterns for semantic classification are compiled once using LazyLock statics.

The processing pipeline is single-threaded and sequential:

Format detection and section analysis – O(n) where n = number of sections
String extraction – O(m) where m = total section size
Deduplication – hash-based grouping of identical strings
Classification – O(k) where k = number of unique strings
Ranking and sorting – O(k log k)

Reducing Processing Time

Use CLI flags to narrow the work Stringy does:

# Limit to top results (skip sorting the long tail)
stringy --top 50 binary

# Increase minimum length to reduce noise and string count
stringy --min-len 8 binary

# Restrict to a single encoding (skip UTF-16 detection)
stringy --enc ascii binary

# Skip classification and ranking entirely
stringy --raw binary

--raw mode is the fastest option – it extracts and deduplicates strings without running the classifier or ranker.

Benchmarking

Stringy includes Criterion benchmarks for core components:

# Run all benchmarks
just bench

# Run a specific benchmark
cargo bench --bench elf
cargo bench --bench pe
cargo bench --bench classification
cargo bench --bench ascii_extraction

Profiling

# CPU profiling with perf (Linux)
perf record --call-graph dwarf -- stringy large_file.exe
perf report

# macOS profiling with Instruments
xcrun xctrace record --template "Time Profiler" --launch -- stringy binary

# Memory profiling
/usr/bin/time -l stringy large_file.exe   # macOS
/usr/bin/time -v stringy large_file.exe   # Linux

Batch Processing

Stringy processes one file per invocation. For batch workflows, use standard Unix tools:

# Process multiple files
find /path/to/binaries -type f -exec stringy --json {} \; > all_strings.jsonl

# Parallel processing with xargs
find /binaries -name "*.exe" -print0 | xargs -0 -P 4 -I {} stringy --json {} > results.jsonl

Troubleshooting

This guide helps resolve common issues when using Stringy. If you don’t find a solution here, please check the GitHub issues or create a new issue.

Installation Issues

“cargo: command not found”

Problem: Rust/Cargo is not installed or not in PATH.

Solution:

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env

# Verify installation
cargo --version

Build Failures

Problem: Compilation errors during cargo build.

Common causes and solutions:

Outdated Rust Version

# Update Rust
rustup update

# Check version (should be 1.91+)
rustc --version

Missing System Dependencies

# Ubuntu/Debian
sudo apt update
sudo apt install build-essential pkg-config

# Fedora/RHEL
sudo dnf groupinstall "Development Tools"
sudo dnf install pkg-config

# macOS
xcode-select --install

Network Issues

# Use alternative registry
export CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse

# Or use offline mode if dependencies are cached
cargo build --offline

Permission Denied

Problem: Cannot execute the binary after installation.

Solution:

# Make binary executable
chmod +x ~/.cargo/bin/stringy

# Or reinstall with proper permissions
cargo install --path . --force

Runtime Issues

“Unsupported file format”

Problem: Stringy cannot detect the binary format.

Diagnosis:

# Check file type
file binary_file

# Check if it's actually a binary
hexdump -C binary_file | head

Note: Unknown or unparseable formats (plain text, etc.) do not error. Stringy falls back to unstructured raw byte scanning and succeeds. This message only appears if something else goes wrong during parsing.

“Permission denied” when reading files

Problem: Cannot read the target binary file.

Solutions:

# Check file permissions
ls -l binary_file

# Make readable
chmod +r binary_file

# Run with appropriate privileges
sudo stringy system_binary

Output Issues

No Strings Found

Problem: Stringy reports no strings in a binary that should have strings.

Diagnosis:

# Check with traditional strings command
strings binary_file | head -20

# Try different encodings (one at a time)
stringy --enc ascii binary_file
stringy --enc utf8 binary_file
stringy --enc utf16le binary_file
stringy --enc utf16be binary_file

# Lower minimum length
stringy --min-len 1 binary_file

Common causes:

Packed or encrypted binary
Unusual string encoding
Strings in unexpected sections
Very short strings below minimum length

Garbled Output

Problem: String output contains garbled or binary characters.

Solutions:

# Force specific encoding
stringy --enc ascii binary_file

# Increase minimum length to filter noise
stringy --min-len 6 binary_file

# Use JSON output which properly escapes invalid sequences
stringy --json binary_file | jq '.text'

# Filter by score
stringy --json binary_file | jq 'select(.score > 70)'

Missing Expected Strings

Problem: Known strings are not appearing in output.

Diagnosis:

# Check if strings exist with traditional tools
strings binary_file | grep "expected_string"

# Try each encoding
stringy --enc ascii binary_file | grep "expected"
stringy --enc utf8 binary_file | grep "expected"
stringy --enc utf16le binary_file | grep "expected"

# Lower score threshold
stringy --json binary_file | jq 'select(.score > 0)' | grep "expected"

Error Messages

“–summary requires a TTY”

Problem: --summary flag used when stdout is piped or redirected.

Solution: --summary only works when stdout is a terminal. Remove --summary when piping output:

# This will error (exit 1):
stringy --summary binary | grep foo

# This works:
stringy --summary binary

Tag overlap error

Problem: Same tag appears in both --only-tags and --no-tags.

Solution: Remove the duplicate tag from one of the two flags. This is a runtime validation error (exit 1).

“Invalid UTF-8 sequence”

Problem: String contains invalid UTF-8 bytes.

Solution: This is usually normal for binary data. Stringy handles this automatically, but you can:

# Use ASCII only to avoid UTF-8 issues
stringy --enc ascii binary_file

# Use JSON output which properly escapes invalid sequences
stringy --json binary_file

“Regex compilation failed”

Problem: Internal regex pattern compilation error.

Solution: This indicates a bug. Please report it with:

# Get version information
stringy --version

File Not Found

Problem: The specified file does not exist.

Exit code: 1

Solution: Check the file path and ensure the file exists:

ls -l /path/to/binary

Performance Issues

Very Slow Processing

Problem: Stringy takes too long to process files.

Solutions:

# Increase minimum length to reduce extraction volume
stringy --min-len 8 large_file.exe

# Limit results
stringy --top 50 large_file.exe

# Use ASCII only
stringy --enc ascii large_file.exe

# Use raw mode (skip classification)
stringy --raw large_file.exe

High Memory Usage

Problem: Stringy uses too much memory.

Solutions:

# Limit results
stringy --top 100 file.exe

# Increase minimum length
stringy --min-len 8 file.exe

# Use raw mode to skip classification
stringy --raw file.exe

Exit Code Reference

Code	Meaning
0	Success (including unknown binary format, empty binary, no filter matches)
1	Runtime error (file not found, tag overlap, `--summary` in non-TTY)
2	Argument parsing error (invalid flag, flag conflict, invalid tag name)

Debugging Tips

Compare with Traditional Tools

# Compare with standard strings
strings binary_file > traditional.txt
stringy --json binary_file | jq -r '.text' > stringy.txt
diff traditional.txt stringy.txt

Test with Known Good Files

# Test with system binaries
stringy /bin/ls        # Linux
stringy /bin/cat       # Linux
stringy /usr/bin/grep  # macOS

Getting Help

Information to Include in Bug Reports

System information:

stringy --version
rustc --version
uname -a  # Linux/macOS

File information:
```
file binary_file
ls -l binary_file
```
Exact command line used
Expected vs actual behavior

Where to Get Help

Documentation: Check this guide and the project documentation
GitHub Issues: Search existing issues or create a new one
Discussions: Use GitHub Discussions for questions and ideas

Keyboard shortcuts

Stringy User Guide