Introduction
Stringy is a smarter alternative to the standard strings command that uses binary analysis to extract meaningful strings from executables. Unlike traditional string extraction tools, Stringy focuses on data structures rather than arbitrary byte runs.
Why Stringy?
The standard strings command has several limitations:
- Noise: Dumps every printable byte sequence, including padding and table data
- UTF-16 Issues: Produces interleaved garbage when scanning UTF-16 strings
- No Context: Provides no information about where strings come from
- No Prioritization: Treats all strings equally, regardless of relevance
Stringy addresses these issues by being:
- Data-structure aware: Only extracts strings from actual binary data structures
- Section-aware: Prioritizes meaningful sections like
.rodata,.rdata,__cstring - Encoding-aware: Properly handles ASCII/UTF-8, UTF-16LE, and UTF-16BE
- Semantically intelligent: Identifies and tags URLs, domains, file paths, GUIDs, etc.
- Ranked: Presents the most relevant strings first
Key Features
Multi-Format Support
- ELF (Linux executables and libraries)
- PE (Windows executables and DLLs)
- Mach-O (macOS executables and frameworks)
Smart String Extraction
- Section-aware extraction prioritizing string-rich sections
- Multi-encoding support (ASCII, UTF-8, UTF-16LE/BE)
- Deduplication with metadata preservation
- Configurable minimum length filtering
Semantic Classification
- Network: URLs, domains, IP addresses
- Filesystem: File paths, registry keys
- Identifiers: GUIDs, email addresses, user agents
- Code: Format strings, Base64 data
- Symbols: Import/export names, demangled symbols
Multiple Output Formats
- Human-readable: Sorted tables for interactive analysis
- JSONL: Machine-readable format for automation
- YARA-friendly: Optimized for security rule creation
Use Cases
Binary Analysis & Reverse Engineering
Extract meaningful strings to understand program functionality, identify libraries, and discover embedded resources.
Malware Analysis
Quickly identify network indicators, file paths, registry keys, and other artifacts of interest in suspicious binaries.
YARA Rule Development
Generate high-confidence string candidates for creating detection rules, with automatic escaping and formatting.
Security Research
Analyze binaries for hardcoded credentials, API endpoints, configuration data, and other security-relevant strings.
Project Status
Stringy is in active development with a solid foundation already in place. The core infrastructure is complete and robust:
Implemented:
- Complete binary format detection (ELF, PE, Mach-O)
- Comprehensive section classification with intelligent weighting
- Import/export symbol extraction from all formats
- String extraction engines (ASCII/UTF-8, UTF-16LE/BE)
- Semantic classification system (URLs, paths, GUIDs, etc.)
- Ranking, scoring, and normalization algorithms
- Output formatters (table, JSONL, YARA)
- Full CLI interface with filtering, encoding, and mode flags
- Noise filtering with multi-layered heuristics
- Type-safe error handling and data structures
- Extensible architecture with trait-based parsers
See the Architecture Overview for technical details and the Contributing guide to get involved.
Installation
Pre-built Binaries
Pre-built binaries for Linux, macOS, and Windows are available on the Releases page.
Download the appropriate archive for your platform, extract it, and place the stringy binary somewhere on your PATH.
From Source
Prerequisites
- Rust: Version 1.91 or later (see rustup.rs if you need to install Rust)
- Git: For cloning the repository
Build and Install
git clone https://github.com/EvilBit-Labs/Stringy
cd Stringy
cargo install --path .
This installs the stringy binary to ~/.cargo/bin/, which should be in your PATH.
Verify Installation
stringy --version
Development Build
For development and testing, Stringy uses just and mise to manage tooling:
git clone https://github.com/EvilBit-Labs/Stringy
cd Stringy
just setup # Install tools and components
just gen-fixtures # Generate test fixtures (requires Zig via mise)
just test # Run tests
If you do not use just, the minimum requirements are:
cargo build --release
cargo test
Troubleshooting
Build Failures
Update Rust to the latest version:
rustup update
Clear the build cache:
cargo clean
cargo build --release
Getting Help
If you encounter issues:
- Check the troubleshooting guide
- Search existing GitHub issues
- Open a new issue with your OS, Rust version (
rustc --version), and complete error output
Next Steps
Once installed, see the Quick Start guide to begin using Stringy.
Quick Start
This guide will get you up and running with Stringy in minutes.
Basic Usage
Analyze a Binary
stringy /path/to/binary
Stringy will:
- Detect ELF, PE, or Mach-O format automatically
- Extract ASCII and UTF-16 strings from prioritized sections
- Apply semantic classification (URLs, paths, GUIDs, etc.)
- Rank results by relevance and display them in a table
Example Output (TTY)
String Tags Score Section
------ ---- ----- -------
https://api.example.com/v1/users url 95 .rdata
{12345678-1234-1234-1234-123456789abc} guid 87 .rdata
/usr/local/bin/application filepath 82 __cstring
Error: %s at line %d fmt 78 .rdata
MyApplication v1.2.3 version 75 .rsrc
Common Use Cases
Security Analysis
Extract network indicators and file paths:
stringy --only-tags url --only-tags domain --only-tags filepath --only-tags regpath malware.exe
YARA Rule Development
Generate rule candidates:
stringy --yara --min-len 8 target.bin > candidates.yar
JSON Output for Automation
stringy --json --debug binary.elf | jq 'select(.display_score > 80)'
Extraction-Only Mode
Skip classification and ranking for fast raw extraction:
stringy --raw binary
Understanding the Output
Score Column
Strings are ranked using a display score from 0-100:
- 90-100: High-value indicators (URLs, GUIDs in high-priority sections)
- 70-89: Meaningful strings (file paths, format strings)
- 50-69: Moderate relevance (imports, version info)
- 0-49: Low relevance (short or noisy strings)
See Output Formats for the full band-mapping table.
Tags
Semantic classifications help identify string types:
| Tag | Description | Example |
|---|---|---|
url | Web URLs | https://example.com/api |
domain | Domain names | api.example.com |
ipv4/ipv6 | IP addresses | 192.168.1.1 |
filepath | File paths | /usr/bin/app |
regpath | Registry paths | HKEY_LOCAL_MACHINE\... |
guid | GUIDs/UUIDs | {12345678-1234-...} |
email | Email addresses | user@example.com |
b64 | Base64 data | SGVsbG8gV29ybGQ= |
fmt | Format strings | Error: %s |
import/export | Symbol names | CreateFileW |
demangled | Demangled symbols | std::io::Read::read |
user-agent-ish | User-agent-like strings | Mozilla/5.0 ... |
version | Version strings | v1.2.3 |
manifest | Manifest data | PE/Mach-O embedded XML |
resource | Resource strings | PE VERSIONINFO/STRINGTABLE |
dylib-path | Dynamic library paths | /usr/lib/libfoo.dylib |
rpath | Runtime search paths | /usr/local/lib |
rpath-var | Rpath variables | @loader_path/../lib |
framework-path | Framework paths (macOS) | /System/Library/... |
Sections
Shows where strings were found:
- ELF:
.rodata,.data.rel.ro,.comment - PE:
.rdata,.rsrc, version info - Mach-O:
__TEXT,__cstring,__DATA_CONST
Filtering and Options
By String Length
# Minimum 6 characters
stringy --min-len 6 binary
By Encoding
# ASCII only
stringy --enc ascii binary
# UTF-16 only (useful for Windows binaries)
stringy --enc utf16 binary.exe
By Tags
# Only network-related strings
stringy --only-tags url --only-tags domain --only-tags ipv4 --only-tags ipv6 binary
# Exclude Base64 noise
stringy --no-tags b64 binary
Limit Results
# Top 50 results
stringy --top 50 binary
Summary
Append a summary block after table output (TTY only):
stringy --summary binary
Output Formats
Table (Default)
Best for interactive analysis:
stringy binary
JSON Lines
For programmatic processing:
stringy --json binary | jq 'select(.tags[] == "Url")'
YARA Format
For security rule creation:
stringy --yara binary > rule_candidates.yar
Tips and Best Practices
Start Broad, Then Focus
- Run basic analysis first:
stringy binary - Identify interesting patterns in high-scoring results
- Use filters to focus:
--only-tags url --only-tags filepath
Combine with Other Tools
# Find strings, then search for references
stringy --json binary | jq -r 'select(.score > 80) | .text' | xargs -I {} grep -r "{}" /path/to/source
# Extract URLs for further analysis
stringy --only-tags url --json binary | jq -r '.text' | sort -u
Performance Considerations
- Use
--top Nto limit output for large binaries - Use
--encto restrict to a single encoding - Consider
--min-lento reduce noise
Next Steps
- Learn about output formats in detail
- Understand the classification system
- Explore advanced CLI options
- Read about performance optimization
Command Line Interface
Basic Syntax
stringy [OPTIONS] <FILE>
stringy [OPTIONS] - # read from stdin
Options
Input/Output
| Option | Description | Default |
|---|---|---|
<FILE> | Binary file to analyze (use - for stdin) | - |
--json | JSONL output; conflicts with --yara | - |
--yara | YARA rule output; conflicts with --json | - |
--help | Show help | - |
--version | Show version | - |
Filtering
| Option | Description | Default |
|---|---|---|
--min-len N | Minimum string length (must be >= 1) | 4 |
--top N | Limit to top N strings by score (applied after all filters) | - |
--enc ENCODING | Filter by encoding: ascii, utf8, utf16, utf16le, utf16be | all |
--only-tags TAG | Include strings with any of these tags (OR); repeatable | all |
--no-tags TAG | Exclude strings with any of these tags; repeatable | none |
Mode Flags
| Option | Description |
|---|---|
--raw | Extraction-only mode (no tagging, ranking, or scoring); conflicts with --only-tags, --no-tags, --top, --debug, --yara |
--summary | Append summary block (TTY table mode only); conflicts with --json, --yara |
--debug | Include score-breakdown fields (section_weight, semantic_boost, noise_penalty) in JSON output; conflicts with --raw |
Encoding Options
The --enc flag accepts exactly one encoding value per invocation:
| Value | Description |
|---|---|
ascii | 7-bit ASCII only |
utf8 | UTF-8 (includes ASCII) |
utf16 | UTF-16 (both little- and big-endian) |
utf16le | UTF-16 Little Endian only |
utf16be | UTF-16 Big Endian only |
Examples
# ASCII only
stringy --enc ascii binary
# UTF-16 only (common for Windows)
stringy --enc utf16 app.exe
# UTF-8 only
stringy --enc utf8 binary
Tag Filtering
Tags are specified with the repeatable --only-tags and --no-tags flags. Repeat the flag for each tag value:
# Network indicators only
stringy --only-tags url --only-tags domain --only-tags ipv4 --only-tags ipv6 malware.exe
# Exclude noisy Base64
stringy --no-tags b64 binary
# File system related
stringy --only-tags filepath --only-tags regpath app.exe
Available Tags
| Tag | Description | Example |
|---|---|---|
url | HTTP/HTTPS URLs | https://api.example.com |
domain | Domain names | example.com |
ipv4 | IPv4 addresses | 192.168.1.1 |
ipv6 | IPv6 addresses | 2001:db8::1 |
filepath | File paths | /usr/bin/app |
regpath | Registry paths | HKEY_LOCAL_MACHINE\... |
guid | GUIDs/UUIDs | {12345678-1234-...} |
email | Email addresses | user@example.com |
b64 | Base64 data | SGVsbG8= |
fmt | Format strings | Error: %s |
user-agent-ish | User-agent-like strings | Mozilla/5.0 ... |
demangled | Demangled symbols | std::io::Read::read |
import | Import names | CreateFileW |
export | Export names | main |
version | Version strings | v1.2.3 |
manifest | Manifest data | XML/JSON config |
resource | Resource strings | UI text |
dylib-path | Dynamic library paths | /usr/lib/libfoo.dylib |
rpath | Runtime search paths | /usr/local/lib |
rpath-var | Rpath variables | @loader_path/../lib |
framework-path | Framework paths (macOS) | /System/Library/Frameworks/... |
Output Formats
Table (Default, TTY)
When stdout is a TTY, results are shown as a table with columns:
String | Tags | Score | Section
When piped (non-TTY), output is plain text with one string per line and no headers.
JSON Lines (--json)
Each line is a JSON object with full metadata. See Output Formats for the schema.
YARA (--yara)
Generates a YARA rule template. See Output Formats for details.
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success (including unknown binary format, empty binary, no filter matches) |
| 1 | General runtime error |
| 2 | Configuration or validation error (tag overlap, --summary in non-TTY) |
| 3 | File not found |
| 4 | Permission denied |
Clap argument parsing errors (invalid flag, flag conflict, invalid tag name) use clap’s own exit code (typically 2).
Advanced Usage
Pipeline Integration
# Extract URLs and check them
stringy --only-tags url --json binary | jq -r '.text' | xargs -I {} curl -I {}
# Find high-score strings
stringy --json binary | jq 'select(.score > 80)'
# Count strings by tag
stringy --json binary | jq -r '.tags[]' | sort | uniq -c
Batch Processing
# Process multiple files
find /path/to/binaries -type f -exec stringy --json {} \; > all_strings.jsonl
# Compare two versions
stringy --json old_binary > old.jsonl
stringy --json new_binary > new.jsonl
diff <(jq -r '.text' old.jsonl | sort) <(jq -r '.text' new.jsonl | sort)
Focused Analysis
# Fast scan for high-value strings only
stringy --top 20 --min-len 8 --only-tags url --only-tags guid --only-tags filepath large_binary
# Extraction-only mode (no classification overhead)
stringy --raw binary
Output Formats
Stringy supports three output formats optimized for different use cases.
Table Format (Default)
TTY Mode
When stdout is a TTY, results are shown as an aligned table. Columns appear in this order:
String Tags Score Section
------ ---- ----- -------
https://api.example.com/v1/users Url 95 .rdata
{12345678-1234-1234-1234-123456789abc} guid 87 .rdata
/usr/local/bin/application filepath 82 __cstring
Error: %s at line %d fmt 78 .rdata
Features:
- Truncation: Long strings are truncated with
...indicator - Sorting: Results sorted by score (highest first)
- Alignment: Columns properly aligned for readability
Plain Text (Piped / Non-TTY)
When stdout is piped, output switches to plain text with one string per line and no headers or table formatting. This is designed for downstream tool consumption.
--summary Block
When --summary is passed (TTY mode only; conflicts with --json and --yara), a summary block is appended after the table showing aggregate statistics about the extraction.
JSON Lines Format
Machine-readable format with one JSON object per line (JSONL), ideal for automation and pipeline integration.
stringy --json binary
Example Output
{"text":"https://api.example.com/v1/users","encoding":"Ascii","offset":4096,"rva":4096,"section":".rdata","length":31,"tags":["Url"],"score":95,"confidence":1.0,"source":"SectionData"}
{"text":"{12345678-1234-1234-1234-123456789abc}","encoding":"Ascii","offset":8192,"rva":8192,"section":".rdata","length":38,"tags":["guid"],"score":87,"confidence":0.95,"source":"SectionData"}
Schema
Each JSON object contains:
| Field | Type | Description |
|---|---|---|
text | string | The extracted string (demangled if applicable) |
original_text | string or null | Original mangled form (present only when demangled) |
encoding | string | Encoding: Ascii, Utf8, Utf16Le, Utf16Be |
offset | number | File offset in bytes |
rva | number or null | Relative Virtual Address (if available) |
section | string or null | Section name where found |
length | number | String length in bytes |
tags | array | Semantic classification tags |
score | number | Internal relevance score |
display_score | number or null | Display score (0-100 band-mapped); only present with --debug |
confidence | number | Confidence score from noise filtering (0.0-1.0) |
source | string | Source type: SectionData, ImportName, ExportName, etc. |
Debug Fields
When --debug is passed, four additional fields appear:
| Field | Type | Description |
|---|---|---|
display_score | number or null | Display score (0-100 band-mapped) |
section_weight | number or null | Section weight contribution to score |
semantic_boost | number or null | Semantic classification boost |
noise_penalty | number or null | Noise penalty applied |
Raw Mode
With --raw --json, output contains extraction-only data: score is 0, tags is empty, and display_score is absent.
Processing Examples
# Extract only URLs
stringy --json binary | jq 'select(.tags[] == "Url") | .text'
# High-score strings only
stringy --json binary | jq 'select(.score > 80)'
# Group by section
stringy --json binary | jq -r '.section' | sort | uniq -c
# Find strings in specific section
stringy --json binary | jq 'select(.section == ".rdata")'
YARA Format
Specialized format for creating YARA detection rules with proper escaping and metadata.
stringy --yara binary
Example Output
// YARA rule generated by Stringy
// Binary: binary
// Generated: 1234567890
rule binary_strings {
meta:
description = "Strings extracted from binary"
generated_by = "stringy"
generated_at = "1234567890"
strings:
// tag: filepath
// score: 82
$filepath_1 = "/usr/local/bin/application" ascii
// tag: fmt
// score: 78
$fmt_1 = "Error: %s at line %d" ascii
// tag: Url
// score: 95
$Url_1 = "https://api.example.com/v1/users" ascii
// skipped (length > 200 chars): 245
condition:
any of them
}
Features
- Rule naming: Rule name is derived from the filename with non-alphanumeric characters replaced by
_and a_stringssuffix added - Tag grouping: Strings are grouped by their first tag with
// tag: <name>comments and per-string// score: <N>annotations - Variable naming: Variables use tag-derived names (e.g.,
$Url_1,$filepath_1,$fmt_1) rather than sequential$sN - Proper escaping: Handles special characters and binary data
- Long string handling: Strings over 200 characters are replaced with
// skipped (length > 200 chars): N(where N is the character count) - Modifiers: Appropriate
ascii/widemodifiers based on encoding
Score Behavior
Stringy uses a band-mapping system to convert internal scores to display scores (0-100):
| Internal Score | Display Score | Meaning |
|---|---|---|
| <= 0 | 0 | Low relevance |
| 1-79 | 1-49 | Low relevance |
| 80-119 | 50-69 | Moderate |
| 120-159 | 70-89 | Meaningful |
| 160-220 | 90-100 | High-value |
| > 220 | 100 (clamped) | High-value |
Format Comparison
| Feature | Table | JSON | YARA |
|---|---|---|---|
| Interactive use | Yes | No | No |
| Automation | No | Yes | No |
| Rule creation | No | No | Yes |
| Full metadata | No | Yes | No |
Output Customization
Filtering
All formats support the same filtering options:
# Limit results
stringy --top 50 --json binary
# Filter by tags
stringy --only-tags url --only-tags domain --yara binary
# Minimum score threshold (post-process)
stringy --json binary | jq 'select(.score >= 70)'
Redirection
# Save to file
stringy --json binary > strings.jsonl
stringy --yara binary > rules.yar
# Pipe to other tools
stringy --json binary | jq 'select(.tags[] == "Url")' | less
Architecture Overview
Stringy is built as a modular Rust library with a clear separation of concerns. The architecture follows a pipeline approach where binary data flows through several processing stages.
High-Level Architecture
Binary File -> Format Detection -> Container Parsing -> String Extraction -> Deduplication -> Classification -> Ranking -> Output
Core Components
1. Container Module (src/container/)
Handles binary format detection and parsing using the goblin crate with comprehensive section analysis.
- Format Detection: Automatically identifies ELF, PE, and Mach-O formats via
goblin::Object::parse() - Section Classification: Categorizes sections by string likelihood with weighted scoring
- Metadata Extraction: Collects imports, exports, and detailed structural information
- Cross-Platform Support: Handles platform-specific section characteristics and naming conventions
Supported Formats
| Format | Parser | Key Sections (Weight) | Import/Export Support |
|---|---|---|---|
| ELF | ElfParser | .rodata (10.0), .comment (9.0), .data.rel.ro (7.0) | Dynamic and static |
| PE | PeParser | .rdata (10.0), .rsrc (9.0), read-only .data (7.0) | Import/export tables |
| Mach-O | MachoParser | __TEXT,__cstring (10.0), __TEXT,__const (9.0) | Symbol tables |
Section Weight System
Container parsers assign weights (1.0-10.0) to sections based on how likely they are to contain meaningful strings. Higher weights indicate higher-value sections. For example, .rodata (read-only data) receives a weight of 10.0, while .text (executable code) receives 1.0.
2. Extraction Module (src/extraction/)
Implements encoding-aware string extraction algorithms with configurable parameters.
- ASCII/UTF-8: Scans for printable character sequences with noise filtering
- UTF-16: Detects little-endian and big-endian wide strings with confidence scoring
- PE Resources: Extracts version info, manifests, and string table resources from PE binaries
- Mach-O Load Commands: Extracts strings from Mach-O load commands (dylib paths, rpaths)
- Deduplication: Groups strings by (text, encoding) keys, preserves all occurrence metadata, merges tags using set union, and calculates combined scores with occurrence-based bonuses
- Noise Filters: Applies configurable filters to reduce false positives
- Section-Aware: Uses container parser weights to prioritize extraction areas
Deduplication System
The deduplication module (src/extraction/dedup/) provides comprehensive string deduplication:
- Grouping Strategy: Strings are grouped by
(text, encoding)tuple, ensuring UTF-8 and UTF-16 versions are kept separate - Occurrence Preservation: All occurrence metadata (offset, RVA, section, source, tags, score, confidence) is preserved
- Tag Merging: Tags from all occurrences are merged using
HashSetfor uniqueness, then converted to a sortedVec<Tag> - Combined Scoring: Calculates combined scores using a base score (maximum across occurrences) plus bonuses for multiple occurrences, cross-section appearances, and multi-source appearances
3. Classification Module (src/classification/)
Applies semantic analysis to extracted strings with comprehensive tagging system.
- Pattern Matching: Uses regex to identify URLs, IPs, domains, paths, GUIDs, emails, format strings, base64, user-agent strings, and version strings
- Symbol Processing: Demangles Rust symbols and processes imports/exports
- Context Analysis: Considers section context and source type for classification
Supported Classification Tags
| Category | Tags | Examples |
|---|---|---|
| Network | url, domain, ipv4, ipv6 | https://api.com, example.com, 192.168.1.1 |
| Filesystem | filepath, regpath, dylib-path, rpath, rpath-var, framework-path | /usr/bin/app, HKEY_LOCAL_MACHINE\... |
| Identifiers | guid, email, user-agent-ish | {12345678-...}, user@domain.com |
| Code | fmt, b64, import, export, demangled | Error: %s, SGVsbG8=, CreateFileW |
| Resources | version, manifest, resource | v1.2.3, XML config, UI strings |
4. Ranking Module (src/classification/ranking.rs)
Implements the scoring algorithm to prioritize relevant strings using multiple factors:
- Section Weight: Based on the section’s classification (higher weights for string-oriented sections like
.rodata) - Semantic Boost: Bonus points for strings with recognized semantic tags (URLs, GUIDs, paths, etc.)
- Noise Penalty: Penalty for characteristics indicating noise (low confidence, repetitive patterns, high entropy)
The internal score is then mapped to a display score (0-100) using a band-mapping system. See Output Formats for the display-score band table.
5. Output Module (src/output/)
Formats results for different use cases:
- Table (
src/output/table/): TTY-aware output with color-coded scores, or plain text when piped. Columns: String, Tags, Score, Section. - JSON (
src/output/json.rs): JSONL format with complete structured data including all metadata fields - YARA (
src/output/yara/): Properly escaped strings with appropriate modifiers and long-string skipping
6. Pipeline Module (src/pipeline/)
Orchestrates the entire flow from file reading through output:
- Configuration (
src/pipeline/config.rs):PipelineConfig,FilterConfig, andEncodingFilter - Filtering (
src/pipeline/filter.rs):FilterEngineapplies post-extraction filtering by min-length, encoding, tags, and top-N - Score Normalization (
src/pipeline/normalizer.rs):ScoreNormalizermaps internal scores to display scores (0-100) and populatesdisplay_scoreon eachFoundStringunconditionally in all non-raw executions - Orchestration (
src/pipeline/mod.rs):Pipeline::rundrives the full pipeline
Data Flow
1. Binary Analysis Phase
The pipeline reads the file, detects the binary format via goblin, and dispatches to the appropriate container parser (ELF, PE, or Mach-O). The parser returns a ContainerInfo struct containing sections with weights, imports, and exports. Unknown or unparseable formats fall back to unstructured raw byte scanning.
2. String Extraction Phase
Strings are extracted from each section using encoding-specific extractors (ASCII, UTF-8, UTF-16LE, UTF-16BE). Import and export symbol names are included as high-value strings. PE resources (version info, manifests, string tables) and Mach-O load command strings are also extracted. Results are then deduplicated by grouping on (text, encoding).
3. Classification Phase
Each string is passed through pattern matchers that assign semantic tags based on content. Rust mangled symbols are demangled. The ranking algorithm then computes a score for each string combining section weight, semantic boost, and noise penalty.
4. Output Phase
Strings are sorted by score (descending), filtered according to user options (tags, encoding, top-N), and formatted for the selected output mode (table, JSON, or YARA).
Data Structures
Core types are defined in src/types/mod.rs:
#![allow(unused)]
fn main() {
pub struct FoundString {
pub text: String,
pub original_text: Option<String>, // pre-demangled form
pub encoding: Encoding,
pub offset: u64,
pub rva: Option<u64>,
pub section: Option<String>,
pub length: u32,
pub tags: Vec<Tag>,
pub score: i32,
pub section_weight: Option<i32>, // debug only
pub semantic_boost: Option<i32>, // debug only
pub noise_penalty: Option<i32>, // debug only
pub display_score: Option<i32>, // populated in all non-raw executions
pub source: StringSource,
pub confidence: f32,
}
}
Key Design Decisions
Error Handling
- Comprehensive error types with context via
thiserror - Graceful degradation for partially corrupted binaries
- Unknown formats fall back to raw byte scanning rather than erroring
Extensibility
- Trait-based architecture for easy format addition
- Pluggable classification systems
- Configurable output formats
Performance
- Section-aware extraction reduces scan time
- Regex caching via
once_cell::sync::Lazyfor repeated pattern matching - Weight-based prioritization avoids scanning low-value sections
Module Dependencies
main.rs
+-- lib.rs (public API, re-exports)
+-- types/
| +-- mod.rs (core data structures: Tag, FoundString, Encoding, etc.)
| +-- error.rs (StringyError, Result)
| +-- constructors.rs (constructor implementations)
| +-- found_string.rs (FoundString builder methods)
| +-- tests.rs
+-- container/
| +-- mod.rs (format detection, ContainerParser trait)
| +-- elf/
| | +-- mod.rs (ELF parser)
| | +-- tests.rs
| +-- pe/
| | +-- mod.rs (PE parser)
| | +-- tests.rs
| +-- macho/
| | +-- mod.rs (Mach-O parser)
| | +-- tests.rs
+-- extraction/
| +-- mod.rs (extraction orchestration)
| +-- ascii/ (ASCII/UTF-8 extraction)
| +-- utf16/ (UTF-16LE/BE extraction with confidence scoring)
| +-- dedup/ (deduplication with scoring)
| +-- filters/ (noise filter implementations)
| +-- pe_resources/ (PE version info, manifests, string tables)
| +-- macho_load_commands.rs (Mach-O load command strings)
+-- classification/
| +-- mod.rs (classification framework)
| +-- patterns/ (regex-based pattern matching)
| +-- symbols.rs (symbol processing and demangling)
| +-- ranking.rs (scoring algorithm)
+-- output/
| +-- mod.rs (OutputFormat, OutputMetadata, formatting dispatch)
| +-- json.rs (JSONL format)
| +-- table/ (TTY and plain text table formatting)
| +-- yara/ (YARA rule generation with escaping)
+-- pipeline/
+-- mod.rs (Pipeline::run orchestration)
+-- config.rs (PipelineConfig, FilterConfig, EncodingFilter)
+-- filter.rs (post-extraction filtering)
+-- normalizer.rs (score band mapping)
External Dependencies
Core Dependencies
goblin- Multi-format binary parsing (ELF, PE, Mach-O)pelite- PE resource extraction (version info, manifests, string tables)serde+serde_json- Serializationthiserror- Error handlingclap- CLI argument parsingregex- Pattern matching for classificationrustc-demangle- Rust symbol demanglingindicatif- Progress bars and spinners for CLI outputtempfile- Temporary file creation for stdin-to-Pipeline bridgingonce_cell- Lazy-initialized static regex patternspatharg- Input argument handling (file path or stdin)
Testing Strategy
Unit Tests
- Each module has comprehensive unit tests
- Mock data for parser testing
- Edge case coverage for string extraction
Integration Tests
- End-to-end CLI functionality via
assert_cmd - Real binary file testing with compiled fixtures
- Snapshot testing via
insta - Cross-platform validation
Performance Tests
- Benchmarks via
criterioninbenches/
Binary Format Support
Stringy supports the three major executable formats across different platforms. Each format has unique characteristics that influence string extraction strategies.
ELF (Executable and Linkable Format)
Used primarily on Linux and other Unix-like systems.
Key Sections for String Extraction
| Section | Priority | Description |
|---|---|---|
.rodata | High | Read-only data, often contains string literals |
.rodata.str1.1 | High | Aligned string literals |
.data.rel.ro | Medium | Read-only after relocation |
.comment | Medium | Compiler and build information |
.note.* | Low | Various metadata notes |
ELF-Specific Features
- Symbol Tables: Extract import/export names from
.dynsymand.symtab - Dynamic Strings: Process
.dynstrfor library names and symbols - Section Flags: Use
SHF_EXECINSTRandSHF_WRITEfor classification - Virtual Addresses: Map file offsets to runtime addresses
- Dynamic Linking: Parse
DT_NEEDEDentries to extract library dependencies - Symbol Types: Support for functions (STT_FUNC), objects (STT_OBJECT), TLS variables (STT_TLS), and indirect functions (STT_GNU_IFUNC)
- Symbol Visibility: Filter hidden and internal symbols from exports (STV_HIDDEN, STV_INTERNAL)
Enhanced Symbol Extraction
The ELF parser now provides comprehensive symbol extraction with:
-
Import Detection: Identifies all undefined symbols (SHN_UNDEF) that need runtime resolution
- Supports multiple symbol types: functions, objects, TLS variables, and indirect functions
- Handles both global and weak bindings
- Maps symbols to their providing libraries using version information
-
Export Detection: Extracts all globally visible defined symbols
- Filters out hidden (STV_HIDDEN) and internal (STV_INTERNAL) symbols
- Includes both strong and weak symbols
- Supports all relevant symbol types
-
Library Dependencies: Extracts DT_NEEDED entries from the dynamic section
- Provides list of required shared libraries
- Used in conjunction with version information for symbol-to-library mapping
-
Symbol-to-Library Mapping: Maps imported symbols to their providing libraries
- Uses ELF version tables (versym and verneed) for best-effort attribution
- Process: versym index → verneed entry → library filename
- Falls back to heuristics for unversioned symbols (e.g., common libc symbols)
- Returns
Nonewhen version information is unavailable or ambiguous
Implementation Details
impl ElfParser {
fn classify_section(section: &SectionHeader, name: &str) -> SectionType {
// Check executable flag first
if section.sh_flags & SHF_EXECINSTR != 0 {
return SectionType::Code;
}
// Classify by name patterns
match name {
".rodata" | ".rodata.str1.1" => SectionType::StringData,
".data.rel.ro" => SectionType::ReadOnlyData,
// ... more classifications
}
}
fn extract_imports(&self, elf: &Elf, libraries: &[String]) -> Vec<ImportInfo> {
// Extract undefined symbols from dynamic symbol table
// Supports STT_FUNC, STT_OBJECT, STT_TLS, STT_GNU_IFUNC, STT_NOTYPE
// Handles both STB_GLOBAL and STB_WEAK bindings
// Maps symbols to libraries using version information
}
fn extract_exports(&self, elf: &Elf) -> Vec<ExportInfo> {
// Extract defined symbols with global/weak binding
// Filters out STV_HIDDEN and STV_INTERNAL symbols
// Includes all relevant symbol types
}
fn extract_needed_libraries(&self, elf: &Elf) -> Vec<String> {
// Parse DT_NEEDED entries from dynamic section
// Returns list of required shared library names
}
fn get_symbol_providing_library(
&self,
elf: &Elf,
sym_index: usize,
libraries: &[String],
) -> Option<String> {
// 1. Get version index from versym table for this symbol
// 2. Look up version in verneed to find library name
// 3. Match with DT_NEEDED entries
// 4. Fallback to heuristics for unversioned symbols
}
}
Library Dependency Mapping
The ELF parser implements symbol-to-library mapping using ELF version information:
-
Version Symbol Table (versym): Maps each dynamic symbol to a version index
- Index 0 (VER_NDX_LOCAL): Local symbol, not available externally
- Index 1 (VER_NDX_GLOBAL): Global symbol, no specific version
- Index ≥ 2: Versioned symbol, references verneed entry
-
Version Needed Table (verneed): Lists library dependencies with version requirements
- Each entry contains a library filename (from DT_NEEDED)
- Auxiliary entries specify version names and indices
- Links version indices to specific libraries
-
Mapping Process:
Symbol → versym[sym_index] → version_index → verneed lookup → library_name -
Fallback Strategies:
- For unversioned symbols: Attempt to match common symbols (e.g.,
printf,malloc) to libc - If only one library is needed: Attribute to that library (least accurate)
- Otherwise: Return
Noneto avoid false positives
- For unversioned symbols: Attempt to match common symbols (e.g.,
Limitations
ELF’s indirect linking model means symbol-to-library mapping is best-effort:
- Accuracy: Version-based mapping is accurate when version information is present, but many binaries lack version info
- Unversioned Symbols: Symbols without version information cannot be definitively mapped without relocation analysis
- Relocation Tables: PLT/GOT relocations would provide definitive mapping but require complex analysis
- Static Linking: Statically linked binaries have no dynamic section, so all imports have
library: None - Stripped Binaries: Stripped binaries may lack symbol tables entirely
The current implementation is sufficient for most string classification use cases where approximate library attribution is acceptable.
PE (Portable Executable)
Used on Windows for executables, DLLs, and drivers.
Key Sections for String Extraction
| Section | Priority | Description |
|---|---|---|
.rdata | High | Read-only data section |
.rsrc | High | Resources (version info, strings, etc.) |
.data | Medium | Initialized data (check write flag) |
.text | Low | Code section (imports/exports only) |
PE-Specific Features
- Resources: Extract from
VERSIONINFO,STRINGTABLE, and manifest resources - Import/Export Tables: Process IAT and EAT for symbol names
- UTF-16 Prevalence: Windows APIs favor wide strings
- Section Characteristics: Use
IMAGE_SCN_*flags for classification
Enhanced Import/Export Extraction
The PE parser provides comprehensive import/export extraction:
-
Import Extraction: Extracts from PE import directory using goblin’s
pe.imports- Each import includes: function name, DLL name, and RVA
- Example:
printffrommsvcrt.dll - Iterates through
pe.importsto createImportInfowith name, library (DLL), and address (RVA)
-
Export Extraction: Extracts from PE export directory using goblin’s
pe.exports- Each export includes: function name, address, and ordinal
- Note: PE executables typically don’t export symbols (only DLLs do)
- Ordinal is derived from index since goblin doesn’t expose it directly
- Handles unnamed exports with “ordinal_{i}” naming
Resource Extraction (Phase 2 Complete)
PE resources are particularly rich sources of strings. The PE parser now provides comprehensive resource string extraction:
VERSIONINFO Extraction
- Extracts all StringFileInfo key-value pairs from VS_VERSIONINFO structures
- Supports multiple language variants via translation table
- Common extracted fields:
CompanyName: Company or organization nameFileDescription: File purpose and descriptionFileVersion: File version string (e.g., “1.0.0.0”)ProductName: Product nameProductVersion: Product version stringLegalCopyright: Copyright informationInternalName: Internal file identifierOriginalFilename: Original filename
- Uses pelite’s high-level
version_info()API for reliable parsing - All strings are UTF-16LE encoded in the resource
- Tagged with
Tag::VersionandTag::Resource
STRINGTABLE Extraction
- Parses RT_STRING resources (type 6) containing localized UI strings
- Handles block structure: strings grouped in blocks of 16
- Block ID calculation:
(StringID >> 4) + 1 - String format: u16 length (in UTF-16 code units) + UTF-16LE string data
- Supports multiple language variants
- Extracts all non-empty strings from all blocks
- Tagged with
Tag::Resource - Common use cases: UI labels, error messages, dialog text
MANIFEST Extraction
- Extracts RT_MANIFEST resources (type 24) containing application manifests
- Automatic encoding detection:
- UTF-8 with BOM (EF BB BF)
- UTF-16LE with BOM (FF FE)
- UTF-16BE with BOM (FE FF)
- Fallback: byte pattern analysis
- Returns full XML manifest content
- Tagged with
Tag::ManifestandTag::Resource - Manifest contains:
- Assembly identity (name, version, architecture)
- Dependency information
- Compatibility settings
- Security settings (requestedExecutionLevel)
Usage Example
use stringy::extraction::extract_resource_strings;
use stringy::types::Tag;
let pe_data = std::fs::read("example.exe")?;
let strings = extract_resource_strings(&pe_data);
// Filter version info strings
let version_strings: Vec<_> = strings.iter()
.filter(|s| s.tags.contains(&Tag::Version))
.collect();
// Filter string table entries
let ui_strings: Vec<_> = strings.iter()
.filter(|s| s.tags.contains(&Tag::Resource) && !s.tags.contains(&Tag::Version))
.collect();
Implementation Details
impl PeParser {
fn classify_section(section: &SectionTable) -> SectionType {
let name = String::from_utf8_lossy(§ion.name);
// Check characteristics
if section.characteristics & IMAGE_SCN_CNT_CODE != 0 {
return SectionType::Code;
}
match name.trim_end_matches('\0') {
".rdata" => SectionType::StringData,
".rsrc" => SectionType::Resources,
// ... more classifications
}
}
fn extract_imports(&self, pe: &PE) -> Vec<ImportInfo> {
// Iterates through pe.imports
// Creates ImportInfo with name, library (DLL), and address (RVA)
}
fn extract_exports(&self, pe: &PE) -> Vec<ExportInfo> {
// Iterates through pe.exports
// Creates ExportInfo with name, address, and ordinal
// Handles unnamed exports with "ordinal_{i}" naming
}
fn calculate_section_weight(section_type: SectionType, name: &str) -> f32 {
// Returns weight values based on section type and name
// Higher weights indicate higher string likelihood
}
}
Section Weight Calculation
The PE parser uses a weight-based system to prioritize sections for string extraction:
| Section Type | Weight | Rationale |
|---|---|---|
| StringData (.rdata) | 10.0 | Primary string storage |
| Resources (.rsrc) | 9.0 | Version info, string tables |
| ReadOnlyData | 7.0 | May contain constants |
| WritableData (.data) | 5.0 | Runtime state, lower priority |
| Code (.text) | 1.0 | Unlikely to contain strings |
| Debug | 2.0 | Internal metadata |
| Other | 1.0 | Minimal priority |
Limitations
The current PE parser implementation provides comprehensive resource string extraction:
- VERSIONINFO: Complete extraction of all StringFileInfo fields
- STRINGTABLE: Full parsing of RT_STRING blocks with language support
- MANIFEST: Encoding detection and XML extraction
- Dialog Resources: RT_DIALOG parsing not yet implemented (future enhancement)
- Menu Resources: RT_MENU parsing not yet implemented (future enhancement)
- Icon Strings: RT_ICON metadata extraction not yet implemented
Future Enhancements:
- Dialog resource parsing for control text and window titles
- Menu resource parsing for menu item text
- Icon and cursor resource metadata
- Accelerator table string extraction
Mach-O (Mach Object)
Used on macOS and iOS for executables, frameworks, and libraries.
Key Sections for String Extraction
| Segment | Section | Priority | Description |
|---|---|---|---|
__TEXT | __cstring | High | C string literals |
__TEXT | __const | High | Constant data |
__DATA_CONST | * | Medium | Read-only after fixups |
__DATA | * | Low | Writable data |
Mach-O-Specific Features
- Load Commands: Extract strings from
LC_*commands - Segment/Section Model: Two-level naming scheme
- Fat Binaries: Multi-architecture support
- String Pools: Centralized string storage in
__cstring
Load Command Processing
Mach-O load commands contain valuable strings:
LC_LOAD_DYLIB: Library paths and namesLC_RPATH: Runtime search pathsLC_ID_DYLIB: Library identificationLC_BUILD_VERSION: Build tool information
Implementation Details
impl MachoParser {
fn classify_section(segment_name: &str, section_name: &str) -> SectionType {
match (segment_name, section_name) {
("__TEXT", "__cstring") => SectionType::StringData,
("__DATA_CONST", _) => SectionType::ReadOnlyData,
("__DATA", _) => SectionType::WritableData,
// ... more classifications
}
}
}
Cross-Platform Considerations
Encoding Differences
| Platform | Primary Encoding | Notes |
|---|---|---|
| Linux/Unix | UTF-8 | ASCII-compatible, variable width |
| Windows | UTF-16LE | Wide strings common in APIs |
| macOS | UTF-8 | Similar to Linux, some UTF-16 |
String Storage Patterns
- ELF: Strings often in
.rodatawith null terminators - PE: Mix of ANSI and Unicode APIs, resources use UTF-16
- Mach-O: Centralized in
__cstring, mostly UTF-8
Section Weight Calculation
Different formats require different weighting strategies:
fn calculate_section_weight(format: BinaryFormat, section_type: SectionType) -> i32 {
match (format, section_type) {
(BinaryFormat::Elf, SectionType::StringData) => 10, // .rodata
(BinaryFormat::Pe, SectionType::Resources) => 9, // .rsrc
(BinaryFormat::MachO, SectionType::StringData) => 10, // __cstring
// ... more weights
}
}
Format Detection
Stringy uses goblin for robust format detection:
pub fn detect_format(data: &[u8]) -> BinaryFormat {
match Object::parse(data) {
Ok(Object::Elf(_)) => BinaryFormat::Elf,
Ok(Object::PE(_)) => BinaryFormat::Pe,
Ok(Object::Mach(_)) => BinaryFormat::MachO,
_ => BinaryFormat::Unknown,
}
}
Future Enhancements
Planned Format Extensions
- WebAssembly (WASM): Growing importance in web and edge computing
- Java Class Files: JVM bytecode analysis
- Android APK/DEX: Mobile application analysis
Enhanced Resource Support
- PE: Dialog resources, icon strings, version blocks
- Mach-O: Plist resources, framework bundles
- ELF: Note sections, build IDs, GNU attributes
Architecture-Specific Features
- ARM64: Pointer authentication, tagged pointers
- x86-64: RIP-relative addressing hints
- RISC-V: Emerging architecture support
This comprehensive format support ensures Stringy can effectively analyze binaries across all major platforms while respecting the unique characteristics of each format.
String Extraction
Stringy’s string extraction engine is designed to find meaningful strings while avoiding noise and false positives. The extraction process is encoding-aware, section-aware, and configurable.
Extraction Pipeline
Binary Data → Section Analysis → Encoding Detection → String Scanning → Deduplication → Classification
Encoding Support
ASCII Extraction
The most common encoding in most binaries. ASCII extraction provides foundational string extraction with configurable minimum length thresholds.
UTF-16LE Extraction
UTF-16LE extraction is now implemented and available for Windows PE binary string extraction. It provides UTF-16LE string extraction with confidence scoring and noise filtering integration.
Algorithm
- Scan for printable sequences: Characters in range 0x20-0x7E (strict printable ASCII)
- Length filtering: Configurable minimum length (default: 4 characters)
- Null termination: Respect null terminators but don’t require them
- Section awareness: Integrate with section metadata for context-aware filtering
Basic Extraction
use stringy::extraction::{extract_ascii_strings, AsciiExtractionConfig};
let data = b"Hello\0World\0Test123";
let config = AsciiExtractionConfig::default();
let strings = extract_ascii_strings(data, &config);
for string in strings {
println!("Found: {} at offset {}", string.text, string.offset);
}
Configuration
use stringy::extraction::AsciiExtractionConfig;
// Default configuration (min_length: 4, no max_length)
let config = AsciiExtractionConfig::default();
// Custom minimum length
let config = AsciiExtractionConfig::new(8);
// Custom minimum and maximum length
let mut config = AsciiExtractionConfig::default();
config.max_length = Some(256);
UTF-8 Extraction
UTF-8 extraction builds on ASCII extraction and handles multi-byte characters. See the main extraction module for UTF-8 support.
Implementation Details
fn extract_ascii_strings(data: &[u8], min_len: usize) -> Vec<RawString> {
let mut strings = Vec::new();
let mut current_string = Vec::new();
let mut start_offset = 0;
for (i, &byte) in data.iter().enumerate() {
if is_printable_ascii(byte) {
if current_string.is_empty() {
start_offset = i;
}
current_string.push(byte);
} else {
if current_string.len() >= min_len {
strings.push(RawString {
data: current_string.clone(),
offset: start_offset,
encoding: Encoding::Ascii,
});
}
current_string.clear();
}
}
strings
}
Noise Filtering
Stringy implements a multi-layered heuristic filtering system to reduce false positives and identify noise in extracted strings. The filtering system uses a combination of entropy analysis, character distribution, linguistic patterns, length checks, repetition detection, and context-aware filtering.
Filter Architecture
The noise filtering system consists of multiple independent filters that can be combined with configurable weights:
- Character Distribution Filter: Detects abnormal character frequency distributions
- Entropy Filter: Uses Shannon entropy to detect padding/repetition and random binary
- Linguistic Pattern Filter: Analyzes vowel-to-consonant ratios and common bigrams
- Length Filter: Penalizes excessively long strings and very short strings in low-weight sections
- Repetition Filter: Detects repeated character patterns and repeated substrings
- Context-Aware Filter: Boosts confidence for strings in high-weight sections
Character Distribution Analysis
Detects strings with abnormal character distributions:
- Excessive punctuation (>80%): Low confidence (0.2)
- Excessive repetition (>90% same character): Very low confidence (0.1)
- Excessive non-alphanumeric (>70%): Low confidence (0.3)
- Reasonable distribution: High confidence (1.0)
Entropy-Based Filtering
Uses Shannon entropy (bits per byte) to classify strings:
- Very low entropy (<1.5 bits/byte): Likely padding or repetition (confidence: 0.1)
- Very high entropy (>7.5 bits/byte): Likely random binary (confidence: 0.2)
- Optimal range (3.5-6.0 bits/byte): High confidence (1.0)
- Acceptable range (2.0-7.0 bits/byte): Moderate confidence (0.4-0.7)
Linguistic Pattern Detection
Analyzes text for word-like patterns:
- Vowel-to-consonant ratio: Reasonable range 0.2-0.8 for English
- Common bigrams: Detects common English patterns (th, he, in, er, an, re, on, at, en, nd)
- Handles non-English: Gracefully handles non-English strings without over-penalizing
Length-Based Filtering
Applies penalties based on string length:
- Excessively long (>200 characters): Low confidence (0.3) - likely table data
- Very short in low-weight sections (<4 chars, weight <0.5): Moderate confidence (0.5)
- Normal length (4-100 characters): High confidence (1.0)
Repetition Detection
Identifies repetitive patterns:
- Repeated characters (e.g., “AAAA”, “0000”): Very low confidence (0.1)
- Repeated substrings (e.g., “abcabcabc”): Low confidence (0.2)
- Normal strings: High confidence (1.0)
Context-Aware Filtering
Boosts or reduces confidence based on section context:
- String data sections (.rodata, .rdata, __cstring): High confidence (0.9-1.0)
- Read-only data sections: High confidence (0.9)
- Resource sections: Maximum confidence (1.0) - known-good sources
- Code sections: Lower confidence (0.3-0.5)
- Writable data sections: Moderate confidence (0.6)
Configuration
use stringy::extraction::config::{NoiseFilterConfig, FilterWeights};
// Default configuration
let config = NoiseFilterConfig::default();
// Customize thresholds
let mut config = NoiseFilterConfig::default();
config.entropy_min = 2.0;
config.entropy_max = 7.0;
config.max_length = 150;
// Customize filter weights
config.filter_weights = FilterWeights {
entropy_weight: 0.3,
char_distribution_weight: 0.25,
linguistic_weight: 0.2,
length_weight: 0.15,
repetition_weight: 0.05,
context_weight: 0.05,
};
Using Noise Filters
use stringy::extraction::config::NoiseFilterConfig;
use stringy::extraction::filters::{CompositeNoiseFilter, FilterContext};
use stringy::types::SectionType;
let filter_config = NoiseFilterConfig::default();
let filter = CompositeNoiseFilter::new(&filter_config);
let context = FilterContext::default();
let confidence = filter.calculate_confidence("Hello, World!", &context);
if confidence >= 0.5 {
// String passed filtering threshold
}
Confidence Scoring
Each string is assigned a confidence score (0.0-1.0) indicating how likely it is to be legitimate:
- 1.0: Maximum confidence (strings from known-good sources like imports, exports, resources)
- 0.7-0.9: High confidence (likely legitimate strings)
- 0.5-0.7: Moderate confidence (may need review)
- 0.0-0.5: Low confidence (likely noise, filtered out by default)
The confidence score is separate from the score field used for final ranking. Confidence specifically represents the noise filtering assessment.
Performance
Noise filtering is designed to add minimal overhead (<10% per acceptance criteria). Individual filters are optimized for performance, and the composite filter allows enabling/disabling specific filters to balance accuracy and speed.
UTF-16 Extraction
Critical for Windows binaries and some resources. Supports both UTF-16LE (Little-Endian) and UTF-16BE (Big-Endian) with automatic byte order detection.
UTF-16LE (Little-Endian)
Most common on Windows platforms. Default 3 character minimum.
Detection heuristics:
- Even-length sequences (2-byte alignment required)
- Low byte printable, high byte mostly zero
- Null termination patterns (0x00 0x00)
- Advanced confidence scoring with multiple heuristics
UTF-16BE (Big-Endian)
Found in Java .class files, network protocols, some cross-platform binaries.
Detection heuristics:
- Even-length sequences
- High byte printable, low byte mostly zero
- Reverse byte order from UTF-16LE
- Same advanced confidence scoring as UTF-16LE
Automatic Byte Order Detection
The ByteOrder::Auto mode automatically detects and extracts both UTF-16LE and UTF-16BE strings from the same data, avoiding duplicates and correctly identifying the encoding of each string.
Implementation
UTF-16 extraction is implemented in src/extraction/utf16.rs following the pattern established in the ASCII extractor. The implementation provides:
extract_utf16_strings(): Main extraction function supporting both byte ordersextract_utf16le_strings(): UTF-16LE-specific extraction (backward compatibility)extract_from_section(): Section-aware extraction with proper metadata populationUtf16ExtractionConfig: Configuration for minimum/maximum character count, byte order selection, and confidence thresholdsByteOrderenum: Control which byte order(s) to scan (LE, BE, Auto)
Usage Example:
use stringy::extraction::utf16::{extract_utf16_strings, Utf16ExtractionConfig, ByteOrder};
// Extract UTF-16LE strings from Windows PE binary
let config = Utf16ExtractionConfig {
byte_order: ByteOrder::LE,
min_length: 3,
confidence_threshold: 0.6,
..Default::default()
};
let strings = extract_utf16_strings(data, &config);
// Extract both UTF-16LE and UTF-16BE with auto-detection
let config = Utf16ExtractionConfig {
byte_order: ByteOrder::Auto,
..Default::default()
};
let strings = extract_utf16_strings(data, &config);
Configuration:
use stringy::extraction::utf16::{Utf16ExtractionConfig, ByteOrder};
// Default configuration (min_length: 3, byte_order: Auto, confidence_threshold: 0.5)
let config = Utf16ExtractionConfig::default();
// Custom minimum character length
let config = Utf16ExtractionConfig::new(5);
// Custom configuration
let mut config = Utf16ExtractionConfig::default();
config.min_length = 3;
config.max_length = Some(256);
config.byte_order = ByteOrder::LE;
config.confidence_threshold = 0.6;
UTF-16-Specific Confidence Scoring
UTF-16 extraction uses advanced confidence scoring to detect false positives from null-interleaved binary data. The confidence score combines multiple heuristics:
-
Valid Unicode range check: Validates code points are in valid Unicode ranges (U+0020-U+D7FF, U+E000-U+FFFD, U+10000-U+10FFFF), penalizes private use areas and invalid surrogates
-
Printable character ratio: Calculates ratio of printable characters including common Unicode ranges
-
ASCII ratio: Boosts confidence for ASCII-heavy strings (>50% characters in ASCII printable range)
-
Null pattern detection: Flags suspicious patterns like:
- Excessive nulls (>30% of characters)
- Regular null intervals (every 2nd, 4th, 8th position)
- Fixed-offset nulls indicating structured binary data
-
Byte order consistency: Verifies byte order is consistent throughout the string (for Auto mode)
Confidence Formula:
confidence = (valid_unicode_weight × valid_ratio)
+ (printable_weight × printable_ratio)
+ (ascii_weight × ascii_ratio)
- (null_pattern_penalty)
- (invalid_range_penalty)
The result is clamped to 0.0-1.0 range.
Examples:
- High confidence: “Microsoft Corporation” (>90% printable, valid Unicode, no null patterns)
- Medium confidence: “Test123” (>70% printable, valid Unicode)
- Low confidence: Null-interleaved binary table data (excessive nulls, regular patterns)
The UTF-16-specific confidence score is combined with general noise filtering confidence when noise filtering is enabled, using the minimum of both scores.
False Positive Prevention
UTF-16 extraction is prone to false positives because binary data with null bytes can look like UTF-16 strings. The confidence scoring system mitigates this by:
- Detecting null-interleaved patterns: Binary tables with numeric data (e.g.,
[0x01, 0x00, 0x02, 0x00]) are flagged as suspicious - Penalizing regular null patterns: Data with nulls at fixed intervals (every 2nd, 4th, 8th byte) receives lower confidence
- Validating Unicode ranges: Invalid code points and surrogate pairs reduce confidence
- Configurable threshold: The
utf16_confidence_threshold(default 0.5) can be tuned to balance recall and precision
Recommendations:
- For Windows PE binaries: Use
ByteOrder::LEwithconfidence_threshold: 0.6 - For Java .class files: Use
ByteOrder::BEwithconfidence_threshold: 0.5 - For unknown formats: Use
ByteOrder::Autowithconfidence_threshold: 0.5 - For high-precision extraction: Increase
confidence_thresholdto 0.7-0.8
Performance Considerations
UTF-16 scanning adds overhead compared to ASCII/UTF-8 extraction:
- Scanning both byte orders: Auto mode doubles the work by scanning for both LE and BE
- Confidence scoring: The multi-heuristic confidence calculation adds computational cost
- Recommendations:
- Use specific byte order (LE or BE) when the target format is known
- Auto mode is best for unknown or mixed-format binaries
- Consider disabling UTF-16 extraction for formats that don’t use it (e.g., pure ELF binaries)
Section-Aware Extraction
Different sections have different string extraction strategies.
High-Priority Sections
ELF: .rodata and variants
- Strategy: Aggressive extraction, low noise filtering
- Encodings: ASCII/UTF-8 primary, UTF-16 secondary
- Minimum length: 3 characters
PE: .rdata
- Strategy: Balanced extraction
- Encodings: ASCII and UTF-16LE equally
- Minimum length: 4 characters
Mach-O: __TEXT,__cstring
- Strategy: High confidence, null-terminated focus
- Encodings: UTF-8 primary
- Minimum length: 3 characters
Medium-Priority Sections
ELF: .data.rel.ro
- Strategy: Conservative extraction
- Noise filtering: Enhanced
- Minimum length: 5 characters
PE: .data (read-only)
- Strategy: Moderate extraction
- Context checking: Enhanced validation
Low-Priority Sections
Writable data sections
- Strategy: Very conservative
- High noise filtering: Skip obvious runtime data
- Minimum length: 6+ characters
Resource Sections
PE Resources (.rsrc)
- VERSIONINFO: Extract version strings, product names
- STRINGTABLE: Localized UI strings
- RT_MANIFEST: XML manifest data
fn extract_pe_resources(pe: &PE, data: &[u8]) -> Vec<RawString> {
let mut strings = Vec::new();
// Extract version info
if let Some(version_info) = extract_version_info(pe, data) {
strings.extend(version_info);
}
// Extract string tables
if let Some(string_tables) = extract_string_tables(pe, data) {
strings.extend(string_tables);
}
strings
}
Deduplication Strategy
Canonicalization
Strings are canonicalized while preserving important metadata:
- Normalize whitespace: Convert tabs/newlines to spaces
- Trim boundaries: Remove leading/trailing whitespace
- Case preservation: Maintain original case for analysis
- Encoding normalization: Convert to UTF-8 for comparison
Metadata Preservation
When duplicates are found:
struct DeduplicatedString {
canonical_text: String,
occurrences: Vec<StringOccurrence>,
primary_encoding: Encoding,
best_section: Option<String>,
}
struct StringOccurrence {
offset: u64,
section: Option<String>,
encoding: Encoding,
length: u32,
}
Deduplication Algorithm
fn deduplicate_strings(strings: Vec<RawString>) -> Vec<DeduplicatedString> {
let mut map: HashMap<String, DeduplicatedString> = HashMap::new();
for string in strings {
let canonical = canonicalize(&string.text);
map.entry(canonical.clone())
.or_insert_with(|| DeduplicatedString::new(canonical))
.add_occurrence(string);
}
map.into_values().collect()
}
Configuration Options
Extraction Configuration
use stringy::extraction::{ByteOrder, Encoding, ExtractionConfig};
pub struct ExtractionConfig {
pub min_ascii_length: usize, // Default: 4
pub min_wide_length: usize, // Default: 3 (for UTF-16)
pub enabled_encodings: Vec<Encoding>, // Default: ASCII, UTF-8
pub noise_filtering_enabled: bool, // Default: true
pub min_confidence_threshold: f32, // Default: 0.5
pub utf16_min_confidence: f32, // Default: 0.7 (for UTF-16LE)
pub utf16_byte_order: ByteOrder, // Default: Auto
pub utf16_confidence_threshold: f32, // Default: 0.5 (UTF-16-specific)
}
UTF-16 Configuration Examples:
use stringy::extraction::{ExtractionConfig, Encoding, ByteOrder};
// Extract UTF-16LE strings from Windows PE binary
let mut config = ExtractionConfig::default();
config.min_wide_length = 3;
config.utf16_confidence_threshold = 0.6;
config.utf16_byte_order = ByteOrder::LE;
config.enabled_encodings.push(Encoding::Utf16Le);
// Extract both UTF-16LE and UTF-16BE with auto-detection
let mut config = ExtractionConfig::default();
config.enabled_encodings.push(Encoding::Utf16Le);
config.enabled_encodings.push(Encoding::Utf16Be);
config.utf16_byte_order = ByteOrder::Auto;
Noise Filter Configuration
use stringy::extraction::config::NoiseFilterConfig;
pub struct NoiseFilterConfig {
pub entropy_min: f32, // Default: 1.5
pub entropy_max: f32, // Default: 7.5
pub max_length: usize, // Default: 200
pub max_repetition_ratio: f32, // Default: 0.7
pub min_vowel_ratio: f32, // Default: 0.1
pub max_vowel_ratio: f32, // Default: 0.9
pub filter_weights: FilterWeights, // Default: balanced weights
}
Filter Weights
use stringy::extraction::config::FilterWeights;
pub struct FilterWeights {
pub entropy_weight: f32, // Default: 0.25
pub char_distribution_weight: f32, // Default: 0.20
pub linguistic_weight: f32, // Default: 0.20
pub length_weight: f32, // Default: 0.15
pub repetition_weight: f32, // Default: 0.10
pub context_weight: f32, // Default: 0.10
}
All weights must sum to 1.0. The configuration validates this automatically.
Encoding Selection
#[non_exhaustive]
pub enum EncodingFilter {
/// Match a specific encoding exactly
Exact(Encoding),
/// Match any UTF-16 variant (UTF-16LE or UTF-16BE)
Utf16Any,
}
Section Filtering
pub struct SectionFilter {
pub include_sections: Option<Vec<String>>,
pub exclude_sections: Option<Vec<String>>,
pub include_debug: bool,
pub include_resources: bool,
}
Performance Optimizations
Memory Mapping
Large files use memory mapping for efficient access via mmap-guard:
fn extract_from_large_file(path: &Path) -> Result<Vec<RawString>> {
let data = mmap_guard::map_file(path)?;
// data implements Deref<Target = [u8]>
extract_strings(&data[..])
}
Note: The Pipeline::run API handles memory mapping automatically.
Parallel Processing
Parallel processing is not yet implemented. Section extraction currently runs sequentially.
Regex Caching
Pattern matching uses cached regex compilation:
lazy_static! {
static ref URL_REGEX: Regex = Regex::new(r"https?://[^\s]+").unwrap();
static ref GUID_REGEX: Regex = Regex::new(r"\{[0-9a-fA-F-]{36}\}").unwrap();
}
Quality Assurance
Validation Heuristics
The noise filtering system implements comprehensive validation:
- Entropy checking: Uses Shannon entropy to detect padding/repetition and random binary data
- Language detection: Analyzes vowel-to-consonant ratios and common bigrams
- Context validation: Considers section type, weight, and permissions
- Character distribution: Detects abnormal frequency distributions
- Repetition detection: Identifies repeated patterns and padding
False Positive Reduction
The multi-layered filtering system targets common sources of false positives:
- Padding detection: Identifies repeated character sequences (e.g., “AAAA”, “\x00\x00\x00\x00”)
- Table data: Filters excessively long strings likely to be structured data
- Binary noise: High-entropy strings are flagged as likely random binary
- Context awareness: Strings in code sections receive lower confidence scores
Performance Characteristics
Noise filtering is designed for minimal overhead:
- Target overhead: <10% compared to extraction without filtering
- Optimized filters: Each filter is independently optimized
- Configurable: Can enable/disable individual filters to balance accuracy and speed
- Scalable: Handles large binaries efficiently
Examples
Basic Extraction with Filtering
use stringy::extraction::{extract_ascii_strings, AsciiExtractionConfig};
use stringy::extraction::config::NoiseFilterConfig;
use stringy::extraction::filters::{CompositeNoiseFilter, FilterContext};
let data = b"Hello World\0AAAA\0Test123";
let config = AsciiExtractionConfig::default();
let strings = extract_ascii_strings(data, &config);
let filter_config = NoiseFilterConfig::default();
let filter = CompositeNoiseFilter::new(&filter_config);
let context = FilterContext::default();
let filtered: Vec<_> = strings
.into_iter()
.filter(|s| filter.calculate_confidence(&s.text, &context) >= 0.5)
.collect();
Custom Filter Configuration
use stringy::extraction::config::{NoiseFilterConfig, FilterWeights};
let mut config = NoiseFilterConfig::default();
config.entropy_min = 2.0;
config.entropy_max = 7.0;
config.max_length = 150;
config.filter_weights = FilterWeights {
entropy_weight: 0.4,
char_distribution_weight: 0.3,
linguistic_weight: 0.15,
length_weight: 0.1,
repetition_weight: 0.03,
context_weight: 0.02,
};
This comprehensive extraction system ensures high-quality string extraction while maintaining performance and minimizing false positives through multi-layered noise filtering.
Classification System
Stringy applies semantic analysis to extracted strings, identifying patterns that indicate specific types of data. This helps analysts focus on the most relevant information quickly.
Classification Pipeline
Raw String -> Pattern Matching -> Validation -> Tag Assignment
Semantic Categories
URLs
- Pattern:
https?://[^\s<>"{}|\\\^\[\]\]+` - Examples:
https://example.com/path,http://malware.site/payload - Validation: Must start with
http://orhttps://
Domain Names
- Pattern: RFC 1035 compliant domain format
- Examples:
example.com,subdomain.evil.site - Validation: Valid TLD from known list, not a URL or email
IP Addresses
- IPv4 Pattern: Standard dotted-decimal notation
- IPv6 Pattern: Full and compressed formats
- Examples:
192.168.1.1,::1,2001:db8::1 - Validation: Valid octet ranges for IPv4, proper format for IPv6
File Paths
- POSIX Pattern: Paths starting with
/ - Windows Pattern: Drive letters (
C:\) or relative paths - UNC Pattern:
\\server\shareformat - Examples:
/etc/passwd,C:\Windows\System32,\\server\share\file
Registry Paths
- Pattern:
HKEY_*orHK*\prefixes - Examples:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft - Validation: Must start with valid registry root key
GUIDs
- Pattern:
\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\} - Examples:
{12345678-1234-1234-1234-123456789abc} - Validation: Strict format compliance with braces required
Email Addresses
- Pattern:
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} - Examples:
admin@malware.com,user.name+tag@example.co.uk - Validation: Single
@, valid TLD length and characters, no empty parts
Base64 Data
- Pattern:
[A-Za-z0-9+/]{20,}={0,2} - Examples:
U29tZSBsb25nZXIgYmFzZTY0IHN0cmluZw== - Validation: Length >= 20, length divisible by 4, padding rules, entropy threshold
Format Strings
- Pattern:
%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\} - Examples:
Error: %s at line %d,User {0} logged in - Validation: Reasonable specifier count, context-aware thresholds
User Agents
- Pattern:
Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+ - Examples:
Mozilla/5.0 (Windows NT 10.0; Win64; x64),Chrome/117.0.5938.92 - Validation: Known browser identifiers and minimum length
Pattern Matching Engine
The semantic classifier uses cached regex patterns via once_cell::sync::Lazy and applies validation checks to reduce false positives.
#![allow(unused)]
fn main() {
use once_cell::sync::Lazy;
use regex::Regex;
static GUID_REGEX: Lazy<Regex> = Lazy::new(|| {
Regex::new(r"^\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}$")
.expect("Invalid GUID regex")
});
}
Using the Classification System
#![allow(unused)]
fn main() {
use stringy::classification::SemanticClassifier;
use stringy::types::{BinaryFormat, Encoding, SectionType, StringContext, StringSource, Tag};
let classifier = SemanticClassifier::new();
let context = StringContext::new(
SectionType::StringData,
BinaryFormat::Elf,
Encoding::Ascii,
StringSource::SectionData,
)
.with_section_name(".rodata".to_string());
let tags = classifier.classify("{12345678-1234-1234-1234-123456789abc}", &context);
if tags.contains(&Tag::Guid) {
// Handle GUID indicator
}
}
Validation Rules
- GUID: Braced, hyphenated, hex-only format.
- Email: TLD length must be between 2 and 24 and alphabetic; domain must include a dot.
- Base64: Length must be divisible by 4, padding allowed only at the end, entropy threshold applied.
- Format String: Must contain at least one specifier and pass context-aware length checks.
- User Agent: Must contain a known browser token and meet minimum length.
Performance Notes
- Regexes are compiled once via
once_cell::sync::Lazyand reused across calls. - Minimum length checks avoid unnecessary regex work on short inputs.
- The classifier is stateless and thread-safe.
Testing
- Unit tests:
tests/classification_tests.rs - Integration tests:
tests/classification_integration_tests.rs
Run tests with:
just test
Ranking Algorithm
Stringy’s ranking system prioritizes strings by relevance, helping analysts focus on the most important findings first. The algorithm combines multiple factors to produce a comprehensive relevance score.
Scoring Formula
Final Score = SectionWeight + SemanticBoost - NoisePenalty
Each component contributes to the overall relevance assessment. The resulting internal score is then mapped to a display score (0-100) via band mapping.
Note: Section weights use a 1.0-10.0 scale, and semantic boosts add to the internal score. The pipeline’s normalizer then maps the combined internal score to a 0-100 display score using the band table shown in Display Score Mapping below.
Section Weight
Different sections have varying likelihood of containing meaningful strings. Container parsers assign weights (1.0-10.0) to each section based on its type and name.
Weight Ranges
| Section Type | Typical Weight | Examples |
|---|---|---|
| Dedicated string storage | 8.0-10.0 | .rodata, __TEXT,__cstring, .rsrc |
| Read-only data | 7.0 | .data.rel.ro, __DATA_CONST |
| General data | 5.0 | .data |
| Code sections | 1.0 | .text |
Format-specific adjustments are applied based on section names. For example, ELF .rodata.str1.1 (aligned strings) and PE .rsrc (rich resources) receive additional priority.
Semantic Boost
Strings with recognized semantic meaning receive score boosts based on their tags.
Boost Categories
| Tag Category | Boost Level | Examples |
|---|---|---|
| Network (URL, Domain, IP) | High | https://api.evil.com |
| Identifiers (GUID, Email) | High | {12345678-1234-...} |
| File System (Path, Registry) | Medium-High | C:\Windows\System32\evil.dll |
| User-Agent-like strings | Medium-High | Mozilla/5.0 ... |
| Version/Manifest | Medium | MyApp v1.2.3 |
| Code Artifacts (Format strings, Base64) | Medium | Error: %s at line %d |
| Symbols (Import, Export) | Low-Medium | CreateFileW, main |
Strings with multiple semantic tags receive additional (diminishing) bonuses for each extra tag.
Noise Penalty
Various factors indicate low-quality or noisy strings, and receive penalties:
Penalty Categories
-
High Entropy: Strings with high Shannon entropy (randomness) are likely binary data or encoded content and receive significant penalties.
-
Excessive Length: Very long strings are often noise (padding, embedded data). Longer strings receive progressively larger penalties.
-
Repeated Patterns: Strings with excessive character repetition (e.g.,
AAAAAAA...) are penalized based on the repetition ratio. -
Common Noise Patterns: Known noise patterns receive penalties, including padding characters, hex dump patterns, and table-like data with excessive delimiters.
Display Score Mapping
The internal score is mapped to a display score (0-100) using bands:
| Internal Score | Display Score | Meaning |
|---|---|---|
| <= 0 | 0 | Low relevance |
| 1-79 | 1-49 | Low relevance |
| 80-119 | 50-69 | Moderate |
| 120-159 | 70-89 | Meaningful |
| 160-220 | 90-100 | High-value |
| > 220 | 100 (clamped) | High-value |
Filtering Recommendations
- Interactive analysis: Show display scores >= 50
- Automated processing: Use display scores >= 70
- YARA rules: Focus on display scores >= 80
- High-confidence indicators: Display scores >= 90
Contributing to Stringy
We welcome contributions to Stringy! This guide will help you get started with development, testing, and submitting changes.
Development Setup
Prerequisites
- Rust: 1.91 or later (MSRV - Minimum Supported Rust Version)
- Git: For version control
- just: Task runner (install via
cargo install justor your package manager) - mise: Tool version manager (manages Zig and other dev tools)
- Zig: Cross-compiler for test fixtures (managed by mise)
Clone and Setup
git clone https://github.com/EvilBit-Labs/Stringy
cd Stringy
# Generate test fixtures (ELF/PE/Mach-O via Zig cross-compilation)
just gen-fixtures
# Run the full check suite
just check
Development Tools
Install recommended tools for development:
# Code formatting
rustup component add rustfmt
# Linting
rustup component add clippy
# Documentation
cargo install mdbook
# Test runner (required by just recipes)
cargo install cargo-nextest
# Coverage (optional)
cargo install cargo-llvm-cov
Project Structure
src/
+-- main.rs # CLI entry point (thin wrapper)
+-- lib.rs # Library root and public API re-exports
+-- types/
| +-- mod.rs # Core data structures (Tag, FoundString, Encoding, etc.)
| +-- error.rs # StringyError enum, Result alias
+-- container/
| +-- mod.rs # Format detection, ContainerParser trait
| +-- elf.rs # ELF parser
| +-- pe.rs # PE parser
| +-- macho.rs # Mach-O parser
+-- extraction/
| +-- mod.rs # Extraction orchestration
| +-- ascii/ # ASCII/UTF-8 extraction
| +-- utf16/ # UTF-16LE/BE extraction
| +-- dedup/ # Deduplication with scoring
| +-- filters/ # Noise filter implementations
| +-- pe_resources/ # PE version info, manifests, string tables
+-- classification/
| +-- mod.rs # Classification framework
| +-- patterns/ # Regex-based pattern matching
| +-- symbols.rs # Symbol processing and demangling
| +-- ranking.rs # Scoring algorithm
+-- output/
| +-- mod.rs # OutputFormat, OutputMetadata, dispatch
| +-- json.rs # JSONL format
| +-- table/ # TTY and plain text table formatting
| +-- yara/ # YARA rule generation
+-- pipeline/
+-- mod.rs # Pipeline::run orchestration
+-- config.rs # PipelineConfig, FilterConfig, EncodingFilter
+-- filter.rs # Post-extraction filtering
+-- normalizer.rs # Score band mapping
tests/
+-- integration_cli.rs # CLI argument and flag tests
+-- integration_cli_errors.rs # CLI error handling tests
+-- integration_elf.rs # ELF-specific tests
+-- integration_pe.rs # PE-specific tests
+-- integration_macho.rs # Mach-O-specific tests
+-- integration_extraction.rs # Extraction tests
+-- integration_flows_1_5.rs # End-to-end flow tests (1-5)
+-- integration_flows_6_8.rs # End-to-end flow tests (6-8)
+-- ... (additional test files)
+-- fixtures/ # Test binary files (flat structure)
+-- snapshots/ # Insta snapshot files
docs/
+-- src/ # mdbook documentation
+-- book.toml # Documentation config
Development Workflow
1. Create a Branch
git checkout -b feature/your-feature-name
# or
git checkout -b fix/issue-description
2. Make Changes
Use just recipes for development commands:
# Format code
just format
# Lint (clippy with -D warnings)
just lint
# Run tests
just test
# Full pre-commit check (fmt + lint + test)
just check
# Full CI suite locally
just ci-check
3. Test Your Changes
# Generate fixtures if needed
just gen-fixtures
# Run all tests
just test
# Run a specific test
cargo nextest run test_name
# Regenerate snapshots after changing test_binary.c
INSTA_UPDATE=always cargo nextest run
4. Update Documentation
If your changes affect the public API or add new features:
# Update API docs
cargo doc --open
# Update user documentation
cd docs
mdbook serve --open
Coding Standards
Rust Style
- Use
cargo fmtfor formatting - Follow
cargo clippyrecommendations (warnings are errors) - No
unsafecode (#![forbid(unsafe_code)]is enforced) - Zero warnings policy
- ASCII only in source code (no emojis, em-dashes, smart quotes)
- Files under 500 lines; split larger files into module directories
- No blanket
#[allow]without inline justification
Testing
Write comprehensive tests:
- Use
instafor snapshot testing - Binary fixtures in
tests/fixtures/(flat structure) - Integration tests use two naming patterns:
integration_*.rsandtest_*.rs - Use
assert_cmdfor CLI testing (note:assert_cmdis non-TTY)
Contribution Areas
High-Priority Areas
- String Extraction Engine - UTF-16 detection improvements, noise filtering enhancements
- Classification System - New semantic patterns, improved confidence scoring
- Output Formats - Customization options, additional format support
Getting Started Ideas
- Add new semantic patterns (email formats, crypto constants)
- Improve test coverage
- Enhance error messages
- Add documentation examples
Submitting Changes
Pull Request Process
- Fork the repository on GitHub
- Create a feature branch from
main - Make your changes following the guidelines above
- Add tests for new functionality (this is policy, not optional)
- Sign off commits with
git commit -s(DCO enforced by GitHub App) - Submit a pull request with a clear description
PR Requirements
- Sign off commits with
git commit -s(DCO enforced) - Pass CI (clippy, rustfmt, tests, CodeQL, cargo-deny)
- Include tests for new functionality
- Be reviewed (human or CodeRabbit) for correctness, safety, and style
- No
unwrap()in library code, unchecked errors, or unvalidated input
Review Process
- Automated checks must pass (CI/CD)
- Code review by maintainers
- Testing on multiple platforms
- Documentation review if applicable
- Merge after approval
Community Guidelines
Getting Help
- GitHub Issues: Bug reports and feature requests
- Discussions: General questions and ideas
- Documentation: Check existing docs first
Release Process
Version Numbering
We follow Semantic Versioning:
- MAJOR: Breaking changes
- MINOR: New features (backward compatible)
- PATCH: Bug fixes (backward compatible)
Release Checklist
- Update version in
Cargo.toml - Update changelog via
git-cliff - Run full test suite
- Update documentation
- Create release tag (
vX.Y.Z) - Releases are built via
cargo-dist
Thank you for contributing to Stringy!
API Documentation
This page provides an overview of Stringy’s public API. For complete API documentation, run cargo doc --open in the project directory.
Core Types
FoundString
The primary data structure representing an extracted string with metadata.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FoundString {
/// The extracted string text
pub text: String,
/// Pre-demangled form (if symbol was demangled)
pub original_text: Option<String>,
/// The encoding used for this string
pub encoding: Encoding,
/// File offset where the string was found
pub offset: u64,
/// Relative Virtual Address (if available)
pub rva: Option<u64>,
/// Section name where the string was found
pub section: Option<String>,
/// Length of the string in bytes
pub length: u32,
/// Semantic tags applied to this string
pub tags: Vec<Tag>,
/// Relevance score for ranking
pub score: i32,
/// Section weight component of score (debug only)
pub section_weight: Option<i32>,
/// Semantic boost component of score (debug only)
pub semantic_boost: Option<i32>,
/// Noise penalty component of score (debug only)
pub noise_penalty: Option<i32>,
/// Display score 0-100, populated by ScoreNormalizer in all non-raw executions
pub display_score: Option<i32>,
/// Source of the string (section data, import, etc.)
pub source: StringSource,
/// UTF-16 confidence score
pub confidence: f32,
}
Encoding
Supported string encodings.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum Encoding {
Ascii,
Utf8,
Utf16Le,
Utf16Be,
}
Tag
Semantic classification tags.
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
pub enum Tag {
Url,
Domain,
IPv4,
IPv6,
FilePath,
RegistryPath,
Guid,
Email,
Base64,
FormatString,
UserAgent,
DemangledSymbol,
Import,
Export,
Version,
Manifest,
Resource,
DylibPath,
Rpath,
RpathVariable,
FrameworkPath,
}
EncodingFilter
Filter for restricting output by string encoding, corresponding to the --enc CLI flag.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum EncodingFilter {
/// Match a specific encoding exactly
Exact(Encoding),
/// Match any UTF-16 variant (UTF-16LE or UTF-16BE)
Utf16Any,
}
Used with FilterConfig to limit results to a specific encoding. Utf16Any matches both Utf16Le and Utf16Be.
FilterConfig
Post-extraction filtering configuration. All fields have sensible defaults; empty tag vectors are no-ops.
pub struct FilterConfig {
/// Minimum string length to include (default: 4)
pub min_length: usize, // --min-len
/// Restrict to a specific encoding
pub encoding: Option<EncodingFilter>, // --enc
/// Only include strings with these tags (empty = no filter)
pub include_tags: Vec<Tag>, // --only-tags
/// Exclude strings with these tags (empty = no filter)
pub exclude_tags: Vec<Tag>, // --no-tags
/// Limit output to top N strings by score
pub top_n: Option<usize>, // --top
}
Builder-style construction:
let config = FilterConfig::new()
.with_min_length(6)
.with_encoding(EncodingFilter::Exact(Encoding::Utf8))
.with_include_tags(vec![Tag::Url, Tag::Domain])
.with_top_n(20);
Main API Functions
BasicExtractor::extract
Extract strings from binary data using the BasicExtractor, which implements the StringExtractor trait.
pub trait StringExtractor {
fn extract(
&self,
data: &[u8],
container_info: &ContainerInfo,
config: &ExtractionConfig,
) -> Result<Vec<FoundString>>;
}
Parameters:
data: Binary data to analyzecontainer_info: Parsed container metadata (sections, imports, exports)config: Extraction configuration options
Returns:
Result<Vec<FoundString>>: Extracted strings with metadata
Example:
use stringy::{BasicExtractor, ExtractionConfig, StringExtractor};
use stringy::container::{detect_format, create_parser};
let data = std::fs::read("binary.exe")?;
let format = detect_format(&data);
let parser = create_parser(format)?;
let container_info = parser.parse(&data)?;
let extractor = BasicExtractor::new();
let config = ExtractionConfig::default();
let strings = extractor.extract(&data, &container_info, &config)?;
for string in strings {
println!("{}: {}", string.score, string.text);
}
detect_format
Detect the binary format of the given data.
pub fn detect_format(data: &[u8]) -> BinaryFormat
Parameters:
data: Binary data to analyze
Returns:
BinaryFormat: Detected format (ELF, PE, MachO, or Unknown)
Example:
use stringy::detect_format;
let data = std::fs::read("binary")?;
let format = detect_format(&data);
println!("Detected format: {:?}", format);
Configuration
ExtractionConfig
Configuration options for string extraction. The struct has 16 fields with sensible defaults.
pub struct ExtractionConfig {
/// Minimum string length in bytes (default: 1)
pub min_length: usize,
/// Maximum string length in bytes (default: 4096)
pub max_length: usize,
/// Whether to scan executable sections (default: true)
pub scan_code_sections: bool,
/// Whether to include debug sections (default: false)
pub include_debug: bool,
/// Section types to prioritize (default: StringData, ReadOnlyData, Resources)
pub section_priority: Vec<SectionType>,
/// Whether to include import/export names (default: true)
pub include_symbols: bool,
/// Minimum length for ASCII strings (default: 1)
pub min_ascii_length: usize,
/// Minimum length for UTF-16 strings (default: 1)
pub min_wide_length: usize,
/// Which encodings to extract (default: ASCII, UTF-8)
pub enabled_encodings: Vec<Encoding>,
/// Enable/disable noise filtering (default: true)
pub noise_filtering_enabled: bool,
/// Minimum confidence threshold (default: 0.5)
pub min_confidence_threshold: f32,
/// Minimum UTF-16LE confidence threshold (default: 0.7)
pub utf16_min_confidence: f32,
/// Which UTF-16 byte order(s) to scan (default: Auto)
pub utf16_byte_order: ByteOrder,
/// Minimum UTF-16-specific confidence threshold (default: 0.5)
pub utf16_confidence_threshold: f32,
/// Enable/disable deduplication (default: true)
pub enable_deduplication: bool,
/// Deduplication threshold (default: None)
pub dedup_threshold: Option<usize>,
}
impl Default for ExtractionConfig {
fn default() -> Self {
Self {
min_length: 1,
max_length: 4096,
scan_code_sections: true,
include_debug: false,
section_priority: vec![
SectionType::StringData,
SectionType::ReadOnlyData,
SectionType::Resources,
],
include_symbols: true,
min_ascii_length: 1,
min_wide_length: 1,
enabled_encodings: vec![Encoding::Ascii, Encoding::Utf8],
noise_filtering_enabled: true,
min_confidence_threshold: 0.5,
utf16_min_confidence: 0.7,
utf16_byte_order: ByteOrder::Auto,
utf16_confidence_threshold: 0.5,
enable_deduplication: true,
dedup_threshold: None,
}
}
}
SemanticClassifier
The SemanticClassifier is constructed via SemanticClassifier::new() and currently has no configuration options. Classification patterns are built-in.
Pipeline Components
ScoreNormalizer
Maps internal relevance scores to a 0-100 display scale using band mapping.
let normalizer = ScoreNormalizer::new();
normalizer.normalize(&mut strings);
// Each FoundString now has display_score populated
Invoked unconditionally by the pipeline in all non-raw executions. Negative internal scores map to display_score = 0. See Ranking for the full band-mapping table.
FilterEngine
Applies post-extraction filtering and sorting. Consumes the input vector and returns a filtered, sorted result.
let engine = FilterEngine::new();
let filtered = engine.apply(strings, &filter_config);
Filter order:
- Minimum length (
min_length) - Encoding match (
encoding) - Include tags (
include_tags– keep only strings with at least one matching tag) - Exclude tags (
exclude_tags– remove strings with any matching tag) - Stable sort by score (descending), then offset (ascending), then text (ascending)
- Top-N truncation (
top_n)
Example: FilterConfig + FilterEngine
use stringy::{FilterConfig, FilterEngine, EncodingFilter, Encoding, Tag};
let config = FilterConfig::new()
.with_min_length(6)
.with_include_tags(vec![Tag::Url, Tag::Domain])
.with_top_n(10);
let engine = FilterEngine::new();
let results = engine.apply(strings, &config);
// results contains at most 10 strings, all >= 6 chars,
// all tagged Url or Domain, sorted by score descending
Container Parsing
ContainerParser Trait
Trait for implementing binary format parsers.
pub trait ContainerParser {
/// Detect if this parser can handle the given data
fn detect(data: &[u8]) -> bool
where
Self: Sized;
/// Parse the container and extract metadata
fn parse(&self, data: &[u8]) -> Result<ContainerInfo>;
}
ContainerInfo
Information about a parsed binary container.
pub struct ContainerInfo {
/// The binary format detected
pub format: BinaryFormat,
/// List of sections in the binary
pub sections: Vec<SectionInfo>,
/// Import information
pub imports: Vec<ImportInfo>,
/// Export information
pub exports: Vec<ExportInfo>,
/// Resource metadata (PE format only)
pub resources: Option<Vec<ResourceMetadata>>,
}
SectionInfo
Information about a section within the binary.
pub struct SectionInfo {
/// Section name
pub name: String,
/// File offset of the section
pub offset: u64,
/// Size of the section in bytes
pub size: u64,
/// Relative Virtual Address (if available)
pub rva: Option<u64>,
/// Classification of the section type
pub section_type: SectionType,
/// Whether the section is executable
pub is_executable: bool,
/// Whether the section is writable
pub is_writable: bool,
/// Weight indicating likelihood of containing meaningful strings (1.0-10.0)
pub weight: f32,
}
Output Formatting
OutputFormatter Trait
Trait for implementing output formatters.
pub trait OutputFormatter {
/// Returns the name of this formatter
fn name(&self) -> &'static str;
/// Format the strings for output
fn format(&self, strings: &[FoundString], metadata: &OutputMetadata) -> Result<String>;
}
Built-in Formatters
The library provides free functions rather than formatter structs:
format_table(strings, metadata)- Human-readable table format (TTY-aware)format_json(strings, metadata)- JSONL formatformat_yara(strings, metadata)- YARA rule formatformat_output(strings, metadata)- Dispatches based onmetadata.output_format
Example:
use stringy::output::{format_json, OutputMetadata};
let metadata = OutputMetadata::new("binary.exe".to_string());
let output = format_json(&strings, &metadata)?;
println!("{}", output);
Error Handling
StringyError
Comprehensive error type for the library.
#[derive(Debug, thiserror::Error)]
pub enum StringyError {
#[error("Unsupported file format (supported: ELF, PE, Mach-O)")]
UnsupportedFormat,
#[error("File I/O error: {0}")]
IoError(#[from] std::io::Error),
#[error("Binary parsing error: {0}")]
ParseError(String),
#[error("Invalid encoding in string at offset {offset}")]
EncodingError { offset: u64 },
#[error("Configuration error: {0}")]
ConfigError(String),
#[error("Serialization error: {0}")]
SerializationError(String),
#[error("Validation error: {0}")]
ValidationError(String),
#[error("Memory mapping error: {0}")]
MemoryMapError(String),
}
Result Type
Convenient result type alias.
pub type Result<T> = std::result::Result<T, StringyError>;
Advanced Usage
Custom Classification
Implement custom semantic classifiers:
use stringy::classification::{ClassificationResult, Classifier};
pub struct CustomClassifier {
// Custom implementation
}
impl Classifier for CustomClassifier {
fn classify(&self, text: &str, context: &StringContext) -> Vec<ClassificationResult> {
// Custom classification logic
vec![]
}
}
Memory-Mapped Files
For large files, use memory mapping via mmap-guard:
let data = mmap_guard::map_file(path)?;
// data implements Deref<Target = [u8]>
Note: The Pipeline::run API handles memory mapping automatically. Direct use of mmap_guard is only needed when using lower-level APIs.
Parallel Processing
Parallel processing is not yet implemented. Stringy currently processes files sequentially. The Pipeline API processes one file at a time.
Feature Flags
Stringy currently has no optional feature flags. All functionality is included by default.
Examples
Basic String Extraction (Pipeline API)
use stringy::pipeline::{Pipeline, PipelineConfig};
use std::path::Path;
fn main() -> stringy::Result<()> {
let config = PipelineConfig::default();
let pipeline = Pipeline::new(config);
pipeline.run(Path::new("binary.exe"))?;
Ok(())
}
Filtered Extraction
use stringy::{BasicExtractor, ExtractionConfig, StringExtractor, Tag};
use stringy::container::{detect_format, create_parser};
fn extract_network_indicators(data: &[u8]) -> stringy::Result<Vec<String>> {
let format = detect_format(data);
let parser = create_parser(format)?;
let container_info = parser.parse(data)?;
let extractor = BasicExtractor::new();
let config = ExtractionConfig::default();
let strings = extractor.extract(data, &container_info, &config)?;
let network_strings: Vec<String> = strings
.into_iter()
.filter(|s| {
s.tags
.iter()
.any(|tag| matches!(tag, Tag::Url | Tag::Domain | Tag::IPv4 | Tag::IPv6))
})
.filter(|s| s.score >= 70)
.map(|s| s.text)
.collect();
Ok(network_strings)
}
Custom Output Format
use serde_json::json;
use stringy::output::{OutputMetadata, OutputFormatter};
use stringy::FoundString;
pub struct CustomFormatter;
impl OutputFormatter for CustomFormatter {
fn name(&self) -> &'static str {
"custom"
}
fn format(&self, strings: &[FoundString], _metadata: &OutputMetadata) -> stringy::Result<String> {
let output = json!({
"total_strings": strings.len(),
"high_confidence": strings.iter().filter(|s| s.score >= 80).count(),
"strings": strings.iter().take(20).collect::<Vec<_>>()
});
Ok(serde_json::to_string_pretty(&output)?)
}
}
For complete API documentation with all methods and implementation details, run:
cargo doc --open
Configuration
Note
The configuration file system described below is planned but not yet implemented. Stringy currently uses CLI flags exclusively for configuration. See the CLI Reference for available options.
Stringy provides extensive configuration options to customize string extraction, classification, and output formatting. Configuration can be provided through command-line arguments, configuration files, or programmatically via the API.
Configuration File
Note: Configuration file support is planned for future releases.
Default Location
~/.config/stringy/config.toml
Example Configuration
[extraction]
min_ascii_len = 4
min_utf16_len = 3
max_string_len = 1024
encodings = ["ascii", "utf16le"]
include_debug = false
include_symbols = true
[classification]
detect_urls = true
detect_domains = true
detect_ips = true
detect_paths = true
detect_guids = true
detect_emails = true
detect_base64 = true
detect_format_strings = true
min_confidence = 0.7
[output]
format = "human"
max_results = 100
show_scores = true
show_offsets = true
color = true
[ranking]
section_weight_multiplier = 1.0
semantic_boost_multiplier = 1.0
noise_penalty_multiplier = 1.0
# Profile-specific configurations
[profiles.security]
encodings = ["ascii", "utf8", "utf16le"]
min_ascii_len = 6
only_tags = ["url", "domain", "ipv4", "ipv6", "filepath", "regpath"]
min_score = 70
[profiles.yara]
format = "yara"
min_ascii_len = 8
exclude_tags = ["import", "export"]
min_score = 80
[profiles.development]
include_debug = true
include_symbols = true
max_results = 500
Extraction Configuration
String Length Limits
Control the minimum and maximum string lengths:
[extraction]
min_ascii_len = 4 # Minimum ASCII string length
min_utf16_len = 3 # Minimum UTF-16 string length
max_string_len = 1024 # Maximum string length (prevents memory issues)
CLI equivalent:
stringy --min-len 6 --max-len 500 binary
Encoding Selection
Choose which encodings to extract:
[extraction]
encodings = ["ascii", "utf8", "utf16le", "utf16be"]
Available encodings:
ascii: 7-bit ASCIIutf8: UTF-8 (includes ASCII)utf16le: UTF-16 Little Endianutf16be: UTF-16 Big Endian
CLI equivalent:
stringy --enc ascii,utf16le binary
Section Filtering
Control which sections to analyze:
[extraction]
include_sections = [".rodata", ".rdata", "__cstring"]
exclude_sections = [".debug_info", ".comment"]
include_debug = false
include_resources = true
CLI equivalent:
stringy --sections .rodata,.rdata --no-debug binary
Symbol Processing
Configure import/export symbol handling:
[extraction]
include_symbols = true
demangle_rust = true
demangle_cpp = false # Future feature
CLI equivalent:
stringy --no-symbols --no-demangle binary
Classification Configuration
Pattern Detection
Enable/disable specific semantic patterns:
[classification]
detect_urls = true
detect_domains = true
detect_ips = true
detect_paths = true
detect_guids = true
detect_emails = true
detect_base64 = true
detect_format_strings = true
detect_user_agents = true
Confidence Thresholds
Set minimum confidence levels:
[classification]
min_confidence = 0.7 # Overall minimum confidence
url_min_confidence = 0.8 # URL-specific threshold
domain_min_confidence = 0.75 # Domain-specific threshold
path_min_confidence = 0.6 # File path threshold
Custom Patterns
Add custom regex patterns:
[classification.custom_patterns]
api_key = 'api[_-]?key["\s]*[:=]["\s]*[a-zA-Z0-9]{20,}'
crypto_address = '(bc1|[13])[a-zA-HJ-NP-Z0-9]{25,62}'
jwt_token = 'eyJ[a-zA-Z0-9_-]+\.[a-zA-Z0-9_-]+\.[a-zA-Z0-9_-]+'
Ranking Configuration
Weight Adjustments
Customize section and semantic weights:
[ranking.section_weights]
string_data = 40
resources = 35
readonly_data = 25
debug = 15
writable_data = 10
code = 5
other = 0
[ranking.semantic_boosts]
url = 25
domain = 20
guid = 20
filepath = 15
format_string = 10
import = 8
export = 8
Penalty Configuration
Adjust noise detection penalties:
[ranking.penalties]
high_entropy_threshold = 4.5
high_entropy_penalty = -15
length_penalty_threshold = 200
max_length_penalty = -20
repetition_threshold = 0.7
repetition_penalty = -12
Output Configuration
Format Selection
Choose default output format:
[output]
format = "human" # human, json, yara
max_results = 100 # Limit number of results
show_all = false # Override max_results limit
Display Options
Customize what information to show:
[output]
show_scores = true
show_offsets = true
show_sections = true
show_encodings = true
show_tags = true
color = true # Enable colored output
truncate_long_strings = true
max_string_display_length = 80
Filtering
Set default filters:
[output]
min_score = 50
only_tags = [] # Empty = show all tags
exclude_tags = ["b64"] # Exclude Base64 by default
Format-Specific Configuration
PE Configuration
Windows PE-specific options:
[formats.pe]
extract_version_info = true
extract_manifests = true
extract_string_tables = true
prefer_utf16 = true
include_resource_names = true
ELF Configuration
Linux ELF-specific options:
[formats.elf]
include_build_id = true
include_gnu_version = true
process_dynamic_strings = true
include_note_sections = false
Mach-O Configuration
macOS Mach-O-specific options:
[formats.macho]
process_load_commands = true
include_framework_paths = true
process_fat_binaries = "first" # first, all, or specific arch
Performance Configuration
Memory Management
Control memory usage:
[performance]
use_memory_mapping = true
memory_map_threshold = 10485760 # 10MB
max_memory_usage = 1073741824 # 1GB
Parallel Processing
Configure parallelization:
[performance]
enable_parallel = true
max_threads = 0 # 0 = auto-detect
chunk_size = 1048576 # 1MB chunks
Caching
Enable various caches:
[performance]
cache_regex_compilation = true
cache_section_analysis = true
cache_string_hashes = true
Environment Variables
Override configuration with environment variables:
| Variable | Description | Example |
|---|---|---|
STRINGY_CONFIG | Config file path | ~/.stringy.toml |
STRINGY_MIN_LEN | Minimum string length | 6 |
STRINGY_FORMAT | Output format | json |
STRINGY_MAX_RESULTS | Result limit | 50 |
NO_COLOR | Disable colored output | 1 |
Profiles
Use predefined configuration profiles:
Security Analysis Profile
stringy --profile security malware.exe
Equivalent to:
min_ascii_len = 6
encodings = ["ascii", "utf8", "utf16le"]
only_tags = ["url", "domain", "ipv4", "ipv6", "filepath", "regpath"]
min_score = 70
YARA Development Profile
stringy --profile yara suspicious.dll
Equivalent to:
format = "yara"
min_ascii_len = 8
exclude_tags = ["import", "export"]
min_score = 80
max_results = 50
Development Profile
stringy --profile dev application
Equivalent to:
include_debug = true
include_symbols = true
max_results = 500
show_all_metadata = true
Validation
Configuration validation ensures settings are compatible:
# This would generate a warning
[extraction]
min_ascii_len = 10
max_string_len = 5 # Invalid: min > max
Migration
When upgrading Stringy versions, configuration migration is handled automatically:
# Backup current config
cp ~/.config/stringy/config.toml ~/.config/stringy/config.toml.backup
# Stringy will migrate on first run
stringy --version
Examples
Minimal Configuration
[extraction]
min_ascii_len = 6
[output]
format = "json"
max_results = 50
Comprehensive Security Analysis
[extraction]
min_ascii_len = 6
min_utf16_len = 4
encodings = ["ascii", "utf8", "utf16le"]
include_debug = false
[classification]
detect_urls = true
detect_domains = true
detect_ips = true
detect_paths = true
detect_guids = true
min_confidence = 0.8
[output]
format = "json"
min_score = 70
only_tags = ["url", "domain", "ipv4", "ipv6", "filepath", "regpath", "guid"]
[ranking.semantic_boosts]
url = 30
domain = 25
ipv4 = 25
ipv6 = 25
filepath = 20
regpath = 20
guid = 20
This flexible configuration system allows Stringy to be adapted for various use cases, from interactive analysis to automated security pipelines.
Performance
Stringy is designed for efficient analysis of binary files, from small executables to large system libraries.
How It Works
Stringy memory-maps input files via mmap-guard for zero-copy access, then processes sections in weight-priority order. Regex patterns for semantic classification are compiled once using LazyLock statics.
The processing pipeline is single-threaded and sequential:
- Format detection and section analysis – O(n) where n = number of sections
- String extraction – O(m) where m = total section size
- Deduplication – hash-based grouping of identical strings
- Classification – O(k) where k = number of unique strings
- Ranking and sorting – O(k log k)
Reducing Processing Time
Use CLI flags to narrow the work Stringy does:
# Limit to top results (skip sorting the long tail)
stringy --top 50 binary
# Increase minimum length to reduce noise and string count
stringy --min-len 8 binary
# Restrict to a single encoding (skip UTF-16 detection)
stringy --enc ascii binary
# Skip classification and ranking entirely
stringy --raw binary
--raw mode is the fastest option – it extracts and deduplicates strings without running the classifier or ranker.
Benchmarking
Stringy includes Criterion benchmarks for core components:
# Run all benchmarks
just bench
# Run a specific benchmark
cargo bench --bench elf
cargo bench --bench pe
cargo bench --bench classification
cargo bench --bench ascii_extraction
Profiling
# CPU profiling with perf (Linux)
perf record --call-graph dwarf -- stringy large_file.exe
perf report
# macOS profiling with Instruments
xcrun xctrace record --template "Time Profiler" --launch -- stringy binary
# Memory profiling
/usr/bin/time -l stringy large_file.exe # macOS
/usr/bin/time -v stringy large_file.exe # Linux
Batch Processing
Stringy processes one file per invocation. For batch workflows, use standard Unix tools:
# Process multiple files
find /path/to/binaries -type f -exec stringy --json {} \; > all_strings.jsonl
# Parallel processing with xargs
find /binaries -name "*.exe" -print0 | xargs -0 -P 4 -I {} stringy --json {} > results.jsonl
Troubleshooting
This guide helps resolve common issues when using Stringy. If you don’t find a solution here, please check the GitHub issues or create a new issue.
Installation Issues
“cargo: command not found”
Problem: Rust/Cargo is not installed or not in PATH.
Solution:
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env
# Verify installation
cargo --version
Build Failures
Problem: Compilation errors during cargo build.
Common causes and solutions:
Outdated Rust Version
# Update Rust
rustup update
# Check version (should be 1.91+)
rustc --version
Missing System Dependencies
# Ubuntu/Debian
sudo apt update
sudo apt install build-essential pkg-config
# Fedora/RHEL
sudo dnf groupinstall "Development Tools"
sudo dnf install pkg-config
# macOS
xcode-select --install
Network Issues
# Use alternative registry
export CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse
# Or use offline mode if dependencies are cached
cargo build --offline
Permission Denied
Problem: Cannot execute the binary after installation.
Solution:
# Make binary executable
chmod +x ~/.cargo/bin/stringy
# Or reinstall with proper permissions
cargo install --path . --force
Runtime Issues
“Unsupported file format”
Problem: Stringy cannot detect the binary format.
Diagnosis:
# Check file type
file binary_file
# Check if it's actually a binary
hexdump -C binary_file | head
Note: Unknown or unparseable formats (plain text, etc.) do not error. Stringy falls back to unstructured raw byte scanning and succeeds. This message only appears if something else goes wrong during parsing.
“Permission denied” when reading files
Problem: Cannot read the target binary file.
Solutions:
# Check file permissions
ls -l binary_file
# Make readable
chmod +r binary_file
# Run with appropriate privileges
sudo stringy system_binary
Output Issues
No Strings Found
Problem: Stringy reports no strings in a binary that should have strings.
Diagnosis:
# Check with traditional strings command
strings binary_file | head -20
# Try different encodings (one at a time)
stringy --enc ascii binary_file
stringy --enc utf8 binary_file
stringy --enc utf16le binary_file
stringy --enc utf16be binary_file
# Lower minimum length
stringy --min-len 1 binary_file
Common causes:
- Packed or encrypted binary
- Unusual string encoding
- Strings in unexpected sections
- Very short strings below minimum length
Garbled Output
Problem: String output contains garbled or binary characters.
Solutions:
# Force specific encoding
stringy --enc ascii binary_file
# Increase minimum length to filter noise
stringy --min-len 6 binary_file
# Use JSON output which properly escapes invalid sequences
stringy --json binary_file | jq '.text'
# Filter by score
stringy --json binary_file | jq 'select(.score > 70)'
Missing Expected Strings
Problem: Known strings are not appearing in output.
Diagnosis:
# Check if strings exist with traditional tools
strings binary_file | grep "expected_string"
# Try each encoding
stringy --enc ascii binary_file | grep "expected"
stringy --enc utf8 binary_file | grep "expected"
stringy --enc utf16le binary_file | grep "expected"
# Lower score threshold
stringy --json binary_file | jq 'select(.score > 0)' | grep "expected"
Error Messages
“–summary requires a TTY”
Problem: --summary flag used when stdout is piped or redirected.
Solution: --summary only works when stdout is a terminal. Remove --summary when piping output:
# This will error (exit 1):
stringy --summary binary | grep foo
# This works:
stringy --summary binary
Tag overlap error
Problem: Same tag appears in both --only-tags and --no-tags.
Solution: Remove the duplicate tag from one of the two flags. This is a runtime validation error (exit 1).
“Invalid UTF-8 sequence”
Problem: String contains invalid UTF-8 bytes.
Solution: This is usually normal for binary data. Stringy handles this automatically, but you can:
# Use ASCII only to avoid UTF-8 issues
stringy --enc ascii binary_file
# Use JSON output which properly escapes invalid sequences
stringy --json binary_file
“Regex compilation failed”
Problem: Internal regex pattern compilation error.
Solution: This indicates a bug. Please report it with:
# Get version information
stringy --version
File Not Found
Problem: The specified file does not exist.
Exit code: 1
Solution: Check the file path and ensure the file exists:
ls -l /path/to/binary
Performance Issues
Very Slow Processing
Problem: Stringy takes too long to process files.
Solutions:
# Increase minimum length to reduce extraction volume
stringy --min-len 8 large_file.exe
# Limit results
stringy --top 50 large_file.exe
# Use ASCII only
stringy --enc ascii large_file.exe
# Use raw mode (skip classification)
stringy --raw large_file.exe
High Memory Usage
Problem: Stringy uses too much memory.
Solutions:
# Limit results
stringy --top 100 file.exe
# Increase minimum length
stringy --min-len 8 file.exe
# Use raw mode to skip classification
stringy --raw file.exe
Exit Code Reference
| Code | Meaning |
|---|---|
| 0 | Success (including unknown binary format, empty binary, no filter matches) |
| 1 | Runtime error (file not found, tag overlap, --summary in non-TTY) |
| 2 | Argument parsing error (invalid flag, flag conflict, invalid tag name) |
Debugging Tips
Compare with Traditional Tools
# Compare with standard strings
strings binary_file > traditional.txt
stringy --json binary_file | jq -r '.text' > stringy.txt
diff traditional.txt stringy.txt
Test with Known Good Files
# Test with system binaries
stringy /bin/ls # Linux
stringy /bin/cat # Linux
stringy /usr/bin/grep # macOS
Getting Help
Information to Include in Bug Reports
-
System information:
stringy --version rustc --version uname -a # Linux/macOS -
File information:
file binary_file ls -l binary_file -
Exact command line used
-
Expected vs actual behavior
Where to Get Help
- Documentation: Check this guide and the project documentation
- GitHub Issues: Search existing issues or create a new one
- Discussions: Use GitHub Discussions for questions and ideas