String Extraction
Stringy’s string extraction engine is designed to find meaningful strings while avoiding noise and false positives. The extraction process is encoding-aware, section-aware, and configurable.
Extraction Pipeline
Binary Data → Section Analysis → Encoding Detection → String Scanning → Deduplication → Classification
Encoding Support
ASCII Extraction
The most common encoding in most binaries. ASCII extraction provides foundational string extraction with configurable minimum length thresholds.
UTF-16LE Extraction
UTF-16LE extraction is now implemented and available for Windows PE binary string extraction. It provides UTF-16LE string extraction with confidence scoring and noise filtering integration.
Algorithm
- Scan for printable sequences: Characters in range 0x20-0x7E (strict printable ASCII)
- Length filtering: Configurable minimum length (default: 4 characters)
- Null termination: Respect null terminators but don’t require them
- Section awareness: Integrate with section metadata for context-aware filtering
Basic Extraction
use stringy::extraction::{extract_ascii_strings, AsciiExtractionConfig};
let data = b"Hello\0World\0Test123";
let config = AsciiExtractionConfig::default();
let strings = extract_ascii_strings(data, &config);
for string in strings {
println!("Found: {} at offset {}", string.text, string.offset);
}
Configuration
use stringy::extraction::AsciiExtractionConfig;
// Default configuration (min_length: 4, no max_length)
let config = AsciiExtractionConfig::default();
// Custom minimum length
let config = AsciiExtractionConfig::new(8);
// Custom minimum and maximum length
let mut config = AsciiExtractionConfig::default();
config.max_length = Some(256);
UTF-8 Extraction
UTF-8 extraction builds on ASCII extraction and handles multi-byte characters. See the main extraction module for UTF-8 support.
Implementation Details
fn extract_ascii_strings(data: &[u8], min_len: usize) -> Vec<RawString> {
let mut strings = Vec::new();
let mut current_string = Vec::new();
let mut start_offset = 0;
for (i, &byte) in data.iter().enumerate() {
if is_printable_ascii(byte) {
if current_string.is_empty() {
start_offset = i;
}
current_string.push(byte);
} else {
if current_string.len() >= min_len {
strings.push(RawString {
data: current_string.clone(),
offset: start_offset,
encoding: Encoding::Ascii,
});
}
current_string.clear();
}
}
strings
}
Noise Filtering
Stringy implements a multi-layered heuristic filtering system to reduce false positives and identify noise in extracted strings. The filtering system uses a combination of entropy analysis, character distribution, linguistic patterns, length checks, repetition detection, and context-aware filtering.
Filter Architecture
The noise filtering system consists of multiple independent filters that can be combined with configurable weights:
- Character Distribution Filter: Detects abnormal character frequency distributions
- Entropy Filter: Uses Shannon entropy to detect padding/repetition and random binary
- Linguistic Pattern Filter: Analyzes vowel-to-consonant ratios and common bigrams
- Length Filter: Penalizes excessively long strings and very short strings in low-weight sections
- Repetition Filter: Detects repeated character patterns and repeated substrings
- Context-Aware Filter: Boosts confidence for strings in high-weight sections
Character Distribution Analysis
Detects strings with abnormal character distributions:
- Excessive punctuation (>80%): Low confidence (0.2)
- Excessive repetition (>90% same character): Very low confidence (0.1)
- Excessive non-alphanumeric (>70%): Low confidence (0.3)
- Reasonable distribution: High confidence (1.0)
Entropy-Based Filtering
Uses Shannon entropy (bits per byte) to classify strings:
- Very low entropy (<1.5 bits/byte): Likely padding or repetition (confidence: 0.1)
- Very high entropy (>7.5 bits/byte): Likely random binary (confidence: 0.2)
- Optimal range (3.5-6.0 bits/byte): High confidence (1.0)
- Acceptable range (2.0-7.0 bits/byte): Moderate confidence (0.4-0.7)
Linguistic Pattern Detection
Analyzes text for word-like patterns:
- Vowel-to-consonant ratio: Reasonable range 0.2-0.8 for English
- Common bigrams: Detects common English patterns (th, he, in, er, an, re, on, at, en, nd)
- Handles non-English: Gracefully handles non-English strings without over-penalizing
Length-Based Filtering
Applies penalties based on string length:
- Excessively long (>200 characters): Low confidence (0.3) - likely table data
- Very short in low-weight sections (<4 chars, weight <0.5): Moderate confidence (0.5)
- Normal length (4-100 characters): High confidence (1.0)
Repetition Detection
Identifies repetitive patterns:
- Repeated characters (e.g., “AAAA”, “0000”): Very low confidence (0.1)
- Repeated substrings (e.g., “abcabcabc”): Low confidence (0.2)
- Normal strings: High confidence (1.0)
Context-Aware Filtering
Boosts or reduces confidence based on section context:
- String data sections (.rodata, .rdata, __cstring): High confidence (0.9-1.0)
- Read-only data sections: High confidence (0.9)
- Resource sections: Maximum confidence (1.0) - known-good sources
- Code sections: Lower confidence (0.3-0.5)
- Writable data sections: Moderate confidence (0.6)
Configuration
use stringy::extraction::config::{NoiseFilterConfig, FilterWeights};
// Default configuration
let config = NoiseFilterConfig::default();
// Customize thresholds
let mut config = NoiseFilterConfig::default();
config.entropy_min = 2.0;
config.entropy_max = 7.0;
config.max_length = 150;
// Customize filter weights
config.filter_weights = FilterWeights {
entropy_weight: 0.3,
char_distribution_weight: 0.25,
linguistic_weight: 0.2,
length_weight: 0.15,
repetition_weight: 0.05,
context_weight: 0.05,
};
Using Noise Filters
use stringy::extraction::config::NoiseFilterConfig;
use stringy::extraction::filters::{CompositeNoiseFilter, FilterContext};
use stringy::types::SectionType;
let filter_config = NoiseFilterConfig::default();
let filter = CompositeNoiseFilter::new(&filter_config);
let context = FilterContext::default();
let confidence = filter.calculate_confidence("Hello, World!", &context);
if confidence >= 0.5 {
// String passed filtering threshold
}
Confidence Scoring
Each string is assigned a confidence score (0.0-1.0) indicating how likely it is to be legitimate:
- 1.0: Maximum confidence (strings from known-good sources like imports, exports, resources)
- 0.7-0.9: High confidence (likely legitimate strings)
- 0.5-0.7: Moderate confidence (may need review)
- 0.0-0.5: Low confidence (likely noise, filtered out by default)
The confidence score is separate from the score field used for final ranking. Confidence specifically represents the noise filtering assessment.
Performance
Noise filtering is designed to add minimal overhead (<10% per acceptance criteria). Individual filters are optimized for performance, and the composite filter allows enabling/disabling specific filters to balance accuracy and speed.
UTF-16 Extraction
Critical for Windows binaries and some resources. Supports both UTF-16LE (Little-Endian) and UTF-16BE (Big-Endian) with automatic byte order detection.
UTF-16LE (Little-Endian)
Most common on Windows platforms. Default 3 character minimum.
Detection heuristics:
- Even-length sequences (2-byte alignment required)
- Low byte printable, high byte mostly zero
- Null termination patterns (0x00 0x00)
- Advanced confidence scoring with multiple heuristics
UTF-16BE (Big-Endian)
Found in Java .class files, network protocols, some cross-platform binaries.
Detection heuristics:
- Even-length sequences
- High byte printable, low byte mostly zero
- Reverse byte order from UTF-16LE
- Same advanced confidence scoring as UTF-16LE
Automatic Byte Order Detection
The ByteOrder::Auto mode automatically detects and extracts both UTF-16LE and UTF-16BE strings from the same data, avoiding duplicates and correctly identifying the encoding of each string.
Implementation
UTF-16 extraction is implemented in src/extraction/utf16.rs following the pattern established in the ASCII extractor. The implementation provides:
extract_utf16_strings(): Main extraction function supporting both byte ordersextract_utf16le_strings(): UTF-16LE-specific extraction (backward compatibility)extract_from_section(): Section-aware extraction with proper metadata populationUtf16ExtractionConfig: Configuration for minimum/maximum character count, byte order selection, and confidence thresholdsByteOrderenum: Control which byte order(s) to scan (LE, BE, Auto)
Usage Example:
use stringy::extraction::utf16::{extract_utf16_strings, Utf16ExtractionConfig, ByteOrder};
// Extract UTF-16LE strings from Windows PE binary
let config = Utf16ExtractionConfig {
byte_order: ByteOrder::LE,
min_length: 3,
confidence_threshold: 0.6,
..Default::default()
};
let strings = extract_utf16_strings(data, &config);
// Extract both UTF-16LE and UTF-16BE with auto-detection
let config = Utf16ExtractionConfig {
byte_order: ByteOrder::Auto,
..Default::default()
};
let strings = extract_utf16_strings(data, &config);
Configuration:
use stringy::extraction::utf16::{Utf16ExtractionConfig, ByteOrder};
// Default configuration (min_length: 3, byte_order: Auto, confidence_threshold: 0.5)
let config = Utf16ExtractionConfig::default();
// Custom minimum character length
let config = Utf16ExtractionConfig::new(5);
// Custom configuration
let mut config = Utf16ExtractionConfig::default();
config.min_length = 3;
config.max_length = Some(256);
config.byte_order = ByteOrder::LE;
config.confidence_threshold = 0.6;
UTF-16-Specific Confidence Scoring
UTF-16 extraction uses advanced confidence scoring to detect false positives from null-interleaved binary data. The confidence score combines multiple heuristics:
-
Valid Unicode range check: Validates code points are in valid Unicode ranges (U+0020-U+D7FF, U+E000-U+FFFD, U+10000-U+10FFFF), penalizes private use areas and invalid surrogates
-
Printable character ratio: Calculates ratio of printable characters including common Unicode ranges
-
ASCII ratio: Boosts confidence for ASCII-heavy strings (>50% characters in ASCII printable range)
-
Null pattern detection: Flags suspicious patterns like:
- Excessive nulls (>30% of characters)
- Regular null intervals (every 2nd, 4th, 8th position)
- Fixed-offset nulls indicating structured binary data
-
Byte order consistency: Verifies byte order is consistent throughout the string (for Auto mode)
Confidence Formula:
confidence = (valid_unicode_weight × valid_ratio)
+ (printable_weight × printable_ratio)
+ (ascii_weight × ascii_ratio)
- (null_pattern_penalty)
- (invalid_range_penalty)
The result is clamped to 0.0-1.0 range.
Examples:
- High confidence: “Microsoft Corporation” (>90% printable, valid Unicode, no null patterns)
- Medium confidence: “Test123” (>70% printable, valid Unicode)
- Low confidence: Null-interleaved binary table data (excessive nulls, regular patterns)
The UTF-16-specific confidence score is combined with general noise filtering confidence when noise filtering is enabled, using the minimum of both scores.
False Positive Prevention
UTF-16 extraction is prone to false positives because binary data with null bytes can look like UTF-16 strings. The confidence scoring system mitigates this by:
- Detecting null-interleaved patterns: Binary tables with numeric data (e.g.,
[0x01, 0x00, 0x02, 0x00]) are flagged as suspicious - Penalizing regular null patterns: Data with nulls at fixed intervals (every 2nd, 4th, 8th byte) receives lower confidence
- Validating Unicode ranges: Invalid code points and surrogate pairs reduce confidence
- Configurable threshold: The
utf16_confidence_threshold(default 0.5) can be tuned to balance recall and precision
Recommendations:
- For Windows PE binaries: Use
ByteOrder::LEwithconfidence_threshold: 0.6 - For Java .class files: Use
ByteOrder::BEwithconfidence_threshold: 0.5 - For unknown formats: Use
ByteOrder::Autowithconfidence_threshold: 0.5 - For high-precision extraction: Increase
confidence_thresholdto 0.7-0.8
Performance Considerations
UTF-16 scanning adds overhead compared to ASCII/UTF-8 extraction:
- Scanning both byte orders: Auto mode doubles the work by scanning for both LE and BE
- Confidence scoring: The multi-heuristic confidence calculation adds computational cost
- Recommendations:
- Use specific byte order (LE or BE) when the target format is known
- Auto mode is best for unknown or mixed-format binaries
- Consider disabling UTF-16 extraction for formats that don’t use it (e.g., pure ELF binaries)
Section-Aware Extraction
Different sections have different string extraction strategies.
High-Priority Sections
ELF: .rodata and variants
- Strategy: Aggressive extraction, low noise filtering
- Encodings: ASCII/UTF-8 primary, UTF-16 secondary
- Minimum length: 3 characters
PE: .rdata
- Strategy: Balanced extraction
- Encodings: ASCII and UTF-16LE equally
- Minimum length: 4 characters
Mach-O: __TEXT,__cstring
- Strategy: High confidence, null-terminated focus
- Encodings: UTF-8 primary
- Minimum length: 3 characters
Medium-Priority Sections
ELF: .data.rel.ro
- Strategy: Conservative extraction
- Noise filtering: Enhanced
- Minimum length: 5 characters
PE: .data (read-only)
- Strategy: Moderate extraction
- Context checking: Enhanced validation
Low-Priority Sections
Writable data sections
- Strategy: Very conservative
- High noise filtering: Skip obvious runtime data
- Minimum length: 6+ characters
Resource Sections
PE Resources (.rsrc)
- VERSIONINFO: Extract version strings, product names
- STRINGTABLE: Localized UI strings
- RT_MANIFEST: XML manifest data
fn extract_pe_resources(pe: &PE, data: &[u8]) -> Vec<RawString> {
let mut strings = Vec::new();
// Extract version info
if let Some(version_info) = extract_version_info(pe, data) {
strings.extend(version_info);
}
// Extract string tables
if let Some(string_tables) = extract_string_tables(pe, data) {
strings.extend(string_tables);
}
strings
}
Deduplication Strategy
Canonicalization
Strings are canonicalized while preserving important metadata:
- Normalize whitespace: Convert tabs/newlines to spaces
- Trim boundaries: Remove leading/trailing whitespace
- Case preservation: Maintain original case for analysis
- Encoding normalization: Convert to UTF-8 for comparison
Metadata Preservation
When duplicates are found:
struct DeduplicatedString {
canonical_text: String,
occurrences: Vec<StringOccurrence>,
primary_encoding: Encoding,
best_section: Option<String>,
}
struct StringOccurrence {
offset: u64,
section: Option<String>,
encoding: Encoding,
length: u32,
}
Deduplication Algorithm
fn deduplicate_strings(strings: Vec<RawString>) -> Vec<DeduplicatedString> {
let mut map: HashMap<String, DeduplicatedString> = HashMap::new();
for string in strings {
let canonical = canonicalize(&string.text);
map.entry(canonical.clone())
.or_insert_with(|| DeduplicatedString::new(canonical))
.add_occurrence(string);
}
map.into_values().collect()
}
Configuration Options
Extraction Configuration
use stringy::extraction::{ByteOrder, Encoding, ExtractionConfig};
pub struct ExtractionConfig {
pub min_ascii_length: usize, // Default: 4
pub min_wide_length: usize, // Default: 3 (for UTF-16)
pub enabled_encodings: Vec<Encoding>, // Default: ASCII, UTF-8
pub noise_filtering_enabled: bool, // Default: true
pub min_confidence_threshold: f32, // Default: 0.5
pub utf16_min_confidence: f32, // Default: 0.7 (for UTF-16LE)
pub utf16_byte_order: ByteOrder, // Default: Auto
pub utf16_confidence_threshold: f32, // Default: 0.5 (UTF-16-specific)
}
UTF-16 Configuration Examples:
use stringy::extraction::{ExtractionConfig, Encoding, ByteOrder};
// Extract UTF-16LE strings from Windows PE binary
let mut config = ExtractionConfig::default();
config.min_wide_length = 3;
config.utf16_confidence_threshold = 0.6;
config.utf16_byte_order = ByteOrder::LE;
config.enabled_encodings.push(Encoding::Utf16Le);
// Extract both UTF-16LE and UTF-16BE with auto-detection
let mut config = ExtractionConfig::default();
config.enabled_encodings.push(Encoding::Utf16Le);
config.enabled_encodings.push(Encoding::Utf16Be);
config.utf16_byte_order = ByteOrder::Auto;
Noise Filter Configuration
use stringy::extraction::config::NoiseFilterConfig;
pub struct NoiseFilterConfig {
pub entropy_min: f32, // Default: 1.5
pub entropy_max: f32, // Default: 7.5
pub max_length: usize, // Default: 200
pub max_repetition_ratio: f32, // Default: 0.7
pub min_vowel_ratio: f32, // Default: 0.1
pub max_vowel_ratio: f32, // Default: 0.9
pub filter_weights: FilterWeights, // Default: balanced weights
}
Filter Weights
use stringy::extraction::config::FilterWeights;
pub struct FilterWeights {
pub entropy_weight: f32, // Default: 0.25
pub char_distribution_weight: f32, // Default: 0.20
pub linguistic_weight: f32, // Default: 0.20
pub length_weight: f32, // Default: 0.15
pub repetition_weight: f32, // Default: 0.10
pub context_weight: f32, // Default: 0.10
}
All weights must sum to 1.0. The configuration validates this automatically.
Encoding Selection
#[non_exhaustive]
pub enum EncodingFilter {
/// Match a specific encoding exactly
Exact(Encoding),
/// Match any UTF-16 variant (UTF-16LE or UTF-16BE)
Utf16Any,
}
Section Filtering
pub struct SectionFilter {
pub include_sections: Option<Vec<String>>,
pub exclude_sections: Option<Vec<String>>,
pub include_debug: bool,
pub include_resources: bool,
}
Performance Optimizations
Memory Mapping
Large files use memory mapping for efficient access via mmap-guard:
fn extract_from_large_file(path: &Path) -> Result<Vec<RawString>> {
let data = mmap_guard::map_file(path)?;
// data implements Deref<Target = [u8]>
extract_strings(&data[..])
}
Note: The Pipeline::run API handles memory mapping automatically.
Parallel Processing
Parallel processing is not yet implemented. Section extraction currently runs sequentially.
Regex Caching
Pattern matching uses cached regex compilation:
lazy_static! {
static ref URL_REGEX: Regex = Regex::new(r"https?://[^\s]+").unwrap();
static ref GUID_REGEX: Regex = Regex::new(r"\{[0-9a-fA-F-]{36}\}").unwrap();
}
Quality Assurance
Validation Heuristics
The noise filtering system implements comprehensive validation:
- Entropy checking: Uses Shannon entropy to detect padding/repetition and random binary data
- Language detection: Analyzes vowel-to-consonant ratios and common bigrams
- Context validation: Considers section type, weight, and permissions
- Character distribution: Detects abnormal frequency distributions
- Repetition detection: Identifies repeated patterns and padding
False Positive Reduction
The multi-layered filtering system targets common sources of false positives:
- Padding detection: Identifies repeated character sequences (e.g., “AAAA”, “\x00\x00\x00\x00”)
- Table data: Filters excessively long strings likely to be structured data
- Binary noise: High-entropy strings are flagged as likely random binary
- Context awareness: Strings in code sections receive lower confidence scores
Performance Characteristics
Noise filtering is designed for minimal overhead:
- Target overhead: <10% compared to extraction without filtering
- Optimized filters: Each filter is independently optimized
- Configurable: Can enable/disable individual filters to balance accuracy and speed
- Scalable: Handles large binaries efficiently
Examples
Basic Extraction with Filtering
use stringy::extraction::{extract_ascii_strings, AsciiExtractionConfig};
use stringy::extraction::config::NoiseFilterConfig;
use stringy::extraction::filters::{CompositeNoiseFilter, FilterContext};
let data = b"Hello World\0AAAA\0Test123";
let config = AsciiExtractionConfig::default();
let strings = extract_ascii_strings(data, &config);
let filter_config = NoiseFilterConfig::default();
let filter = CompositeNoiseFilter::new(&filter_config);
let context = FilterContext::default();
let filtered: Vec<_> = strings
.into_iter()
.filter(|s| filter.calculate_confidence(&s.text, &context) >= 0.5)
.collect();
Custom Filter Configuration
use stringy::extraction::config::{NoiseFilterConfig, FilterWeights};
let mut config = NoiseFilterConfig::default();
config.entropy_min = 2.0;
config.entropy_max = 7.0;
config.max_length = 150;
config.filter_weights = FilterWeights {
entropy_weight: 0.4,
char_distribution_weight: 0.3,
linguistic_weight: 0.15,
length_weight: 0.1,
repetition_weight: 0.03,
context_weight: 0.02,
};
This comprehensive extraction system ensures high-quality string extraction while maintaining performance and minimizing false positives through multi-layered noise filtering.