Binary Format Support
Stringy supports the three major executable formats across different platforms. Each format has unique characteristics that influence string extraction strategies.
ELF (Executable and Linkable Format)
Used primarily on Linux and other Unix-like systems.
Key Sections for String Extraction
| Section | Priority | Description |
|---|---|---|
.rodata | High | Read-only data, often contains string literals |
.rodata.str1.1 | High | Aligned string literals |
.data.rel.ro | Medium | Read-only after relocation |
.comment | Medium | Compiler and build information |
.note.* | Low | Various metadata notes |
ELF-Specific Features
- Symbol Tables: Extract import/export names from
.dynsymand.symtab - Dynamic Strings: Process
.dynstrfor library names and symbols - Section Flags: Use
SHF_EXECINSTRandSHF_WRITEfor classification - Virtual Addresses: Map file offsets to runtime addresses
- Dynamic Linking: Parse
DT_NEEDEDentries to extract library dependencies - Symbol Types: Support for functions (STT_FUNC), objects (STT_OBJECT), TLS variables (STT_TLS), and indirect functions (STT_GNU_IFUNC)
- Symbol Visibility: Filter hidden and internal symbols from exports (STV_HIDDEN, STV_INTERNAL)
Enhanced Symbol Extraction
The ELF parser now provides comprehensive symbol extraction with:
-
Import Detection: Identifies all undefined symbols (SHN_UNDEF) that need runtime resolution
- Supports multiple symbol types: functions, objects, TLS variables, and indirect functions
- Handles both global and weak bindings
- Maps symbols to their providing libraries using version information
-
Export Detection: Extracts all globally visible defined symbols
- Filters out hidden (STV_HIDDEN) and internal (STV_INTERNAL) symbols
- Includes both strong and weak symbols
- Supports all relevant symbol types
-
Library Dependencies: Extracts DT_NEEDED entries from the dynamic section
- Provides list of required shared libraries
- Used in conjunction with version information for symbol-to-library mapping
-
Symbol-to-Library Mapping: Maps imported symbols to their providing libraries
- Uses ELF version tables (versym and verneed) for best-effort attribution
- Process: versym index → verneed entry → library filename
- Falls back to heuristics for unversioned symbols (e.g., common libc symbols)
- Returns
Nonewhen version information is unavailable or ambiguous
Implementation Details
impl ElfParser {
fn classify_section(section: &SectionHeader, name: &str) -> SectionType {
// Check executable flag first
if section.sh_flags & SHF_EXECINSTR != 0 {
return SectionType::Code;
}
// Classify by name patterns
match name {
".rodata" | ".rodata.str1.1" => SectionType::StringData,
".data.rel.ro" => SectionType::ReadOnlyData,
// ... more classifications
}
}
fn extract_imports(&self, elf: &Elf, libraries: &[String]) -> Vec<ImportInfo> {
// Extract undefined symbols from dynamic symbol table
// Supports STT_FUNC, STT_OBJECT, STT_TLS, STT_GNU_IFUNC, STT_NOTYPE
// Handles both STB_GLOBAL and STB_WEAK bindings
// Maps symbols to libraries using version information
}
fn extract_exports(&self, elf: &Elf) -> Vec<ExportInfo> {
// Extract defined symbols with global/weak binding
// Filters out STV_HIDDEN and STV_INTERNAL symbols
// Includes all relevant symbol types
}
fn extract_needed_libraries(&self, elf: &Elf) -> Vec<String> {
// Parse DT_NEEDED entries from dynamic section
// Returns list of required shared library names
}
fn get_symbol_providing_library(
&self,
elf: &Elf,
sym_index: usize,
libraries: &[String],
) -> Option<String> {
// 1. Get version index from versym table for this symbol
// 2. Look up version in verneed to find library name
// 3. Match with DT_NEEDED entries
// 4. Fallback to heuristics for unversioned symbols
}
}
Library Dependency Mapping
The ELF parser implements symbol-to-library mapping using ELF version information:
-
Version Symbol Table (versym): Maps each dynamic symbol to a version index
- Index 0 (VER_NDX_LOCAL): Local symbol, not available externally
- Index 1 (VER_NDX_GLOBAL): Global symbol, no specific version
- Index ≥ 2: Versioned symbol, references verneed entry
-
Version Needed Table (verneed): Lists library dependencies with version requirements
- Each entry contains a library filename (from DT_NEEDED)
- Auxiliary entries specify version names and indices
- Links version indices to specific libraries
-
Mapping Process:
Symbol → versym[sym_index] → version_index → verneed lookup → library_name -
Fallback Strategies:
- For unversioned symbols: Attempt to match common symbols (e.g.,
printf,malloc) to libc - If only one library is needed: Attribute to that library (least accurate)
- Otherwise: Return
Noneto avoid false positives
- For unversioned symbols: Attempt to match common symbols (e.g.,
Limitations
ELF’s indirect linking model means symbol-to-library mapping is best-effort:
- Accuracy: Version-based mapping is accurate when version information is present, but many binaries lack version info
- Unversioned Symbols: Symbols without version information cannot be definitively mapped without relocation analysis
- Relocation Tables: PLT/GOT relocations would provide definitive mapping but require complex analysis
- Static Linking: Statically linked binaries have no dynamic section, so all imports have
library: None - Stripped Binaries: Stripped binaries may lack symbol tables entirely
The current implementation is sufficient for most string classification use cases where approximate library attribution is acceptable.
PE (Portable Executable)
Used on Windows for executables, DLLs, and drivers.
Key Sections for String Extraction
| Section | Priority | Description |
|---|---|---|
.rdata | High | Read-only data section |
.rsrc | High | Resources (version info, strings, etc.) |
.data | Medium | Initialized data (check write flag) |
.text | Low | Code section (imports/exports only) |
PE-Specific Features
- Resources: Extract from
VERSIONINFO,STRINGTABLE, and manifest resources - Import/Export Tables: Process IAT and EAT for symbol names
- UTF-16 Prevalence: Windows APIs favor wide strings
- Section Characteristics: Use
IMAGE_SCN_*flags for classification
Enhanced Import/Export Extraction
The PE parser provides comprehensive import/export extraction:
-
Import Extraction: Extracts from PE import directory using goblin’s
pe.imports- Each import includes: function name, DLL name, and RVA
- Example:
printffrommsvcrt.dll - Iterates through
pe.importsto createImportInfowith name, library (DLL), and address (RVA)
-
Export Extraction: Extracts from PE export directory using goblin’s
pe.exports- Each export includes: function name, address, and ordinal
- Note: PE executables typically don’t export symbols (only DLLs do)
- Ordinal is derived from index since goblin doesn’t expose it directly
- Handles unnamed exports with “ordinal_{i}” naming
Resource Extraction (Phase 2 Complete)
PE resources are particularly rich sources of strings. The PE parser now provides comprehensive resource string extraction:
VERSIONINFO Extraction
- Extracts all StringFileInfo key-value pairs from VS_VERSIONINFO structures
- Supports multiple language variants via translation table
- Common extracted fields:
CompanyName: Company or organization nameFileDescription: File purpose and descriptionFileVersion: File version string (e.g., “1.0.0.0”)ProductName: Product nameProductVersion: Product version stringLegalCopyright: Copyright informationInternalName: Internal file identifierOriginalFilename: Original filename
- Uses pelite’s high-level
version_info()API for reliable parsing - All strings are UTF-16LE encoded in the resource
- Tagged with
Tag::VersionandTag::Resource
STRINGTABLE Extraction
- Parses RT_STRING resources (type 6) containing localized UI strings
- Handles block structure: strings grouped in blocks of 16
- Block ID calculation:
(StringID >> 4) + 1 - String format: u16 length (in UTF-16 code units) + UTF-16LE string data
- Supports multiple language variants
- Extracts all non-empty strings from all blocks
- Tagged with
Tag::Resource - Common use cases: UI labels, error messages, dialog text
MANIFEST Extraction
- Extracts RT_MANIFEST resources (type 24) containing application manifests
- Automatic encoding detection:
- UTF-8 with BOM (EF BB BF)
- UTF-16LE with BOM (FF FE)
- UTF-16BE with BOM (FE FF)
- Fallback: byte pattern analysis
- Returns full XML manifest content
- Tagged with
Tag::ManifestandTag::Resource - Manifest contains:
- Assembly identity (name, version, architecture)
- Dependency information
- Compatibility settings
- Security settings (requestedExecutionLevel)
Usage Example
use stringy::extraction::extract_resource_strings;
use stringy::types::Tag;
let pe_data = std::fs::read("example.exe")?;
let strings = extract_resource_strings(&pe_data);
// Filter version info strings
let version_strings: Vec<_> = strings.iter()
.filter(|s| s.tags.contains(&Tag::Version))
.collect();
// Filter string table entries
let ui_strings: Vec<_> = strings.iter()
.filter(|s| s.tags.contains(&Tag::Resource) && !s.tags.contains(&Tag::Version))
.collect();
Implementation Details
impl PeParser {
fn classify_section(section: &SectionTable) -> SectionType {
let name = String::from_utf8_lossy(§ion.name);
// Check characteristics
if section.characteristics & IMAGE_SCN_CNT_CODE != 0 {
return SectionType::Code;
}
match name.trim_end_matches('\0') {
".rdata" => SectionType::StringData,
".rsrc" => SectionType::Resources,
// ... more classifications
}
}
fn extract_imports(&self, pe: &PE) -> Vec<ImportInfo> {
// Iterates through pe.imports
// Creates ImportInfo with name, library (DLL), and address (RVA)
}
fn extract_exports(&self, pe: &PE) -> Vec<ExportInfo> {
// Iterates through pe.exports
// Creates ExportInfo with name, address, and ordinal
// Handles unnamed exports with "ordinal_{i}" naming
}
fn calculate_section_weight(section_type: SectionType, name: &str) -> f32 {
// Returns weight values based on section type and name
// Higher weights indicate higher string likelihood
}
}
Section Weight Calculation
The PE parser uses a weight-based system to prioritize sections for string extraction:
| Section Type | Weight | Rationale |
|---|---|---|
| StringData (.rdata) | 10.0 | Primary string storage |
| Resources (.rsrc) | 9.0 | Version info, string tables |
| ReadOnlyData | 7.0 | May contain constants |
| WritableData (.data) | 5.0 | Runtime state, lower priority |
| Code (.text) | 1.0 | Unlikely to contain strings |
| Debug | 2.0 | Internal metadata |
| Other | 1.0 | Minimal priority |
Limitations
The current PE parser implementation provides comprehensive resource string extraction:
- VERSIONINFO: Complete extraction of all StringFileInfo fields
- STRINGTABLE: Full parsing of RT_STRING blocks with language support
- MANIFEST: Encoding detection and XML extraction
- Dialog Resources: RT_DIALOG parsing not yet implemented (future enhancement)
- Menu Resources: RT_MENU parsing not yet implemented (future enhancement)
- Icon Strings: RT_ICON metadata extraction not yet implemented
Future Enhancements:
- Dialog resource parsing for control text and window titles
- Menu resource parsing for menu item text
- Icon and cursor resource metadata
- Accelerator table string extraction
Mach-O (Mach Object)
Used on macOS and iOS for executables, frameworks, and libraries.
Key Sections for String Extraction
| Segment | Section | Priority | Description |
|---|---|---|---|
__TEXT | __cstring | High | C string literals |
__TEXT | __const | High | Constant data |
__DATA_CONST | * | Medium | Read-only after fixups |
__DATA | * | Low | Writable data |
Mach-O-Specific Features
- Load Commands: Extract strings from
LC_*commands - Segment/Section Model: Two-level naming scheme
- Fat Binaries: Multi-architecture support
- String Pools: Centralized string storage in
__cstring
Load Command Processing
Mach-O load commands contain valuable strings:
LC_LOAD_DYLIB: Library paths and namesLC_RPATH: Runtime search pathsLC_ID_DYLIB: Library identificationLC_BUILD_VERSION: Build tool information
Implementation Details
impl MachoParser {
fn classify_section(segment_name: &str, section_name: &str) -> SectionType {
match (segment_name, section_name) {
("__TEXT", "__cstring") => SectionType::StringData,
("__DATA_CONST", _) => SectionType::ReadOnlyData,
("__DATA", _) => SectionType::WritableData,
// ... more classifications
}
}
}
Cross-Platform Considerations
Encoding Differences
| Platform | Primary Encoding | Notes |
|---|---|---|
| Linux/Unix | UTF-8 | ASCII-compatible, variable width |
| Windows | UTF-16LE | Wide strings common in APIs |
| macOS | UTF-8 | Similar to Linux, some UTF-16 |
String Storage Patterns
- ELF: Strings often in
.rodatawith null terminators - PE: Mix of ANSI and Unicode APIs, resources use UTF-16
- Mach-O: Centralized in
__cstring, mostly UTF-8
Section Weight Calculation
Different formats require different weighting strategies:
fn calculate_section_weight(format: BinaryFormat, section_type: SectionType) -> i32 {
match (format, section_type) {
(BinaryFormat::Elf, SectionType::StringData) => 10, // .rodata
(BinaryFormat::Pe, SectionType::Resources) => 9, // .rsrc
(BinaryFormat::MachO, SectionType::StringData) => 10, // __cstring
// ... more weights
}
}
Format Detection
Stringy uses goblin for robust format detection:
pub fn detect_format(data: &[u8]) -> BinaryFormat {
match Object::parse(data) {
Ok(Object::Elf(_)) => BinaryFormat::Elf,
Ok(Object::PE(_)) => BinaryFormat::Pe,
Ok(Object::Mach(_)) => BinaryFormat::MachO,
_ => BinaryFormat::Unknown,
}
}
Future Enhancements
Planned Format Extensions
- WebAssembly (WASM): Growing importance in web and edge computing
- Java Class Files: JVM bytecode analysis
- Android APK/DEX: Mobile application analysis
Enhanced Resource Support
- PE: Dialog resources, icon strings, version blocks
- Mach-O: Plist resources, framework bundles
- ELF: Note sections, build IDs, GNU attributes
Architecture-Specific Features
- ARM64: Pointer authentication, tagged pointers
- x86-64: RIP-relative addressing hints
- RISC-V: Emerging architecture support
This comprehensive format support ensures Stringy can effectively analyze binaries across all major platforms while respecting the unique characteristics of each format.