Crate extractous
source ·Expand description
Extractous is a library that extracts text from various file formats.
- Supports many file formats such as Word, Excel, PowerPoint, PDF, and many more.
- Strives to be simple fast and efficient
§Quick Start
Extractous API entry point is the Extractor struct.
All public apis are accessible through an extractor.
The extractor provides functions to extract text from files, Urls, and byte arrays.
To use an extractor, you need to:
§Create and config an extractor
use extractous::Extractor;
use extractous::PdfParserConfig;
// Create a new extractor. Note it uses a consuming builder pattern
let mut extractor = Extractor::new()
.set_extract_string_max_length(1000);
// can also perform conditional configuration
let custom_pdf_config = true;
if custom_pdf_config {
extractor = extractor.set_pdf_config(
PdfParserConfig::new().set_extract_annotation_text(false)
);
}
§Extract text
use extractous::Extractor;
use extractous::PdfParserConfig;
// Create a new extractor. Note it uses a consuming builder pattern
let mut extractor = Extractor::new().set_extract_string_max_length(1000);
// Extract text from a file
let text = extractor.extract_file_to_string("README.md").unwrap();
println!("{}", text);
Structs§
- Extractor for extracting text from different file formats
- Microsoft Office parser configuration settings
- PDF parsing configuration settings
- StreamReader implements std::io::Read
- Tesseract OCR configuration settings
Enums§
- Supported encodings
- Represent errors returned by extractous
- OCR Strategy for PDF parsing
Type Aliases§
- Result that is a wrapper of Result<T, extractous::Error>