[go: up one dir, main page]

extractous 0.1.1

Extractous provides a fast and efficient way to extract content from all kind of file formats including PDF, Word, Excel CSV, Email etc... Internally it uses a natively compiled Apache Tika for formats are not supported natively by Rust
docs.rs failed to build extractous-0.1.1
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build: extractous-0.3.0

Extractous

Extractous is a Rust crate that provides a unified approach for detecting and extracting metadata and text content from various documents types such as PDF, Word, HTML, and many other formats.

Features

  • High-level Rust API for extracting text and metadata content for many file formats.
  • Strives to be efficient and fast.
  • Internally it calls the Apache Tika for any file format that is not natively supported in the Rust core.
  • Comprehensive documentation and examples to help you get started quickly.

Installation

To use extractous in your Rust project, add the following line to your Cargo.toml file:

[dependencies]
extractous = "0.1.1"

Supported file formats

File Format Native Rust Through Tika
pdf -
csv -

Building

  • GraalVm is required to build tika_native. We recommend using sdkman
  • To be able to use awt on macOS, please use Bellsoft Liberica NIK java 22
  • sdk install java 24.0.1.r22-nik
  • We use gradle to perform the build. Gradle wrapper is included in the project, no need to install gradle.
  • Make sure JAVA_HOME is pointing to the graalvm jdk and not any other jdk in your environment. Try java --version you should see something like:
openjdk 22.0.1 2024-04-16
OpenJDK Runtime Environment Liberica-NIK-24.0.1-1 (build 22.0.1+10)
OpenJDK 64-Bit Server VM Liberica-NIK-24.0.1-1 (build 22.0.1+10, mixed mode, sharing)