[go: up one dir, main page]

Search Results for "docx metadata extraction"

Showing 61 open source projects for "docx metadata extraction"

View related business solutions
  • Gen AI apps are built with MongoDB Atlas Icon
    Gen AI apps are built with MongoDB Atlas

    The database for AI-powered applications.

    MongoDB Atlas is the developer-friendly database used to build, scale, and run gen AI and LLM-powered apps—without needing a separate vector database. Atlas offers built-in vector search, global availability across 115+ regions, and flexible document modeling. Start building AI apps faster, all in one place.
    Start Free
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 1
    ContextGem

    ContextGem

    ContextGem: Effortless LLM extraction from documents

    ContextGem is an open-source framework designed to simplify the extraction of structured data and insights from documents using large language models (LLMs). It provides a flexible, intuitive API that minimizes boilerplate code, enabling developers to build complex extraction workflows efficiently. ContextGem supports various document formats and integrates with multiple LLM providers, making it a versatile tool for tasks like contract analysis, anomaly detection, and information retrieval.​
    Downloads: 2 This Week
    Last Update:
    See Project
  • 2
    dude uncomplicated data extraction

    dude uncomplicated data extraction

    dude uncomplicated data extraction: A simple framework

    Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax. Dude is currently in Pre-Alpha. Please expect breaking changes. You can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python scripts to dude scrape command.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    Portable Executable Parser

    Portable Executable Parser

    lightweight Go package to parse, analyze and extract metadata

    Saferwall PE is a lightweight Go package for parsing, analyzing, and extracting metadata from Portable Executable (PE) binaries. Designed with malware analysis in mind, it is robust against malformed PE files and provides detailed insights into executable structures.​
    Downloads: 12 This Week
    Last Update:
    See Project
  • 4
    MegaParse

    MegaParse

    File Parser optimised for LLM Ingestion with no loss

    MegaParse is a file parser optimized for Large Language Model (LLM) ingestion, ensuring no loss of information. It efficiently parses various document formats, such as PDFs, DOCX, and PPTX, converting them into formats ideal for processing by LLMs. This tool is essential for applications that require accurate and comprehensive data extraction from diverse document types.
    Downloads: 0 This Week
    Last Update:
    See Project
  • AI-based, Comprehensive Service Management for Businesses and IT Providers Icon
    AI-based, Comprehensive Service Management for Businesses and IT Providers

    Modular solutions for change management, asset management and more

    ChangeGear provides IT staff with the functions required to manage everything from ticketing to incident, change and asset management and more. ChangeGear includes a virtual agent, self-service portals and AI-based features to support analyst and end user productivity.
    Learn More
  • 5
    Dungbeetle

    Dungbeetle

    A distributed job server

    Dungbeetle is a metadata and data lineage tracking tool developed by Zerodha to map and visualize how data flows across systems. It helps teams maintain data transparency by tracking dependencies between databases, tables, and reports, offering a centralized view of data pipelines. Dungbeetle is designed to enhance observability and trust in analytics ecosystems.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 6
    PDFPatcher

    PDFPatcher

    A versatile toolkit for PDF manipulation

    PDFPatcher (aka “PDF补丁丁”) is a versatile toolkit for PDF manipulation—editing document metadata, bookmarks, page layout, content restrictions, rotation, compression, merging/splitting, image extraction, and more, all within an intuitive interface. Merge/split PDFs or images, preserve or add bookmarks, and set page dimensions. Batch style/color/target changes, regex/XPath search/replace, mid‑page positioning. Modify PDF metadata, page numbers, links, initial view mode, and remove open actions.
    Downloads: 24 This Week
    Last Update:
    See Project
  • 7
    GROBID

    GROBID

    A machine learning software for extracting information

    ...All the usual publication metadata are covered (including DOI, PMID, etc.).
    Downloads: 3 This Week
    Last Update:
    See Project
  • 8
    MinerU

    MinerU

    A high-quality tool for convert PDF to Markdown and JSON

    MinerU is an open-source, high-quality document extraction toolkit focused on converting PDFs (and other document formats) into structured Markdown and JSON. It leverages OCR and layout analysis to preserve semantic structure and metadata, ideal for research and data science workflows.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 9
    Docspell

    Docspell

    Assist in organizing your piles of documents

    Docspell is a personal document organizer. Or sometimes called a "Document Management System" (DMS). You'll need a scanner to convert your papers into files. Docspell can then assist in organizing the resulting mess. It can unify your files from scanners, emails, and other sources. It is targeted for home use, i.e. families, households, and also for smaller groups/companies. You can associate tags, set correspondent,s and lots of other predefined and custom metadata. If your documents are...
    Downloads: 4 This Week
    Last Update:
    See Project
  • Comet Backup - Fast, Secure Backup Software for MSPs Icon
    Comet Backup - Fast, Secure Backup Software for MSPs

    Fast, Secure Backup Software for Businesses and IT Providers

    Comet is a flexible backup platform, giving you total control over your backup environment and storage destinations.
    Learn More
  • 10
    Trafilatura

    Trafilatura

    Python & command-line tool to gather text on the Web

    Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text-processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats. Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    Director

    Director

    AI video agents framework for next-gen video interactions

    Director is a video database management system designed to organize, search, and retrieve large collections of video content efficiently.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    tika-python

    tika-python

    Python binding to the Apache Tika™ REST services

    A Python port of the Apache Tika library that makes Tika available using the Tika REST Server. This makes Apache Tika available as a Python library, installable via Setuptools, Pip and easy to install. To use this library, you need to have Java 7+ installed on your system as tika-python starts up the Tika REST server in the background. To get this working in a disconnected environment, download a tika server file (both tika-server.jar and tika-server.jar.md5, which can be found here) and set...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    ArXiv MCP Server

    ArXiv MCP Server

    A Model Context Protocol server for searching and analyzing arXiv

    arxiv-mcp-server bridges AI assistants and the arXiv repository through a clean MCP interface, enabling search, metadata retrieval, and content access without bespoke scraping. With simple tools like “search” and “fetch,” an agent can find papers, pull abstracts, and download PDFs for downstream summarization or analysis. The project includes packaging and CI to publish to PyPI, plus tests and linting for reliability. Issue threads show feature requests such as extracting embedded LaTeX and...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    UCO3D

    UCO3D

    Uncommon Objects in 3D dataset

    ...The repository includes automated downloaders with checksum verification, fine-grained controls to fetch only selected modalities or super-categories, and a lightweight Python API for loading frames, geometry, and splats on demand. Metadata is indexed in SQLite for quick queries at scale, and helper builders handle alignment, undistortion, frame extraction from videos, and cropping around the object.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    deepdoctection

    deepdoctection

    A Repo For Document AI

    DeepDoctection is a document AI framework that applies deep learning techniques to analyze and extract structured data from scanned documents, PDFs, and images. deepdoctection is a Python library that orchestrates document extraction and document layout analysis tasks using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated frameworks for...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Down

    Down

    Streaming downloads using Net::HTTP, http.rb or HTTPX

    Down is a small, reliable Ruby library for downloading files that favors correctness, streaming, and clear error handling. It follows redirects safely, supports timeouts and retries, and streams responses to disk to keep memory usage low—ideal for large downloads or server environments. The API returns file-like objects (often Tempfile) with helpful metadata such as original filename and content type, which plays nicely with file-attachment libraries and background jobs. Multiple HTTP...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    yt-dlp

    yt-dlp

    A youtube-dl fork with additional features and fixes

    yt-dlp is a youtube-dl fork based on the now inactive youtube-dlc. The main focus of this project is adding new features and patches while also keeping up to date with the original project
    Downloads: 332 This Week
    Last Update:
    See Project
  • 18
    AnyTXT Searcher

    AnyTXT Searcher

    A Powerful Desktop Full-Text Search Engine, Just Like Local Google.

    ...It has a powerful document parsing engine built in, which extracts the text of commonly used file formats without installing any other software, and combines the built-in high-speed indexing system to store the metadata of the text. You can quickly find any text in any file on your disk by Anytxt almost in 0.1 second. It works on Windows 11,10, 8, 7, Vista, XP, 2008, 2012, 2016,2022... AnyTXT Searcher supports the following file formats: Plain text (txt, cpp, py, html, etc.) Microsoft OneNote (one) Microsoft Word (doc, docx) Microsoft Excel (xls, xlsx) Microsoft PowerPoint (ppt, pptx) PDF WPS Office (wps, et, dps) EBook (epub, mobi, azw3, fb2 etc.) ...
    Leader badge">
    Downloads: 4,516 This Week
    Last Update:
    See Project
  • 19
    DataExtract

    DataExtract

    Extracts Data Types Like Email Addresses From All Kinds Of Files

    DataExtract is a program that scans files of many different types - text, PDF, Word, Excel etc, extracting all kinds of structured patterns, like email addresses and phone numbers, from them.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 20

    YoungerSibling

    YoungerSibling: Cross-platform OSINT tool for quick data gathering.

    YoungerSibling is a Python-based terminal utility script designed for educational purposes. It provides a set of useful tools to perform tasks like searching the web, performing lookups (Google search, IP lookup, username lookup, etc.), and extracting metadata from images, directly from the terminal. This project aims to help students, developers, and hobbyists learn about web scraping, API usage, and terminal interaction with Python.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 21
    DocWire SDK

    DocWire SDK

    Award-winning modern data processing SDK in C++20

    DocWire SDK, a standout C++20AI driven data processing tool, has received award from SourceForge and strong backing from Microsoft. It handles nearly 100 file types, empowering efficient text extraction, web data extraction, and document analysis. For businesses, the shift to DocWire SDK signifies a leap forward. It promises comprehensive document format support and the ability to extract valuable insights from email boxes, databases, and websites using cutting-edge AI. DocWire SDK aims to...
    Leader badge">
    Downloads: 25 This Week
    Last Update:
    See Project
  • 22
    Honeyview

    Honeyview

    Fast and lightweight image viewer

    Honeyview is a fast and lightweight image viewer that supports a wide range of image formats, including popular ones like JPG, PNG, and BMP, as well as less common formats such as PSD and RAW. Designed for users who need an efficient tool for viewing and organizing images, Honeyview provides a clean and intuitive interface with high-speed performance. It also supports viewing images within compressed archives, such as ZIP or RAR files, without the need for extraction. Additionally, Honeyview...
    Downloads: 64 This Week
    Last Update:
    See Project
  • 23
    TextureAtlas Extractor and GIF Generator

    TextureAtlas Extractor and GIF Generator

    A powerful, free and open-source tool for extracting frames and animat

    TextureAtlas to GIF and Frames converts texture atlases into organized frame collections and GIF/WebP/APNG animations. Perfect for creating showcases and galleries of game sprites. Supported Spritesheet Types * Starling / Sparrow (XML) * TextPacker (TXT) + Can now also attempt to extract unsupported spritesheet formats through pixel, color and bounding box detection Official Download Sites: * https://sourceforge.net/projects/textureatlas-to-gif-and-frames/ *...
    Downloads: 18 This Week
    Last Update:
    See Project
  • 24

    joy of text

    Editor with scripting language, security features & system interfaces.

    Jot was developed general purpose editor for large CAD files. It's command-driven UI requires no mode switching and hence requires fewer keystrokes to get a typical job done. It is particularly useful for checking and cross-referencing between several source, intermediate and output files - a common requirement for CAD work. But jot's usefulness doesn't stop there. It's sophisticated search features can, for example, be used for interactive data mining or automating the extraction of...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 25
    Amplify

    Amplify

    Automatic enrichment, enhancement, and explanation of your data

    Amplify attaches afterburners to your data. Amplify explains metadata extraction, classification, tagging, and reporting. Eriches derivative data generation like thumbnails, previews, conversions, etc. Enhances batteries-included value-adds like data quality reports, image augmentation, OCR, translations, etc. Amplify leverages the decentralized compute provided by Bacalhau to magically enrich your data.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • Next