[go: up one dir, main page]

Search Results for "data extraction of website"

Showing 1011 open source projects for "data extraction of website"

View related business solutions
  • Gen AI apps are built with MongoDB Atlas Icon
    Gen AI apps are built with MongoDB Atlas

    The database for AI-powered applications.

    MongoDB Atlas is the developer-friendly database used to build, scale, and run gen AI and LLM-powered apps—without needing a separate vector database. Atlas offers built-in vector search, global availability across 115+ regions, and flexible document modeling. Start building AI apps faster, all in one place.
    Start Free
  • Optimize every aspect of hiring with Greenhouse Recruiting Icon
    Optimize every aspect of hiring with Greenhouse Recruiting

    Hire for what’s next.

    What’s next for many of us is changing. Your company’s ability to hire great talent is as important as ever – so you’ll be ready for whatever’s ahead. Whether you need to scale your team quickly or improve your hiring process, Greenhouse gives you the right technology, know-how and support to take on what’s next.
    Learn More
  • 1
    dude uncomplicated data extraction

    dude uncomplicated data extraction

    dude uncomplicated data extraction: A simple framework

    Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax. Dude is currently in Pre-Alpha. Please expect breaking changes. You can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python scripts to dude scrape command.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    Firecrawl

    Firecrawl

    Turn entire websites into LLM-ready markdown or structured data

    Crawl and convert any website into LLM-ready markdown or structured data. Built by Mendable.ai and the Firecrawl community. Includes powerful scraping, crawling, and data extraction capabilities. Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 3
    ExtractThinker

    ExtractThinker

    ExtractThinker is a Document Intelligence library for LLMs

    ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 4
    pdfly

    pdfly

    CLI tool to extract (meta)data from PDF and manipulate PDF files

    A Python library designed for manipulating PDF files with functionalities for extraction, transformation, and document generation.
    Downloads: 7 This Week
    Last Update:
    See Project
  • JS7 JobScheduler is an open source workload automation solution. Icon
    JS7 JobScheduler is an open source workload automation solution.

    JS7 offers cross-platform job execution, managed file transfer, complex no-code job dependencies and a real REST API.

    JS7 JobScheduler is an open source workload automation solution. It is used to run executable files, shell scripts etc. and database procedures.
    Learn More
  • 5
    Parsera

    Parsera

    Lightweight library for scraping web-sites with LLMs

    Scrape data from any website with only a link and column descriptions. Parsera is a tool designed to scrape web content, specifically handling poorly structured or messy websites.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 6
    OCRBase

    OCRBase

    MD/.JSON Document OCR and structured data extraction API

    OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 7
    Article Extractor

    Article Extractor

    To extract main article from given URL with Node.js

    A Node.js library for extracting main content from web articles, removing unnecessary clutter like ads and navigation elements.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 8
    MinerU

    MinerU

    A high-quality tool for convert PDF to Markdown and JSON

    MinerU is an open-source, high-quality document extraction toolkit focused on converting PDFs (and other document formats) into structured Markdown and JSON. It leverages OCR and layout analysis to preserve semantic structure and metadata, ideal for research and data science workflows.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 9
    Dendrite

    Dendrite

    Tools to build web AI agents that can authenticate

    Dendrite Python SDK is a toolkit for building web AI agents that can authenticate, interact with, and extract data from any website, facilitating web automation tasks.
    Downloads: 6 This Week
    Last Update:
    See Project
  • Fax.Cloud delivers encrypted, point-to-point faxing with guaranteed delivery and built-in audit trails Icon
    Fax.Cloud delivers encrypted, point-to-point faxing with guaranteed delivery and built-in audit trails

    For organizations in regulated industries needing a solution to replace traditional fax infrastructure and integrate with email or online portals

    Unlike email or file-sharing tools, Fax.Cloud doesn’t bounce around the internet, exposed and vulnerable. It’s direct, encrypted, and verified. You get delivery confirmation, audit trails, and peace of mind, without the spam filters, metadata leaks, or cyber threats.
    Learn More
  • 10
    X-Crawl

    X-Crawl

    Flexible Node.js AI-assisted crawler library

    A high-performance web crawling and scraping framework for Node.js, designed for large-scale data extraction.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 11
    Scanopy

    Scanopy

    Clean network diagrams, One-time setup, zero upkeep

    Scanopy is a powerful multi-modal data capture and analysis toolkit that enables users to collect, process, and visualize structured and unstructured information from a variety of sources in a flexible pipeline. It is built to handle complex scanning tasks — such as OCR, document analysis, audio transcription, network data capture, and image extraction — while providing unified APIs and workflows that make managing heterogeneous data sources seamless. ...
    Downloads: 12 This Week
    Last Update:
    See Project
  • 12
    ContextGem

    ContextGem

    ContextGem: Effortless LLM extraction from documents

    ContextGem is an open-source framework designed to simplify the extraction of structured data and insights from documents using large language models (LLMs). It provides a flexible, intuitive API that minimizes boilerplate code, enabling developers to build complex extraction workflows efficiently. ContextGem supports various document formats and integrates with multiple LLM providers, making it a versatile tool for tasks like contract analysis, anomaly detection, and information retrieval.​
    Downloads: 4 This Week
    Last Update:
    See Project
  • 13
    FModel

    FModel

    Unreal Engine Archives Explorer

    FModel is a freeware application designed to explore Unreal Engine games archives, allowing users to delve into the assets and structures of games developed with Unreal Engine.
    Downloads: 73 This Week
    Last Update:
    See Project
  • 14
    tsfresh

    tsfresh

    Automatic extraction of relevant features from time series

    tsfresh is a python package. It automatically calculates a large number of time series characteristics, the so called features. tsfresh is used to to extract characteristics from time series. Without tsfresh, you would have to calculate all characteristics by hand. With tsfresh this process is automated and all your features can be calculated automatically. Further tsfresh is compatible with pythons pandas and scikit-learn APIs, two important packages for Data Science endeavours in python....
    Downloads: 3 This Week
    Last Update:
    See Project
  • 15
    Skyvern

    Skyvern

    Automate browser-based workflows with LLMs and Computer Vision

    ...Skyvern understands how to solve CAPTCHAs to complete complicated workflows. Support for authenticating into user accounts, including support for 2FA/TOTP. Extract data from workflows in any schema of your choice including CSV or JSON. Automate procurement pipelines, breeze through government forms, and complete workflows in any language.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 16
    AgentQL MCP

    AgentQL MCP

    Model Context Protocol server that integrates AgentQL's data

    The AgentQL MCP Server is a Model Context Protocol (MCP) server that integrates AgentQL's data extraction capabilities, enabling users to extract structured data from web pages using natural language prompts. ​
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Browser Use

    Browser Use

    Make websites accessible for AI agents

    Browser-Use is a framework that makes websites accessible for AI agents, enabling automated interactions and data extraction from web pages.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 18
    audioFlux

    audioFlux

    A library for audio and music analysis, feature extraction

    A library for audio and music analysis, and feature extraction. Can be used for deep learning, pattern recognition, signal processing, bioinformatics, statistics, finance, etc. audioflux is a deep learning tool library for audio and music analysis, feature extraction. It supports dozens of time-frequency analysis transformation methods and hundreds of corresponding time-domain and frequency-domain feature combinations. It can be provided to deep learning networks for training and is used to...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 19
    Automa

    Automa

    A chrome extension for automating your browser by connecting blocks

    Automa is a browser extension for browser automation. From auto-fill forms, doing a repetitive task, taking a screenshot, to scraping data of the website, it's up to you what you want to do with this extension. Automa has provided various kinds of blocks that will help you do automation, and all you need to do is connect them. Want your workflow to run every day or every time you visit a specific website? You can set the workflow trigger on the trigger block. Try a workflow from the marketplace. ...
    Downloads: 20 This Week
    Last Update:
    See Project
  • 20
    zpdf

    zpdf

    Zero-copy PDF text extraction library written in Zig

    zpdf is a high-performance PDF text extraction library written in Zig that focuses on speed, low overhead, and modern parsing techniques. It leans heavily on memory-mapped file reading and zero-copy patterns where possible, so it can scan large PDFs without repeatedly copying data around in memory. The library supports streaming extraction using efficient arena allocation, making it well suited for workloads that need to process big documents quickly or in batches. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 21
    Ferret

    Ferret

    Declarative web scraping

    A web scraping system aiming to simplify data extraction from the web. ferret has a declarative query language that makes it easy to focus on the data that you need to get. ferret has the ability to scrape JS rendered pages, handle all page events, and emulate user interactions. the ferret was designed as a library from the ground up. it can be easily embedded into any Go application. ferret helps you to focus on the data you need using an easy-to-learn declarative language. ferret uses Chrome/Chromium via Chrome Devtools Protocol to handle dynamically rendered web pages. ferret is extremely extensible, and creating custom functions and types is super easy. ferret allows users to focus on the data. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 22
    mtail

    mtail

    Extract internal monitoring data from application logs

    Extract internal monitoring data from application logs for collection in a time-series database. mtail is a tool for extracting metrics from application logs to be exported into a timeseries database or timeseries calculator for alerting and dashboarding. It fills a monitoring niche by being the glue between applications that do not export their own internal state (other than via logs) and existing monitoring systems, such that system operators do not need to patch those applications to instrument them or writing custom extraction code for every such application. ...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 23
    Matomo

    Matomo

    Alternative to Google Analytics that gives you full control over data

    Google Analytics alternative that protects your data and your customers' privacy. Take back control with Matomo – a powerful web analytics platform that gives you 100% data ownership. You could lose your customers’ trust and risk damaging your reputation if people learn their data is used for Google’s “own purposes”. By choosing the ethical alternative, Matomo, you won’t make privacy sacrifices or compromise your site.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 24
    py-pdf-parser

    py-pdf-parser

    A Python tool to help extracting information from structured PDFs

    py-pdf-parser is a Python tool designed to help extract information from structured PDFs. It provides a simple interface to define parsing rules and extract data from PDF documents. ​
    Downloads: 8 This Week
    Last Update:
    See Project
  • 25
    WebHarvest - web data extraction tool
    Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easely extract useful data from arbitrary web pages.
    Downloads: 13 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next