Search Results for "data extraction of website"

Sort By:

Showing 1011 open source projects for "data extraction of website"

View related business solutions

Gen AI apps are built with MongoDB Atlas
The database for AI-powered applications.

MongoDB Atlas is the developer-friendly database used to build, scale, and run gen AI and LLM-powered apps—without needing a separate vector database. Atlas offers built-in vector search, global availability across 115+ regions, and flexible document modeling. Start building AI apps faster, all in one place.

Start Free
Optimize every aspect of hiring with Greenhouse Recruiting
Hire for what’s next.

What’s next for many of us is changing. Your company’s ability to hire great talent is as important as ever – so you’ll be ready for whatever’s ahead. Whether you need to scale your team quickly or improve your hiring process, Greenhouse gives you the right technology, know-how and support to take on what’s next.

Learn More
1

dude uncomplicated data extraction

dude uncomplicated data extraction: A simple framework

Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax. Dude is currently in Pre-Alpha. Please expect breaking changes. You can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python scripts to dude scrape command.

Downloads: 0 This Week

Last Update: 2024-03-02
See Project
2

Firecrawl

Turn entire websites into LLM-ready markdown or structured data

Crawl and convert any website into LLM-ready markdown or structured data. Built by Mendable.ai and the Firecrawl community. Includes powerful scraping, crawling, and data extraction capabilities. Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each.

Downloads: 3 This Week

Last Update: 2026-02-02
See Project
3

ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs

ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.

Downloads: 7 This Week

Last Update: 2025-06-09
See Project
4

pdfly

CLI tool to extract (meta)data from PDF and manipulate PDF files

A Python library designed for manipulating PDF files with functionalities for extraction, transformation, and document generation.

Downloads: 7 This Week

Last Update: 2025-10-13
See Project
JS7 JobScheduler is an open source workload automation solution.
JS7 offers cross-platform job execution, managed file transfer, complex no-code job dependencies and a real REST API.

JS7 JobScheduler is an open source workload automation solution. It is used to run executable files, shell scripts etc. and database procedures.

Learn More
5

Parsera

Lightweight library for scraping web-sites with LLMs

Scrape data from any website with only a link and column descriptions. Parsera is a tool designed to scrape web content, specifically handling poorly structured or messy websites.

Downloads: 4 This Week

Last Update: 2025-10-08
See Project
6

OCRBase

MD/.JSON Document OCR and structured data extraction API

OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured...

Downloads: 3 This Week

Last Update: 5 days ago
See Project
7

Article Extractor

To extract main article from given URL with Node.js

A Node.js library for extracting main content from web articles, removing unnecessary clutter like ads and navigation elements.

Downloads: 7 This Week

Last Update: 2025-09-04
See Project
8

MinerU

A high-quality tool for convert PDF to Markdown and JSON

MinerU is an open-source, high-quality document extraction toolkit focused on converting PDFs (and other document formats) into structured Markdown and JSON. It leverages OCR and layout analysis to preserve semantic structure and metadata, ideal for research and data science workflows.

Downloads: 11 This Week

Last Update: 2026-02-06
See Project
9

Dendrite

Tools to build web AI agents that can authenticate

Dendrite Python SDK is a toolkit for building web AI agents that can authenticate, interact with, and extract data from any website, facilitating web automation tasks.

Downloads: 6 This Week

Last Update: 2025-01-29
See Project
Fax.Cloud delivers encrypted, point-to-point faxing with guaranteed delivery and built-in audit trails
For organizations in regulated industries needing a solution to replace traditional fax infrastructure and integrate with email or online portals

Unlike email or file-sharing tools, Fax.Cloud doesn’t bounce around the internet, exposed and vulnerable. It’s direct, encrypted, and verified. You get delivery confirmation, audit trails, and peace of mind, without the spam filters, metadata leaks, or cyber threats.

Learn More
10

X-Crawl

Flexible Node.js AI-assisted crawler library

A high-performance web crawling and scraping framework for Node.js, designed for large-scale data extraction.

Downloads: 7 This Week

Last Update: 2025-04-06
See Project
11

Scanopy

Clean network diagrams, One-time setup, zero upkeep

Scanopy is a powerful multi-modal data capture and analysis toolkit that enables users to collect, process, and visualize structured and unstructured information from a variety of sources in a flexible pipeline. It is built to handle complex scanning tasks — such as OCR, document analysis, audio transcription, network data capture, and image extraction — while providing unified APIs and workflows that make managing heterogeneous data sources seamless. ...

Downloads: 12 This Week

Last Update: 4 days ago
See Project
12

ContextGem

ContextGem: Effortless LLM extraction from documents

ContextGem is an open-source framework designed to simplify the extraction of structured data and insights from documents using large language models (LLMs). It provides a flexible, intuitive API that minimizes boilerplate code, enabling developers to build complex extraction workflows efficiently. ContextGem supports various document formats and integrates with multiple LLM providers, making it a versatile tool for tasks like contract analysis, anomaly detection, and information retrieval.

Downloads: 4 This Week

Last Update: 2025-12-19
See Project
13

FModel

Unreal Engine Archives Explorer

FModel is a freeware application designed to explore Unreal Engine games archives, allowing users to delve into the assets and structures of games developed with Unreal Engine.

Downloads: 73 This Week

Last Update: 2025-12-19
See Project
14

tsfresh

Automatic extraction of relevant features from time series

tsfresh is a python package. It automatically calculates a large number of time series characteristics, the so called features. tsfresh is used to to extract characteristics from time series. Without tsfresh, you would have to calculate all characteristics by hand. With tsfresh this process is automated and all your features can be calculated automatically. Further tsfresh is compatible with pythons pandas and scikit-learn APIs, two important packages for Data Science endeavours in python....

Downloads: 3 This Week

Last Update: 2025-08-30
See Project
15

Skyvern

Automate browser-based workflows with LLMs and Computer Vision

...Skyvern understands how to solve CAPTCHAs to complete complicated workflows. Support for authenticating into user accounts, including support for 2FA/TOTP. Extract data from workflows in any schema of your choice including CSV or JSON. Automate procurement pipelines, breeze through government forms, and complete workflows in any language.

Downloads: 2 This Week

Last Update: 4 days ago
See Project
16

AgentQL MCP

Model Context Protocol server that integrates AgentQL's data

The AgentQL MCP Server is a Model Context Protocol (MCP) server that integrates AgentQL's data extraction capabilities, enabling users to extract structured data from web pages using natural language prompts.

Downloads: 0 This Week

Last Update: 2025-04-08
See Project
17

Browser Use

Make websites accessible for AI agents

Browser-Use is a framework that makes websites accessible for AI agents, enabling automated interactions and data extraction from web pages.

Downloads: 7 This Week

Last Update: 2 days ago
See Project
18

audioFlux

A library for audio and music analysis, feature extraction

A library for audio and music analysis, and feature extraction. Can be used for deep learning, pattern recognition, signal processing, bioinformatics, statistics, finance, etc. audioflux is a deep learning tool library for audio and music analysis, feature extraction. It supports dozens of time-frequency analysis transformation methods and hundreds of corresponding time-domain and frequency-domain feature combinations. It can be provided to deep learning networks for training and is used to...

Downloads: 8 This Week

Last Update: 2024-08-09
See Project
19

Automa

A chrome extension for automating your browser by connecting blocks

Automa is a browser extension for browser automation. From auto-fill forms, doing a repetitive task, taking a screenshot, to scraping data of the website, it's up to you what you want to do with this extension. Automa has provided various kinds of blocks that will help you do automation, and all you need to do is connect them. Want your workflow to run every day or every time you visit a specific website? You can set the workflow trigger on the trigger block. Try a workflow from the marketplace. ...

Downloads: 20 This Week

Last Update: 2025-08-11
See Project
20

zpdf

Zero-copy PDF text extraction library written in Zig

zpdf is a high-performance PDF text extraction library written in Zig that focuses on speed, low overhead, and modern parsing techniques. It leans heavily on memory-mapped file reading and zero-copy patterns where possible, so it can scan large PDFs without repeatedly copying data around in memory. The library supports streaming extraction using efficient arena allocation, making it well suited for workloads that need to process big documents quickly or in batches. ...

Downloads: 2 This Week

Last Update: 2026-02-01
See Project
21

Ferret

Declarative web scraping

A web scraping system aiming to simplify data extraction from the web. ferret has a declarative query language that makes it easy to focus on the data that you need to get. ferret has the ability to scrape JS rendered pages, handle all page events, and emulate user interactions. the ferret was designed as a library from the ground up. it can be easily embedded into any Go application. ferret helps you to focus on the data you need using an easy-to-learn declarative language. ferret uses Chrome/Chromium via Chrome Devtools Protocol to handle dynamically rendered web pages. ferret is extremely extensible, and creating custom functions and types is super easy. ferret allows users to focus on the data. ...

Downloads: 3 This Week

Last Update: 2025-05-07
See Project
22

mtail

Extract internal monitoring data from application logs

Extract internal monitoring data from application logs for collection in a time-series database. mtail is a tool for extracting metrics from application logs to be exported into a timeseries database or timeseries calculator for alerting and dashboarding. It fills a monitoring niche by being the glue between applications that do not export their own internal state (other than via logs) and existing monitoring systems, such that system operators do not need to patch those applications to instrument them or writing custom extraction code for every such application. ...

Downloads: 5 This Week

Last Update: 2024-08-08
See Project
23

Matomo

Alternative to Google Analytics that gives you full control over data

Google Analytics alternative that protects your data and your customers' privacy. Take back control with Matomo – a powerful web analytics platform that gives you 100% data ownership. You could lose your customers’ trust and risk damaging your reputation if people learn their data is used for Google’s “own purposes”. By choosing the ethical alternative, Matomo, you won’t make privacy sacrifices or compromise your site.

Downloads: 7 This Week

Last Update: 2026-02-03
See Project
24

py-pdf-parser

A Python tool to help extracting information from structured PDFs

py-pdf-parser is a Python tool designed to help extract information from structured PDFs. It provides a simple interface to define parsing rules and extract data from PDF documents.

Downloads: 8 This Week

Last Update: 2025-04-28
See Project
25

WebHarvest - web data extraction tool

Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easely extract useful data from arbitrary web pages.

14 Reviews

Downloads: 13 This Week

Last Update: 2025-10-25
See Project