[go: up one dir, main page]

Showing 288 open source projects for "crawler"

View related business solutions
  • La version gratuite d'Auth0 s'enrichit ! Icon
    La version gratuite d'Auth0 s'enrichit !

    Gratuit pour 25 000 utilisateurs avec intégration Okta illimitée : concentrez-vous sur le développement de vos applications.

    Vous l'avez demandé, nous l'avons fait ! Les versions gratuite et payante d'Auth0 incluent des options qui vous permettent de développer, déployer et faire évoluer vos applications en toute sécurité. Utilisez Auth0 dès maintenant pour découvrir tous ses avantages.
    Essayez Auth0 gratuitement
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 1
    Crawler Detect

    Crawler Detect

    CrawlerDetect is a PHP class for detecting bots/crawlers/spiders

    Crawler Detect is a PHP library that detects bots, crawlers, and spiders by analyzing user-agent headers and comparing them against a constantly updated list of known crawlers. It's useful for analytics, rate-limiting, or displaying alternative content for automated tools. It is fast, lightweight, and easy to integrate into any PHP application.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 2
    Spatie Crawler

    Spatie Crawler

    An easy to use, powerful crawler implemented in PHP

    Spatie Crawler is a PHP library that allows developers to crawl websites and extract information efficiently. It can be used for web scraping, link checking, or automated testing of web pages. The library is simple to use and supports customizable crawling strategies, including controlling crawl depth and handling redirects. It’s suitable for building crawlers that navigate large or dynamically generated websites.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 3
    SiteOne Crawler

    SiteOne Crawler

    SiteOne Crawler is a website analyzer and exporter

    SiteOne Crawler is a very useful and easy-to-use tool you'll ♥ as a Dev/DevOps, website owner or consultant. Works on all popular platforms - Windows, macOS, and Linux (x64 and arm64 too). It will crawl your entire website in depth, analyze and report problems, show useful statistics and reports, generate an offline version of the website, generate sitemaps, or send reports via email. Watch a detailed video with a sample report for Astro. build website. This crawler can be used as a command...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    EasySpider

    EasySpider

    A visual no-code/code-free web crawler/spider

    A visual code-free/no-code web crawler/spider, supporting both Chinese and English.
    Downloads: 16 This Week
    Last Update:
    See Project
  • Monitoring, Securing, Optimizing 3rd party scripts Icon
    Monitoring, Securing, Optimizing 3rd party scripts

    For developers looking for a solution to monitor, script, and optimize 3rd party scripts

    c/side is crawling many sites to get ahead of new attacks. c/side is the only fully autonomous detection tool for assessing 3rd party scripts. We do not rely purely on threat feed intel or easy to circumvent detections. We also use historical context and AI to review the payload and behavior of scripts.
    Learn More
  • 5
    Heritrix

    Heritrix

    Internet Archive's open-source, web-scale, web crawler project

    Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt. Heritrix is designed to respect the robots.txt exclusion directives...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 6
    TorBot

    TorBot

    Dark Web OSINT Tool

    Contributions to this project are always welcome. To add a new feature fork the dev branch and give a pull request when your new feature is tested and complete. If its a new module, it should be put inside the modules directory. The branch name should be your new feature name in the format <Feature_featurename_version(optional)>. On Linux platforms, you can make an executable for TorBot by using the install.sh script. You will need to give the script the correct permissions using chmod +x...
    Downloads: 14 This Week
    Last Update:
    See Project
  • 7
    WebMagic

    WebMagic

    A scalable web crawler framework for Java

    WebMagic is a scalable crawler framework. It covers the whole lifecycle of crawler, downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler. WebMagic is a simple but scalable crawler framework. You can develop a crawler easily based on it. WebMagic has a simple core with high flexibility, a simple API for html extracting. It also provides annotation with POJO to customize a crawler, and no configuration is needed. Some other features...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    miniblink49

    miniblink49

    Lighter, faster browser kernel of blink to integrate HTML UI in apps

    ... electron). Customize as you wish, simulate another browser environment. Perfect HTML5 support, friendly to various front-end libraries (support HTML5, and friendly to front framework). After turning off the cross-domain switch, you can use various cross-domain functions (support cross-domain). Headless mode, which greatly saves resources and is suitable for crawlers (headless mode, be suitable for Web Crawler).
    Downloads: 10 This Week
    Last Update:
    See Project
  • 9
    SiteOne Crawler (desktop app)

    SiteOne Crawler (desktop app)

    A free, feature-rich web analyzer and exporter/cloner you will love!

    A free in-depth website analyzer providing audits of security, performance, SEO, accessibility and other technical aspects. Available as a desktop application for Windows/macOS/Linux and as a CLI tool for advanced users and CI/CD processes. It also includes an offline web page exporter (website clone, mirror).
    Downloads: 2 This Week
    Last Update:
    See Project
  • AI-based, Comprehensive Service Management for Businesses and IT Providers Icon
    AI-based, Comprehensive Service Management for Businesses and IT Providers

    Modular solutions for change management, asset management and more

    ChangeGear provides IT staff with the functions required to manage everything from ticketing to incident, change and asset management and more. ChangeGear includes a virtual agent, self-service portals and AI-based features to support analyst and end user productivity.
    Learn More
  • 10
    crwlr

    crwlr

    Library for Rapid (Web) Crawler and Scraper Development

    This library provides kind of a framework and a lot of ready-to-use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with. Before diving into the library, let's have a look at the terms crawling and scraping. For most real-world use cases, those two things go hand in hand, which is why this library helps with and combines both. A (web) crawler is a program that (down)loads documents and follows the links in it to load them as well. A crawler could...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 11
    DocSearch

    DocSearch

    The easiest way to add search to your documentation

    Initially created to fulfill our own developers' needs, DocSearch quickly evolved into a successful community project. Over the years, we've explored new ways to address the complexities of search for the open-source community. DocSearch understands how the user input fits into the context of your project and instantly presents the most relevant content with fewer interactions than any other method. With a design very close to the native experience on mobile, we leverage users acquaintance...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 12
    crawley

    crawley

    The unix-way web crawler

    Crawls web pages and prints any link it can find. Fast HTML SAX-parser (powered by golang.org/x/net/html) Small (below 1500 SLOC), idiomatic, 100% test-covered codebase. Grabs most of useful resources URLs (pics, videos, audios, forms, etc...) Found URLs are streamed to stdout and guaranteed to be unique (with fragments omitted) Scan depth (limited by starting host and path, by default - 0) can be configured. Can crawl rules and sitemaps from robots.txt. Brute mode - scan HTML comments for...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 13
    Nebula libp2p DHT

    Nebula libp2p DHT

    A libp2p DHT crawler, monitor, and measurement tool

    A libp2p DHT crawler and monitor that tracks the liveness of peers. The crawler connects to DHT bootstrap peers and then recursively follows all entries in their k-buckets until all peers have been visited. The crawler supports the IPFS, Filecoin, Polkadot, Kusama, Rococo, Westend networks and more. The crawler can store its results as JSON documents or in a postgres database - the --dry-run flag prevents it from doing either. Nebula will print a summary of the crawl at the end instead. A crawl...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 14
    Modlishka

    Modlishka

    Powerful and flexible HTTP reverse proxy

    Modlishka is a powerful and flexible HTTP reverse proxy. It implements an entirely new and interesting approach of handling browser-based HTTP traffic flow, which allows to transparently proxy of multi-domain destination traffic, both TLS and non-TLS, over a single domain, without the requirement of installing any additional certificate on the client. What exactly does this mean? In short, it simply has a lot of potential, that can be used in many use case scenarios. Modlishka was written as...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 15
    X-Crawl

    X-Crawl

    Flexible Node.js AI-assisted crawler library

    A high-performance web crawling and scraping framework for Node.js, designed for large-scale data extraction.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 16
    Snap Lens Web Crawler

    Snap Lens Web Crawler

    Crawl and download Snap Lenses from lens.snapchat.com with ease.

    Crawl and download Snap Lenses from lens.snapchat.com with ease. This crawler is a dependency of Snap Camera Server https://snap-camera-server.sourceforge.io
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Proxy_Pool

    Proxy_Pool

    Python crawler proxy IP pool (proxy pool)

    The main function of the crawler agent IP pool project is to regularly collect free agents published on the Internet for verification and storage, and to regularly verify and store agents to ensure the availability of agents, and to provide API and CLI. At the same time, you can also expand the proxy source to increase the quality and quantity of the proxy pool IP.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    ngx_waf

    ngx_waf

    Handy, High performance, ModSecurity compatible Nginx firewall module

    Handy, High-performance Nginx firewall module. Such as black and white list of IPs or IP range, uri black and white list, and request body black list, etc. Directives and rules are easy to write and readable. The IP detection is a constant-time operation. Most of the remaining inspections use caching to improve performance. Compatible with ModSecurity's rules, you can use OWASP ModSecurity Core Rule Set. Supports verifying Google, Bing, Baidu and Yandex crawlers and allowing them...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 19
    Crawl4AI

    Crawl4AI

    Open-source LLM Friendly Web Crawler & Scraper

    Crawl4AI is a high-performance, AI‑ready web crawler tailored for LLM data ingestion and RAG pipelines. It supports adaptive crawling heuristics (stopping when enough info is gathered), structured markdown output, and high-speed parallel execution. Designed to operate at scale with optional Docker deployment and framework integrations.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Colly

    Colly

    Elegant Scraper and Crawler Framework for Golang

    Colly provides a clean interface to write any kind of crawler/scraper/spider. With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving. Clean API. Fast (>1k request/sec on a single core) Manages request delays and maximum concurrency per domain. Automatic cookie and session handling. Sync/async/parallel scraping. Distributed scraping. Caching, automatic encoding of non-unicode responses...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    RobotsTxt

    RobotsTxt

    The repository contains Google's robots.txt parser

    This is a high-performance, production-tested library for parsing and evaluating robots.txt rules against crawler user agents. It implements the core semantics of the Robots Exclusion Protocol: user-agent sections, Allow/Disallow directives, wildcard handling, and precedence rules. The code is optimized for speed and low memory so large crawls can evaluate millions of URLs quickly. It also focuses on correctness—edge cases like overlapping patterns and longest-match resolution are handled...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22
    LambdaHack

    LambdaHack

    Haskell game engine library for roguelike dungeon crawlers

    Haskell game engine library for roguelike dungeon crawlers. LambdaHack is a Haskell game engine library for ASCII roguelike games of arbitrary theme, size and complexity, with optional tactical squad combat. It's packaged together with a sample dungeon crawler in a quirky fantasy setting. To use the engine, you need to specify the content to be procedurally generated. You declare what the game world is made of (entities, their relations, physics and lore) and the engine builds the world...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 23
    Roach

    Roach

    The complete web scraping toolkit for PHP

    Roach is a complete web scraping toolkit for PHP. It is a shameless clone heavily inspired by the popular Scrapy package for Python. Roach allows us to define spiders that crawl and scrape web documents. But wait, there’s more. Roach isn’t just a simple crawler, but includes an entire pipeline to clean, persist and otherwise process extracted data as well. It’s your all-in-one resource for web scraping in PHP. Roach doesn’t depend on a specific framework. Instead, you can use the core package...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 24
    Laravel Sitemap

    Laravel Sitemap

    Create and generate sitemaps with ease

    ... it in the callable you pass to hasCrawled. You can also instruct the underlying crawler to not crawl some pages by passing a callable to shouldCrawl. You can configure the crawler used by the sitemap generator. The sitemap generator can execute JavaScript on each page so it will discover links that are generated by your JS scripts. You can enable this feature by setting execute_javascript in the config file to true.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    DNS Crawler

    DNS Crawler

    A Bulk Domain Assessment Tool

    DNS Crawler is a lightweight, Python-based utility designed for efficient batch processing and assessment of internet domain names. It reads from a list of domains formatted as: domain_name <tab> or ; optional_comment and generates a detailed, Excel-compatible CSV report with columns including: DOMAIN: Domain name REG: Registrar SOA, NS, MX, TXT, SPF, DMARC, MS, A, PTR: Common DNS records for comprehensive domain analysis NOTE: Optional comments from the original input file...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next