corpus free download - SourceForge

Showing 194 open source projects for "corpus"

View related business solutions

Gen AI apps are built with MongoDB Atlas
The database for AI-powered applications.

MongoDB Atlas is the developer-friendly database used to build, scale, and run gen AI and LLM-powered apps—without needing a separate vector database. Atlas offers built-in vector search, global availability across 115+ regions, and flexible document modeling. Start building AI apps faster, all in one place.

Start Free
Yeastar: Business Phone System and Unified Communications
Go beyond just a PBX with all communications integrated as one.

User-friendly, optimized, and scalable, the Yeastar P-Series Phone System redefines business connectivity by bringing together calling, meetings, omnichannel messaging, and integrations in one simple platform—removing the limitations of distance, platforms, and systems.

Learn More
1

gensim

Topic Modelling for Humans

Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. The target audience is the natural language processing (NLP) and information retrieval (IR) community.

Downloads: 4 This Week

Last Update: 2025-10-16
See Project
2

ArXiv MCP Server

A Model Context Protocol server for searching and analyzing arXiv

...Issue threads show feature requests such as extracting embedded LaTeX and improving markdown conversion, reflecting active community use in research flows. It’s designed to be drop-in for MCP clients, giving them typed inputs/outputs and predictable errors around a well-known academic corpus. For developers building research copilots, it removes the glue work of wiring arXiv APIs into an agent toolchain.

Downloads: 2 This Week

Last Update: 2026-01-26
See Project
3

Kimi K2

Kimi K2 is the large language model series developed by Moonshot AI

Kimi K2 is Moonshot AI’s advanced open-source large language model built on a scalable Mixture-of-Experts (MoE) architecture that combines a trillion total parameters with a subset of ~32 billion active parameters to deliver powerful and efficient performance on diverse tasks. It was trained on an enormous corpus of over 15.5 trillion tokens to push frontier capabilities in coding, reasoning, and general agentic tasks while addressing training stability through novel optimizer and architecture design strategies. The model family includes variants like a foundational base model that researchers can fine-tune for specific use cases and an instruct-optimized variant primed for general-purpose chat and agent-style interactions, offering flexibility for both experimentation and deployment. ...

Downloads: 117 This Week

Last Update: 2026-01-27
See Project
4

OSS-Fuzz Gen

LLM powered fuzzing via OSS-Fuzz

...The system integrates with modern LLM-assisted workflows to draft harness code and then iterates based on build errors or low coverage signals. Importantly, it aligns with OSS-Fuzz conventions, generating corpus seeds, build rules, and sanitizer settings so projects can plug in quickly. Reports highlight what functions were targeted, how coverage evolved, and where manual hints could unlock more paths. The goal is pragmatic: shrink the gap between “we should fuzz this” and “we have robust fuzzing running in CI,” especially for understaffed maintainers.

Downloads: 0 This Week

Last Update: 2025-10-12
See Project
SOCRadar Extended Threat Intelligence Platform
Get real-time visibility into vulnerabilities, leaked data, and threat actor activity targeting your organization.

SOCRadar Extended Threat Intelligence, a natively single platform from its inception that proactively identifies and analyzes cyber threats with contextual and actionable intelligence.

Start Free Trial
5

Reor Project

Private & local AI personal knowledge management app

Reor is an AI-powered desktop note-taking app: it automatically links related notes, answers questions on your notes, provides semantic search and can generate AI flashcards. Everything is stored locally and you can edit your notes with an Obsidian-like markdown editor. The hypothesis of the project is that AI tools for thought should run models locally by default. Reor stands on the shoulders of the giants Ollama, Transformers.js & LanceDB to enable both LLMs and embedding models to run locally.

Downloads: 10 This Week

Last Update: 2025-04-13
See Project
6

Chronos Forecasting

Pretrained (Language) Models for Probabilistic Time Series Forecasting

...Once trained, probabilistic forecasts are obtained by sampling multiple future trajectories given the historical context. Chronos models have been trained on a large corpus of publicly available time series data, as well as synthetic data generated using Gaussian processes.

Downloads: 3 This Week

Last Update: 2025-12-17
See Project
7

CodeGeeX4

CodeGeeX4-ALL-9B, a versatile model for all AI software development

CodeGeeX4 is the fourth-generation open source multilingual code large language model (LLM) developed by ZhipuAI. Designed as a powerful AI coding assistant, it supports over 100 programming languages and has been trained on a massive code and natural language corpus. Compared to its predecessors, CodeGeeX4 introduces improved reasoning, stronger alignment with developer needs, and better performance on real-world programming benchmarks. It supports tasks such as code completion, generation from natural language descriptions, code translation, bug fixing, and explanation. The repository provides model checkpoints, inference examples, and fine-tuning guides, making it adaptable for both research and practical software development workflows. ...

Downloads: 4 This Week

Last Update: 13 hours ago
See Project
8

IMS Open Corpus Workbench

Indexing and query tools for very large text corpora

The IMS Open Corpus Workbench is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP, which can be used interactively in a terminal session, as a backend e.g. from a Perl script, or through the Web-based GUI CQPweb.

">

Downloads: 35 This Week

Last Update: 5 days ago
See Project
9

nanochat

The best ChatGPT that $100 can buy

nanochat is a from-scratch, end-to-end “mini ChatGPT” that shows the entire path from raw text to a chatty web app in one small, dependency-lean codebase. The repository stitches together every stage of the lifecycle: tokenizer training, pretraining a Transformer on a large web corpus, mid-training on dialogue and multiple-choice tasks, supervised fine-tuning, optional reinforcement learning for alignment, and finally efficient inference with caching. Its north star is approachability and speed: you can boot a fresh GPU box and drive the whole pipeline via a single script, producing a usable chat model in hours and a clear markdown report of what happened. ...

Downloads: 4 This Week

Last Update: 4 days ago
See Project
Enterprise AI Agents for Every Customer Moment
For enterprise companies looking for AI Agents

From chat to voice to SMS, every conversation gets a smart, personalized response powered by your policies, tone, and data.

Learn More
10

Step3-VL-10B

Multimodal model achieving SOTA performance

...Despite having only about 10 billion parameters, it delivers performance that rivals or even surpasses much larger models (10×–20× larger) on a wide range of multimodal benchmarks covering reasoning, perception, and complex tasks, positioning it as one of the most powerful models in its class. It achieves this efficiency and strong performance through unified pre-training on a massive 1.2 trillion-token multimodal corpus that jointly optimizes a language-aligned perception encoder with a powerful decoder, creating deep synergy between image processing and text understanding.

Downloads: 3 This Week

Last Update: 2026-01-22
See Project
11

DeepSeek Coder

DeepSeek Coder: Let the Code Write Itself

DeepSeek-Coder is a series of code-specialized language models designed to generate, complete, and infill code (and mixed code + natural language) with high fluency in both English and Chinese. The models are trained from scratch on a massive corpus (~2 trillion tokens), of which about 87% is code and 13% is natural language. This dataset covers project-level code structure (not just line-by-line snippets), using a large context window (e.g. 16K) and a secondary fill-in-the-blank objective to encourage better contextual completions and infilling. Multiple sizes of the model are offered (e.g. 1B, 5.7B, 6.7B, 33B) so users can trade off inference cost vs capability. ...

Downloads: 5 This Week

Last Update: 2025-11-11
See Project
12

Honggfuzz

Security oriented software fuzzer

honggfuzz is a general-purpose, high-performance fuzzer that mixes coverage feedback with practical crash triage to uncover memory-safety and logic bugs. It supports multiple fuzzing modes—stdin, file, and networking—so targets can be exercised the same way they run in production. Instrumentation via compiler hooks or hardware/perf counters guides mutations toward previously unseen edges, while persistent mode keeps the target process alive to amortize startup costs. The tool integrates...

Downloads: 1 This Week

Last Update: 2025-10-09
See Project
13

Echidna

Ethereum smart contract fuzzer

...It uses sophisticated grammar-based fuzzing campaigns based on a contract ABI to falsify user-defined predicates or Solidity assertions. We designed Echidna with modularity in mind, so it can be easily extended to include new mutations or test specific contracts in specific cases. Optional corpus collection, mutation and coverage guidance to find deeper bugs. Powered by Slither to extract useful information before the fuzzing campaign. Source code integration to identify which lines are covered after the fuzzing campaign. Curses-based retro UI, text-only or JSON output.

Downloads: 0 This Week

Last Update: 2026-01-16
See Project
14

VoxCPM

TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning

...This design helps decouple semantic and acoustic information while preserving fine-grained prosody, leading to more stable and expressive generation than many discrete-token systems. Trained on a large 1.8-million-hour bilingual corpus, VoxCPM can infer appropriate speaking style from context, dynamically adjusting intonation, rhythm, and emotional tone. It supports zero-shot voice cloning from a short reference audio clip, capturing timbre, accent, and pacing to closely mimic a target speaker without per-speaker fine-tuning.

Downloads: 2 This Week

Last Update: 2025-12-05
See Project
15

Wikipedia2Vec

A tool for learning vector representations of words and entities

Wikipedia2Vec is an embedding learning tool that creates word and entity vector representations from Wikipedia, enabling NLP models to leverage structured and contextual knowledge.

Downloads: 1 This Week

Last Update: 2025-01-24
See Project
16

Syllabic Verse Analysis (SylVA)

Syllabifies and scans syllabic verse texts for metrical annotation

The tool syllabifies and scans texts written in syllabic verse for metrical corpus annotation. It is designed for Old French and Old Occitan and exports the results in PAULA format suitable for the ANNIS platform (http://corpus-tools.org/annis/). Used first in the preparation of the metrical treebank containing the Old Occitan <i>Boeci</i> text (cf. Rainsford and Scrivner 2014), development continued for use with the Old Gallo-Romance Corpus <http://www.ogr-corpus.org>).

Downloads: 0 This Week

Last Update: 2025-08-18
See Project
17

ddc-dstar-core

ddc-dstar corpus management framework (legacy core)

Legacy core code for the ddc-dstar corpus management framework, suitable for use with dstar-gantry.

Downloads: 0 This Week

Last Update: 2025-11-15
See Project
18

iramuteq

IRAMUTEQ : Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires. Logiciel de traitement de données pour des corpus texte ou de type individus/caractères. Permet notamment de réaliser des analyses de type "ALCESTE"

">

Downloads: 830 This Week

Last Update: 2024-11-03
See Project
19

modnlp

Modular Suite of NLP Tools

...It provides an API and tools for (inverted) indexing, storage and retrieval of large amounts of text, with (XML-based) handling of meta-data, tools for text categorisation, including, functionality for XML parsing, term set reduction (and basic keyword extraction), probabilistic classifier induction, sample classification tools, and evaluation modules, a suite of corpus management, curation and distributed access tools. If you use the tool please consider referencing it using the following article: Luz, S., & Sheehan, S. (2020). Methods and visualization tools for the analysis of medical, political and scientific concepts in Genealogies of Knowledge. Palgrave Communications, 6(1), 1-20. ...

Downloads: 17 This Week

Last Update: 2025-12-01
See Project
20

Dodge gpt

Bypass Ai content for GPTZero and others making text Undetectable

*New Update* ╔════════════════════════════════════════════════════════════════╗ ║ DODGE V10 - STEALTH EDITION ║ The Only AI Text Humanizer That Defeats GPTZero ╚════════════════════════════════════════════════════════════════╝ █████████████████████████████████████████████████████████████████ █ █ █ 🛡️ CURRENT STATUS: GPTZERO RESISTANT - VERIFIED 2026 █ █ 📊 SUCCESS RATE: 60.7% AGAINST ALL DETECTORS █ █ 🔬 BASED ON: REAL HUMAN CORPUS ANALYSIS █ █ █ █████████████████████████████████████████████████████████████████ Dodge V10 isn't just another "synonym replacer" or "typo adder". It is a sophisticated neural text transformation engine that rewrites content to match the EXACT statistical fingerprint of human writing. -----100% Free-----

2 Reviews

Downloads: 18 This Week

Last Update: 3 days ago
See Project
21

TEXminer

Text Mining Classification for Texts in ASCII, Unicode and PDF Format.

...TEXminer allows Language Detection by Letter Frequency Analysis, finding important Words by Cooccurrence Analysis, Determination of Central Expressions, Thematic Text Classification (also Semantic Groups) Fingerprint Comparison and Word Frequency. Because TEXminer is not disigned to have a Reference Corpus, Thematic Model Statistics uses Language Models (lexicons) to have Background Knowledge about certain Languages (English, German, French, Spanish, Italian, Russian), which are derived from Decaleon Project. The Thematic Models for Standard Vocabulary have been extended (spring 2015). The Thematic Models for Technical Terms have been extended (2015). ...

Downloads: 2 This Week

Last Update: 2025-03-25
See Project
22

NLG-Eval

Evaluation code for various unsupervised automated metrics

NLG-Eval is a toolkit for evaluating the quality of natural language generation (NLG) outputs using multiple automated metrics such as BLEU, METEOR, and ROUGE.

Downloads: 3 This Week

Last Update: 2025-01-24
See Project
23

LF Aligner

LF Aligner helps translators create translation memories from texts and their translations. It relies on Hunalign for automatic sentence pairing. Input: txt, doc, docx, rtf, pdf, html. Output: tab delimited txt, TMX and xls. With web features. My email address is listed in readme.txt; for support, use the forum here. My personal website: www.farkastranslations.com.

">

13 Reviews

Downloads: 134 This Week

Last Update: 2023-09-04
See Project
24

Letrista

Generador automático de textos

...Su eficacia hasta el momento es limitada, a pesar de ser capaz de crear textos semanticamente aceptables en la mayoría de los casos, es incapaz de centrarse en una temática específica. El corpus mínimo para generar una base de datos utilazable es de al menos 100 páginas de texto y los resultados comienzan a mejorar a partir de esa cifra.

Downloads: 0 This Week

Last Update: 2023-07-14
See Project
25

Paul Graham GPT

RAG on Paul Graham's essays

Paul Graham GPT is a specialized AI-powered search and chat app built on a corpus of essays from Paul Graham, giving users the ability to query and discuss his writings in a conversational way. The repo stores the full text of his essays (chunked), uses embeddings (e.g. via OpenAI embeddings) to allow semantic search over that corpus, and hosts a chat interface that combines retrieval results with LLM-based answering — enabling RAG (retrieval-augmented generation) over a fixed dataset. ...

Downloads: 1 This Week

Last Update: 2025-12-08
See Project