[go: up one dir, main page]

  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • Photo and Video Editing APIs and SDKs Icon
    Photo and Video Editing APIs and SDKs

    Trusted by 150 million+ creators and businesses globally

    Unlock Picsart's full editing suite by embedding our Editor SDK directly into your platform. Offer your users the power of a full design suite without leaving your site.
    Learn More
  • 1
    NuMarkdown-8B-Thinking

    NuMarkdown-8B-Thinking

    Reasoning-powered OCR VLM for converting complex documents to Markdown

    NuMarkdown-8B-Thinking is the first reasoning OCR vision-language model (VLM) designed to convert documents into clean Markdown optimized for retrieval-augmented generation (RAG). Built on Qwen 2.5-VL-7B and fine-tuned with synthetic Doc → Reasoning → Markdown examples, it generates thinking tokens before producing the final Markdown to better handle complex layouts and tables. It uses a two-phase training process: supervised fine-tuning (SFT) followed by reinforcement learning (GRPO) with a layout-centric reward for accuracy on challenging documents. The model excels at non-standard layouts and complex table structures, outperforming non-reasoning OCR systems like GPT-4o and OCRFlux, and competing with large closed-source reasoning models like Gemini 2.5. Thinking token usage can range from 20% to 500% of the final answer, depending on task difficulty. NuMarkdown-8B-Thinking is released under the MIT license and supports vLLM and Transformers for deployment.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    Open Infra Index

    Open Infra Index

    Production-tested AI infrastructure tools

    open-infra-index is a central “infrastructure index” repository maintained by DeepSeek AI that acts as a catalog and hub for a collection of production-tested AI infrastructure tools and internal building blocks they have open-sourced. Instead of a single monolithic codebase, it functions more like an index or launching point: linking and documenting a set of library repos (e.g. FlashMLA, DeepEP, DeepGEMM, 3FS, etc.) that together form DeepSeek’s infrastructure stack. The repo's README describes the project as sharing “humble building blocks” of their online service—code that is documented, deployed, and battle-tested in production. The timing of its opening matches DeepSeek’s “Open-Source Week” campaign (starting around February 2025) when they gradually released internal infrastructure components publicly. It is licensed under CC0-1.0 (Creative Commons Zero) to maximize openness.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    OpenAI Quickstart Python

    OpenAI Quickstart Python

    Python example app from the OpenAI API quickstart tutorial

    openai-quickstart-python is an official OpenAI repository containing multiple Python quickstart applications that demonstrate how to use different OpenAI API endpoints, including Chat and Assistants. It provides practical, beginner-friendly examples to help developers quickly learn how to send requests, handle responses, and build basic applications using the OpenAI Python SDK. The examples folder includes small, self-contained projects showcasing common use cases like chat completions, tool usage, and interactive interfaces. Each example is designed to be easily runnable with minimal setup—requiring only Python, a virtual environment, and an API key. The repository also includes environment setup guides and example scripts, such as a simple Flask web app for chat interactions, allowing developers to test OpenAI API integrations locally. Overall, openai-quickstart-python serves as an essential starting point for developers looking to prototype and experiment with OpenAI-powered apps.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    OpenAI Realtime Console

    OpenAI Realtime Console

    React app for inspecting, building and debugging with the Realtime API

    openai-realtime-console is a developer tool created by OpenAI that provides a web-based console for experimenting with the Realtime API. The Realtime API enables low-latency, interactive communication with language models, supporting use cases such as live conversations, real-time transcription, and interactive applications. This console serves as a reference implementation, showing how to establish WebRTC or WebSocket connections, send audio or text inputs, and receive model outputs in real time. It is built as a simple frontend that developers can run locally to test and understand how Realtime API interactions work. The project is intended as an educational and debugging resource rather than a production-ready application. By offering clear examples of streaming inputs and outputs, the console helps developers accelerate prototyping of real-time AI-powered applications.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Simple, Secure Domain Registration Icon
    Simple, Secure Domain Registration

    Get your domain at wholesale price. Cloudflare offers simple, secure registration with no markups, plus free DNS, CDN, and SSL integration.

    Register or renew your domain and pay only what we pay. No markups, hidden fees, or surprise add-ons. Choose from over 400 TLDs (.com, .ai, .dev). Every domain is integrated with Cloudflare's industry-leading DNS, CDN, and free SSL to make your site faster and more secure. Simple, secure, at-cost domain registration.
    Sign up for free
  • 5
    OpenAI Realtime Embedded

    OpenAI Realtime Embedded

    Instructions on how to use the Realtime API on Microcontrollers

    openai-realtime-embedded is a repository that provides resources, SDKs, and example links for using OpenAI’s Realtime API on embedded hardware platforms (e.g. microcontrollers). The goal is to enable low-latency conversational agents (e.g. voice-based assistants) running directly on constrained devices, by leveraging WebRTC and streaming APIs to communicate with OpenAI systems. The repo includes pointers to an ESP32 implementation (maintained as esp32 branch) and documentation that Espressif offers an official example in openai_demo. It does not appear to include a full cross-platform embedded SDK in the main branch (the core content is mostly links and minimal README), but acts as a launching point for integrating realtime on microcontrollers.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    OpenVLA 7B

    OpenVLA 7B

    Vision-language-action model for robot control via images and text

    OpenVLA 7B is a multimodal vision-language-action model trained on 970,000 robot manipulation episodes from the Open X-Embodiment dataset. It takes camera images and natural language instructions as input and outputs normalized 7-DoF robot actions, enabling control of multiple robot types across various domains. Built on top of LLaMA-2 and DINOv2/SigLIP visual backbones, it allows both zero-shot inference for known robot setups and parameter-efficient fine-tuning for new domains. The model supports real-world robotics tasks, with robust generalization to environments seen in pretraining. Its actions include delta values for position, orientation, and gripper status, and can be un-normalized based on robot-specific statistics. OpenVLA is MIT-licensed, fully open-source, and designed collaboratively by Stanford, Berkeley, Google DeepMind, and TRI. Deployment is facilitated via Python and Hugging Face tools, with flash attention support for efficient inference.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    PRM800K

    PRM800K

    800,000 step-level correctness labels on LLM solutions to MATH problem

    PRM800K is a process supervision dataset accompanying the paper Let’s Verify Step by Step, providing 800,000 step-level correctness labels on model-generated solutions to problems from the MATH dataset. The repository releases the raw labels and the labeler instructions used in two project phases, enabling researchers to study how human raters graded intermediate reasoning. Data are stored as newline-delimited JSONL files tracked with Git LFS, where each line is a full solution sample that can contain many step-level labels and rich metadata such as labeler UUIDs, timestamps, generation identifiers, and quality-control flags. Each labeled step can include multiple candidate completions with ratings of -1, 0, or +1, optional human-written corrections (phase 1), and a chosen completion index, along with a final finish reason such as found_error, solution, bad_problem, or give_up.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    Pearl

    Pearl

    A Production-ready Reinforcement Learning AI Agent Library

    Pearl is a production-ready reinforcement learning and contextual bandit agent library built for real-world sequential decision making. It is organized around modular components—policy learners, replay buffers, exploration strategies, safety modules, and history summarizers—that snap together to form reliable agents with clear boundaries and strong defaults. The library implements classic and modern algorithms across two regimes: contextual bandits (e.g., LinUCB, LinTS, SquareCB, neural bandits) and fully sequential RL (e.g., DQN, PPO-style policy optimization), with attention to practical concerns like nonstationarity and dynamic action spaces. Tutorials demonstrate end-to-end workflows on OpenAI Gym tasks and contextual-bandit setups derived from tabular datasets, emphasizing reproducibility and clear baselines. Pearl’s design favors clarity and deployability: metrics, logging, and evaluation harnesses are integrated so you can monitor learning, compare agents, and catch regressions.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    Perception Models

    Perception Models

    State-of-the-art Image & Video CLIP, Multimodal Large Language Models

    Perception Models is a state-of-the-art framework developed by Facebook Research for advanced image and video perception tasks. It introduces two primary components: the Perception Encoder (PE) for visual feature extraction and the Perception Language Model (PLM) for multimodal decoding and reasoning. The PE module is a family of vision encoders designed to excel in image and video understanding, surpassing models like SigLIP2, InternVideo2, and DINOv2 across multiple benchmarks. Meanwhile, PLM integrates with PE to power vision-language modeling, achieving results competitive with leading multimodal systems such as QwenVL2.5 and InternVL3, all while being fully reproducible with open data. The project supports a wide range of research applications, from visual recognition and dense prediction to fine-grained multimodal understanding. Additionally, it includes several large-scale open datasets for both image and video perception.
    Downloads: 0 This Week
    Last Update:
    See Project
  • The All-in-One Commerce Platform for Businesses - Shopify Icon
    The All-in-One Commerce Platform for Businesses - Shopify

    Shopify offers plans for anyone that wants to sell products online and build an ecommerce store, small to mid-sized businesses as well as enterprise

    Shopify is a leading all-in-one commerce platform that enables businesses to start, build, and grow their online and physical stores. It offers tools to create customized websites, manage inventory, process payments, and sell across multiple channels including online, in-person, wholesale, and global markets. The platform includes integrated marketing tools, analytics, and customer engagement features to help merchants reach and retain customers. Shopify supports thousands of third-party apps and offers developer-friendly APIs for custom solutions. With world-class checkout technology, Shopify powers over 150 million high-intent shoppers worldwide. Its reliable, scalable infrastructure ensures fast performance and seamless operations at any business size.
    Learn More
  • 10
    Phi-3-MLX

    Phi-3-MLX

    Phi-3.5 for Mac: Locally-run Vision and Language Models

    Phi-3-Vision-MLX is an Apple MLX (machine learning on Apple silicon) implementation of Phi-3 Vision, a lightweight multi-modal model designed for vision and language tasks. It focuses on running vision-language AI efficiently on Apple hardware like M1 and M2 chips.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    Profile Data

    Profile Data

    Analyze computation-communication overlap in V3/R1

    profile-data is a repository that publishes profiling traces and metrics from DeepSeek’s training and inference infrastructure (especially during DeepSeek-V3 / R1 experiments). The profiling data targets insights into computation-communication overlap, pipeline scheduling (e.g. DualPipe), and how MoE / EP / parallelism strategies interact in real systems. The repository contains JSON trace files like train.json, prefill.json, decode.json, and associated assets. Users can load them into tools like Chrome tracing to inspect GPU idle times, overlapping operations, and scheduling alignment. The idea is to bring transparency to internal efficiency tradeoffs, enabling researchers to reproduce, analyze, or improve on DeepSeek’s parallelism strategies. The README explains how trace data corresponds to forward/backward chunks, settings (e.g. EP64, TP1, 4K sequence length), and notes that pipeline communication is excluded for simplicity.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Prompt-to-Prompt

    Prompt-to-Prompt

    Latent Diffusion and Stable Diffusion Implementation

    Prompt-to-Prompt is a research codebase that demonstrates how to edit images generated by diffusion models using only changes to the text prompt. Instead of retraining or heavy fine-tuning, it manipulates the model’s cross-attention maps so the structure of the original image is largely preserved while semantics shift according to the revised prompt. The method supports gentle edits (e.g., style, color, lighting) as well as stronger semantic substitutions, and it can localize edits to specific words or regions by selectively updating attention. Because edits are steerable via prompt wording and token weighting, creators can iterate quickly, exploring variations without losing composition. The repository includes reference notebooks and scripts that plug into popular latent diffusion backbones, making it practical to try the technique on your own prompts and seeds. It’s especially useful for workflows that need consistent framing, product shots, illustrations, and concept art, etc.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    PyTorch GAN Zoo

    PyTorch GAN Zoo

    A mix of GAN implementations including progressive growing

    PyTorch GAN Zoo is a comprehensive open research toolbox designed for experimenting with and developing Generative Adversarial Networks (GANs) using PyTorch. The project provides modular implementations of popular GAN architectures, including Progressive Growing of GANs (PGAN), DCGAN, and an experimental StyleGAN version. It is built to support both researchers and developers who want to train, evaluate, and extend GANs efficiently across diverse datasets such as CelebA-HQ, FashionGen, DTD, and CIFAR-10. In addition to core GAN training, the repository includes tools for model evaluation, such as Inception Score and SWD metrics, as well as advanced features like GDPP for diverse generation and AC-GAN conditioning for class-specific synthesis. The framework also supports “inspirational generation,” enabling style or content transfer from reference images through pre-trained models.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    PyTorch-BigGraph

    PyTorch-BigGraph

    Generate embeddings from large-scale graph-structured data

    PyTorch-BigGraph (PBG) is a system for learning embeddings on massive graphs—think billions of nodes and edges—using partitioning and distributed training to keep memory and compute tractable. It shards entities into partitions and buckets edges so that each training pass only touches a small slice of parameters, which drastically reduces peak RAM and enables horizontal scaling across machines. PBG supports multi-relation graphs (knowledge graphs) with relation-specific scoring functions, negative sampling strategies, and typed entities, making it suitable for link prediction and retrieval. Its training loop is built for throughput: asynchronous I/O, memory-mapped tensors, and lock-free updates keep GPUs and CPUs fed even at extreme scale. The toolkit includes evaluation metrics and export tools so learned embeddings can be used in downstream nearest-neighbor search, recommendation, or analytics. In practice, PBG’s design lets practitioners train high-quality graph embeddings.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    QwQ-32B

    QwQ-32B

    QwQ-32B is a reasoning-focused language model for complex tasks

    QwQ-32B is a 32.8 billion parameter reasoning-optimized language model developed by Qwen as part of the Qwen2.5 family, designed to outperform conventional instruction-tuned models on complex tasks. Built with RoPE positional encoding, SwiGLU activations, RMSNorm, and Attention QKV bias, it excels in multi-turn conversation and long-form reasoning. It supports an extended context length of up to 131,072 tokens and incorporates supervised fine-tuning and reinforcement learning for enhanced instruction-following capabilities. The model is capable of structured thinking and delivers competitive performance against top models like DeepSeek-R1 and o1-mini. Recommended usage involves prompts starting with <think>\n, non-greedy sampling strategies, and support for standardized outputs on math and multiple-choice tasks. For long input handling, it supports YaRN (Yet another RoPE Namer) for context scaling.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Qwen-Image-Edit

    Qwen-Image-Edit

    An advanced bilingual image editing with semantic control

    Qwen-Image-Edit is the image editing extension of Qwen-Image, a 20B parameter model that combines advanced visual and text-rendering capabilities for creative and precise editing. It leverages both Qwen2.5-VL for semantic control and a VAE Encoder for appearance control, enabling users to edit at both the content and detail level. The model excels at semantic edits like style transfer, object rotation, and novel view synthesis, while also handling precise appearance edits such as adding or removing elements without altering surrounding regions. A standout feature is its bilingual text editing in English and Chinese, which preserves original font, size, and style during modifications. Benchmarks confirm its state-of-the-art performance in image editing, establishing it as a reliable foundation for both artistic and practical tasks. Its applications span IP creation, meme generation, background changes, clothing edits, and fine corrections in artworks or calligraphy.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Qwen-VL

    Qwen-VL

    Chat & pretrained large vision language model

    Qwen-VL is Alibaba Cloud’s vision-language large model family, designed to integrate visual and linguistic modalities. It accepts image inputs (with optional bounding boxes) and text, and produces text (and sometimes bounding boxes) as output. The model variants (VL-Plus, VL-Max, etc.) have been upgraded for better visual reasoning, text recognition from images, fine-grained understanding, and support for high image resolutions / extreme aspect ratios. Qwen-VL supports multilingual inputs and conversation (e.g. Chinese, English), and is aimed at tasks like image captioning, question answering on images (VQA, DocVQA), grounding (detecting objects or regions from textual queries), etc.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Qwen2-7B-Instruct

    Qwen2-7B-Instruct

    Instruction-tuned 7B language model for chat and complex tasks

    Qwen2-7B-Instruct is a 7.62-billion-parameter instruction-tuned language model from the Qwen2 series developed by Alibaba's Qwen team. Built on a transformer architecture with SwiGLU activation and group query attention, it is optimized for chat, reasoning, coding, multilingual tasks, and extended context understanding up to 131,072 tokens. The model was pretrained on a large-scale dataset and aligned via supervised fine-tuning and direct preference optimization. It shows strong performance across benchmarks such as MMLU, MT-Bench, GSM8K, and Humaneval, often surpassing similarly sized open-source models. Designed for conversational use, it integrates with Hugging Face Transformers and supports long-context applications via YARN and vLLM for efficient deployment.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Qwen2.5-14B-Instruct

    Qwen2.5-14B-Instruct

    Powerful 14B LLM with strong instruction and long-text handling

    Qwen2.5-14B-Instruct is a powerful instruction-tuned language model developed by the Qwen team, based on the Qwen2.5 architecture. It features 14.7 billion parameters and is optimized for tasks like dialogue, long-form generation, and structured output. The model supports context lengths up to 128K tokens and can generate up to 8K tokens, making it suitable for long-context applications. It demonstrates improved performance in coding, mathematics, and multilingual understanding across over 29 languages. Qwen2.5-14B-Instruct is built on a transformer backbone with RoPE, SwiGLU, RMSNorm, and attention QKV bias. It’s resilient to varied prompt styles and is especially effective for JSON and tabular data generation. The model is instruction-tuned and supports chat templating, making it ideal for chatbot and assistant use cases.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Qwen2.5-VL-3B-Instruct

    Qwen2.5-VL-3B-Instruct

    Qwen2.5-VL-3B-Instruct: Multimodal model for chat, vision & video

    Qwen2.5-VL-3B-Instruct is a 3.75 billion parameter multimodal model by Qwen, designed to handle complex vision-language tasks in both image and video formats. As part of the Qwen2.5 series, it supports image-text-to-text generation with capabilities like chart reading, object localization, and structured data extraction. The model can serve as an intelligent visual agent capable of interacting with digital interfaces and understanding long-form videos by dynamically sampling resolution and frame rate. It uses a SwiGLU and RMSNorm-enhanced ViT architecture and introduces mRoPE updates for robust temporal and spatial understanding. The model supports flexible image input (file path, URL, base64) and outputs structured responses like bounding boxes or JSON, making it highly versatile in commercial and research settings. It excels in a wide range of benchmarks such as DocVQA, InfoVQA, and AndroidWorld control tasks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Qwen2.5-VL-7B-Instruct

    Qwen2.5-VL-7B-Instruct

    Multimodal 7B model for image, video, and text understanding tasks

    Qwen2.5-VL-7B-Instruct is a multimodal vision-language model developed by the Qwen team, designed to handle text, images, and long videos with high precision. Fine-tuned from Qwen2.5-VL, this 7-billion-parameter model can interpret visual content such as charts, documents, and user interfaces, as well as recognize common objects. It supports complex tasks like visual question answering, localization with bounding boxes, and structured output generation from documents. The model is also capable of video understanding with dynamic frame sampling and temporal reasoning, enabling it to analyze and respond to long-form videos. Built with an enhanced ViT architecture using window attention, SwiGLU, and RMSNorm, it aligns closely with Qwen2.5 LLM standards. The model demonstrates high performance across benchmarks like DocVQA, ChartQA, and MMStar, and even functions as a tool-using visual agent.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    Qwen3-Next

    Qwen3-Next

    Qwen3-Next: 80B instruct LLM with ultra-long context up to 1M tokens

    Qwen3-Next-80B-A3B-Instruct is the flagship release in the Qwen3-Next series, designed as a next-generation foundation model for ultra-long context and efficient reasoning. With 80B total parameters and 3B activated at a time, it leverages hybrid attention (Gated DeltaNet + Gated Attention) and a high-sparsity Mixture-of-Experts architecture to achieve exceptional efficiency. The model natively supports a context length of 262K tokens and can be extended up to 1 million tokens using RoPE scaling (YaRN), making it highly capable for processing large documents and extended conversations. Multi-Token Prediction (MTP) boosts both training and inference, while stability optimizations such as weight-decayed and zero-centered layernorm ensure robustness. Benchmarks show it performs comparably to larger models like Qwen3-235B on reasoning, coding, multilingual, and alignment tasks while requiring only a fraction of the training cost.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    Ring

    Ring

    Ring is a reasoning MoE LLM provided and open-sourced by InclusionAI

    Ring is a reasoning Mixture-of-Experts (MoE) large language model (LLM) developed by inclusionAI. It is built from or derived from Ling. Its design emphasizes reasoning, efficiency, and modular expert activation. In its “flash” variant (Ring-flash-2.0), it optimizes inference by activating only a subset of experts. It applies reinforcement learning/reasoning optimization techniques. Its architectures and training approaches are tuned to enable efficient and capable reasoning performance. Reasoning-optimized model with reinforcement learning enhancements. Efficient architecture and memory design for large-scale reasoning. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    SG2Im

    SG2Im

    Code for "Image Generation from Scene Graphs", Johnson et al, CVPR 201

    sg2im is a research codebase that learns to synthesize images from scene graphs—structured descriptions of objects and their relationships. Instead of conditioning on free-form text alone, it leverages graph structure to control layout and interactions, generating scenes that respect constraints like “person left of dog” or “cup on table.” The pipeline typically predicts object layouts (bounding boxes and masks) from the graph, then renders a realistic image conditioned on those layouts. This separation lets the model reason about geometry and composition before committing to texture and color, improving spatial fidelity. The repository includes training code, datasets, and evaluation scripts so researchers can reproduce baselines and extend components such as the graph encoder or image generator. In practice, sg2im demonstrates how structured semantics can guide generative models to produce controllable, compositional imagery.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    Seamless Communication

    Seamless Communication

    Foundational Models for State-of-the-Art Speech and Text Translation

    Seamless Communication is a research project focused on building more integrated, low-latency multimodal communication between humans and AI agents. The motivation is to move beyond “text in, text out” and enable direct, live, multi-turn exchange involving language, gesture, gaze, vision, and modality switching without user friction. The system architecture includes a real-time multimodal signal pipeline for audio, video, and sensor data, a dialog manager that can decide when to act (speak, gesture, point) or query, and a cross-modal reasoning layer that fuses perception with semantic context. The research prototype includes components for visual grounding (understanding when a user references something in view), gesture recognition and synthesis, and turn-taking mechanisms that mirror human conversational timing. Because latency and synchronization are critical, the codebase invests in asynchronous scheduling, overlap of perception and reasoning, and fast fallback responses.
    Downloads: 0 This Week
    Last Update:
    See Project