inference engine free download

Showing 72 open source projects for "inference engine"

View related business solutions

Gen AI apps are built with MongoDB Atlas
The database for AI-powered applications.

MongoDB Atlas is the developer-friendly database used to build, scale, and run gen AI and LLM-powered apps—without needing a separate vector database. Atlas offers built-in vector search, global availability across 115+ regions, and flexible document modeling. Start building AI apps faster, all in one place.

Start Free
Employees get more done with Rippling
Streamline your business with an all-in-one platform for HR, IT, payroll, and spend management.

Effortlessly manage the entire employee lifecycle, from hiring to benefits administration. Automate HR tasks, ensure compliance, and streamline approvals. Simplify IT with device management, software access, and compliance monitoring, all from one dashboard. Enjoy timely payroll, real-time financial visibility, and dynamic spend policies. Rippling empowers your business to save time, reduce costs, and enhance efficiency, allowing you to focus on growth. Experience the power of unified management with Rippling today.

Learn More
1

Transformer Engine

A library for accelerating Transformer models on NVIDIA GPUs

Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper GPUs, to provide better performance with lower memory utilization in both training and inference. TE provides a collection of highly optimized building blocks for popular Transformer architectures and an automatic mixed precision-like API that can be used seamlessly with your framework-specific code.

Downloads: 1 This Week

Last Update: 2026-02-02
See Project
2

Temporal Inference Engine

A real time inference engine for temporal logical specifications

A real time inference engine for temporal logical specifications, which is able to acquire, process and generate any binary or real signal through POSIX IPC, files or UNIX sockets. Specifications of signals and dynamic systems are represented as special graphs and executed in real time, with a predictable sampling time of few milliseconds. Real time signal processing, dynamic system control, state machine modeling and logical property verification are some fields of application of this software. ...

1 Review

Downloads: 5 This Week

Last Update: 4 days ago
See Project
3

ReactiveMP.jl

High-performance reactive message-passing based Bayesian engine

ReactiveMP.jl is a Julia package that provides an efficient reactive message passing based Bayesian inference engine on a factor graph. The package is a part of the bigger and user-friendly ecosystem for automatic Bayesian inference called RxInfer. While ReactiveMP.jl exports only the inference engine, RxInfer provides convenient tools for model and inference constraints specification as well as routines for running efficient inference both for static and real-time datasets.

Downloads: 11 This Week

Last Update: 2026-02-02
See Project
4

vLLM

A high-throughput and memory-efficient inference and serving engine

vLLM is a fast and easy-to-use library for LLM inference and serving. High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more.

Downloads: 30 This Week

Last Update: 2026-02-04
See Project
Get safely back to business
SafetyCulture iAuditor is designed for companies who need to conduct safety inspections & quality audits

Equip your team with a simple safety inspection and observation app that anyone can learn in minutes, so you can get safely back to business from wherever you are.

Learn More
5

SimpleLLM

950 line, minimal, extensible LLM inference engine built from scratch

SimpleLLM is a minimal, extensible large language model inference engine implemented in roughly 950 lines of code, built from scratch to serve both as a learning tool and a research platform for novel inference techniques. It provides the core components of an LLM runtime—such as tokenization, batching, and asynchronous execution—without the abstraction overhead of more complex engines, making it easier for developers and researchers to understand and modify. ...

Downloads: 0 This Week

Last Update: 2026-01-28
See Project
6

gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models

Gemma.cpp is a C++ implementation for running inference with Gemma models efficiently on CPUs and GPUs. Developed by Google, it allows running large language models (LLMs) like Gemma with minimal hardware, focusing on optimized performance and low latency. Gemma.cpp is intended for developers seeking to deploy LLMs in production environments without needing massive computational resources.

Downloads: 5 This Week

Last Update: 2025-03-25
See Project
7

HunyuanWorld-Voyager

RGBD video generation model conditioned on camera input

...The system jointly produces aligned RGB and depth video sequences, making it directly applicable to 3D reconstruction tasks. At its core, Voyager integrates a world-consistent video diffusion model with an efficient long-range world exploration engine powered by auto-regressive inference. To support training, the team built a scalable data engine that automatically curates large video datasets with camera pose estimation and metric depth prediction. As a result, Voyager delivers state-of-the-art performance on world exploration benchmarks while maintaining photometric, style, and 3D consistency.

Downloads: 48 This Week

Last Update: 2025-12-17
See Project
8

SAM 3

Code for running inference and finetuning with SAM 3 model

SAM 3 (Segment Anything Model 3) is a unified foundation model for promptable segmentation in both images and videos, capable of detecting, segmenting, and tracking objects. It accepts both text prompts (open-vocabulary concepts like “red car” or “goalkeeper in white”) and visual prompts (points, boxes, masks) and returns high-quality masks, boxes, and scores for the requested concepts. Compared with SAM 2, SAM 3 introduces the ability to exhaustively segment all instances of an...

Downloads: 69 This Week

Last Update: 2026-02-03
See Project
9

AI Runner

Offline inference engine for art, real-time voice conversations

AI Runner is an offline inference engine designed to run a collection of AI workloads on your own machine, including image generation for art, real-time voice conversations, LLM-powered chatbots and automated workflows. It is implemented as a desktop-oriented Python application and emphasizes privacy and self-hosting, allowing users to work with text-to-speech, speech-to-text, text-to-image and multimodal models without sending data to external services.

Downloads: 10 This Week

Last Update: 2025-12-11
See Project
AI Powered Global HCM for the Evolving World of Work
For Start-ups, SME's, Large Enterprise

Darwinbox is a new-age & disruptive mobile-first, cloud-based HRMS platform built for the large enterprises to attract, engage and nurture their most critical resource - talent. It is an end-to-end integrated HR system that aids in streamlining activities across the employee lifecycle (Hire to Retire). Our powerful enterprise product features are built with a clear focus on intuitiveness and scalability, with standards of best in class consumer apps. Darwinbox’s motto is to engage, empower, and inspire employees on one side in addition to automating and simplifying all HR processes for the enterprise on the other. Over 350+ leading enterprises with 850k users manage their entire employee lifecycle on this unified platform.

Learn More
10

Pruna AI

Pruna is a model optimization framework built for developers

Pruna is an open-source, self-hostable AI inference engine designed to help teams deploy and manage large language models (LLMs) efficiently across private or hybrid infrastructures. Built with performance and developer ergonomics in mind, Pruna simplifies inference workflows by enabling multi-model orchestration, autoscaling, GPU resource allocation, and compatibility with popular open-source models.

Downloads: 2 This Week

Last Update: 2026-01-22
See Project
11

llama2.c

Inference Llama 2 in one file of pure C

llama2.c is a minimalist implementation of the Llama 2 language model architecture designed to run entirely in pure C. Created by Andrej Karpathy, this project offers an educational and lightweight framework for performing inference on small Llama 2 models without external dependencies. It provides a full training and inference pipeline: models can be trained in PyTorch and later executed using a concise 700-line C program (run.c). While it can technically load Meta’s official Llama 2...

Downloads: 2 This Week

Last Update: 2 days ago
See Project
12

CTranslate2

Fast inference engine for Transformer models

CTranslate2 is a C++ and Python library for efficient inference with Transformer models. The project implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering, etc., to accelerate and reduce the memory usage of Transformer models on CPU and GPU. The execution is significantly faster and requires less resources than general-purpose deep learning frameworks on supported models and tasks thanks to many...

Downloads: 2 This Week

Last Update: 2026-02-04
See Project
13

Open WebUI

User-friendly AI Interface

Open WebUI is an extensible, feature-rich, and user-friendly self-hosted AI platform designed to operate entirely offline. It supports various LLM runners like Ollama and OpenAI-compatible APIs, with a built-in inference engine for Retrieval Augmented Generation (RAG), making it a powerful AI deployment solution. Key features include effortless setup via Docker or Kubernetes, seamless integration with OpenAI-compatible APIs, granular permissions and user groups for enhanced security, responsive design across devices, and full Markdown and LaTeX support for enriched interactions. ...

Downloads: 90 This Week

Last Update: 18 hours ago
See Project
14

OnnxStream

Lightweight inference library for ONNX files, written in C++

The challenge is to run Stable Diffusion 1.5, which includes a large transformer model with almost 1 billion parameters, on a Raspberry Pi Zero 2, which is a microcomputer with 512MB of RAM, without adding more swap space and without offloading intermediate results on disk. The recommended minimum RAM/VRAM for Stable Diffusion 1.5 is typically 8GB. Generally, major machine learning frameworks and libraries are focused on minimizing inference latency and/or maximizing throughput, all of which at the cost of RAM usage. So I decided to write a super small and hackable inference library specifically focused on minimizing memory consumption: OnnxStream. OnnxStream is based on the idea of decoupling the inference engine from the component responsible for providing the model weights, which is a class derived from WeightsProvider. ...

Downloads: 4 This Week

Last Update: 2024-08-14
See Project
15

marqo

Tensor search for humans

A tensor-based search and analytics engine that seamlessly integrates with your applications, websites, and workflows. Marqo is a versatile and robust search and analytics engine that can be integrated into any website or application. Due to horizontal scalability, Marqo provides lightning-fast query times, even with millions of documents. Marqo helps you configure deep-learning models like CLIP to pull semantic meaning from images. It can seamlessly handle image-to-image, image-to-text and...

Downloads: 1 This Week

Last Update: 2026-02-04
See Project
16

Secret Llama

Fully private LLM chatbot that runs entirely with a browser

Secret Llama is a privacy-first large-language-model chatbot that runs entirely inside your web browser, meaning no server is required and your conversation data never leaves your device. It focuses on open-source model support, letting you load families like Llama and Mistral directly in the client for fully local inference. Because everything happens in-browser, it can work offline once models are cached, which is helpful for air-gapped environments or travel. The interface mirrors the modern chat UX you’d expect—streaming responses, markdown, and a clean layout—so there’s no usability tradeoff to gain privacy. Under the hood it uses a web-native inference engine to accelerate model execution with GPU/WebGPU when available, keeping responses responsive even without a backend. ...

Downloads: 3 This Week

Last Update: 2025-11-07
See Project
17

stable-diffusion.cpp

Diffusion model(SD,Flux,Wan,Qwen Image,Z-Image,...) inference

stable-diffusion.cpp is a lightweight, high-performance implementation of Stable Diffusion and related generative models written entirely in portable C/C++, designed to run on virtually any device without heavy dependencies. It enables text-to-image and image-to-image generation, supports a growing set of models like SD1.x, SD2.x, SDXL, SD-Turbo, Qwen Image, and more, and is continually updated with support for cutting-edge model variants including video and image editing models. The project...

Downloads: 23 This Week

Last Update: 4 days ago
See Project
18

Superduper

Superduper: Integrate AI models and machine learning workflows

Superduper is a Python-based framework for building end-2-end AI-data workflows and applications on your own data, integrating with major databases. It supports the latest technologies and techniques, including LLMs, vector-search, RAG, and multimodality as well as classical AI and ML paradigms. Developers may leverage Superduper by building compositional and declarative objects that out-source the details of deployment, orchestration versioning, and more to the Superduper engine. This...

Downloads: 5 This Week

Last Update: 2025-08-26
See Project
19

FlowGram

Extensible workflow development framework

FlowGram is an open-source, node-based workflow development framework and toolkit aimed at helping developers build custom AI-workflow platforms or automation systems through a visual, drag-and-drop interface. Instead of shipping as a ready-made product, it provides the building blocks — a canvas for wiring together nodes, a form engine for configuring node parameters, a variable-scope and type-inference engine, and a set of “materials” (pre-built node types such as code execution, conditional logic, LLM calls, etc.) that can be composed into larger workflows. This makes FlowGram highly flexible: you can prototype data-processing pipelines, AI-agent flows, automation scripts, or even business process automation without writing all the plumbing yourself. ...

Downloads: 1 This Week

Last Update: 2026-01-19
See Project
20

DALI

A GPU-accelerated library containing highly optimized building blocks

...Deep learning applications require complex, multi-stage data processing pipelines that include loading, decoding, cropping, resizing, and many other augmentations. These data processing pipelines, which are currently executed on the CPU, have become a bottleneck, limiting the performance and scalability of training and inference. DALI addresses the problem of the CPU bottleneck by offloading data preprocessing to the GPU. Additionally, DALI relies on its own execution engine, built to maximize the throughput of the input pipeline.

Downloads: 0 This Week

Last Update: 2025-12-08
See Project
21

TensorRT Node for ComfyUI

Enables the best performance on NVIDIA RTX Graphics Cards

ComfyUI_TensorRT is an extension that lets ComfyUI run AI inference through NVIDIA’s TensorRT, aiming to get faster, more efficient execution on supported GPUs. It bridges the gap between ComfyUI’s flexible, node-based workflows and TensorRT’s highly optimized engine format. The result is that complex diffusion or image-processing graphs can be accelerated without the user having to rewrite the pipeline.

Downloads: 1 This Week

Last Update: 2025-10-30
See Project
22

Matrix

Multi-Agent daTa geneRation Infra and eXperimentation framework

...That design makes Matrix particularly well-suited for large-batch inference, model benchmarking, data curation, augmentation, or generation — whether for language, code, dialogue, or multimodal tasks. It supports both open-source LLMs and proprietary models (via integration with model backends), and works with containerized or sandboxed environments for safe tool execution or external code runs.

Downloads: 0 This Week

Last Update: 2025-12-30
See Project
23

TypeDB

TypeDB: a strongly-typed database

TypeDB is a strongly-typed database with a rich and logical type system. TypeDB empowers you to tackle complex problems, and TypeQL is its query language.TypeDB is a database with a rich and logical type system. TypeDB empowers you to solve complex problems, using TypeQL as its query language. TypeDB provides a strong type system for developers to break down complex problems into meaningful and logical systems. Through TypeQL, TypeDB provides powerful abstractions over low-level and complex...

Downloads: 3 This Week

Last Update: 2026-02-05
See Project
24

DeepCamera

Open-Source AI Camera. Empower any camera/CCTV

DeepCamera empowers your traditional surveillance cameras and CCTV/NVR with machine learning technologies. It provides open-source facial recognition-based intrusion detection, fall detection, and parking lot monitoring with the inference engine on your local device. SharpAI-hub is the cloud hosting for AI applications that helps you deploy AI applications with your CCTV camera on your edge device in minutes. SharpAI yolov7_reid is an open-source Python application that leverages AI technologies to detect intruders with traditional surveillance cameras. The source code is here It leverages Yolov7 as a person detector, FastReID for person feature extraction, Milvus the local vector database for self-supervised learning to identify unseen persons, Labelstudio to host images locally and for further usage such as label data and train your own classifier. ...

Downloads: 10 This Week

Last Update: 2024-08-15
See Project
25

Transcoder

Hardware-accelerated video transcoding using Android MediaCodec APIs

Transcoder by DeepMedia is an AI-powered video-to-video speech translation engine that enables fully automated multilingual dubbing. Unlike traditional speech translation systems that rely on multi-stage pipelines, Transcoder directly translates one speaker’s video into another language while preserving facial expressions, lip-sync, and vocal identity. Designed for real-time use and production-grade pipelines, Transcoder combines advanced deep learning models with GPU acceleration to deliver...

Downloads: 8 This Week

Last Update: 2025-03-25
See Project