[go: up one dir, main page]

gflow 0.4.3

A lightweight, single-node job scheduler written in Rust.
Documentation

gflow - A lightweight, single-node job scheduler

GitHub Actions Workflow Status Crates.io Version Anaconda-Server Badge Crates.io Downloads (recent) dependency status Crates.io License Crates.io Size

gflow is a lightweight, single-node job scheduler written in Rust, inspired by Slurm. It is designed for efficiently managing and scheduling tasks, especially on machines with GPU resources.

Snapshot

gflow

Core Features

  • Daemon-based Scheduling: A persistent daemon (gflowd) manages the job queue and resource allocation.
  • Rich Job Submission: Supports dependencies, priorities, job arrays, and time limits via the gbatch command.
  • Time Limits: Set maximum runtime for jobs (similar to Slurm's --time) to prevent runaway processes.
  • Service and Job Control: Provides clear commands to inspect the scheduler state (ginfo), query the job queue (gqueue), and control job states (gcancel).
  • tmux Integration: Uses tmux for robust, background task execution and session management.
  • Output Logging: Automatic capture of job output to log files via tmux pipe-pane.
  • Simple Command-Line Interface: Offers a user-friendly and powerful set of command-line tools.

Component Overview

The gflow suite consists of several command-line tools:

  • gflowd: The scheduler daemon that runs in the background, managing jobs and resources.
  • ginfo: Displays scheduler and GPU information.
  • gbatch: Submits jobs to the scheduler, similar to Slurm's sbatch.
  • gqueue: Lists and filters jobs in the queue, similar to Slurm's squeue.
  • gcancel: Cancels jobs and manages job states (internal use).

Installation

Quick Install (Linux x86_64)

Install gflow with a single command:

curl -fsSL https://raw.githubusercontent.com/AndPuQing/gflow/main/install.sh | sh

This will download and install the latest release binaries to ~/.cargo/bin. You can customize the installation directory by setting the GFLOW_INSTALL_DIR environment variable:

curl -fsSL https://raw.githubusercontent.com/AndPuQing/gflow/main/install.sh | GFLOW_INSTALL_DIR=/usr/local/bin sh

Install via cargo

cargo install gflow

This will install all the necessary binaries (gflowd, ginfo, gbatch, gqueue, gcancel, gjob).

Install via Conda

You can install gflow using Conda from the conda-forge channel:

conda install -c conda-forge gflow

Build Manually

  1. Clone the repository:

    git clone https://github.com/AndPuQing/gflow.git
    cd gflow
    
  2. Build the project:

    cargo build --release
    

    The executables will be available in the target/release/ directory.

Quick Start

  1. Start the scheduler daemon:

    gflowd up
    

    Run this in a dedicated terminal or tmux session and leave it running. You can check its health at any time with gflowd status and inspect resources with ginfo.

  2. Submit a job: Create a script my_job.sh:

    #!/bin/bash
    echo "Starting job on GPU: $CUDA_VISIBLE_DEVICES"
    sleep 30
    echo "Job finished."
    

    Submit it using gbatch:

    gbatch --gpus 1 ./my_job.sh
    
  3. Check the job queue:

    gqueue
    

    You can also watch the queue update live: watch gqueue.

  4. Stop the scheduler:

    gflowd down
    

    This shuts down the daemon and cleans up the tmux session.

Usage Guide

Submitting Jobs with gbatch

gbatch provides flexible options for job submission.

  • Submit a command directly:

    gbatch --gpus 1 python train.py --epochs 10
    
  • Set a job name and priority:

    gbatch --gpus 1 --name "training-run-1" --priority 10 ./my_job.sh
    
  • Create a job that depends on another:

    # First job
    gbatch --gpus 1 --name "job1" ./job1.sh
    # Get job ID from gqueue, e.g., 123
    
    # Second job depends on the first
    gbatch --gpus 1 --name "job2" --depends-on 123 ./job2.sh
    
  • Set a time limit for a job:

    # 30-minute limit
    gbatch --time 30 python train.py
    
    # 2-hour limit (HH:MM:SS format)
    gbatch --time 2:00:00 python long_training.py
    
    # 5 minutes 30 seconds
    gbatch --time 5:30 python quick_task.py
    

    See docs/TIME_LIMITS.md for detailed documentation on time limits.

Querying Jobs with gqueue

gqueue allows you to filter and format the job list.

  • Filter by job state:

    gqueue --states Running,Queued
    
  • Filter by job ID or name:

    gqueue --jobs 123,124
    gqueue --names "training-run-1"
    
  • Customize output format:

    gqueue --format "ID,Name,State,GPUs"
    

Configuration

Configuration for gflowd can be customized. The default configuration file is located at ~/.config/gflow/gflowd.toml.

Star History

Contributing

If you find any bugs or have feature requests, feel free to create an Issue and contribute by submitting Pull Requests.

License

gflow is licensed under the MIT License. See LICENSE for more details.