TritonDFT

Automating DFT with a Multi-Agent Framework

arXiv Code Paper Model Demo Video Wiki Slack X

Density Functional Theory (DFT) is a cornerstone of materials science, yet executing DFT in practice requires coordinating a complex, multi-step workflow. We present TritonDFT, a multi-agent framework that enables efficient and accurate DFT execution through an expert-curated, extensible workflow design, Pareto-aware parameter inference, and multi-source knowledge augmentation. We further introduce DFTBench, a benchmark for evaluating the agent's multi-dimensional capabilities, spanning science expertise, trade-off optimization, HPC knowledge, and cost efficiency.

Based on our survey with 19 domain researchers at the PhD level or above, DFT execution is a complex, multi-step process requiring heterogeneous domain expertise. TritonDFT reduces the per-step time from minutes–hours to seconds–minutes, providing automation across the entire workflow.

Introduction

Density Functional Theory (DFT) stands as the computational cornerstone of modern materials science. As a first-principles method, DFT provides high-fidelity predictions to validate theoretical hypotheses and reduce experimental cost. However, executing DFT in practice involves a complex, multi-step workflow. Practitioners must search for structural information, configure input parameters, write DFT software-specific scripts, launch and monitor HPC jobs, and interpret and analyze execution results.

As shown in our internal survey with 19 domain researchers at the PhD level or above, manually handling each step typically takes minutes to hours. This imposes substantial overhead and slows down the discovery process. While existing DFT tools can handle certain low-level details, users still need to manually handle most of the steps and coordinate the overall workflow.

Such manual overhead gives rise to a natural question: Can we leverage Large Language Model (LLM)-based agents to orchestrate these steps and enable automation?

We introduce TritonDFT, a trainable multi-agent framework that enables efficient and accurate DFT execution through:

Expert-curated, extensible workflow design with explicit task-to-executable mapping
Pareto-aware parameter inference for accuracy-cost trade-off optimization
Multi-source knowledge augmentation including domain tools, historical memory, and human-in-the-loop interaction

We further introduce DFTBench, a benchmark for evaluating the agent's multi-dimensional capabilities, spanning science expertise, trade-off optimization, HPC knowledge, and cost efficiency.

Figure 1. Performance analysis with Pass Rate and Cost Efficiency across different material types. Cost Efficiency is measured as (1 / Cost Factor), averaged over all passed cases within each type.

Demo

Watch TritonDFT in action — from user query to automated DFT execution, parameter optimization, and result analysis.

TritonDFT Demo. End-to-end automated DFT workflow execution, showcasing the multi-agent framework with Pareto-aware parameter inference and iterative refinement.

TritonDFT: An Expert-Informed Multi-Agent System

Overview of TritonDFT, a multi-agent system for automated DFT workflow execution. Four specialized agents—Planner, Executor, Analyzer, and Refiner—interact via shared knowledge base and task-specific tools. The Plan-Execute-Refine loop enables iterative optimization and error recovery.

Multi-Agent Architecture

TritonDFT adopts a Plan-Execute-Refine workflow design with four specialized agents:

Planner Agent

Decomposes user queries into computational steps, selects appropriate DFT methods, and determines task-to-executable mappings based on material properties and desired outputs.

Executor Agent

Generates DFT software-specific input scripts (Quantum Espresso, VASP), manages HPC resource allocation, launches jobs, and monitors execution progress.

Analyzer Agent

Parses DFT output files, extracts physical quantities, validates convergence criteria, and identifies numerical errors or physical inconsistencies.

Refiner Agent

Adjusts parameters based on convergence tests, recovers from failures, and iteratively optimizes configurations to meet accuracy and cost requirements.

DFTBench: Multi-Dimensional Capability Evaluation

Despite extensive benchmarks like graduate-level materials-domain knowledge, key capabilities in end-to-end DFT workflows—including numerical accuracy, Pareto-optimality, HPC parallelization, and cost efficiency—remain unevaluated. We present DFTBench to evaluate these capabilities.

100

Materials

Spanning 10 distinct types

500+

CPU Hours

Expert-curated ground truth

Dimensions

Multi-faceted evaluation

Evaluation Dimensions

Science Expertise

Understanding of physics and materials science concepts, DFT theory, and domain-specific knowledge required for parameter selection.

Trade-off Optimization

Ability to estimate and optimize the accuracy-cost Pareto frontier, balancing numerical fidelity with computational efficiency.

HPC Knowledge

Expertise in parallelization strategies, resource allocation, job scheduling, and optimization of computational workflows on HPC systems.

Cost Efficiency

Practical efficiency in real-world resource usage, minimizing wall-clock time and computational cost while maintaining accuracy requirements.

Experimental Results

Framework Comparison

We compare TritonDFT with state-of-the-art agentic DFT frameworks. TritonDFT provides the most comprehensive evaluation on a diverse dataset of 10 material categories, uniquely benchmarking accuracy-cost tradeoff, parallel efficiency, and monetary cost.

Method	Framework Architecture			Evaluation Dataset & Metrics
Method	Supported Task Types	Parameter Configuration	Knowledge Augmentation	Material Types	Ground Truth Curation	Accuracy-Cost Tradeoff	Parallel Efficiency	Monetary Cost
DREAMS	Surface Chemistry (Adsorption)	Physics Only	Open Database	2 (Metal, Insulator)	Public Dataset	✗	✗	✗
VASPilot	Electronic Structure (Band, DOS)	Physics Only	Open Database	1 (Semiconductor)	Public Dataset	✗	✗	✗
AgenticDFT	Geometry & Energetics (Relaxation, Band)	Physics Only	Open Database	2 (Metal, Semiconductor)	Public Dataset	✗	✗	✗
TritonDFT (Ours)	General QE Usage (>10 Task Types)	Physics + HPC (Pareto-aware)	Open Database + Memory + Human-in-the-loop	10 (Metal, Insulator, Superconductor, Topological, ...)	Expert Curated Calculations	✓	✓	✓

Table 1. Comparison of TritonDFT with state-of-the-art agentic DFT frameworks.

Parameter Configuration Performance

Model performance on DFT parameter configuration across different LLMs under varying error thresholds. GPT 5.2 achieves the highest pass rates, while Claude Opus 4.5 excels at advanced parameter satisfaction.

Model	ΔE < 20 meV/atom		ΔE < 10 meV/atom		ΔE < 1 meV/atom		Advanced Param. Satisfaction
Model	Pass Rate	Cost Factor	Pass Rate	Cost Factor	Pass Rate	Cost Factor	Advanced Param. Satisfaction
GPT 5.2	70.5%	14.29	67.0%	8.95	47.1%	4.23	51.3%
GPT 5.1	39.3%	6.22	32.9%	4.21	9.8%	2.24	43.6%
GPT 4o	52.8%	1.85	38.2%	1.28	13.6%	0.50	28.2%
GPT 4o mini	5.7%	1.01	5.6%	1.17	4.5%	0.97	28.2%
Gemini 2.5 Pro	59.6%	3.77	53.9%	2.95	14.9%	1.24	48.7%
Gemini 2.5 Flash	23.6%	1.85	16.9%	1.68	2.3%	0.78	38.5%
Claude Opus 4.5	9.0%	1.62	5.6%	1.33	4.5%	0.58	53.8%
Claude Sonnet 4.5	30.3%	2.38	25.8%	1.93	21.6%	0.87	38.5%

Table 2. Model performance on DFT parameter configuration. GPT 4o demonstrates the best cost efficiency (green values).

Performance by Material Type

Figure 3. Energy Deviation and Computational Cost Variations with different DFT Parameters for Silicon. TritonDFT learns to identify and select configurations on the Pareto frontier, achieving optimal accuracy-cost trade-offs.

Accuracy Analysis

Mean absolute error (MAE, %) across different DFT tasks, computed over successfully finished execution results.

Model	VC-relax	SCF	Band Gap	DOS
GPT 5.2	0.04	0.04	0.09	0.97
GPT 5.1	0.06	0.07	0.31	2.21
GPT 4o	0.10	1.11	2.48	9.04
Gemini 2.5 Pro	0.05	0.09	1.14	1.40
Gemini 2.5 Flash	0.06	0.83	1.21	11.17
Claude Opus 4.5	0.06	0.11	2.10	3.00
Claude Sonnet 4.5	0.09	0.14	2.00	2.12

Table 3. Mean absolute error (%) across different DFT tasks. Lower is better. GPT 5.2 consistently achieves the lowest errors.

Throughput & Cost Tradeoff

Figure 4. Throughput comparison (excluding DFT execution time) across different models.

Figure 5. K-point grid and computational cost tradeoff analysis for DFT calculations.

Cost Analysis

Average API cost (USD) per query.

Model	Struct Relax	Band Gap	DOS
GPT 5.2	0.05 ± 0.02	0.15 ± 0.04	0.13 ± 0.04
GPT 5.1	0.04 ± 0.02	0.13 ± 0.04	0.10 ± 0.03
GPT 4o	0.06 ± 0.03	0.18 ± 0.04	0.14 ± 0.03
Gemini 2.5 Pro	0.05 ± 0.03	0.13 ± 0.04	0.11 ± 0.04
Gemini 2.5 Flash	0.01 ± 0.01	0.03 ± 0.01	0.03 ± 0.01
Claude Opus 4.5	0.15 ± 0.08	0.44 ± 0.13	0.37 ± 0.11
Claude Sonnet 4.5	0.15 ± 0.06	0.34 ± 0.09	0.28 ± 0.08

Table 4. Average API cost per query. Gemini 2.5 Flash is the most cost-effective.

Parallel Efficiency

Relative speedup (%) over default baseline.

Model	16 Cores	32 Cores	64 Cores
GPT 5.2	+14.4%	+4.2%	+15.1%
GPT 5.1	+11.3%	−5.8%	−14.1%
GPT 4o	−21.0%	−34.0%	−23.6%
GPT 4o mini	−25.7%	−25.6%	+2.8%
Gemini 2.5 Pro	+4.7%	−6.4%	−3.3%
Gemini 2.5 Flash	−20.7%	−43.7%	−32.0%
Claude Opus 4.5	+15.4%	+16.1%	+16.1%
Claude Sonnet 4.5	+13.0%	+5.1%	+2.4%

Table 5. Claude Opus 4.5 shows the most consistent speedup across all core counts.

Phonon Dispersion Analysis

TritonDFT supports advanced DFT tasks including phonon dispersion calculations. Below we show representative phonon band structures computed end-to-end by TritonDFT, demonstrating the framework's capability to handle complex lattice-dynamics workflows.

Figure 6. Phonon dispersion relations computed by TritonDFT for representative materials, showcasing automated phonon workflow execution including self-consistent calculation, force-constant extraction, and band-path interpolation.

Key Findings

End-to-End Automation

TritonDFT automates the complete DFT workflow, reducing manual intervention time by 80–90% based on our survey with 19 domain researchers.

Pareto Optimization

Pareto-aware parameter inference enables intelligent accuracy-cost trade-offs, with GPT 5.2 achieving 70.5% pass rate at the 20 meV/atom threshold.

Extensible Design

Supports 10 diverse material categories (metal, insulator, superconductor, topological, etc.) and >10 task types in general QE usage.

LLM Capability Analysis

Experiments reveal substantial capability differences: GPT 5.2 for accuracy, GPT 4o for cost efficiency, Claude Opus 4.5 for parallelization.

Table of Contents

TritonDFT

Automating DFT with a Multi-Agent Framework

Introduction

Demo