GATO: GPU-Accelerated and Batched Trajectory Optimization for Scalable Edge Model Predictive Control

1School of Engineering and Applied Science, Columbia University
3Barnard College, Columbia University
2University of Michigan
4Dartmouth College

Performance & Hardware Validation

GATO achieves 18–21× speedup over CPU baselines and 1.4–16× speedup over GPU baselines as batch size increases, covering the under-served regime of tens to low-hundreds of parallel solves at real-time rates.

Abstract

While Model Predictive Control (MPC) delivers strong performance across robotics applications, solving the underlying (batches of) nonlinear trajectory optimization (TO) problems online remains computationally demanding. Existing GPU-accelerated approaches typically (i) parallelize a single solve to meet real-time deadlines, (ii) scale to very large batches at slower-than-real-time rates, or (iii) achieve speed by restricting model generality (e.g., point-mass dynamics or a single linearization). This leaves a large gap in solver performance for many state-of-the-art MPC applications that require real-time batches of tens to low-hundreds of solves. As such, we present GATO, an open source, GPU-accelerated, batched TO solver co-designed across algorithm, software, and computational hardware to deliver real-time throughput for these moderate batch size regimes. Our approach leverages a combination of block-, warp-, and thread-level parallelism within and across solves for ultra-high performance. We demonstrate the effectiveness of our approach through a combination of: simulated benchmarks showing speedups of 18-21x over CPU baselines and 1.4-16x over GPU baselines as batch size increases; case studies highlighting improved disturbance rejection and convergence behavior; and finally a validation on hardware using an industrial manipulator. We open source GATO to support reproducibility and adoption.

Key Results

Bar chart of solve times varying batch size across CPU and GPU solvers
Heat map of GATO solve times varying batch size and time horizon

(Left) Solve times for 6-DoF manipulator motions while varying the batch size (M) and underlying solver. N=64 for all solves. GATO shows far improved scalability as compared to state-of-the-art CPU and GPU solutions. (Right) A heat map of solve times while varying both batch size (M) and time horizon (N). GATO is able to reach kHz control rates for real-time iterations of large batches (512) of short horizon (N=8) trajectories, as well as smaller batches (32) of longer horizon trajectories (N=128), showing the flexibility of the design.

Bar chart of tracking error and scatter plot of joint velocities for figure-8 task
End-effector trajectories during figure-8 task with external force applied

Figure-8 tracking task, with an external disturbance applied at the end effector. (Left) Bar chart shows tracking error, scatter plot shows average total joint velocities. Increasing GATO's batch size enables increased disturbance rejection, lowering tracking error and joint velocities until the increased latency from a larger batch size outweighs the optimality gains. (Right) End-effector trajectories realized during this experiment when 50N of external force is applied at the end effector, again showing that modest batch sizes lead to the best performance.

Cumulative density function of solve times for GATO across different batch sizes and pendulum configurations
Hardware experiment showing GATO successfully and unsuccessfully accounting for time-varying unmodeled disturbance

(Left) Cumulative density function of the solve times for a trajectory length N=16 for GATO across different batch sizes and varied pendulum configurations. For each batch size, the solver accounts for 100 disturbance scenarios. We see how larger batch sizes enable more accurate unmodeled disturbance rejection.   (Right) Hardware experiment showing the solver successfully (M=32) and unsuccessfully (M=1) accounting for the time-varying unmodeled disturbance and reach the targets.

Method

GATO solves batches of nonlinear trajectory optimization problems using a GPU-accelerated Sequential Quadratic Programming (SQP) approach co-designed across algorithm, software, and hardware. At each SQP iteration, it:

  1. Batch initialization: Initializes all trajectories in parallel across GPU threads, amortizing setup cost over the entire batch.
  2. Parallelized KKT formation & factorization: Exploits block-, warp-, and thread-level GPU parallelism both within a single solve (across time steps) and across solves (across the batch) to form and factor the KKT system via Schur complement decomposition with no CPU-GPU data transfer in the inner loop.
  3. Line search & convergence: Applies a GPU-resident line search with merit function evaluation, enabling fast convergence and strong disturbance rejection without falling back to CPU.

This co-design fills the critical gap between single-solve real-time methods and large-batch offline solvers, enabling scalable edge MPC for manipulation and other robotics applications.

GATO system design overview showing algorithm, software, and hardware co-design

Supported Robot Systems

GATO ships with plant models for two robot arms. Additional systems can be added by implementing _plant.cuh and _grid.cuh (see gato/dynamics/).

RobotDOFNotes
KUKA iiwa147Industrial manipulator; grid dynamics pre-generated via GRiD
Neubrex Indy76Collaborative robot; grid dynamics pre-generated via GRiD

Configurable horizon lengths: 8, 16, 32, 64, 128 knot points (set via -DKNOTS at build time).

Quick Start

Setup (Docker + uv)

git clone https://github.com/A2R-Lab/GATO.git
cd GATO
./tools/install.sh   # install Docker and uv
./tools/docker.sh    # build image and enter container

Build the CUDA Solver

./tools/build.sh     # default: iiwa14 + indy7, knots 8/32/128

Custom Build Options

mkdir -p build && cd build
cmake -DPLANT="iiwa14" -DKNOTS="16;64" ..
cmake --build . --parallel

Built Python extension modules are written to python/bsqp/ as bsqpN{N}_{plant}.so. See examples/ for Jupyter notebooks demonstrating MPC with GATO.

Related Projects

  • MPCGPU — original single-solve GPU-accelerated MPC upon which this project is derived
  • GRiD — GPU-accelerated rigid body dynamics with analytical gradients

BibTeX

@inproceedings{du2026gato,
    title={GATO: GPU-Accelerated and Batched Trajectory Optimization for Scalable Edge Model Predictive Control},
    author={Alexander Du and Emre Adabag and Gabriel Bravo and Brian Plancher},
    booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
    year={2026},
    month={June}
}
GATO logo