Skip to content

Multi-Datacenter Coordination

Research Question

How do multiple datacenters at different grid locations interact, how should their controllers coordinate, and can shifting LLM replicas between sites further reduce voltage violations when batch-size control is exhausted?

Overview

When multiple datacenters share a distribution feeder, their controllers must coordinate to avoid conflicting actions. OpenG2G supports multi-DC simulations where each datacenter has its own OFO controller bound to its site, and an optional cross-site load-shifting controller.

OFO coordination: Each per-site OFO controller independently optimizes batch sizes using voltage sensitivities computed for its own loads. The controllers share the same grid state but operate on different datacenter instances.

Load shifting: When a site's OFO controller has exhausted batch-size adjustments (all models at min or max feasible batch), the LoadShiftController can shift LLM replicas between sites. This is a last-resort mechanism that moves inference servers from high-violation sites to low-violation sites.

Scripts

Script Purpose
analyze_LLM_load_shifting.py Compare OFO with and without cross-site load shifting
run_ofo.py Run multi-DC OFO simulation

Usage

IEEE 123-Bus: OFO vs OFO + Load Shifting

The experiment is defined inline in analyze_LLM_load_shifting.py with multi-model DCs (at least 3 models per site for warm-start) and load shifting enabled.

python examples/offline/analyze_LLM_load_shifting.py --system ieee123

Outputs (in outputs/ieee123/load_shift_comparison/):

  • voltage_comparison.png: side-by-side voltage envelopes (OFO vs OFO+shift)
  • net_replica_shift.png: per-site net replica changes over time
  • power_comparison.png: per-site DC power traces
  • summary_bar_chart.png: violation time and integral comparison
  • Per-case result_ofo_no_shift.csv / result_ofo_with_shift.csv: voltage metrics plus mean_throughput_tps, integrated_throughput_tokens, and itl_deadline_fraction

The terminal summary prints a two-column comparison (OFO vs OFO + load shift) for violation time, integral violation, worst Vmin/Vmax, mean throughput, and ITL-over-deadline fraction, so you can see the shift controller's effect on both grid and DC quality simultaneously.

Load Shifting Rules

The LoadShiftController follows five rules:

  1. Warm start only: Only shifts models already running at both source and destination.
  2. Last resort: Only acts when all models at the violated site have batch sizes at their feasible limit.
  3. Directional: Undervoltage → shift replicas OUT; overvoltage → shift replicas IN.
  4. Capacity-aware: Checks available_gpu_capacity() on the destination before shifting.
  5. Incremental: Shifts gpus_per_shift GPUs per time step, repeating until resolved.

Configuration

Load shifting is configured via LoadShiftConfig:

LoadShiftConfig(enabled=True, gpus_per_shift=8)
  • enabled: Enable/disable the load shifting controller
  • gpus_per_shift: GPUs moved per control step (default 8)
  • total_gpu_capacity on OfflineDatacenter: Maximum GPUs per site (enforced during shifting)
  • Each site must have at least 3 models for warm-start shifting

See Building Simulators for the full component API.