Skip to content

Data Pipeline

OpenG2G ships with built-in support for trace-replay simulations based on real GPU benchmark data. This page describes how raw benchmark measurements are compiled into artifacts that plug into simulation, and how those artifacts are consumed at runtime.

LLM workloads

The data pipeline is focused on LLM workloads (inference from the ML.ENERGY Benchmark results, and training via synthetic generation), which is an important motivating workload for AI datacenter-grid interactions. We hope to improve and expand the data pipeline to support more workloads.

Overview

Data generation is integrated into the library classes that consume the data. Each class has generate(), save(), load(), and ensure() methods:

Class Generates Consumed by
InferenceData Power traces + ITL distribution fits OfflineDatacenter
LogisticModelStore Logistic curve fits (power, latency, throughput) OFOBatchSizeController
TrainingTrace Synthetic training power trace OfflineDatacenter

Each class provides an ensure() classmethod that generates data if it doesn't exist and loads it:

inference_data = InferenceData.ensure(data_dir, models, data_sources, dt_s=0.1)
training_trace = TrainingTrace.ensure(data_dir / "training_trace.csv", training_params)
logistic_models = LogisticModelStore.ensure(data_dir / "logistic_fits.csv", models, data_sources)

Online Simulation

Online simulation with live GPUs does not use the power and ITL distributions, as they are supplied directly by running servers. However, the OFO controller can still use logistic fits for gradient estimation.

Config File

A shared config.json (examples/offline/config.json) stores model specifications and benchmark data sources:

{
  "model_specs": [
    {
      "model_label": "Llama-3.1-8B",
      "model_id": "meta-llama/Llama-3.1-8B-Instruct",
      "gpus_per_replica": 1,
      "itl_deadline_s": 0.08,
      "feasible_batch_sizes": [8, 16, 32, 64, 128, 256, 512]
    }
  ],
  "data_sources": [
    {
      "model_label": "Llama-3.1-8B",
      "task": "lm-arena-chat",
      "gpu": "H100",
      "batch_sizes": [8, 16, 32, 64, 96, 128, 192, 256, 384, 512, 768, 1024]
    }
  ],
  "training_trace_params": {}
}
  • model_specs[] entries are parsed as InferenceModelSpec. These describe model identity (GPU requirements, feasible batch sizes, latency deadlines) but not deployment-specific parameters like replica counts or initial batch sizes — those are defined per-experiment in each script.
  • data_sources[] entries are parsed as MLEnergySource, linked to models by model_label.
  • training_trace_params is parsed as TrainingTraceParams. Empty {} uses all defaults.
  • First run downloads benchmark data from the HuggingFace Hub and caches it in data/offline/{hash}/.

All other configuration — datacenter sizing, controller tuning, workload scenarios, grid setup — is defined programmatically in each example script. See Building Simulators and examples/offline/systems.py for details. For running simulations, see Quickstart and the Examples documentation.

Lazy Generation and Caching

Each data class provides an ensure() classmethod that combines generate-if-missing and load into a single call:

# First run: generates data to data_dir, then loads it.
# Subsequent runs: loads directly from cache.
inference_data = InferenceData.ensure(
    data_dir, models, data_sources,
    dt_s=0.1,
)
training_trace = TrainingTrace.ensure(
    data_dir / "training_trace.csv",
    training_trace_params,
)
logistic_models = LogisticModelStore.ensure(
    data_dir / "logistic_fits.csv",
    models, data_sources,
)

Under the hood, ensure() checks whether the output file or directory exists. If not, it calls generate().save() to create the artifacts. Then it calls load() to return the ready-to-use object.

Default data path

The helper load_data_sources() in examples/offline/systems.py computes a hash-based cache path from the data-relevant config keys (data sources and training trace parameters). Different configs automatically get different cache directories, so you can switch configs without manually clearing the cache.

Inference Data Generation

InferenceData.generate() uses the mlenergy-data toolkit to download and process GPU benchmark data from the ML.ENERGY Benchmark v3 dataset.

For each model and batch size, it:

  1. Extracts power timelines from benchmark runs
  2. Resamples to a median-duration grid
  3. Fits ITLMixtureModel distributions per batch size
ML.ENERGY Benchmark Dataset                 mlenergy-data
    (Hugging Face hub)                         toolkit

  ┌─────────────────────┐
  │ results.json        │       LLMRuns.from_hf()
  │ (power, latency,    │────────────────────────────>┐
  │  throughput, ITL)   │    Load, filter, validate   │
  │  per model × batch  │                             │
  └─────────────────────┘                             │
                                                      v
                                  ┌───────────────────────────────────┐
         Config file              │ InferenceData.generate()          │
  ┌─────────────────────┐         │                                   │
  │ config.json         │         │ For each model x batch size:      │
  │                     │────────>│  1. Extract power timelines       │
  │ model_specs[] +     │         │  2. Resample to median-duration   │
  │ data_sources[]      │         │  3. Fit ITLMixtureModel           │
  └─────────────────────┘         └───────────┬───────────────────────┘
                                              v
                        ┌────────────────────────────────┐
                        │ data/offline/{hash}/            │
                        │                                 │
                        │  traces/*.csv                   │  <── GPU power timeseries
                        │  traces_summary.csv             │  <── Trace manifest
                        │  latency_fits.csv               │  <── ITL distribution params
                        │  _manifest.json                 │  <── Version stamp
                        └────────────────────────────────┘

Logistic Curve Fitting

LogisticModelStore.generate() fits four-parameter logistic curves to power, latency, and throughput versus batch size:

\[p(x) = \frac{P_{\max}}{1 + \exp(-k_p(x - x_{0,p}))} + p_0, \quad x \triangleq \log_2(b)\]

where \(P_{\max}\) is the saturation magnitude, \(k_p\) controls transition sharpness, \(x_{0,p}\) is the characteristic batch size threshold, and \(p_0\) is an offset term. Latency and throughput use the same functional form with their own parameters.

OpenG2G uses LogisticModel from mlenergy-data at both stages:

ITL Mixture Model

Historical ITL measurements exhibit heavy-tailed behavior. The generation step captures this using a weighted mixture of two lognormal distributions per batch size.

OpenG2G uses ITLMixtureModel from mlenergy-data at both stages:

Training Trace Generation

TrainingTrace.generate() synthesizes a training power trace with configurable high/low plateaus, noise, brief dips, and a warm-up ramp. Generation is based on characteristics derived from real large model training measurements.

Parameters are controlled via TrainingTraceParams. The empty dict {} in the config uses all defaults.

Dataset Access

The mlenergy-data toolkit automatically downloads benchmark data from the ML.ENERGY Benchmark v3 dataset on first run.

To use the dataset:

  1. Request access on Hugging Face
  2. Create a Hugging Face access token
  3. Set the HF_TOKEN environment variable to your token before running.

Runtime Integration

At simulation time, the generated artifacts are consumed by two components: