Deploying Vision Models at the Edge: A Deep Dive into Quantization and TensorRT Optimization
Practical strategies for deploying complex vision models on edge devices while maintaining accuracy. Covers INT8 quantization, TensorRT optimization, and real-world benchmarks on Jetson and Hailo platforms.
Deploying Vision Models at the Edge: A Deep Dive into Quantization and TensorRT Optimization
Running complex vision models on edge devices presents a fundamental engineering challenge: how do you maintain the accuracy of a 200MB floating-point model while running it at 30 FPS on a device with limited compute and memory? This post documents our journey deploying multi-person pose estimation and tracking models on NVIDIA Jetson and Hailo-8 platforms.
The Edge Deployment Challenge
Our production pipeline consists of three major components:
- Person Detection: YOLOv8-based detector fine-tuned for fisheye distortion
- Pose Estimation: HRNet-W32 for 17-keypoint skeleton estimation
- Multi-Object Tracking: ByteTrack with appearance features
Running these sequentially on a Jetson Orin Nano (40 TOPS INT8) with FP32 models yields approximately 3 FPS—unacceptable for real-time applications. Our target: 25+ FPS with minimal accuracy degradation.
Understanding Quantization Fundamentals
The Mathematics of INT8 Quantization
Quantization maps floating-point weights and activations to lower-precision integers:
Where:
- is the scale factor
- is the zero-point offset
- is the original FP32 value
The inverse operation recovers an approximation:
The key challenge is determining optimal scale factors that minimize quantization error while preserving model accuracy.
Calibration Strategies
We evaluated three calibration approaches for determining scale factors:
1. MinMax Calibration
scale = (max_val - min_val) / (qmax - qmin)
zero_point = qmin - round(min_val / scale)
Simple but sensitive to outliers. A single activation spike can dramatically reduce effective precision for the majority of values.
2. Entropy Calibration (KL Divergence)
Minimizes the information loss between the original FP32 distribution and quantized distribution:
TensorRT's default calibrator uses this approach with 128 histogram bins.
3. Percentile Calibration
Clips outliers by using the 99.99th percentile instead of true min/max:
def percentile_calibration(tensor, percentile=99.99):
lower = np.percentile(tensor, 100 - percentile)
upper = np.percentile(tensor, percentile)
scale = (upper - lower) / 255
return scale, -lower / scale
This proved most effective for our pose estimation models, which exhibit long-tailed activation distributions.
TensorRT Optimization Pipeline
Step 1: ONNX Export with Dynamic Axes
Export PyTorch models with explicit dynamic dimensions:
import torch
import torch.onnx
def export_pose_model(model, output_path):
model.eval()
dummy_input = torch.randn(1, 3, 384, 288).cuda()
torch.onnx.export(
model,
dummy_input,
output_path,
input_names=['input'],
output_names=['heatmaps', 'offsets'],
dynamic_axes={
'input': {0: 'batch_size'},
'heatmaps': {0: 'batch_size'},
'offsets': {0: 'batch_size'}
},
opset_version=17,
do_constant_folding=True
)
Step 2: TensorRT Engine Building
Build optimized engines with INT8 precision:
import tensorrt as trt
def build_engine(onnx_path, engine_path, calibrator):
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
with open(onnx_path, 'rb') as f:
parser.parse(f.read())
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)
# Enable INT8 with calibration
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = calibrator
# Enable FP16 fallback for sensitive layers
config.set_flag(trt.BuilderFlag.FP16)
# Build and serialize
serialized_engine = builder.build_serialized_network(network, config)
with open(engine_path, 'wb') as f:
f.write(serialized_engine)
Step 3: Custom Calibrator Implementation
class PoseCalibrator(trt.IInt8EntropyCalibrator2):
def __init__(self, data_loader, cache_file):
super().__init__()
self.data_loader = iter(data_loader)
self.cache_file = cache_file
self.batch_size = 8
self.current_index = 0
# Allocate device memory
self.device_input = cuda.mem_alloc(
self.batch_size * 3 * 384 * 288 * 4
)
def get_batch(self, names):
try:
batch = next(self.data_loader)
cuda.memcpy_htod(self.device_input, batch.numpy())
return [int(self.device_input)]
except StopIteration:
return None
def read_calibration_cache(self):
if os.path.exists(self.cache_file):
with open(self.cache_file, 'rb') as f:
return f.read()
return None
def write_calibration_cache(self, cache):
with open(self.cache_file, 'wb') as f:
f.write(cache)
Layer-wise Precision Analysis
Not all layers quantize equally well. We developed a systematic approach to identify problematic layers:
Sensitivity Analysis Protocol
def analyze_layer_sensitivity(model, calibration_data, metric_fn):
"""
Quantize one layer at a time, measure accuracy impact.
"""
baseline = metric_fn(model, precision='fp32')
sensitivities = {}
for layer_name in model.get_quantizable_layers():
# Quantize only this layer
model.set_layer_precision(layer_name, 'int8')
score = metric_fn(model, precision='mixed')
sensitivities[layer_name] = baseline - score
model.set_layer_precision(layer_name, 'fp32')
return sorted(sensitivities.items(), key=lambda x: x[1], reverse=True)
Results: Sensitive Layers in HRNet
| Layer | Sensitivity Score | Action |
|---|---|---|
stage4.fuse_layers.3.3 | 0.082 | Keep FP16 |
final_layer.conv | 0.071 | Keep FP16 |
stage3.fuse_layers.2.2 | 0.043 | Keep FP16 |
stage2.branches.1.0.conv1 | 0.008 | Quantize INT8 |
| ... | ... | ... |
By keeping only 3 layers in FP16 (2% of total layers), we preserved 99.1% of FP32 accuracy while gaining most of the INT8 speedup.
Memory Optimization Techniques
1. Activation Checkpointing
For multi-stage networks, recompute intermediate activations instead of storing them:
class CheckpointedHRNet(nn.Module):
def forward(self, x):
# Stage 1-2: Normal forward
x = self.stage1(x)
x = self.stage2(x)
# Stage 3-4: Checkpointed
x = torch.utils.checkpoint.checkpoint(
self.stage3, x, use_reentrant=False
)
x = torch.utils.checkpoint.checkpoint(
self.stage4, x, use_reentrant=False
)
return x
Memory reduction: 40% with 15% compute overhead.
2. Multi-stream Inference
Overlap data transfer and computation using CUDA streams:
class PipelinedInference:
def __init__(self, engine, num_streams=2):
self.streams = [cuda.Stream() for _ in range(num_streams)]
self.contexts = [engine.create_execution_context()
for _ in range(num_streams)]
self.buffers = [self._allocate_buffers()
for _ in range(num_streams)]
def infer_async(self, inputs):
results = []
for i, inp in enumerate(inputs):
stream_idx = i % len(self.streams)
stream = self.streams[stream_idx]
ctx = self.contexts[stream_idx]
bufs = self.buffers[stream_idx]
# Async copy input
cuda.memcpy_htod_async(bufs['input'], inp, stream)
# Execute
ctx.execute_async_v2(
bindings=bufs['bindings'],
stream_handle=stream.handle
)
# Async copy output
cuda.memcpy_dtoh_async(bufs['output'], bufs['output_d'], stream)
results.append((stream, bufs['output']))
return results
3. Unified Memory for Large Batches
For batch sizes exceeding GPU memory:
# Enable unified memory
cuda.mem_alloc_managed(size, cuda.mem_attach_flags.GLOBAL)
Allows automatic page migration between CPU and GPU, enabling larger batch processing at the cost of some latency.
Benchmark Results
Jetson Orin Nano (40 TOPS)
| Model | FP32 | FP16 | INT8 | INT8 + Optimizations |
|---|---|---|---|---|
| YOLOv8s (640x640) | 8.2 FPS | 22.1 FPS | 35.4 FPS | 41.2 FPS |
| HRNet-W32 (384x288) | 4.1 FPS | 11.3 FPS | 24.7 FPS | 28.9 FPS |
| ByteTrack | 89.2 FPS | 91.1 FPS | 92.3 FPS | 94.1 FPS |
| Full Pipeline | 2.9 FPS | 7.8 FPS | 18.2 FPS | 25.7 FPS |
Accuracy Comparison (COCO val2017)
| Model | FP32 AP | INT8 AP | Degradation |
|---|---|---|---|
| YOLOv8s Detection | 44.9 | 44.2 | -0.7 |
| HRNet-W32 Pose | 74.4 | 73.8 | -0.6 |
| Combined mAP | 67.2 | 66.4 | -0.8 |
Hailo-8 Deployment Notes
The Hailo-8 accelerator (26 TOPS) uses a different compilation flow:
# Compile ONNX to Hailo Executable Format (HEF)
hailo compiler pose_model.onnx \
--hw-arch hailo8 \
--calib-set calibration_data.npy \
--output pose_model.hef
Key differences from TensorRT:
- Uses proprietary quantization, less control over per-layer precision
- Requires Hailo Dataflow Compiler for optimization
- Better power efficiency (2.5W vs Jetson's 7-15W)
Benchmark: 31 FPS for the full pipeline at 2.5W power consumption.
Production Deployment Checklist
- Calibration Data Quality: Use 500-1000 representative images, covering edge cases
- Thermal Management: INT8 runs hotter; ensure adequate cooling
- Precision Fallback: Keep a FP16 engine for debugging discrepancies
- Version Pinning: Lock TensorRT, CUDA, and cuDNN versions
- Monitoring: Log inference times and detect thermal throttling
Conclusion
Edge deployment of complex vision models is achievable with systematic optimization. The key insights:
- Percentile calibration outperforms entropy for long-tailed distributions
- Selective FP16 layers (< 5%) preserve accuracy with minimal speed impact
- Multi-stream inference provides 15-20% throughput improvement
- Combined optimizations achieved 8.8x speedup over FP32 baseline
Our production systems now run reliably at 25+ FPS on Jetson Orin Nano, enabling real-time spatial intelligence in resource-constrained environments.