Deploying Vision Models at the Edge: A Deep Dive into Quantization and TensorRT Optimization

Running complex vision models on edge devices presents a fundamental engineering challenge: how do you maintain the accuracy of a 200MB floating-point model while running it at 30 FPS on a device with limited compute and memory? This post documents our journey deploying multi-person pose estimation and tracking models on NVIDIA Jetson and Hailo-8 platforms.

The Edge Deployment Challenge

Our production pipeline consists of three major components:

Person Detection: YOLOv8-based detector fine-tuned for fisheye distortion
Pose Estimation: HRNet-W32 for 17-keypoint skeleton estimation
Multi-Object Tracking: ByteTrack with appearance features

Running these sequentially on a Jetson Orin Nano (40 TOPS INT8) with FP32 models yields approximately 3 FPS—unacceptable for real-time applications. Our target: 25+ FPS with minimal accuracy degradation.

Understanding Quantization Fundamentals

The Mathematics of INT8 Quantization

Quantization maps floating-point weights and activations to lower-precision integers:

$Q(x) = \text{round}\left(\frac{x}{s}\right) + z$

Where:

$s$ is the scale factor
$z$ is the zero-point offset
$x$ is the original FP32 value

The inverse operation recovers an approximation:

$\hat{x} = s \cdot (Q(x) - z)$

The key challenge is determining optimal scale factors that minimize quantization error while preserving model accuracy.

Calibration Strategies

We evaluated three calibration approaches for determining scale factors:

1. MinMax Calibration

scale = (max_val - min_val) / (qmax - qmin)
zero_point = qmin - round(min_val / scale)

Simple but sensitive to outliers. A single activation spike can dramatically reduce effective precision for the majority of values.

2. Entropy Calibration (KL Divergence)

Minimizes the information loss between the original FP32 distribution and quantized distribution:

$D_{KL}(P || Q) = \sum_{i} P(i) \log\frac{P(i)}{Q(i)}$

TensorRT's default calibrator uses this approach with 128 histogram bins.

3. Percentile Calibration

Clips outliers by using the 99.99th percentile instead of true min/max:

def percentile_calibration(tensor, percentile=99.99):
    lower = np.percentile(tensor, 100 - percentile)
    upper = np.percentile(tensor, percentile)
    scale = (upper - lower) / 255
    return scale, -lower / scale

This proved most effective for our pose estimation models, which exhibit long-tailed activation distributions.

TensorRT Optimization Pipeline

Step 1: ONNX Export with Dynamic Axes

Export PyTorch models with explicit dynamic dimensions:

import torch
import torch.onnx

def export_pose_model(model, output_path):
    model.eval()
    dummy_input = torch.randn(1, 3, 384, 288).cuda()
    
    torch.onnx.export(
        model,
        dummy_input,
        output_path,
        input_names=['input'],
        output_names=['heatmaps', 'offsets'],
        dynamic_axes={
            'input': {0: 'batch_size'},
            'heatmaps': {0: 'batch_size'},
            'offsets': {0: 'batch_size'}
        },
        opset_version=17,
        do_constant_folding=True
    )

Step 2: TensorRT Engine Building

Build optimized engines with INT8 precision:

import tensorrt as trt

def build_engine(onnx_path, engine_path, calibrator):
    logger = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(logger)
    network = builder.create_network(
        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    )
    parser = trt.OnnxParser(network, logger)
    
    with open(onnx_path, 'rb') as f:
        parser.parse(f.read())
    
    config = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)
    
    # Enable INT8 with calibration
    config.set_flag(trt.BuilderFlag.INT8)
    config.int8_calibrator = calibrator
    
    # Enable FP16 fallback for sensitive layers
    config.set_flag(trt.BuilderFlag.FP16)
    
    # Build and serialize
    serialized_engine = builder.build_serialized_network(network, config)
    with open(engine_path, 'wb') as f:
        f.write(serialized_engine)

Step 3: Custom Calibrator Implementation

class PoseCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, data_loader, cache_file):
        super().__init__()
        self.data_loader = iter(data_loader)
        self.cache_file = cache_file
        self.batch_size = 8
        self.current_index = 0
        
        # Allocate device memory
        self.device_input = cuda.mem_alloc(
            self.batch_size * 3 * 384 * 288 * 4
        )
    
    def get_batch(self, names):
        try:
            batch = next(self.data_loader)
            cuda.memcpy_htod(self.device_input, batch.numpy())
            return [int(self.device_input)]
        except StopIteration:
            return None
    
    def read_calibration_cache(self):
        if os.path.exists(self.cache_file):
            with open(self.cache_file, 'rb') as f:
                return f.read()
        return None
    
    def write_calibration_cache(self, cache):
        with open(self.cache_file, 'wb') as f:
            f.write(cache)

Layer-wise Precision Analysis

Not all layers quantize equally well. We developed a systematic approach to identify problematic layers:

Sensitivity Analysis Protocol

def analyze_layer_sensitivity(model, calibration_data, metric_fn):
    """
    Quantize one layer at a time, measure accuracy impact.
    """
    baseline = metric_fn(model, precision='fp32')
    sensitivities = {}
    
    for layer_name in model.get_quantizable_layers():
        # Quantize only this layer
        model.set_layer_precision(layer_name, 'int8')
        score = metric_fn(model, precision='mixed')
        sensitivities[layer_name] = baseline - score
        model.set_layer_precision(layer_name, 'fp32')
    
    return sorted(sensitivities.items(), key=lambda x: x[1], reverse=True)

Results: Sensitive Layers in HRNet

Layer	Sensitivity Score	Action
`stage4.fuse_layers.3.3`	0.082	Keep FP16
`final_layer.conv`	0.071	Keep FP16
`stage3.fuse_layers.2.2`	0.043	Keep FP16
`stage2.branches.1.0.conv1`	0.008	Quantize INT8
...	...	...

By keeping only 3 layers in FP16 (2% of total layers), we preserved 99.1% of FP32 accuracy while gaining most of the INT8 speedup.

Memory Optimization Techniques

1. Activation Checkpointing

For multi-stage networks, recompute intermediate activations instead of storing them:

class CheckpointedHRNet(nn.Module):
    def forward(self, x):
        # Stage 1-2: Normal forward
        x = self.stage1(x)
        x = self.stage2(x)
        
        # Stage 3-4: Checkpointed
        x = torch.utils.checkpoint.checkpoint(
            self.stage3, x, use_reentrant=False
        )
        x = torch.utils.checkpoint.checkpoint(
            self.stage4, x, use_reentrant=False
        )
        return x

Memory reduction: 40% with 15% compute overhead.

2. Multi-stream Inference

Overlap data transfer and computation using CUDA streams:

class PipelinedInference:
    def __init__(self, engine, num_streams=2):
        self.streams = [cuda.Stream() for _ in range(num_streams)]
        self.contexts = [engine.create_execution_context() 
                        for _ in range(num_streams)]
        self.buffers = [self._allocate_buffers() 
                       for _ in range(num_streams)]
    
    def infer_async(self, inputs):
        results = []
        for i, inp in enumerate(inputs):
            stream_idx = i % len(self.streams)
            stream = self.streams[stream_idx]
            ctx = self.contexts[stream_idx]
            bufs = self.buffers[stream_idx]
            
            # Async copy input
            cuda.memcpy_htod_async(bufs['input'], inp, stream)
            
            # Execute
            ctx.execute_async_v2(
                bindings=bufs['bindings'],
                stream_handle=stream.handle
            )
            
            # Async copy output
            cuda.memcpy_dtoh_async(bufs['output'], bufs['output_d'], stream)
            results.append((stream, bufs['output']))
        
        return results

3. Unified Memory for Large Batches

For batch sizes exceeding GPU memory:

# Enable unified memory
cuda.mem_alloc_managed(size, cuda.mem_attach_flags.GLOBAL)

Allows automatic page migration between CPU and GPU, enabling larger batch processing at the cost of some latency.

Benchmark Results

Jetson Orin Nano (40 TOPS)

Model	FP32	FP16	INT8	INT8 + Optimizations
YOLOv8s (640x640)	8.2 FPS	22.1 FPS	35.4 FPS	41.2 FPS
HRNet-W32 (384x288)	4.1 FPS	11.3 FPS	24.7 FPS	28.9 FPS
ByteTrack	89.2 FPS	91.1 FPS	92.3 FPS	94.1 FPS
Full Pipeline	2.9 FPS	7.8 FPS	18.2 FPS	25.7 FPS

Accuracy Comparison (COCO val2017)

Model	FP32 AP	INT8 AP	Degradation
YOLOv8s Detection	44.9	44.2	-0.7
HRNet-W32 Pose	74.4	73.8	-0.6
Combined mAP	67.2	66.4	-0.8

Hailo-8 Deployment Notes

The Hailo-8 accelerator (26 TOPS) uses a different compilation flow:

# Compile ONNX to Hailo Executable Format (HEF)
hailo compiler pose_model.onnx \
    --hw-arch hailo8 \
    --calib-set calibration_data.npy \
    --output pose_model.hef

Key differences from TensorRT:

Uses proprietary quantization, less control over per-layer precision
Requires Hailo Dataflow Compiler for optimization
Better power efficiency (2.5W vs Jetson's 7-15W)

Benchmark: 31 FPS for the full pipeline at 2.5W power consumption.

Production Deployment Checklist

Calibration Data Quality: Use 500-1000 representative images, covering edge cases
Thermal Management: INT8 runs hotter; ensure adequate cooling
Precision Fallback: Keep a FP16 engine for debugging discrepancies
Version Pinning: Lock TensorRT, CUDA, and cuDNN versions
Monitoring: Log inference times and detect thermal throttling

Conclusion

Edge deployment of complex vision models is achievable with systematic optimization. The key insights:

Percentile calibration outperforms entropy for long-tailed distributions
Selective FP16 layers (< 5%) preserve accuracy with minimal speed impact
Multi-stream inference provides 15-20% throughput improvement
Combined optimizations achieved 8.8x speedup over FP32 baseline

Our production systems now run reliably at 25+ FPS on Jetson Orin Nano, enabling real-time spatial intelligence in resource-constrained environments.