CLAUDE: tp_dnn GPU_CHUNK sub-batching - fix full-res OOM
tpdnn_infer processed the whole `count`-scene request in one shot; at full res
(512x640) a 64-scene field tensor is ~10GB (64x124x512x640) + decode p ~10GB ->
CUDA OOM on a 16GB card. The 64x80 synthetic smoke didn't expose it.
Now loops in GPU_CHUNK-sized sub-batches (== infer_server's GPU_CHUNK), writing each
chunk straight to the host output buffers so GPU tensors stay bounded (~3-4GB/chunk
at CHUNK=8). L2 hidden/age/sprev carry across chunks; reset only at the first chunk
when l2_reset (matches the server). Env TPDNN_GPU_CHUNK (default 8).
Validated: parity vs server oracle still EXACT (0.0) at CHUNK=8 (8+4) and CHUNK=4
(4+4+4) - L2 carry across chunk boundaries is correct.
Co-Authored-By:
Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Showing
Please register or sign in to comment