-
Andrey Filippov authored
tpdnn_infer processed the whole `count`-scene request in one shot; at full res (512x640) a 64-scene field tensor is ~10GB (64x124x512x640) + decode p ~10GB -> CUDA OOM on a 16GB card. The 64x80 synthetic smoke didn't expose it. Now loops in GPU_CHUNK-sized sub-batches (== infer_server's GPU_CHUNK), writing each chunk straight to the host output buffers so GPU tensors stay bounded (~3-4GB/chunk at CHUNK=8). L2 hidden/age/sprev carry across chunks; reset only at the first chunk when l2_reset (matches the server). Env TPDNN_GPU_CHUNK (default 8). Validated: parity vs server oracle still EXACT (0.0) at CHUNK=8 (8+4) and CHUNK=4 (4+4+4) - L2 carry across chunk boundaries is correct. Co-Authored-By:Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6c9c2486