jna/tp_dnn.cpp · 6c9c24863fa7cf6a3e6d520e5c1d22d44767f2df · Elphel / tile_processor_gpu

CLAUDE: tp_dnn GPU_CHUNK sub-batching - fix full-res OOM · 6c9c2486

Andrey Filippov authored Jun 27, 2026

tpdnn_infer processed the whole `count`-scene request in one shot; at full res
(512x640) a 64-scene field tensor is ~10GB (64x124x512x640) + decode p ~10GB ->
CUDA OOM on a 16GB card. The 64x80 synthetic smoke didn't expose it.

Now loops in GPU_CHUNK-sized sub-batches (== infer_server's GPU_CHUNK), writing each
chunk straight to the host output buffers so GPU tensors stay bounded (~3-4GB/chunk
at CHUNK=8). L2 hidden/age/sprev carry across chunks; reset only at the first chunk
when l2_reset (matches the server). Env TPDNN_GPU_CHUNK (default 8).

Validated: parity vs server oracle still EXACT (0.0) at CHUNK=8 (8+4) and CHUNK=4
(4+4+4) - L2 carry across chunk boundaries is correct.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

6c9c2486

tp_dnn.cpp 12.2 KB

Replace tp_dnn.cpp