• Andrey Filippov's avatar
    CLAUDE: tp_dnn GPU_CHUNK sub-batching - fix full-res OOM · 6c9c2486
    Andrey Filippov authored
    tpdnn_infer processed the whole `count`-scene request in one shot; at full res
    (512x640) a 64-scene field tensor is ~10GB (64x124x512x640) + decode p ~10GB ->
    CUDA OOM on a 16GB card. The 64x80 synthetic smoke didn't expose it.
    
    Now loops in GPU_CHUNK-sized sub-batches (== infer_server's GPU_CHUNK), writing each
    chunk straight to the host output buffers so GPU tensors stay bounded (~3-4GB/chunk
    at CHUNK=8). L2 hidden/age/sprev carry across chunks; reset only at the first chunk
    when l2_reset (matches the server). Env TPDNN_GPU_CHUNK (default 8).
    
    Validated: parity vs server oracle still EXACT (0.0) at CHUNK=8 (8+4) and CHUNK=4
    (4+4+4) - L2 carry across chunk boundaries is correct.
    Co-Authored-By: 's avatarClaude Opus 4.8 (1M context) <noreply@anthropic.com>
    6c9c2486
tp_dnn.cpp 12.2 KB