-
Andrey Filippov authored
tpdnn_infer processed the whole `count`-scene request in one shot; at full res (512x640) a 64-scene field tensor is ~10GB (64x124x512x640) + decode p ~10GB -> CUDA OOM on a 16GB card. The 64x80 synthetic smoke didn't expose it. Now loops in GPU_CHUNK-sized sub-batches (== infer_server's GPU_CHUNK), writing each chunk straight to the host output buffers so GPU tensors stay bounded (~3-4GB/chunk at CHUNK=8). L2 hidden/age/sprev carry across chunks; reset only at the first chunk when l2_reset (matches the server). Env TPDNN_GPU_CHUNK (default 8). Validated: parity vs server oracle still EXACT (0.0) at CHUNK=8 (8+4) and CHUNK=4 (4+4+4) - L2 carry across chunk boundaries is correct. Co-Authored-By:Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6c9c2486
| Name |
Last commit
|
Last update |
|---|---|---|
| .. | ||
| .gitignore | ||
| build_dnn.sh | ||
| build_lib.sh | ||
| build_probe.sh | ||
| fetch_libtorch.sh | ||
| publish_libtorch_mirror.sh | ||
| tp_dnn.cpp | ||
| tp_jna.cpp | ||
| tp_nvrtc_probe.cpp |