Commits · 6c9c24863fa7cf6a3e6d520e5c1d22d44767f2df · Elphel / tile_processor_gpu

27 Jun, 2026 3 commits

CLAUDE: tp_dnn GPU_CHUNK sub-batching - fix full-res OOM · 6c9c2486

Andrey Filippov authored Jun 27, 2026

tpdnn_infer processed the whole `count`-scene request in one shot; at full res
(512x640) a 64-scene field tensor is ~10GB (64x124x512x640) + decode p ~10GB ->
CUDA OOM on a 16GB card. The 64x80 synthetic smoke didn't expose it.

Now loops in GPU_CHUNK-sized sub-batches (== infer_server's GPU_CHUNK), writing each
chunk straight to the host output buffers so GPU tensors stay bounded (~3-4GB/chunk
at CHUNK=8). L2 hidden/age/sprev carry across chunks; reset only at the first chunk
when l2_reset (matches the server). Env TPDNN_GPU_CHUNK (default 8).

Validated: parity vs server oracle still EXACT (0.0) at CHUNK=8 (8+4) and CHUNK=4
(4+4+4) - L2 carry across chunk boundaries is correct.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

6c9c2486

CLAUDE: publish_libtorch_mirror.sh - libtorch -> mirror maven layout (piece 4B) · ea9c117a

Andrey Filippov authored Jun 27, 2026

Server-side copies the already-uploaded libtorch zip into
mirror.elphel.com/maven-dependencies/org/pytorch/libtorch-cxx11-cu128/2.7.1/ (maven
layout: .zip + .pom + .sha1/.md5, no 3.8GB re-upload) so imagej-elphel's -Plibtorch
profile can dependency:unpack it. Pairs with fetch_libtorch.sh (direct download).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ea9c117a

CLAUDE: native LibTorch L1+L2 inference shim (libtpdnn.so) for JNA · 9202dd30

Andrey Filippov authored Jun 27, 2026

Piece 2 of the native-JNA DNN path. tp_dnn.cpp is a C-ABI port of infer_server.py's
hot path so the Java client can run L1+L2 in-process instead of over TCP:
tpdnn_init/upload/infer/free (+ num_levels/level_frames)
faithfully reproducing build_pyramid, the 16x shift-and-stitch full-res recovery,
decode (ghostbuster + velocity centroid), and the L2 ConvGRU recurrence + track-age.
Loads the TorchScript models from imagej_elphel_dnn (export_torchscript /
export_l2_torchscript). Disables the TorchScript JIT fuser at init (nvrtc element-wise
fusion fails on Blackwell; production wants no runtime nvrtc).

Validated: native vs the running Python server (same CUDA) max|diff| offset5=0,
roi=0 — bit-for-bit. (Oracle dump_ref.py + driver tpdnn_test.cpp, scratch.)

Built standalone via build_dnn.sh (g++ + libtorch 2.7.1+cu128, ABI=1), separate
from the nvcc-built libtileproc.so; fetch_libtorch.sh pulls the pinned libtorch.
Context unification + zero-copy kernel<->tensor sharing is a later step.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

9202dd30

26 Jun, 2026 5 commits

CLAUDE: tp_proc_exec_imclt honors use_ref (gpu_clt_ref) — JNA ref render fix · 7540202f

Andrey Filippov authored Jun 26, 2026

execImcltRbgAll(ref_scene) was dropped on the JNA side: native imclt was
hardcoded to gpu_clt, so rendering the reference scene actually rendered the
scene. Add a use_ref arg -> select gpu_clt_ref. Needed so the reference-CLT
post-mortem render reflects the real buffer (CORR2D-all-NaN divergence: inter
correlation needs BOTH gpu_clt + gpu_clt_ref; gpu_clt proven good via SOURCE).
Updated the internal selftest caller to (p,1,0).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

7540202f

CLAUDE: add tp_proc_set_corr_indices_td (TD->PD re-upload for JNA CUAS) · 54f134b1

Andrey Filippov authored Jun 26, 2026

Inverse of tp_proc_get_corr_indices/get_corr_td: uploads host-built per-pair TD
corr indices + data back to the device (gpu_corr_indices / gpu_num_corr_tiles /
gpu_corrs_td, pitched) and sets last_num_corr_tiles so the following
corr2d_normalize + get_corr2d use the right count.

Backs GpuQuad.setCorrIndicesTdData (TDCorrTile.convertTDtoPD) on the JNA
rectilinear CUAS path, which previously fell through to base JCuda and NPE'd on
a null device pointer. Buffers are sized num_pairs*ntiles (num_pairs=3 for the
rectilinear config), giving ample headroom for the selected-tile count.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

54f134b1

CLAUDE: Add TpProc texture path (oracle): textures_nonoverlap + readback · 2aa4e6e0

Andrey Filippov authored Jun 26, 2026

Oracle (clt_aberrations_quad_corr_GPU) needs the texture path that RT did not:
execTextures (calc_extra -> diff_rgb_combo, calc_textures -> textures),
getTextureIndices, getExtra, getFlatTextures.

Adds to TpProc:
- texture buffers (lazy-allocated in tp_proc_ensure_textures; sizes match
  GpuQuad: max_texture_size=(num_colors+1+(num_cams+num_colors+1))*256)
- tp_proc_exec_textures: textures_nonoverlap<<<1,1>>> mirroring
  GpuQuad.execTextures_DP. The kernel CDP-builds the index list internally
  (create_nonoverlap_list) so we do NOT host pre-fill it, and CDP-launches
  textures_accumulate (dyn-shared attr set on the accumulate fn). linescan_order
  is taken from the caller (0 in production -> diff_rgb_combo in texture_indices
  order). This matches PRODUCTION, not the Stage-5 harness convention (linescan=1
  + host-prefilled indices + stale golden) that produced the documented
  diff_rgb_combo mismatch.
- tp_proc_get_texture_indices / get_diff_rgb_combo / get_textures readback.

Native compiles clean (4 new symbols exported). Not yet deployed: live .so
left untouched (run was active); rebuild via jna/build_lib.sh when free.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2aa4e6e0

CLAUDE: tp_proc_get_corr_indices/combo_indices/corr_td + erase_corrs (oracle TD-corr readback) · 33827862

Andrey Filippov authored Jun 26, 2026

Native readback of gpu_corr_indices / gpu_corrs_combo_indices / gpu_corrs_td (DtoH) and cudaMemset2D
erase of gpu_corrs, for GpuQuadJna's getCorrIndices/getCorrComboIndices/getCorrTdData/eraseGpuCorrs
(oracle TD-correlation path).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

33827862

CLAUDE: FIX convert_direct deconvolution in JNA — pass (kernels_hor, kernels_vert), not (0, *) · 0399a26d

Andrey Filippov authored Jun 26, 2026

THE production-mismatch bug (RMSE ~1.7 vs JCUDA, invariant to FPN/row-col/MB). convert_direct gates
the deconvolution kernels on `kernels_hor>0` (TileProcessor.cu:2782-2783): with kernels_hor=0 it passes
NULL kernels -> NO deconvolution. tp_proc_exec_convert_direct hardcoded kh=0 (copied from the harness,
whose golden was itself made with no deconvolution), so JNA skipped aberration deconvolution while
production GpuQuad passes (kernels_hor, kernels_vert)=(82,66) and applies it.

Fix: add kernels_vert to TpProc (= kern_tiles/(kernels_hor*num_colors)); exec passes
(no_kernels?0:kernels_hor, no_kernels?0:kernels_vert). tp_proc_convert_selftest now uses no_kernels=1
to keep matching the NO-deconv harness golden (StageProc still PASS: CLT 0.1085 / RBG 0.0201 / corr 2e-5).
Production (GpuQuadJna no_kernels=false) now applies deconvolution = matches JCUDA. .so-only change.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

0399a26d

25 Jun, 2026 12 commits

CLAUDE: tp_proc_get_tasks — DtoH readback of gpu_ftasks (for updateTasks) · 53599fd3
Andrey Filippov authored Jun 25, 2026
```
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
```
53599fd3
CLAUDE: tp_proc_set_clt — HtoD upload of a per-cam CLT slice (for setCltData / CLT restore) · 87796b0c
Andrey Filippov authored Jun 25, 2026
```
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
```
87796b0c

CLAUDE: tp_proc_set_const — upload LPF coefficients to named __constant__ symbols (native module) · 8cf5c5a2

Andrey Filippov authored Jun 25, 2026

Mirrors GpuQuad.setLpfRbg/setLpfCorr (cuModuleGetGlobal + cuMemcpyHtoD) against the native module's
lpf_data / lpf_corr / lpf_rb_corr symbols. Fixes the first JNA-mode NPE (setLpfRbg hit the null
gpuTileProcessor.module).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

8cf5c5a2

CLAUDE: Step 2 corr — split TpProc into granular... · 2f926aee

Andrey Filippov authored Jun 25, 2026

CLAUDE: Step 2 corr — split TpProc into granular correlate2D_td/_inter_td/combine/normalize + get_corr2d

Match GpuQuad's separate correlation calls: tp_proc_exec_corr2d_td (correlate2D TD, fat_zero=0,
corr_radius=0 -> gpu_corrs_td, returns num_corr_tiles), tp_proc_exec_corr2d_inter_td (correlate2D_inter
clt vs clt_ref, sel_sensors), tp_proc_exec_corr2d_combine (init|no_transpose<<1, num_pairs, pairs_mask
-> num_corr_combo), tp_proc_exec_corr2d_normalize (combo -> corrs_combo / per-pair -> gpu_corrs),
tp_proc_get_corr2d (per-pair de-pitch) + tp_proc_num_corr_tiles/combo. Add per-pair pixel buffer
gpu_corrs (+dstride_corr). The bundled tp_proc_exec_corr2d/StageProc path is retained.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2f926aee

CLAUDE: Step 1 complete — TpProc imclt + corr2d granular ops (full LWIR16-CUAS surface) · d1c277ca

Andrey Filippov authored Jun 25, 2026

Add to the persistent TpProc API: tp_proc_setup_rbg_corr (imclt RBG + correlation buffers + corr
config), tp_proc_exec_imclt / tp_proc_get_rbg, tp_proc_exec_corr2d (correlate2D TD -> corr2D_combine
-> corr2D_normalize) / tp_proc_get_corr2d_combo. launch1() helper. tp_proc_convert_selftest extended
to validate imclt vs aux_chnN.rbg and quad corr vs aux_corr-quad.corr (order-independent, stale golden).

Validated on RTX 5060 Ti via the persistent API: CLT==golden (0.1085), RBG==golden (0.0201),
quad-corr value-err 2.06e-5, no_kernels runs finite. The persistent granular API now covers the
full set of GPU ops the LWIR16 CUAS workflow uses (geometry/convert_direct[+no_kernels/use_center_image/
erase_clt/ref_scene]/imclt/correlations) — the surface GpuQuadJna (integration step 2) delegates to.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

d1c277ca

CLAUDE: Step 1 — persistent granular native API (TpProc) for the convert_direct core · 06c12c4a

Andrey Filippov authored Jun 25, 2026

Add TpProc: the production-facing persistent instance (buffers allocated once in tp_proc_setup,
reused across set/exec/get, freed in tp_proc_destroy) — the surface GpuQuadJna will delegate to.
API: tp_proc_create/setup, set_geometry/correction_vector/kernels/kernel_offsets/image/center_image/
tasks, exec_geometry (calc_reverse_distortions+rot_derivs+calculate_tiles_offsets), exec_convert_direct
(ref_scene/erase_clt/no_kernels), get_clt, destroy. Includes the fragile convert_direct paths the
migration must preserve: no_kernels (skip deconvolution -> kernels_hor/vert=0), use_center_image
(broadcast one center image to all sensors), erase_clt (erase_clt_tiles), ref_scene (clt_ref buffer).

tp_proc_convert_selftest validates end-to-end on RTX 5060 Ti: standard convert CLT == clt/aux_chnN.clt
golden (max|CLT-golden|=0.1085, == Stage 2, num_active=5120); no_kernels path runs with finite output.
update_image_gpu pitch is in BYTES (the "in floats" comment is misleading).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

06c12c4a

CLAUDE: Stage 5 — native textures_nonoverlap via JNA (executes on Blackwell;... · 341538c7

Andrey Filippov authored Jun 25, 2026

CLAUDE: Stage 5 — native textures_nonoverlap via JNA (executes on Blackwell; golden mismatch documented)

Extend run_convert_pipeline with do_tex: setTextures/setRGBA-equivalent buffers
(texture_indices from tasks, gpu_textures, diff_rgb_combo, color_weights, generate_RBGA_params),
cuFuncSetAttribute(textures_accumulate, MAX_DYNAMIC_SHARED_SIZE_BYTES, shared_size) for the CDP
child, launch textures_nonoverlap <<<1,1>>> (18 args), de-interleave diff_rgb_combo. tp_tex_selftest.

RESULT on RTX 5060 Ti: textures_nonoverlap + its CDP child textures_accumulate EXECUTE correctly
(no errors, shared 58880, 5120 tiles, output rms within ~1% of golden) => Blackwell compatibility
confirmed. BUT diff_rgb_combo does NOT match the Jul-2025 golden numerically (value layers off by
constant ~268, diff layers diff_sigma-sensitive). Ruled out input-CLT sensitivity (same error with
golden CLT), diff_sigma (10.0 closest), arg/param order. NOT used by the LWIR16 CUAS workflow
(cuas/ uses only convert_direct/corr2D_normalize/imclt_rbg_all). Documented known issue (golden
staleness / unverified RGB-path drift), to track later via git bisect + the 107 kernel branch switch.
See imagej-elphel-internal handoffs/2026-06-25_texture-diff-rgb-combo-mismatch.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

341538c7

CLAUDE: Stage 4 — native correlate2D/combine/normalize via JNA (quad correlation) · a0984dca

Andrey Filippov authored Jun 25, 2026

Extend run_convert_pipeline with do_corr: after convert_direct, allocate corr buffers
(gpu_corrs_td/combo_td/combo via alloc_image_gpu + corr_indices/combo_indices/num_corr_tiles),
launch correlate2D <<<1,1>>> (TD, CDP; 18 args, generates indices), read num_corr_tiles,
corr2D_combine (quad pairs_mask 0x0f), corr2D_normalize (TD->pixel), de-pitch gpu_corrs_combo.
tp_corr_selftest wrapper (do_corr=1).

Validated on RTX 5060 Ti via JNA: num_pairs=120, num_corr_combo=5120, output stats identical
to golden (max 0.6638, rms 0.0717). clt/aux_corr-quad.corr is OLDER (Apr-2025) than the CLT
golden (Jul-2025) so the active-tile ORDER differs -> pointwise compare is permutation-dominated
(0.66). Order-independent check (sort both, compare distributions): max value error 2.06e-05
== float32 precision => correlate2D/combine/normalize compute the correct values on Blackwell.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

a0984dca

CLAUDE: Stage 3 — native imclt_rbg_all via JNA + .rbg golden validation · edfc7bae

Andrey Filippov authored Jun 25, 2026

Refactor the Stage-2 selftest into run_convert_pipeline(do_imclt) shared helper;
tp_convert_direct_selftest is now a thin wrapper (do_imclt=0). Add tp_imclt_selftest
(do_imclt=1): after convert_direct, allocate pitched RBG output buffers (alloc_image_gpu,
648x520/cam, mono), launch imclt_rbg_all <<<1,1>>> (gpu_clt -> gpu_corr_images),
de-pitch via cudaMemcpy2D, compare to clt/aux_chnN.rbg golden.

Validated on RTX 5060 Ti via Java->JNA: max|RBG-golden|=0.0201 over peaks 1535 ->
relative ~1.31e-5. convert_direct CLT error unchanged (0.108505) => no regression.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

edfc7bae

CLAUDE: Stage 2 — native convert_direct selftest (first real execution + CDP on Blackwell) · 05ee47d0

Andrey Filippov authored Jun 25, 2026

Add tp_convert_direct_selftest to the JNA shim: mirrors TpHostGpu allTests' convert
path (setImageKernels/setImgBuffers/setCltBuffers/setTasks + calc_reverse_distortions
-> rot_derivs -> calculate_tiles_offsets [CDP] -> convert_direct), reusing the harness
runtime-API host helpers (tp_utils/tp_files/TpParams/tp_paths) for ALL allocation and
porting only the launches to driver-API cuLaunchKernel against the NVRTC module. Reads
CLT back, compares to clt/aux_chnN.clt golden.

build_lib.sh: nvcc + -std=c++17 (static constexpr TpParams members become inline),
-Isrc + cuda-samples Common (helper_cuda.h), --pre-include algorithm.

Validated on RTX 5060 Ti via Java->JNA: num_active_tiles=5120 (all), max|CLT-golden|
=0.1085 over peaks of 12260 -> relative ~8.85e-6 (float32 NVRTC-vs-nvcc variation).
First CDP (calculate_tiles_offsets) and 17-arg pointer-of-pointers convert_direct
launch executing natively on Blackwell.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

05ee47d0

CLAUDE: Stage 1 — native TpInstance geometry path (calc_reverse_distortions + rot_derivs) · a5b7c269

Andrey Filippov authored Jun 25, 2026

Add TpInstance to the JNA shim: device buffers (gpu_geometry_correction,
gpu_rByRDist, gpu_rot_deriv, gpu_correction_vector) + setters (HtoD),
the two pure-geometry launches (calcReverseDistortionTable {16,1,1}/{3,3,3},
calc_rot_deriv {num_cams,1,1}/{3,3,3}), and readback getters. Driver-API
cuLaunchKernel against the NVRTC module (mirrors GpuQuad.execCalcReverseDistortions
/ execRotDerivs, no JCuda). build_lib.sh builds libtileproc.so.

Validated via Java->JNA against tile_processor_gpu/clt reference data on the
RTX 5060 Ti: rByRDist == clt/*.rbyrdist to ~1e-7 (aux 16-cam and main),
rot_deriv rows orthogonal to ~1e-10 (scaled-rotation structure, det~zoom^3).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

a5b7c269

CLAUDE: JNA shim for the GPU migration (Stage 0/0b) · eec885a0

Andrey Filippov authored Jun 25, 2026

libtileproc shim (tp_jna.cpp: extern "C" tp_create_module/num_functions/last_error/destroy)
+ standalone tp_nvrtc_probe.cpp + build_probe.sh. NVRTC-compiles the kernels (+ JCUDA defines)
-> cuLink(libcudadevrt, CDP) -> module -> 19 functions, validated on the RTX 5060 Ti (sm_120 via
compute_90 PTX + driver JIT). Build artifacts gitignored. By the JCuda->JNA migration.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

eec885a0

03 Dec, 2025 2 commits
- Added link to Cuda samples · 75acb784
  Andrey Filippov authored Dec 03, 2025
  
  75acb784
- Updated eclipse settings · 2bd8aa74
  Andrey Filippov authored Dec 03, 2025
  
  2bd8aa74
28 Sep, 2025 1 commit
- copied from lwir16 branch · 1d8d3602
  Andrey Filippov authored Sep 28, 2025
  
  1d8d3602
22 Jul, 2025 1 commit
- fixed inter-scene correlation · 8612a64e
  Andrey Filippov authored Jul 22, 2025
  
  8612a64e
15 Apr, 2025 2 commits
- debugged initially with jcuda · d8e9a454
  Andrey Filippov authored Apr 15, 2025
  
  d8e9a454
- removed unused · 414f6351
  Andrey Filippov authored Apr 15, 2025
  
  414f6351
13 Apr, 2025 1 commit
- adding image sizes · 6a600b81
  Andrey Filippov authored Apr 13, 2025
  
  6a600b81
12 Apr, 2025 3 commits
- hfree(), gfree() by reference · 6b0ef5a2
  Andrey Filippov authored Apr 12, 2025
  
  6b0ef5a2
- testing refactored · 32ed475a
  Andrey Filippov authored Apr 12, 2025
  
  32ed475a
- Ignoring all clt* · b469db50
  Andrey Filippov authored Apr 12, 2025
  
  b469db50
10 Apr, 2025 2 commits
- implemented testTextures, testTexturesRGBA · 13ca3625
  Andrey Filippov authored Apr 10, 2025
  
  13ca3625
- testCorrelate2DIntraTD · 3610b7a6
  Andrey Filippov authored Apr 10, 2025
  
  3610b7a6
09 Apr, 2025 1 commit
- More code reimplemented as methods · 6f9c7399
  Andrey Filippov authored Apr 09, 2025
  
  6f9c7399
08 Apr, 2025 1 commit
- More methods replacing flat code · 14d7689b
  Andrey Filippov authored Apr 08, 2025
  
  14d7689b
07 Apr, 2025 1 commit
- More refactoring · dc090454
  Andrey Filippov authored Apr 07, 2025
  
  dc090454
06 Apr, 2025 1 commit
- More refactoring · 67816dbf
  Andrey Filippov authored Apr 06, 2025
  
  67816dbf
03 Apr, 2025 1 commit
- More refactoring · 68491042
  Andrey Filippov authored Apr 03, 2025
  
  68491042
01 Apr, 2025 2 commits
- GenerateRgbaHost - from class instance · 46256d14
  Andrey Filippov authored Apr 01, 2025
  
  46256d14
- Refactoring, working state · 1bf5e8bf
  Andrey Filippov authored Apr 01, 2025
  
  1bf5e8bf
31 Mar, 2025 1 commit
- Refactoring · 13f515b9
  Andrey Filippov authored Mar 31, 2025
  
  13f515b9