-
Andrey Filippov authored
Extend run_convert_pipeline with do_corr: after convert_direct, allocate corr buffers (gpu_corrs_td/combo_td/combo via alloc_image_gpu + corr_indices/combo_indices/num_corr_tiles), launch correlate2D <<<1,1>>> (TD, CDP; 18 args, generates indices), read num_corr_tiles, corr2D_combine (quad pairs_mask 0x0f), corr2D_normalize (TD->pixel), de-pitch gpu_corrs_combo. tp_corr_selftest wrapper (do_corr=1). Validated on RTX 5060 Ti via JNA: num_pairs=120, num_corr_combo=5120, output stats identical to golden (max 0.6638, rms 0.0717). clt/aux_corr-quad.corr is OLDER (Apr-2025) than the CLT golden (Jul-2025) so the active-tile ORDER differs -> pointwise compare is permutation-dominated (0.66). Order-independent check (sort both, compare distributions): max value error 2.06e-05 == float32 precision => correlate2D/combine/normalize compute the correct values on Blackwell. Co-Authored-By:Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a0984dca