• Andrey Filippov's avatar
    CLAUDE: Stage 4 — native correlate2D/combine/normalize via JNA (quad correlation) · a0984dca
    Andrey Filippov authored
    Extend run_convert_pipeline with do_corr: after convert_direct, allocate corr buffers
    (gpu_corrs_td/combo_td/combo via alloc_image_gpu + corr_indices/combo_indices/num_corr_tiles),
    launch correlate2D <<<1,1>>> (TD, CDP; 18 args, generates indices), read num_corr_tiles,
    corr2D_combine (quad pairs_mask 0x0f), corr2D_normalize (TD->pixel), de-pitch gpu_corrs_combo.
    tp_corr_selftest wrapper (do_corr=1).
    
    Validated on RTX 5060 Ti via JNA: num_pairs=120, num_corr_combo=5120, output stats identical
    to golden (max 0.6638, rms 0.0717). clt/aux_corr-quad.corr is OLDER (Apr-2025) than the CLT
    golden (Jul-2025) so the active-tile ORDER differs -> pointwise compare is permutation-dominated
    (0.66). Order-independent check (sort both, compare distributions): max value error 2.06e-05
    == float32 precision => correlate2D/combine/normalize compute the correct values on Blackwell.
    Co-Authored-By: 's avatarClaude Opus 4.8 (1M context) <noreply@anthropic.com>
    a0984dca
tp_jna.cpp 28.6 KB