CLAUDE: initial import — DNN training/eval/export (migrated from imagej-elphel-internal/c5p_dnn)

L1 RawFCN + L2 ConvGRU(torus), synthetic data gen, training/eval, infer_server, and export_torchscript.py (self-contained TorchScript for native LibTorch inference). GPLv3 (Elphel norm); headers on all .py/.sh; LICENSE = GPLv3. runs/ checkpoints untracked. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

CLAUDE: initial import — DNN training/eval/export (migrated from imagej-elphel-internal/c5p_dnn)
L1 RawFCN + L2 ConvGRU(torus), synthetic data gen, training/eval, infer_server, and export_torchscript.py (self-contained TorchScript for native LibTorch inference). GPLv3 (Elphel norm); headers on all .py/.sh; LICENSE = GPLv3. runs/ checkpoints untracked. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
782ef529 · Andrey Filippov · 782ef529 · 782ef529 · 782ef529 · 782ef529
Commit 782ef529 authored Jun 26, 2026 by Andrey Filippov
48 changed files
--- a/.gitignore
+++ b/.gitignore
+__pycache__/
+*.pyc
+runs/
+*.venv/
+export_venv/
+*.tif
--- a/DESIGN.md
+++ b/DESIGN.md
+# C5P DNN front-end — design & decisions
+
+CUAS real-time detector DNN front-end: an **all-convolutional (FCN)** per-pixel target estimator
+that replaces the C5P matched-filter velocity bank. Trained on synthetic Gaussian-noise patches,
+deployed inside `CuasDetectRT` (ImageJ) via ONNX Runtime (Java, CPU EP). Produces, per pixel, a
+velocity posterior over an 11×11 grid (±1.25 px/frame, 0.25 step) + detection confidence `s` +
+sub-pixel (dx,dy) offset.
+
+## Files
+- `synth.py` — on-the-fly training patches. `generate_sample` (center / off-center / noise classes),
+  half-cosine bump targets, per-frame constant velocity sampled on a disk, SNR-swept.
+- `model.py` — `RawFCN` (24×24×N → 1×1×124 via valid convs + 2 maxpools, slid densely as an FCN);
+  `fcn_loss` (det BCE + velocity soft-target CE + offset MSE); `vel_bias_loss` (batch-moment de-bias).
+- `train.py` — training loop + ONNX export (dynamic H/W axes). The CLI defaults ARE the "weighted" recipe.
+- `velocity_bias.py` — gain-vs-SNR diagnostic (predicted vs true velocity, fit per SNR bin).
+- `gen_synth_cuas.py` — builds the full-frame `*-CUAS-SYNTHETIC-CUAS.tiff` velocity-reference grid
+  for in-ImageJ testing (radial layout, one velocity cell per target). Writes a `.groundtruth.json`.
+- `compare_dnn_truth.py` — compares saved DNN output vs the real UAS_log.csv (offset/velocity/time).
+- `baselines.py`, `export_onnx.py`, `make_testvec.py`, `viz_*.py`, `extract_B.py` — support/diagnostics.
+
+## Key decisions & findings (2026-06-15)
+
+1. **Runtime temporal depth N.** `CuasDnnInfer` reads N from the ONNX input shape `[B,N,H,W]` and
+   exposes `getNFrames()`; `CuasDetectRT` uses it instead of a hardcoded 8 → 8- vs 9-frame models swap
+   purely by changing `curt_dnn_model`. (Committed: imagej-elphel `9d06cce7`.)
+
+2. **Velocity training DOMAIN vs output grid.** `synth.py` confines training velocity to a *disk* of
+   radius `vmax_px`. It was 1.0, but the output grid runs to ±1.25 (radius-5 cells), so every cell with
+   |v|>1.0 was untrained → underestimate + asymmetric ghosts at the corners. Diagnosed via the synthetic
+   grid (the diagonal (1.0,1.0)=|v|1.414 cell got projected onto the trained boundary → argmax (0.75,0.75)).
+   Fix: train with **vmax_px ≈ 1.4** (covers the ±1.25 on-axis with margin; only super-physical diagonal
+   corners stay untrained). `velocity_bias.py` couldn't catch this — it samples the same disk.
+
+3. **Velocity SCALE de-bias (`vel_bias_loss`).** Softmax-centroid velocity shrinks as confidence drops
+   (centroid of a noise-broadened, grid-bounded posterior regresses toward 0): gain 0.97 clean → 0.76 at
+   SNR=1. **The recurrent layer reduces variance, not bias** — so the bias must be removed in the DNN.
+   `vel_bias_loss`: per equal-population bin of a conditioning var, pooled least-squares gain-through-origin
+   of predicted vs true velocity, penalize `(gain−1)²`. This pins the MEAN scale to 1 per bin WITHOUT
+   penalizing per-sample variance (variance is information-limited; left for the recurrent to average).
+   Result: gain → ~1.0 across SNR (RMSE preserved at low SNR — the intended signature).
+   - `--bias_by snr` (default): clean training label; the correction is baked into the weights so the var
+     need not exist at inference. **Best for the in-loss term.**
+   - `--bias_by s` (confidence): cause-agnostic — corrects on the net's own uncertainty regardless of WHY
+     it's low (Gaussian noise vs clutter vs close targets), so it may transfer to non-Gaussian degraders on
+     REAL data. Near-identical to snr on Gaussian; the real-data A/B (tile-streak, two close targets) is the
+     only discriminator. If it doesn't transfer, augment synth with those factors.
+
+4. **Half-pixel registration.** Training referenced the patch center at `(W−1)/2 = 11.5` (even patch),
+   but deployment (`CuasDnnInfer.inferROI`) puts the ROI/output pixel at index `half = P/2 = 12`
+   (`ix = cx + i − half`, patch spans `[cx−12, cx+11]`). The 0.5 gap was a systematic ½-px position bias.
+   Fix: set `cx0 = cy0 = P/2` in `generate_sample` (matches the deployment reference; even patch kept).
+   The 24-patch asymmetry (12 left / 11 right) is exactly what deployment already imposes, so training now
+   matches it — no odd-25-patch architecture change needed.
+
+5. **Temporal sync (exact).** The DNN output is anchored at the **newest** frame of its N-frame window
+   (`window[0] = framesD[newest]`, training labels frame i=0 = newest). The output is tagged
+   `ts[newest] + " f"+newest`, so `f<n>` IS the motion frame (= level-slice index). No ±-frame slack.
+
+6. **Radial synthetic grid math.** `gen_synth_cuas.py` radial layout: cell (vx,vy) node = `center + 30·(vx,vy)`,
+   velocity `(vx/4, vy/4)` px/frame, `pos = node + v·t`. So the **effective grid spacing = 30 + t/4 px**
+   (breathes outward 0.25 px/frame). Clean integer-pixel grids occur at `t ≡ 0 (mod 4)`: t=8→32px, 12→33px, …
+   Between them, sub-pixel offsets fan out per velocity. center = (320.5, 256.5) on a 640×512 frame.
+
+7. **ONNX deploy convention.** torch exports external-data ONNX (`model.onnx` + `model.onnx.data`); the
+   `.onnx` references the sidecar by its export-time name, so the pair **cannot be renamed flat** (renaming
+   loads the wrong sidecar → size-mismatch error). Deploy each model in **its own subdir** with the canonical
+   `model.onnx` / `model.onnx.data`. Cache: `~/.cache/c5p_dnn/<name>/model.onnx`.
+
+8. **Ghostbuster (untrained-corner velocities).** The velocity grid (radius `vel_radius`=5 cells,
+   corners to R≈7) extends past the trained disk (R = `vmax_px`·`vel_decimate` ≈ 5.6 at vmax 1.4), so
+   the untrained corner cells emit spurious velocity sidelobes (ghosts) of non-trivial strength (field
+   value up to ~0.09 vs ~0.15 real) that would confuse the recurrent. `CuasDetectRT.dnnGhostbust` zeros
+   velocity cells with cell-radius > `curt_dnn_vmax·vel_decimate`; if a pixel's PEAK lands in that region
+   the whole detection is discarded (field=0, **s=0**). Applied to the DNN field + offset before save and
+   the recurrent feed. **`curt_dnn_vmax` (px/frame, default 1.4) must match the loaded model's training
+   `vmax_px`** (PM models = 1.4, `m9_base` = 1.0); too low over-masks trained cells, too high leaves ghosts.
+   (imagej-elphel `0bd16311`.)
+
+9. **Larger attention area (32-patch) — kills trajectory-alias ghosts (T4).** Widened the receptive
+   field to a 32-patch: `RawFCN(patch=32)` = 6 conv3 + 2 pool (32→30→28→14→12→10→5→3→1; pools on even
+   sizes so the cx0=P/2 centering holds), ~119k params. Grows off-center suppression reach
+   `off_max = P/2-margin-1` from 9→13 px, covering the alias reach `vmax·(N-1) ≈ 11.2 px`. On the real
+   synthetic grid the ghost field dropped **0.16 → ~0.003** (~50×, essentially gone). NB: the single-
+   static `ghost_probe.py` did NOT reproduce the ghost (both 24/32 suppress a lone static off-center
+   target) — the real alias needs the multi-target/conditioning context — but the wider RF fixed it
+   regardless. Java: `inferROI` patch is configurable via `CuasDnnInfer.setPatch`; new param
+   **`curt_dnn_patch`** (default 24; set 32 for this model) — MUST match the loaded model. Deployed
+   `~/.cache/c5p_dnn/m9_p32_s/model.onnx`. IMPORTANT: `curt_dnn_vmax` (ghostbuster) must be ≥ the max
+   real-target velocity you want to KEEP, not just the training vmax — e.g. (4,4)=|v|1.414 (cell-R 5.657)
+   needs vmax≥1.42, so use 1.5 (rmax 6.0) to keep (4,4) while still killing far corners (5,5)=7.07.
+   Training disk was 1.4, so (4,4) is marginally extrapolated; a retrain at vmax_px≈1.45–1.5 trains the
+   diagonal corners cleanly.
+
+## Training recipe (DGX GB10, NGC pytorch:25.10-py3, ~35 it/s, ~3 min / 6000 steps)
+```
+python train.py --steps 6000 --nframes 9 --vmax 1.4 --w_bias 10 [--bias_by s] --out runs/<name>
+```
+Defaults (frac_pos 0.4, frac_off 0.4, w_vel 1.0, w_off 0.3, snr 1 8, patch 24, vel_radius 5, vel_decimate 4,
+sigma_v 0.9) = the weighted recipe. Sync scripts to `elphel@192.168.0.62:~/c5p_dnn/`, run in the container.
+
+## Deployed models (`~/.cache/c5p_dnn/`) — "perfect-match" set (half-pixel fix + vmax 1.4 + de-bias)
+- `m9_pm/model.onnx`   — 9-frame, snr-de-bias
+- `m9_pm_s/model.onnx` — 9-frame, s-de-bias
+(supersede `m9_dbias*`, which lacked the half-pixel fix; `m9_base` = 9-frame no-de-bias baseline.)
+
+## Open items
+- Real-data A/B: snr vs s de-bias on the tile-streak and two-close-targets cases (the s-transfer test).
+- Connect the DNN field to the recurrent layer (weights, as-is vs splat); tune re-sharpen.
+- (Done — ghostbuster, decision 8.) Corner ghost sidelobes masked at readout.
+- Explore in-network recurrent (extra layers / memory) — cf. Andrey's predecessor 2-stage Siamese
+  tile-disparity CNN (spatial neighbour context; multi-scale 1×1/3×3/5×5 loss).
--- a/DESIGN_v2_mf_hough.md
+++ b/DESIGN_v2_mf_hough.md
+# C5P DNN v2 — Two-stage MF + learned Hough-vote (2026-06-18, Andrey + Claude)
+
+Redesign agreed after the v1 findings: softmax competition (not 121-resolution) kills ghosts;
+the ghostbuster is structurally backwards (clips near-limit reals, keeps under-read fast ghosts);
+the trajectory-alias velocity ramp is **structure, not noise**; low-SNR needs **spatial** evidence
+integration (the spatial analog of the temporal recurrent). This architecture turns the alias into
+the detector.
+
+## Core idea
+A true target (head H at newest frame, velocity V, tail T = H − V·(N−1)) makes the per-pixel MF field
+fire a **predictable pattern** around H: a neighbor pixel P reports (tail-anchored, exact for tail-only)
+`V_P = V + (P − H)/(N−1)` — slope 1/(N−1), the ramp we measured. The invariant: every pixel of the
+target back-projects to the **same tail**:  `P − V_P·(N−1) = T`. So aliased neighbors don't scatter —
+they all point at one place. Accumulating those (s-weighted) votes recovers the target from many weak
+detections (~√N gain) — far more noise-resilient than one pixel's s.
+
+## Stage 1 — local matched filter (per-pixel, neighbor-agnostic)
+- In: N-frame conditioned patch. Out per pixel: continuous **(Vx, Vy, s)** + K latent "secret-message" channels.
+- **No velocity grid** — continuous V (the v1 grid-softmax's competition role moves to Stage 2). Resolves the
+  grid-vs-reg tension: continuous velocity here, competition there.
+- **Velocity range ±2.5–3 from the start** (Andrey 2026-06-18: higher range is needed/beneficial; design wide now to
+  avoid a narrow→wide re-iteration). Continuous V → NO grid-resolution penalty (unlike v1's wider grid: decimate-2
+  step-0.5 gave centroid RMSE 0.06 vs 0.03 — here it's just a wider tanh bound). Wider per-DNN range ⇒ FEWER pyramid
+  levels to stack.
+- RF ≈ trajectory reach `vmax·(N−1)` + bump. At vmax 3: **N=9 → reach 24 → patch ~52**; N=5 → reach 12 → patch ~28.
+  Lock N with the range (T5 coupling) — **lean N=9** (temporal depth = the low-SNR lever; the bigger RF is CNN-shared,
+  so amortized). Still drops v1's neighbor-suppression margin → smaller than a v1-style net at the same vmax.
+- Training: **center-positives only** (report the trajectory through me) + noise negatives (det=0). **Drop the
+  off-center negatives** — we WANT the alias ramp, not suppression. Trained on causal MB (RT bias absorbed, per
+  the −0.36→+0.02 result). CNN-shared features (cheaper than literal per-pixel MF, like the 3d3 coarse layer).
+
+## Stage 2 — learned Hough vote (spatial competition + aggregation)
+- In: the Stage-1 (Vx,Vy,s)[+latent] field. Each pixel casts an s-weighted vote at `T_P = P − V_P·(N−1)`
+  through a **learned vote kernel**; accumulate → soft-argmax → per-target detection. Recover head = T + V·(N−1).
+- **Physics as init + soft bound, not hard-coded:** seed toward the tail-anchored ramp, clamp the deviation to
+  the measured spread (the real all-frames MF spread is NARROWER than tail-only, and MB shifts it); let it learn
+  the rest.
+- **Subsumes both** the v1 velocity-softmax competition AND the ghostbuster: a ghost has one voter → loses; a real
+  target gets coherent votes → wins. No output-velocity clipping (that was backwards).
+
+## Loss schedule — dictated reference → end-to-end (the agreed methodology)
+Structure lives in the ARCHITECTURE (differentiable Stage1→Stage2); "dictatorship" is just the aux-loss weight:
+1. **Reference (hard):** deep supervision — aux loss pins Stage-1 (Vx,Vy,s) to the MF targets + Stage-2 final loss.
+   Interpretable, data-efficient, and the permanent **debugging substrate** (v1 was only debuggable because we could
+   read intermediates).
+2. **Anneal:** lower the Stage-1 aux weight; let the K latent channels carry richer inter-stage features.
+3. **End-to-end:** final loss only (+ small aux as regularizer). **Compare to the reference**; keep the gain if it
+   beats it, else the reference stands.
+Training order: Stage 1 alone (freezable) → Stage 2 on its field → end-to-end.
+
+## Pairs with the recurrent (T3)
+Stage-2 vote = spatial weak-evidence integration; recurrent = temporal. Same principle; eventually one framework.
+
+## To decide at implementation
+N and RF size; K latent channels; vote-accumulator sub-pixel resolution; whether Stage 2 is conv/attention/explicit
+scatter-add; how the head-recovery (T + V·(N−1)) feeds the output/recurrent. Build the reference first, measure
+against v1 (m9_p32_grid121_cblur etc.) on the same synthetic harness.
--- a/LICENSE
+++ b/LICENSE
+                    GNU GENERAL PUBLIC LICENSE
+                       Version 3, 29 June 2007
+
+ Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+                            Preamble
+
+  The GNU General Public License is a free, copyleft license for
+software and other kinds of works.
+
+  The licenses for most software and other practical works are designed
+to take away your freedom to share and change the works.  By contrast,
+the GNU General Public License is intended to guarantee your freedom to
+share and change all versions of a program--to make sure it remains free
+software for all its users.  We, the Free Software Foundation, use the
+GNU General Public License for most of our software; it applies also to
+any other work released this way by its authors.  You can apply it to
+your programs, too.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+them if you wish), that you receive source code or can get it if you
+want it, that you can change the software or use pieces of it in new
+free programs, and that you know you can do these things.
+
+  To protect your rights, we need to prevent others from denying you
+these rights or asking you to surrender the rights.  Therefore, you have
+certain responsibilities if you distribute copies of the software, or if
+you modify it: responsibilities to respect the freedom of others.
+
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must pass on to the recipients the same
+freedoms that you received.  You must make sure that they, too, receive
+or can get the source code.  And you must show them these terms so they
+know their rights.
+
+  Developers that use the GNU GPL protect your rights with two steps:
+(1) assert copyright on the software, and (2) offer you this License
+giving you legal permission to copy, distribute and/or modify it.
+
+  For the developers' and authors' protection, the GPL clearly explains
+that there is no warranty for this free software.  For both users' and
+authors' sake, the GPL requires that modified versions be marked as
+changed, so that their problems will not be attributed erroneously to
+authors of previous versions.
+
+  Some devices are designed to deny users access to install or run
+modified versions of the software inside them, although the manufacturer
+can do so.  This is fundamentally incompatible with the aim of
+protecting users' freedom to change the software.  The systematic
+pattern of such abuse occurs in the area of products for individuals to
+use, which is precisely where it is most unacceptable.  Therefore, we
+have designed this version of the GPL to prohibit the practice for those
+products.  If such problems arise substantially in other domains, we
+stand ready to extend this provision to those domains in future versions
+of the GPL, as needed to protect the freedom of users.
+
+  Finally, every program is threatened constantly by software patents.
+States should not allow patents to restrict development and use of
+software on general-purpose computers, but in those that do, we wish to
+avoid the special danger that patents applied to a free program could
+make it effectively proprietary.  To prevent this, the GPL assures that
+patents cannot be used to render the program non-free.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+                       TERMS AND CONDITIONS
+
+  0. Definitions.
+
+  "This License" refers to version 3 of the GNU General Public License.
+
+  "Copyright" also means copyright-like laws that apply to other kinds of
+works, such as semiconductor masks.
+
+  "The Program" refers to any copyrightable work licensed under this
+License.  Each licensee is addressed as "you".  "Licensees" and
+"recipients" may be individuals or organizations.
+
+  To "modify" a work means to copy from or adapt all or part of the work
+in a fashion requiring copyright permission, other than the making of an
+exact copy.  The resulting work is called a "modified version" of the
+earlier work or a work "based on" the earlier work.
+
+  A "covered work" means either the unmodified Program or a work based
+on the Program.
+
+  To "propagate" a work means to do anything with it that, without
+permission, would make you directly or secondarily liable for
+infringement under applicable copyright law, except executing it on a
+computer or modifying a private copy.  Propagation includes copying,
+distribution (with or without modification), making available to the
+public, and in some countries other activities as well.
+
+  To "convey" a work means any kind of propagation that enables other
+parties to make or receive copies.  Mere interaction with a user through
+a computer network, with no transfer of a copy, is not conveying.
+
+  An interactive user interface displays "Appropriate Legal Notices"
+to the extent that it includes a convenient and prominently visible
+feature that (1) displays an appropriate copyright notice, and (2)
+tells the user that there is no warranty for the work (except to the
+extent that warranties are provided), that licensees may convey the
+work under this License, and how to view a copy of this License.  If
+the interface presents a list of user commands or options, such as a
+menu, a prominent item in the list meets this criterion.
+
+  1. Source Code.
+
+  The "source code" for a work means the preferred form of the work
+for making modifications to it.  "Object code" means any non-source
+form of a work.
+
+  A "Standard Interface" means an interface that either is an official
+standard defined by a recognized standards body, or, in the case of
+interfaces specified for a particular programming language, one that
+is widely used among developers working in that language.
+
+  The "System Libraries" of an executable work include anything, other
+than the work as a whole, that (a) is included in the normal form of
+packaging a Major Component, but which is not part of that Major
+Component, and (b) serves only to enable use of the work with that
+Major Component, or to implement a Standard Interface for which an
+implementation is available to the public in source code form.  A
+"Major Component", in this context, means a major essential component
+(kernel, window system, and so on) of the specific operating system
+(if any) on which the executable work runs, or a compiler used to
+produce the work, or an object code interpreter used to run it.
+
+  The "Corresponding Source" for a work in object code form means all
+the source code needed to generate, install, and (for an executable
+work) run the object code and to modify the work, including scripts to
+control those activities.  However, it does not include the work's
+System Libraries, or general-purpose tools or generally available free
+programs which are used unmodified in performing those activities but
+which are not part of the work.  For example, Corresponding Source
+includes interface definition files associated with source files for
+the work, and the source code for shared libraries and dynamically
+linked subprograms that the work is specifically designed to require,
+such as by intimate data communication or control flow between those
+subprograms and other parts of the work.
+
+  The Corresponding Source need not include anything that users
+can regenerate automatically from other parts of the Corresponding
+Source.
+
+  The Corresponding Source for a work in source code form is that
+same work.
+
+  2. Basic Permissions.
+
+  All rights granted under this License are granted for the term of
+copyright on the Program, and are irrevocable provided the stated
+conditions are met.  This License explicitly affirms your unlimited
+permission to run the unmodified Program.  The output from running a
+covered work is covered by this License only if the output, given its
+content, constitutes a covered work.  This License acknowledges your
+rights of fair use or other equivalent, as provided by copyright law.
+
+  You may make, run and propagate covered works that you do not
+convey, without conditions so long as your license otherwise remains
+in force.  You may convey covered works to others for the sole purpose
+of having them make modifications exclusively for you, or provide you
+with facilities for running those works, provided that you comply with
+the terms of this License in conveying all material for which you do
+not control copyright.  Those thus making or running the covered works
+for you must do so exclusively on your behalf, under your direction
+and control, on terms that prohibit them from making any copies of
+your copyrighted material outside their relationship with you.
+
+  Conveying under any other circumstances is permitted solely under
+the conditions stated below.  Sublicensing is not allowed; section 10
+makes it unnecessary.
+
+  3. Protecting Users' Legal Rights From Anti-Circumvention Law.
+
+  No covered work shall be deemed part of an effective technological
+measure under any applicable law fulfilling obligations under article
+11 of the WIPO copyright treaty adopted on 20 December 1996, or
+similar laws prohibiting or restricting circumvention of such
+measures.
+
+  When you convey a covered work, you waive any legal power to forbid
+circumvention of technological measures to the extent such circumvention
+is effected by exercising rights under this License with respect to
+the covered work, and you disclaim any intention to limit operation or
+modification of the work as a means of enforcing, against the work's
+users, your or third parties' legal rights to forbid circumvention of
+technological measures.
+
+  4. Conveying Verbatim Copies.
+
+  You may convey verbatim copies of the Program's source code as you
+receive it, in any medium, provided that you conspicuously and
+appropriately publish on each copy an appropriate copyright notice;
+keep intact all notices stating that this License and any
+non-permissive terms added in accord with section 7 apply to the code;
+keep intact all notices of the absence of any warranty; and give all
+recipients a copy of this License along with the Program.
+
+  You may charge any price or no price for each copy that you convey,
+and you may offer support or warranty protection for a fee.
+
+  5. Conveying Modified Source Versions.
+
+  You may convey a work based on the Program, or the modifications to
+produce it from the Program, in the form of source code under the
+terms of section 4, provided that you also meet all of these conditions:
+
+    a) The work must carry prominent notices stating that you modified
+    it, and giving a relevant date.
+
+    b) The work must carry prominent notices stating that it is
+    released under this License and any conditions added under section
+    7.  This requirement modifies the requirement in section 4 to
+    "keep intact all notices".
+
+    c) You must license the entire work, as a whole, under this
+    License to anyone who comes into possession of a copy.  This
+    License will therefore apply, along with any applicable section 7
+    additional terms, to the whole of the work, and all its parts,
+    regardless of how they are packaged.  This License gives no
+    permission to license the work in any other way, but it does not
+    invalidate such permission if you have separately received it.
+
+    d) If the work has interactive user interfaces, each must display
+    Appropriate Legal Notices; however, if the Program has interactive
+    interfaces that do not display Appropriate Legal Notices, your
+    work need not make them do so.
+
+  A compilation of a covered work with other separate and independent
+works, which are not by their nature extensions of the covered work,
+and which are not combined with it such as to form a larger program,
+in or on a volume of a storage or distribution medium, is called an
+"aggregate" if the compilation and its resulting copyright are not
+used to limit the access or legal rights of the compilation's users
+beyond what the individual works permit.  Inclusion of a covered work
+in an aggregate does not cause this License to apply to the other
+parts of the aggregate.
+
+  6. Conveying Non-Source Forms.
+
+  You may convey a covered work in object code form under the terms
+of sections 4 and 5, provided that you also convey the
+machine-readable Corresponding Source under the terms of this License,
+in one of these ways:
+
+    a) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by the
+    Corresponding Source fixed on a durable physical medium
+    customarily used for software interchange.
+
+    b) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by a
+    written offer, valid for at least three years and valid for as
+    long as you offer spare parts or customer support for that product
+    model, to give anyone who possesses the object code either (1) a
+    copy of the Corresponding Source for all the software in the
+    product that is covered by this License, on a durable physical
+    medium customarily used for software interchange, for a price no
+    more than your reasonable cost of physically performing this
+    conveying of source, or (2) access to copy the
+    Corresponding Source from a network server at no charge.
+
+    c) Convey individual copies of the object code with a copy of the
+    written offer to provide the Corresponding Source.  This
+    alternative is allowed only occasionally and noncommercially, and
+    only if you received the object code with such an offer, in accord
+    with subsection 6b.
+
+    d) Convey the object code by offering access from a designated
+    place (gratis or for a charge), and offer equivalent access to the
+    Corresponding Source in the same way through the same place at no
+    further charge.  You need not require recipients to copy the
+    Corresponding Source along with the object code.  If the place to
+    copy the object code is a network server, the Corresponding Source
+    may be on a different server (operated by you or a third party)
+    that supports equivalent copying facilities, provided you maintain
+    clear directions next to the object code saying where to find the
+    Corresponding Source.  Regardless of what server hosts the
+    Corresponding Source, you remain obligated to ensure that it is
+    available for as long as needed to satisfy these requirements.
+
+    e) Convey the object code using peer-to-peer transmission, provided
+    you inform other peers where the object code and Corresponding
+    Source of the work are being offered to the general public at no
+    charge under subsection 6d.
+
+  A separable portion of the object code, whose source code is excluded
+from the Corresponding Source as a System Library, need not be
+included in conveying the object code work.
+
+  A "User Product" is either (1) a "consumer product", which means any
+tangible personal property which is normally used for personal, family,
+or household purposes, or (2) anything designed or sold for incorporation
+into a dwelling.  In determining whether a product is a consumer product,
+doubtful cases shall be resolved in favor of coverage.  For a particular
+product received by a particular user, "normally used" refers to a
+typical or common use of that class of product, regardless of the status
+of the particular user or of the way in which the particular user
+actually uses, or expects or is expected to use, the product.  A product
+is a consumer product regardless of whether the product has substantial
+commercial, industrial or non-consumer uses, unless such uses represent
+the only significant mode of use of the product.
+
+  "Installation Information" for a User Product means any methods,
+procedures, authorization keys, or other information required to install
+and execute modified versions of a covered work in that User Product from
+a modified version of its Corresponding Source.  The information must
+suffice to ensure that the continued functioning of the modified object
+code is in no case prevented or interfered with solely because
+modification has been made.
+
+  If you convey an object code work under this section in, or with, or
+specifically for use in, a User Product, and the conveying occurs as
+part of a transaction in which the right of possession and use of the
+User Product is transferred to the recipient in perpetuity or for a
+fixed term (regardless of how the transaction is characterized), the
+Corresponding Source conveyed under this section must be accompanied
+by the Installation Information.  But this requirement does not apply
+if neither you nor any third party retains the ability to install
+modified object code on the User Product (for example, the work has
+been installed in ROM).
+
+  The requirement to provide Installation Information does not include a
+requirement to continue to provide support service, warranty, or updates
+for a work that has been modified or installed by the recipient, or for
+the User Product in which it has been modified or installed.  Access to a
+network may be denied when the modification itself materially and
+adversely affects the operation of the network or violates the rules and
+protocols for communication across the network.
+
+  Corresponding Source conveyed, and Installation Information provided,
+in accord with this section must be in a format that is publicly
+documented (and with an implementation available to the public in
+source code form), and must require no special password or key for
+unpacking, reading or copying.
+
+  7. Additional Terms.
+
+  "Additional permissions" are terms that supplement the terms of this
+License by making exceptions from one or more of its conditions.
+Additional permissions that are applicable to the entire Program shall
+be treated as though they were included in this License, to the extent
+that they are valid under applicable law.  If additional permissions
+apply only to part of the Program, that part may be used separately
+under those permissions, but the entire Program remains governed by
+this License without regard to the additional permissions.
+
+  When you convey a copy of a covered work, you may at your option
+remove any additional permissions from that copy, or from any part of
+it.  (Additional permissions may be written to require their own
+removal in certain cases when you modify the work.)  You may place
+additional permissions on material, added by you to a covered work,
+for which you have or can give appropriate copyright permission.
+
+  Notwithstanding any other provision of this License, for material you
+add to a covered work, you may (if authorized by the copyright holders of
+that material) supplement the terms of this License with terms:
+
+    a) Disclaiming warranty or limiting liability differently from the
+    terms of sections 15 and 16 of this License; or
+
+    b) Requiring preservation of specified reasonable legal notices or
+    author attributions in that material or in the Appropriate Legal
+    Notices displayed by works containing it; or
+
+    c) Prohibiting misrepresentation of the origin of that material, or
+    requiring that modified versions of such material be marked in
+    reasonable ways as different from the original version; or
+
+    d) Limiting the use for publicity purposes of names of licensors or
+    authors of the material; or
+
+    e) Declining to grant rights under trademark law for use of some
+    trade names, trademarks, or service marks; or
+
+    f) Requiring indemnification of licensors and authors of that
+    material by anyone who conveys the material (or modified versions of
+    it) with contractual assumptions of liability to the recipient, for
+    any liability that these contractual assumptions directly impose on
+    those licensors and authors.
+
+  All other non-permissive additional terms are considered "further
+restrictions" within the meaning of section 10.  If the Program as you
+received it, or any part of it, contains a notice stating that it is
+governed by this License along with a term that is a further
+restriction, you may remove that term.  If a license document contains
+a further restriction but permits relicensing or conveying under this
+License, you may add to a covered work material governed by the terms
+of that license document, provided that the further restriction does
+not survive such relicensing or conveying.
+
+  If you add terms to a covered work in accord with this section, you
+must place, in the relevant source files, a statement of the
+additional terms that apply to those files, or a notice indicating
+where to find the applicable terms.
+
+  Additional terms, permissive or non-permissive, may be stated in the
+form of a separately written license, or stated as exceptions;
+the above requirements apply either way.
+
+  8. Termination.
+
+  You may not propagate or modify a covered work except as expressly
+provided under this License.  Any attempt otherwise to propagate or
+modify it is void, and will automatically terminate your rights under
+this License (including any patent licenses granted under the third
+paragraph of section 11).
+
+  However, if you cease all violation of this License, then your
+license from a particular copyright holder is reinstated (a)
+provisionally, unless and until the copyright holder explicitly and
+finally terminates your license, and (b) permanently, if the copyright
+holder fails to notify you of the violation by some reasonable means
+prior to 60 days after the cessation.
+
+  Moreover, your license from a particular copyright holder is
+reinstated permanently if the copyright holder notifies you of the
+violation by some reasonable means, this is the first time you have
+received notice of violation of this License (for any work) from that
+copyright holder, and you cure the violation prior to 30 days after
+your receipt of the notice.
+
+  Termination of your rights under this section does not terminate the
+licenses of parties who have received copies or rights from you under
+this License.  If your rights have been terminated and not permanently
+reinstated, you do not qualify to receive new licenses for the same
+material under section 10.
+
+  9. Acceptance Not Required for Having Copies.
+
+  You are not required to accept this License in order to receive or
+run a copy of the Program.  Ancillary propagation of a covered work
+occurring solely as a consequence of using peer-to-peer transmission
+to receive a copy likewise does not require acceptance.  However,
+nothing other than this License grants you permission to propagate or
+modify any covered work.  These actions infringe copyright if you do
+not accept this License.  Therefore, by modifying or propagating a
+covered work, you indicate your acceptance of this License to do so.
+
+  10. Automatic Licensing of Downstream Recipients.
+
+  Each time you convey a covered work, the recipient automatically
+receives a license from the original licensors, to run, modify and
+propagate that work, subject to this License.  You are not responsible
+for enforcing compliance by third parties with this License.
+
+  An "entity transaction" is a transaction transferring control of an
+organization, or substantially all assets of one, or subdividing an
+organization, or merging organizations.  If propagation of a covered
+work results from an entity transaction, each party to that
+transaction who receives a copy of the work also receives whatever
+licenses to the work the party's predecessor in interest had or could
+give under the previous paragraph, plus a right to possession of the
+Corresponding Source of the work from the predecessor in interest, if
+the predecessor has it or can get it with reasonable efforts.
+
+  You may not impose any further restrictions on the exercise of the
+rights granted or affirmed under this License.  For example, you may
+not impose a license fee, royalty, or other charge for exercise of
+rights granted under this License, and you may not initiate litigation
+(including a cross-claim or counterclaim in a lawsuit) alleging that
+any patent claim is infringed by making, using, selling, offering for
+sale, or importing the Program or any portion of it.
+
+  11. Patents.
+
+  A "contributor" is a copyright holder who authorizes use under this
+License of the Program or a work on which the Program is based.  The
+work thus licensed is called the contributor's "contributor version".
+
+  A contributor's "essential patent claims" are all patent claims
+owned or controlled by the contributor, whether already acquired or
+hereafter acquired, that would be infringed by some manner, permitted
+by this License, of making, using, or selling its contributor version,
+but do not include claims that would be infringed only as a
+consequence of further modification of the contributor version.  For
+purposes of this definition, "control" includes the right to grant
+patent sublicenses in a manner consistent with the requirements of
+this License.
+
+  Each contributor grants you a non-exclusive, worldwide, royalty-free
+patent license under the contributor's essential patent claims, to
+make, use, sell, offer for sale, import and otherwise run, modify and
+propagate the contents of its contributor version.
+
+  In the following three paragraphs, a "patent license" is any express
+agreement or commitment, however denominated, not to enforce a patent
+(such as an express permission to practice a patent or covenant not to
+sue for patent infringement).  To "grant" such a patent license to a
+party means to make such an agreement or commitment not to enforce a
+patent against the party.
+
+  If you convey a covered work, knowingly relying on a patent license,
+and the Corresponding Source of the work is not available for anyone
+to copy, free of charge and under the terms of this License, through a
+publicly available network server or other readily accessible means,
+then you must either (1) cause the Corresponding Source to be so
+available, or (2) arrange to deprive yourself of the benefit of the
+patent license for this particular work, or (3) arrange, in a manner
+consistent with the requirements of this License, to extend the patent
+license to downstream recipients.  "Knowingly relying" means you have
+actual knowledge that, but for the patent license, your conveying the
+covered work in a country, or your recipient's use of the covered work
+in a country, would infringe one or more identifiable patents in that
+country that you have reason to believe are valid.
+
+  If, pursuant to or in connection with a single transaction or
+arrangement, you convey, or propagate by procuring conveyance of, a
+covered work, and grant a patent license to some of the parties
+receiving the covered work authorizing them to use, propagate, modify
+or convey a specific copy of the covered work, then the patent license
+you grant is automatically extended to all recipients of the covered
+work and works based on it.
+
+  A patent license is "discriminatory" if it does not include within
+the scope of its coverage, prohibits the exercise of, or is
+conditioned on the non-exercise of one or more of the rights that are
+specifically granted under this License.  You may not convey a covered
+work if you are a party to an arrangement with a third party that is
+in the business of distributing software, under which you make payment
+to the third party based on the extent of your activity of conveying
+the work, and under which the third party grants, to any of the
+parties who would receive the covered work from you, a discriminatory
+patent license (a) in connection with copies of the covered work
+conveyed by you (or copies made from those copies), or (b) primarily
+for and in connection with specific products or compilations that
+contain the covered work, unless you entered into that arrangement,
+or that patent license was granted, prior to 28 March 2007.
+
+  Nothing in this License shall be construed as excluding or limiting
+any implied license or other defenses to infringement that may
+otherwise be available to you under applicable patent law.
+
+  12. No Surrender of Others' Freedom.
+
+  If conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot convey a
+covered work so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you may
+not convey it at all.  For example, if you agree to terms that obligate you
+to collect a royalty for further conveying from those to whom you convey
+the Program, the only way you could satisfy both those terms and this
+License would be to refrain entirely from conveying the Program.
+
+  13. Use with the GNU Affero General Public License.
+
+  Notwithstanding any other provision of this License, you have
+permission to link or combine any covered work with a work licensed
+under version 3 of the GNU Affero General Public License into a single
+combined work, and to convey the resulting work.  The terms of this
+License will continue to apply to the part which is the covered work,
+but the special requirements of the GNU Affero General Public License,
+section 13, concerning interaction through a network will apply to the
+combination as such.
+
+  14. Revised Versions of this License.
+
+  The Free Software Foundation may publish revised and/or new versions of
+the GNU General Public License from time to time.  Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+  Each version is given a distinguishing version number.  If the
+Program specifies that a certain numbered version of the GNU General
+Public License "or any later version" applies to it, you have the
+option of following the terms and conditions either of that numbered
+version or of any later version published by the Free Software
+Foundation.  If the Program does not specify a version number of the
+GNU General Public License, you may choose any version ever published
+by the Free Software Foundation.
+
+  If the Program specifies that a proxy can decide which future
+versions of the GNU General Public License can be used, that proxy's
+public statement of acceptance of a version permanently authorizes you
+to choose that version for the Program.
+
+  Later license versions may give you additional or different
+permissions.  However, no additional obligations are imposed on any
+author or copyright holder as a result of your choosing to follow a
+later version.
+
+  15. Disclaimer of Warranty.
+
+  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
+APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
+HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
+OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
+IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
+ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+  16. Limitation of Liability.
+
+  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
+THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
+GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
+USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
+DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
+PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
+EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGES.
+
+  17. Interpretation of Sections 15 and 16.
+
+  If the disclaimer of warranty and limitation of liability provided
+above cannot be given local legal effect according to their terms,
+reviewing courts shall apply local law that most closely approximates
+an absolute waiver of all civil liability in connection with the
+Program, unless a warranty or assumption of liability accompanies a
+copy of the Program in return for a fee.
+
+                     END OF TERMS AND CONDITIONS
+
+            How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+state the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    imagej-elphel
+    Copyright (C) 2017  Elphel
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation, either version 3 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+Also add information on how to contact you by electronic and paper mail.
+
+  If the program does terminal interaction, make it output a short
+notice like this when it starts in an interactive mode:
+
+    imagej-elphel  Copyright (C) 2017  Elphel
+    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License.  Of course, your program's commands
+might be different; for a GUI interface, you would use an "about box".
+
+  You should also get your employer (if you work as a programmer) or school,
+if any, to sign a "copyright disclaimer" for the program, if necessary.
+For more information on this, and how to apply and follow the GNU GPL, see
+<http://www.gnu.org/licenses/>.
+
+  The GNU General Public License does not permit incorporating your program
+into proprietary programs.  If your program is a subroutine library, you
+may consider it more useful to permit linking proprietary applications with
+the library.  If this is what you want to do, use the GNU Lesser General
+Public License instead of this License.  But first, please read
+<http://www.gnu.org/philosophy/why-not-lgpl.html>.
--- a/README.md
+++ b/README.md
+# imagej_elphel_dnn
+
+DNN companion to [imagej-elphel](https://git.elphel.com/Elphel/imagej-elphel) for tile-processor
+motion detection / ranging. GPLv3.
+
+## Layers
+- **L1 — `RawFCN`** (`model.py`): fully-convolutional patch net `[B,N,P,P] -> [B,C,1,1]`, slid densely over
+  the full frame (no FC layers). Per-tile detection logit + velocity field. Trained on synthetic sequences.
+- **L2 — `Layer2Net` / ConvGRU on a torus** (`layer2.py`): learned track-before-detect over the L1 field.
+
+## Workflow
+- Training / eval / synthetic data generation: `train.py`, `layer2_train*.py`, `gen_synth_cuas.py`, `synth.py`, eval scripts.
+- Inference (dev/remote): `infer_server.py` (+ `run_infer_server.sh`) — PyTorch server.
+- **Deployment export:** `export_torchscript.py` produces a self-contained TorchScript `.pt` (weights + graph),
+  loaded natively (no Python) by LibTorch in [tile_processor_gpu](https://git.elphel.com/Elphel/tile_processor_gpu)
+  and bundled as a resource in imagej-elphel (`resources/cuas_dnn/`). Build-once on a dev box (PyTorch);
+  deployment needs only the NVIDIA driver + libtorch runtime + the bundled `.pt`.
+
+Checkpoints (`runs/`) are training outputs and are not tracked.
--- a/baselines.py
+++ b/baselines.py
+# baselines.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""
+Phase-correlation and matched-filter velocity baselines for the C5P DNN benchmark. # By Claude on 06/13/2026
+
+Pure numpy (no torch) so they can be unit-tested anywhere. Both estimate (vx,vy) px/frame
+from an [N,H,W] patch stack (frame 0 = newest; target near center). These are the bars the
+network must match (PC) and beat (matched filter = our current linear front-end).
+"""
+
+import numpy as np
+
+
+def _parabolic(c, l, r):
+    """Sub-sample peak offset in [-0.5,0.5] from 3 samples (left, center, right), c is max."""
+    d = (l - r)
+    den = 2.0 * (l - 2.0 * c + r)
+    return (d / den) if abs(den) > 1e-12 else 0.0
+
+
+def pc_shift(a, b):
+    """Phase correlation: estimate the shift (sx,sy) of `a` relative to `b` (sub-pixel).
+    Positive sx means `a`'s content is at larger x than `b`'s."""
+    H, W = a.shape
+    Fa = np.fft.fft2(a); Fb = np.fft.fft2(b)
+    R = Fa * np.conj(Fb)
+    R /= (np.abs(R) + 1e-9)            # whiten -> phase correlation (fat zero eps)
+    r = np.fft.ifft2(R).real
+    py, px = np.unravel_index(np.argmax(r), r.shape)
+    # sub-pixel parabolic on each axis (wrap neighbors)
+    dx = _parabolic(r[py, px], r[py, (px - 1) % W], r[py, (px + 1) % W])
+    dy = _parabolic(r[py, px], r[(py - 1) % H, px], r[(py + 1) % H, px])
+    sx = (px + dx); sy = (py + dy)
+    if sx > W / 2: sx -= W
+    if sy > H / 2: sy -= H
+    return sx, sy
+
+
+def pc_velocity(frames):
+    """NAIVE PC (kept for contrast, DO NOT use as the bar): per-pair whitening then spatial
+    shift-averaging - amplifies noise, fails at low SNR. See pc_velocity_fd for the real one."""
+    N = frames.shape[0]
+    sx = sy = 0.0
+    for i in range(N - 1):
+        dx, dy = pc_shift(frames[i], frames[i + 1])
+        sx += dx; sy += dy
+    return sx / (N - 1), sy / (N - 1)
+
+
+def pc_velocity_fd(frames, baseline=1, fatzero=1e-2):
+    """Andrey's coherent PC: SUM the cross-power spectra of same-baseline pairs in the # By Claude on 06/13/2026
+    frequency domain BEFORE normalization, normalize once (fat-zero regularized), ifft ->
+    one correlation surface from all pairs jointly. For constant velocity every consecutive
+    pair encodes the same per-frame shift, so they add coherently (~(N-1)x SNR) and a single
+    whitening then localizes the peak. baseline>1 uses longer-baseline pairs (bigger, better-
+    resolved displacement for slow targets - the 'increase pairs/baseline when speed is low').
+    Returns (vx,vy) px/frame. (Masked iterative + tracking-camera refinement not included
+    here - this is the core FD-combined estimate; refinement would push it further.)"""
+    N, H, W = frames.shape
+    F = np.fft.fft2(frames, axes=(-2, -1))          # [N,H,W] complex
+    R = np.zeros((H, W), dtype=complex)
+    for i in range(N - baseline):
+        R += F[i] * np.conj(F[i + baseline])        # combine in FD BEFORE normalization
+    R /= (np.abs(R) + fatzero * np.abs(R).max())    # single whitening (fat zero)
+    r = np.fft.ifft2(R).real
+    py, px = np.unravel_index(np.argmax(r), r.shape)
+    dx = _parabolic(r[py, px], r[py, (px - 1) % W], r[py, (px + 1) % W])
+    dy = _parabolic(r[py, px], r[(py - 1) % H, px], r[(py + 1) % H, px])
+    sx = px + dx; sy = py + dy
+    if sx > W / 2: sx -= W
+    if sy > H / 2: sy -= H
+    return sx / baseline, sy / baseline             # peak shift = v*baseline
+
+
+def _bump(cx, cy, H, W, radial=False):
+    ys = np.arange(H)[:, None] - cy; xs = np.arange(W)[None, :] - cx
+    if radial:
+        rr = np.sqrt(xs * xs + ys * ys)
+        return np.where(rr < 1.5, np.cos(np.pi / 3.0 * rr), 0.0)
+    bx = np.where(np.abs(xs) < 1.5, np.cos(np.pi / 3.0 * np.abs(xs)), 0.0)
+    by = np.where(np.abs(ys) < 1.5, np.cos(np.pi / 3.0 * np.abs(ys)), 0.0)
+    return bx * by
+
+
+def mf_velocity(frames, vel_radius=5, vel_decimate=4, radial=False, parabolic=True):
+    """Matched-filter velocity (= the C5P statistic at the patch center): correlate the
+    stack with the swept bump for each velocity cell, argmax over the grid (+ parabolic
+    sub-cell). Returns (vx,vy) px/frame and the peak response."""
+    N, H, W = frames.shape
+    cx0 = (W - 1) / 2.0; cy0 = (H - 1) / 2.0
+    vdim = 2 * vel_radius + 1
+    resp = np.empty((vdim, vdim), dtype=np.float64)
+    for iy, vyc in enumerate(range(-vel_radius, vel_radius + 1)):
+        for ix, vxc in enumerate(range(-vel_radius, vel_radius + 1)):
+            vx = vxc / vel_decimate; vy = vyc / vel_decimate
+            s = 0.0
+            for i in range(N):
+                t = _bump(cx0 - vx * i, cy0 - vy * i, H, W, radial)
+                s += float((frames[i] * t).sum())
+            resp[iy, ix] = s
+    iy, ix = np.unravel_index(np.argmax(resp), resp.shape)
+    vyc, vxc = iy - vel_radius, ix - vel_radius
+    if parabolic and 0 < ix < vdim - 1 and 0 < iy < vdim - 1:
+        vxc += _parabolic(resp[iy, ix], resp[iy, ix - 1], resp[iy, ix + 1])
+        vyc += _parabolic(resp[iy, ix], resp[iy - 1, ix], resp[iy + 1, ix])
+    return vxc / vel_decimate, vyc / vel_decimate, float(resp[iy, ix])
+
+
+if __name__ == "__main__":
+    import synth
+    rng = np.random.default_rng(7)
+    print("velRMSE px/fr  (PCnaive = bad strawman, PCfd = Andrey's coherent FD-combined, MF = matched filter)")
+    for snr in [2.0, 3.0, 5.0, 8.0, 100.0]:
+        en = []; efd = []; efd4 = []; emf = []
+        for _ in range(200):
+            f, lab = synth.generate_sample(rng, snr=snr, target=True)
+            vxn, vyn = pc_velocity(f)
+            vxf, vyf = pc_velocity_fd(f, baseline=1)
+            vxf4, vyf4 = pc_velocity_fd(f, baseline=4)   # longer baseline for slow targets
+            vxm, vym, _ = mf_velocity(f)
+            en.append(np.hypot(vxn - lab["vx"], vyn - lab["vy"]))
+            efd.append(np.hypot(vxf - lab["vx"], vyf - lab["vy"]))
+            efd4.append(np.hypot(vxf4 - lab["vx"], vyf4 - lab["vy"]))
+            emf.append(np.hypot(vxm - lab["vx"], vym - lab["vy"]))
+        rms = lambda e: np.sqrt(np.mean(np.square(e)))
+        print(f"snr={snr:6.1f}  PCnaive={rms(en):.4f}  PCfd(b1)={rms(efd):.4f}  "
+              f"PCfd(b4)={rms(efd4):.4f}  MF={rms(emf):.4f}")
--- a/build_combo3.py
+++ b/build_combo3.py
+# build_combo3.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Build the standard combo3_hyper.tif from per-type eval tiffs. By Claude on 06/24/2026.
+
+The locked L2 viewing standard (Andrey, 2026-06-24): ImageJ hyperstack axes TZCYX,
+C=type / Z=frame / T=series, grayscale, each 32x32 frame 2x2-tiled to 64x64 (seam at
+center cross), per-slice labels "TYPE sN fF". C=type keeps ImageJ's display range
+per-channel so scrubbing frame(Z)/series(T) holds a common contrast.
+
+Channel order matches gap_eval.py/clean_eval_fwhm.py outputs:
+  gap evals  (5 types): L2_det, L1_s, input, truth, signal
+  clean/easy (3 types): L2_det, L1_s, truth
+Reads whatever subset is present in --dir, in that fixed order.
+
+Usage:  build_combo3.py --dir runs/l1views/mhc_eval --T 128
+        build_combo3.py --dir runs/l1views/mhc_easy --T 120
+"""
+import argparse, os, numpy as np, tifffile
+
+ORDER = ["L2_det", "L1_s", "input", "truth", "signal"]  # fixed C order
+
+
+def tile2x2(a):  # (...,32,32) -> (...,64,64), seam at center cross
+    return np.tile(a, (1, 1, 2, 2)) if a.ndim == 4 else np.tile(a, (2, 2))
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--dir", required=True)
+    ap.add_argument("--T", type=int, default=128, help="frames per series")
+    ap.add_argument("--out", default=None)
+    a = ap.parse_args()
+    # Default filename = folder basename, so open windows get distinct/descriptive ImageJ
+    # titles (e.g. mhgb_train12_hyper.tif), not a generic combo3_hyper.tif. By Claude 06/24/2026.
+    out = a.out or os.path.join(a.dir, os.path.basename(os.path.normpath(a.dir)) + "_hyper.tif")
+
+    types = [t for t in ORDER if os.path.exists(os.path.join(a.dir, f"{t}.tif"))]
+    if not types:
+        raise SystemExit(f"no per-type tiffs found in {a.dir}")
+    stacks = []
+    for t in types:
+        s = tifffile.imread(os.path.join(a.dir, f"{t}.tif")).astype(np.float32)  # (T*Z,32,32)
+        nser = s.shape[0] // a.T
+        s = s.reshape(nser, a.T, s.shape[1], s.shape[2])  # (series,frame,32,32)
+        s = tile2x2(s)                                     # (series,frame,64,64)
+        stacks.append(s)
+    nser, nfr = stacks[0].shape[0], stacks[0].shape[1]
+    C = len(types)
+    vol = np.stack(stacks, axis=2)  # (T=series, Z=frame, C=type, Y, X)
+
+    # ImageJ Labels: C fastest, then Z(frame), then T(series)
+    labels = [f"{types[c]} s{t} f{z}" for t in range(nser) for z in range(nfr) for c in range(C)]
+    tifffile.imwrite(out, vol, imagej=True,
+                     metadata={"axes": "TZCYX", "mode": "grayscale", "Labels": labels})
+    print(f"wrote {out}  TZCYX={vol.shape}  types={types}  series={nser} frames={nfr}")
+
+
+if __name__ == "__main__":
+    main()
--- a/clean_eval_fwhm.py
+++ b/clean_eval_fwhm.py
+# clean_eval_fwhm.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Clean FWHM eval: trained L2 vs L1, on NO-GAP data, bright frames only. By Claude 06/23/2026"""
+import argparse, numpy as np, torch, synth, layer2_data as L1D
+from layer2 import Layer2Net
+
+def fwhm_at(field, cx, cy, G):
+    # roll truth to center, measure half-max width of center row & col (sub-pixel, clean interp)
+    f=np.roll(np.roll(field,int(round(G//2-cy)),0),int(round(G//2-cx)),1); c=G//2
+    def w(line):
+        pk=float(line[c])
+        if pk<=1e-6: return np.nan
+        h=pk/2
+        # right crossing
+        i=c
+        while i<len(line)-1 and line[i]>=h: i+=1
+        r=(i-1)+(line[i-1]-h)/(line[i-1]-line[i]) if line[i-1]>line[i] else float(i-1)
+        # left
+        i=c
+        while i>0 and line[i]>=h: i-=1
+        l=(i+1)-(line[i+1]-h)/(line[i+1]-line[i]) if line[i+1]>line[i] else float(i+1)
+        return r-l
+    return np.nanmean([w(f[c,:]),w(f[:,c])])
+
+ap=argparse.ArgumentParser()
+ap.add_argument("--l1",default="runs/weighted9_pm/model.pt")
+ap.add_argument("--model",required=True); ap.add_argument("--out",required=True)
+ap.add_argument("--T",type=int,default=120); ap.add_argument("--G",type=int,default=32)
+a=ap.parse_args(); import os; os.makedirs(a.out,exist_ok=True)
+dev="cuda" if torch.cuda.is_available() else "cpu"
+net1,N,_=L1D._load_l1(a.l1,dev)
+ck=torch.load(a.model,map_location=dev); ar=ck["args"]
+net=Layer2Net(ch_in=3,ch_hidden=ar["ch"],grid=a.G,vmax=ar["vmax"]).to(dev)
+net.load_state_dict(ck["model"]); net.eval()
+rng=np.random.default_rng(777)
+# CLEAN no-gap data (matches training)
+frames,pos,vel,present=L1D.render_run(rng,T=a.T,G=a.G,vmax=ar["vmax"],snr=ar["snr"],gaps=False)
+seq=L1D.gen_field_sequence(net1,frames,pos,a.G,N,dev)
+with torch.no_grad():
+    det,_=net(torch.from_numpy(seq[None]).to(dev))
+    l2=torch.sigmoid(det)[0,:,0].cpu().numpy()
+l1s=seq[:,0]
+fl1,fl2,fp_away=[],[],[]
+for t in range(a.T):
+    if not present[t]: continue
+    cx,cy=pos[t]; ci,cj=int(round(cy))%a.G,int(round(cx))%a.G
+    if l1s[t,ci,cj]>0.5: fl1.append(fwhm_at(l1s[t],cx,cy,a.G))   # L1 bright frames
+    if l2[t].max()>0.5:
+        fl2.append(fwhm_at(l2[t],cx,cy,a.G))
+        m=np.ones((a.G,a.G),bool); yy,xx=np.ogrid[:a.G,:a.G]
+        m[(((xx-cj+a.G/2)%a.G-a.G/2)**2+((yy-ci+a.G/2)%a.G-a.G/2)**2)<=9]=False
+        fp_away.append(float(l2[t][m].max()))
+synth.save_tiff_stack(l1s,f"{a.out}/L1_s.tif"); synth.save_tiff_stack(l2,f"{a.out}/L2_det.tif")
+tr=np.zeros((a.T,a.G,a.G),np.float32)
+for t in range(a.T):
+    if present[t]: tr[t]=L1D.halfcos_bump_torus(pos[t,0],pos[t,1],a.G)
+synth.save_tiff_stack(tr,f"{a.out}/truth.tif")
+print(f"=== CLEAN no-gap eval of {a.model} ===")
+print(f"present {int(present.sum())}/{a.T}  L1 locked {len(fl1)}  L2 locked {len(fl2)}")
+print(f"FWHM @ target (bright frames):  L1 ~ {np.nanmean(fl1):.2f} px   L2 ~ {np.nanmean(fl2):.2f} px")
+print(f"L2 max-FP away-from-target (locked frames): mean {np.mean(fp_away):.2f}  max {np.max(fp_away):.2f}")
--- a/compare_dnn_truth.py
+++ b/compare_dnn_truth.py
+#!/usr/bin/env python3
+# compare_dnn_truth.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""
+compare_dnn_truth.py  - compare the DNN -DNN-OFFSET output against the eyeballed
+linear ground-truth track of the real target (DJI Mini 4 Pro @ ~600 m).
+
+Reads the ImageJ hyperstack  *-DNN...-OFFSET.tiff  (channels dx, dy, s over the
+save ROI, one Z per scene), parses the per-scene timestamp labels, and for each
+scene reports: max detection s, the sub-pixel detected position (px+dx, py+dy)
+at the peak pixel, the interpolated truth position, and the position error.
+
+This is a first-cut scaffold: it INTROSPECTS the TIFF first (shape, imagej
+metadata, labels) and then parses with the assumptions marked  # VERIFY  below.
+If the axis order / label format differ, flip the marked lines - the structure
+is otherwise correct.
+
+Deps:  pip install tifffile numpy   (matplotlib optional for --plot)
+Usage: python3 compare_dnn_truth.py /path/to/dir_or_file [--plot] [--sthr 0.3]
+"""
+import sys, glob, re, argparse
+import numpy as np
+
+# ----- ground truth from the UAS flight log (smoothed + interpolated) + constant offset -----
+# log cols: timestamp(full s), px, py, range(m). px,py UPDATE at ~5 fps (held between samples ->
+# a staircase) while logged finer, so we SMOOTH (moving average) then interpolate to scene times.
+# The drone curves (NOT constant velocity), hence the log rather than a linear fit. A constant
+# calibration offset aligns log->image:  image_truth(t) = smooth_log(t) + (OFFSET_X, OFFSET_Y).
+OFFSET_X, OFFSET_Y = -4.0, +2.0   # log->image; refine (log(410.1,306.1) vs eyeball(406,308))
+SMOOTH_WIN = 9                    # moving-average window in log samples (~0.3 s) to de-staircase the 5 fps updates
+ROI_X0, ROI_Y0, ROI_W, ROI_H = 395, 300, 70, 20
+VEL_RADIUS = 5     # fan is (2*VR+1)^2 cells (curt_vel_radius)
+VEL_STEP   = 0.25  # px/level-frame per grid cell (= 1/curt_vel_decimate, VD4)
+SEARCH_R   = 3     # px window around the log truth: measure the DNN on the TARGET, not the y~312 clutter
+_LOG = {'t': None, 'x': None, 'y': None}
+
+
+def read_meta(path):
+    """All <key>value</key> props from the ImageJ Info XML (TIFF ImageDescription): curt_* etc. By Claude on 06/16/2026"""
+    import tifffile, xml.etree.ElementTree as ET
+    info = None
+    try:
+        with tifffile.TiffFile(path) as tf:
+            md = tf.imagej_metadata or {}
+            info = md.get('Info')
+            if info is None:
+                try: info = tf.pages[0].tags['ImageDescription'].value
+                except Exception: info = None
+    except Exception:
+        info = None
+    props = {}
+    if info:
+        try:
+            for ch in ET.fromstring(info):
+                props[ch.tag] = ch.text
+        except Exception:
+            pass
+    return props
+
+
+def _smooth(v, w):
+    """Edge-safe moving average (pad with edge values, no zero-bias at the ends)."""
+    if w <= 1 or v.size < w:
+        return v
+    p = w // 2
+    return np.convolve(np.pad(v, p, mode='edge'), np.ones(w) / w, mode='valid')[:v.size]
+
+
+def load_log(path):
+    import csv
+    ts, xs, ys = [], [], []
+    with open(path) as f:
+        rd = csv.reader(f); next(rd)                       # skip header
+        for row in rd:
+            if len(row) < 3:
+                continue
+            ts.append(float(row[0])); xs.append(float(row[1])); ys.append(float(row[2]))
+    o = np.argsort(ts)
+    t = np.array(ts)[o]
+    _LOG['t'], _LOG['x'], _LOG['y'] = t, _smooth(np.array(xs)[o], SMOOTH_WIN), _smooth(np.array(ys)[o], SMOOTH_WIN)
+    print(f"  log: {len(t)} samples, t {t[0]:.3f}..{t[-1]:.3f}, smooth_win={SMOOTH_WIN}, offset=({OFFSET_X:+.1f},{OFFSET_Y:+.1f})")
+
+
+def truth_xy(t, margin=0.3):
+    """Smoothed-log truth (x, y) at full-second timestamp t, + constant offset. None outside log window."""
+    if t is None or _LOG['t'] is None or t < _LOG['t'][0] - margin or t > _LOG['t'][-1] + margin:
+        return None
+    return (float(np.interp(t, _LOG['t'], _LOG['x'])) + OFFSET_X,
+            float(np.interp(t, _LOG['t'], _LOG['y'])) + OFFSET_Y)
+
+
+def parse_ts(label):
+    """ '1773135468_500748-0 f3'  ->  5468.500748  (seconds tail, drops the high prefix)."""
+    if label is None:
+        return None
+    m = re.search(r'(\d{6,})_(\d+)', str(label))   # search (labels are prefixed 'dx:'/'<vtitle>:'); ts seconds are long
+    if not m:
+        m2 = re.search(r'(\d{6,})', str(label))
+        return float(m2.group(1)) if m2 else None
+    sec = int(m.group(1))                   # FULL seconds, matches the log's full timestamp
+    frac = float('0.' + m.group(2))
+    return sec + frac
+
+
+def load_offset(path):
+    import tifffile
+    with tifffile.TiffFile(path) as tf:
+        arr = tf.asarray()
+        ij = tf.imagej_metadata or {}
+        labels = ij.get('Labels')
+        print(f"  shape={arr.shape} dtype={arr.dtype}  ij(channels={ij.get('channels')}, "
+              f"slices={ij.get('slices')}, frames={ij.get('frames')})")
+        if labels:
+            print(f"  first labels: {labels[:6]}")
+    return arr, labels
+
+
+def to_channels_scenes(arr):
+    """Normalize the array to (3, n_scene, H, W) = (dx,dy,s channels, scenes).  # VERIFY axis order."""
+    a = np.asarray(arr)
+    if a.ndim == 4:
+        # find the axis of size 3 (channels dx,dy,s); the other non-spatial axis is scenes
+        ax3 = next((i for i, s in enumerate(a.shape[:-2]) if s == 3), 0)
+        a = np.moveaxis(a, ax3, 0)                      # -> (3, scenes, H, W)
+    elif a.ndim == 3:
+        # (3*scenes, H, W) flattened: ImageJ interleaves... assume channel-major blocks  # VERIFY
+        n = a.shape[0]
+        assert n % 3 == 0, f"page count {n} not divisible by 3"
+        a = a.reshape(3, n // 3, a.shape[1], a.shape[2])
+    else:
+        raise ValueError(f"unexpected ndim {a.ndim}")
+    return a  # (3, nsc, H, W) : [0]=dx [1]=dy [2]=s
+
+
+def load_hyper(path):
+    """Read -DNN-...-HYPER-RECT -> velocity fan (nvel, nsc, H, W); drops the leading MAX-over-v slice."""
+    import tifffile
+    with tifffile.TiffFile(path) as tf:
+        a = np.asarray(tf.asarray()); ij = tf.imagej_metadata or {}
+        labels = ij.get('Labels')
+    nvel = (2 * VEL_RADIUS + 1) ** 2
+    axv = next((i for i, s in enumerate(a.shape) if s in (nvel, nvel + 1)), 0)   # VERIFY velocity axis
+    a = np.moveaxis(a, axv, 0)
+    if a.shape[0] == nvel + 1:
+        a = a[1:]                        # drop the leading MAX-over-v slice
+    print(f"  hyper: fan shape {a.shape} (nvel,nsc,H,W)")
+    return a, labels
+
+
+def fan_vel(f):
+    """(argmax_vx, argmax_vy, centroid_vx, centroid_vy, s) from one pixel's fan, in px/level-frame."""
+    f = np.clip(np.nan_to_num(np.asarray(f, float)), 0, None)
+    tot = f.sum()
+    if tot <= 0:
+        return (np.nan, np.nan, np.nan, np.nan, 0.0)
+    n = 2 * VEL_RADIUS + 1
+    ix = np.arange(f.size)
+    vx = (ix % n - VEL_RADIUS) * VEL_STEP
+    vy = (ix // n - VEL_RADIUS) * VEL_STEP
+    k = int(np.argmax(f))
+    return (vx[k], vy[k], float((vx * f).sum() / tot), float((vy * f).sum() / tot), float(tot))
+
+
+def log_speed(t, dt):
+    """Log velocity (px per level-frame of dt seconds) at full-second t, from the smoothed-log slope."""
+    if _LOG['t'] is None or t is None:
+        return (np.nan, np.nan)
+    gx = np.gradient(_LOG['x'], _LOG['t']); gy = np.gradient(_LOG['y'], _LOG['t'])   # px/s
+    return (float(np.interp(t, _LOG['t'], gx) * dt), float(np.interp(t, _LOG['t'], gy) * dt))
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument('path', help='dir (uses newest *-DNN*-OFFSET.tiff) or a tiff file')
+    ap.add_argument('--sthr', type=float, default=0.3, help='detection threshold on s')
+    ap.add_argument('--plot', action='store_true')
+    args = ap.parse_args()
+
+    path = args.path
+    if not path.lower().endswith(('.tif', '.tiff')):
+        cands = sorted(glob.glob(f"{path}/*-DNN*OFFSET*.tif*"), key=lambda p: __import__('os').path.getmtime(p))
+        if not cands:
+            sys.exit(f"no *-DNN*-OFFSET*.tiff in {path}")
+        path = cands[-1]
+    print(f"file: {path}")
+    import os
+    # Run config from image metadata (all curt_*), preferring it; else the -ROIx_y_w_h filename tag; else default. By Claude on 06/16/2026
+    meta = read_meta(path)
+    if meta:
+        print("metadata: " + ", ".join(f"{k}={meta[k]}" for k in
+              ('curt_dnn_model','curt_dnn_patch','curt_dnn_vmax','curt_save_select') if k in meta))
+    nums = re.findall(r'-?\d+', meta.get('curt_save_select', '')) if meta else []
+    if len(nums) >= 4:
+        ROI_X0, ROI_Y0, ROI_W, ROI_H = map(int, nums[:4])
+        print(f"ROI from metadata: {ROI_X0},{ROI_Y0},{ROI_W},{ROI_H}")
+    else:
+        mroi = re.search(r'ROI(\d+)_(\d+)_(\d+)_(\d+)', os.path.basename(path))
+        if mroi:
+            ROI_X0, ROI_Y0, ROI_W, ROI_H = map(int, mroi.groups())
+            print(f"ROI from filename: {ROI_X0},{ROI_Y0},{ROI_W},{ROI_H}")
+        else:
+            print(f"ROI not in metadata/filename; using default {ROI_X0},{ROI_Y0},{ROI_W},{ROI_H}")
+    logp = os.path.join(os.path.dirname(os.path.abspath(path)), 'UAS_log.csv')
+    if not os.path.exists(logp):
+        sys.exit(f"no UAS_log.csv next to {path}")
+    load_log(logp)
+    arr, labels = load_offset(path)
+    chs = to_channels_scenes(arr)
+    # remote (DGX) -OFFSET is full-frame (5ch dx,dy,s,Vx,Vy at 512x640); local is ROI-sized (3ch).
+    # chs[0,1,2]=dx,dy,s for both; crop a full-frame array to the metadata ROI so the ROI-relative
+    # indexing below works for both. By Claude on 06/20/2026
+    if (chs.shape[-2] != ROI_H) or (chs.shape[-1] != ROI_W):
+        chs = chs[:, :, ROI_Y0:ROI_Y0 + ROI_H, ROI_X0:ROI_X0 + ROI_W]
+        print(f"  cropped full-frame -OFFSET to ROI -> {chs.shape}")
+    # channel order: 3-ch local -OFFSET = {dx,dy,s}; 5-ch remote -OFFSET = {s,Vx,Vy,dx,dy} (s-first). By Claude on 06/20/2026
+    if chs.shape[0] >= 5:
+        s, dx, dy = chs[0], chs[3], chs[4]        # each (nsc, H, W)
+    else:
+        dx, dy, s = chs[0], chs[1], chs[2]
+    nsc = s.shape[0]
+    ts = [parse_ts(labels[i]) if labels and i < len(labels) else None for i in range(nsc)]
+
+    print(f"\nposition: peak s within +/-{SEARCH_R}px of the log truth (excludes distant clutter)")
+    print(f"{'t':>14} {'truth(x,y)':>16} {'s@tgt':>7} {'det(x,y)':>16} {'err_px':>7}")
+    errs = []; det_pts = []
+    for i in range(nsc):
+        t = ts[i]; gt = truth_xy(t)
+        si = np.nan_to_num(s[i], nan=-1.0)
+        if gt is not None:
+            gr = int(round(gt[1] - ROI_Y0)); gc = int(round(gt[0] - ROI_X0))
+            r0, r1 = max(0, gr - SEARCH_R), min(si.shape[0], gr + SEARCH_R + 1)
+            c0, c1 = max(0, gc - SEARCH_R), min(si.shape[1], gc + SEARCH_R + 1)
+            if r1 <= r0 or c1 <= c0:
+                continue
+            lr, lc = np.unravel_index(int(np.argmax(si[r0:r1, c0:c1])), (r1 - r0, c1 - c0))
+            r, c = r0 + lr, c0 + lc
+        else:
+            r, c = np.unravel_index(int(np.argmax(si)), si.shape)
+        smax = float(s[i][r, c]) if np.isfinite(s[i][r, c]) else 0.0
+        det_x = ROI_X0 + c + (dx[i, r, c] if np.isfinite(dx[i, r, c]) else 0.0)
+        det_y = ROI_Y0 + r + (dy[i, r, c] if np.isfinite(dy[i, r, c]) else 0.0)
+        if gt is not None:
+            e = ((det_x - gt[0])**2 + (det_y - gt[1])**2) ** 0.5
+            if smax >= args.sthr:
+                errs.append(e); det_pts.append((t, det_x, det_y))
+            print(f"{t:14.3f} ({gt[0]:6.2f},{gt[1]:6.2f}) {smax:7.3f} ({det_x:6.2f},{det_y:6.2f}) {e:7.2f}")
+        else:
+            print(f"{(t or float('nan')):14.3f} {'(out)':>16} {smax:7.3f}")
+
+    if errs:
+        errs = np.array(errs)
+        print(f"\ndetected {len(errs)}/{nsc} scenes (s>= {args.sthr});  "
+              f"pos err  mean={errs.mean():.2f}  median={np.median(errs):.2f}  max={errs.max():.2f} px")
+
+    # time-offset / trend check: is the position error a velocity-dependent trend (-> log<->image clock
+    # offset, i.e. calibration) rather than random DNN error? Search dt that minimizes residual RMS.
+    if len(det_pts) >= 4:
+        tp = np.array([p[0] for p in det_pts]); dxp = np.array([p[1] for p in det_pts]); dyp = np.array([p[2] for p in det_pts])
+        ex = np.array([dxp[k] - truth_xy(tp[k])[0] for k in range(len(tp))])
+        ey = np.array([dyp[k] - truth_xy(tp[k])[1] for k in range(len(tp))])
+        rms0 = float(np.sqrt(np.mean(ex**2 + ey**2)))
+        sx = float(np.polyfit(tp, ex, 1)[0]); sy = float(np.polyfit(tp, ey, 1)[0])  # px/s trend
+        best = (0.0, rms0)
+        for dtc in np.arange(-0.5, 0.5001, 1.0/60):
+            gs = [truth_xy(t + dtc) for t in tp]
+            if any(g is None for g in gs):
+                continue
+            r = float(np.sqrt(np.mean([(dxp[k]-gs[k][0])**2 + (dyp[k]-gs[k][1])**2 for k in range(len(tp))])))
+            if r < best[1]:
+                best = (float(dtc), r)
+        print(f"\ntime-offset/trend check ({len(tp)} scenes, s>= {args.sthr}):")
+        print(f"  err_x: mean={ex.mean():+.2f}px trend={sx:+.2f}px/s | err_y: mean={ey.mean():+.2f}px trend={sy:+.2f}px/s")
+        print(f"  RMS@dt=0={rms0:.2f}px ; best dt={best[0]:+.3f}s -> RMS={best[1]:.2f}px "
+              f"({'time-offset likely' if best[1] < 0.7*rms0 else 'no strong time-offset -> mostly random/DNN'})")
+
+    # --- velocity from the -DNN-...-HYPER-RECT fan (argmax + centroid) vs the log slope ---
+    import os
+    hpath = path.replace('OFFSET', 'HYPER-RECT')   # match the offset file's level + synth/real
+    if os.path.exists(hpath):
+        fan, hlabels = load_hyper(hpath)
+        _, nschh, H, W = fan.shape   # fan = (nvel, nsc, H, W); scenes is axis 1 // fixed By Claude 06/20/2026
+        hts = [parse_ts(hlabels[i]) if hlabels and i < len(hlabels) else None for i in range(nschh)]
+        good = [t for t in hts if t is not None]
+        dt = float(np.median(np.diff(good))) if len(good) > 1 else (8.0 / 60.0)
+        print(f"\nvelocity from {os.path.basename(hpath)}  (frame dt={dt:.4f}s; px/level-frame)")
+        print(f"{'t':>14} {'argmax(vx,vy)':>16} {'centroid(vx,vy)':>18} {'log(vx,vy)':>16}")
+        for i in range(nschh):
+            smap = fan[:, i].reshape(fan.shape[0], H, W).sum(0)   # (H,W) detection map
+            gt = truth_xy(hts[i])
+            if gt is not None:
+                gr = int(round(gt[1] - ROI_Y0)); gc = int(round(gt[0] - ROI_X0))
+                r0, r1 = max(0, gr - SEARCH_R), min(H, gr + SEARCH_R + 1)
+                c0, c1 = max(0, gc - SEARCH_R), min(W, gc + SEARCH_R + 1)
+                if r1 <= r0 or c1 <= c0:
+                    continue
+                lr, lc = np.unravel_index(int(np.argmax(smap[r0:r1, c0:c1])), (r1 - r0, c1 - c0))
+                rr, cc = r0 + lr, c0 + lc
+            else:
+                rr, cc = np.unravel_index(int(np.argmax(smap)), smap.shape)
+            ax, ay, cx, cy, sval = fan_vel(fan[:, i, rr, cc])
+            lvx, lvy = log_speed(hts[i], dt)
+            tt = hts[i] if hts[i] else float('nan')
+            print(f"{tt:14.3f}   ({ax:+.2f},{ay:+.2f})    ({cx:+.2f},{cy:+.2f})    ({lvx:+.2f},{lvy:+.2f})")
+
+    if args.plot:
+        import matplotlib.pyplot as plt
+        tv = np.array([t if t else np.nan for t in ts])
+        sv = np.array([float(np.nanmax(s[i])) for i in range(nsc)])
+        plt.figure(); plt.plot(tv, sv, 'o-'); plt.axhline(args.sthr, ls='--', c='r')
+        plt.xlabel('t (s tail)'); plt.ylabel('max s'); plt.title('DNN visibility(t)'); plt.grid(True)
+        plt.show()
+
+
+if __name__ == '__main__':
+    main()
--- a/dense_check.py
+++ b/dense_check.py
+#!/usr/bin/env python3
+# dense_check.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Validate full-res shift-and-stitch == per-pixel patch inference. By Claude on 06/20/2026.
+Run inside the NGC container (needs model.pt + model.py).
+
+The stride-4 RawFCN, run densely on a whole frame, emits a 1/4-res grid. To recover the
+full-res per-input-pixel field (what the Java CPU inferROI produces, CuasDnnInfer.java:111),
+we run the dense net on the S*S=16 (S=4) sub-pixel shifts of the zero-padded frame and
+interleave the outputs. This proves that reconstruction bit-matches the per-pixel reference
+and pins the padding/offset alignment before it goes into the server + Java."""
+import sys
+import torch
+import torch.nn.functional as F
+from model import RawFCN
+
+
+def load(run_dir):
+    ck = torch.load(run_dir + "/model.pt", map_location="cpu", weights_only=False)
+    a = ck.get("args", {}) or {}
+    m = RawFCN(n_frames=a.get("nframes", 8), vel_radius=a.get("vel_radius", 5),
+               patch=a.get("patch", 24), velocity_mode=a.get("velocity_mode", "grid"),
+               vmax=a.get("vmax", 1.4))
+    m.load_state_dict(ck["model"])
+    m.eval().cuda()
+    return m, a.get("nframes", 8), a.get("patch", 24)
+
+
+@torch.no_grad()
+def per_pixel(m, x, P, region):
+    # Reference (mirrors CPU inferROI): for each pixel in region, run the net on its P×P
+    # patch (zero-padded at borders), keep the channel vector. All patches batched -> 1 forward.
+    # x: [N, H, W] on GPU.   region = (y0, x0, h, w).
+    N, H, W = x.shape
+    half = P // 2
+    y0, x0, h, w = region
+    xp = F.pad(x, (half, half, half, half))               # [N, H+P, W+P]  zero border (== CPU fill)
+    patches = torch.empty(h * w, N, P, P, device=x.device, dtype=x.dtype)  # [npix, N, P, P]
+    for q in range(h * w):
+        cy, cx = y0 + q // w, x0 + q % w                  # pixel center in ORIGINAL coords
+        patches[q] = xp[:, cy:cy + P, cx:cx + P]          # xp[cy:cy+P] == original rows [cy-half, cy+half)
+    out = m(patches)                                      # [npix, C, 1, 1]
+    return out[:, :, 0, 0].reshape(h, w, -1)              # [h, w, C]
+
+
+@torch.no_grad()
+def shift_stitch(m, x, P, S=4):
+    # Pad by half so a valid dense pass aligns output cell (oi,oj) to input pixel (S*oi, S*oj).
+    # Then 16 shifts (sy,sx in 0..S-1) interleave into the full-res [C,H,W] field.
+    N, H, W = x.shape
+    half = P // 2
+    xp = F.pad(x, (half, half, half, half))               # [N, H+P, W+P]
+    C = m.out_ch
+    full = torch.zeros(C, H, W, device=x.device, dtype=x.dtype)  # [C, H, W] full-res field
+    for sy in range(S):
+        for sx in range(S):
+            xs = xp[:, sy:, sx:]                          # shift the padded frame by (sy,sx)
+            y = m(xs[None])[0]                            # [C, oH, oW]  1/4-res dense grid
+            oh = min(y.shape[1], (H - sy + S - 1) // S)   # cells that land inside [0,H)/[0,W)
+            ow = min(y.shape[2], (W - sx + S - 1) // S)
+            full[:, sy::S, sx::S] = y[:, :oh, :ow]        # interleave at stride S
+    return full.permute(1, 2, 0)                          # [H, W, C]
+
+
+def compare(m, P, N, tag):
+    # one alignment check under the current backend/precision settings
+    torch.manual_seed(0)
+    H, W = 40, 48
+    dt = next(m.parameters()).dtype                       # match model precision (fp32 / fp64)
+    x = torch.randn(N, H, W, dtype=dt).cuda()             # [N, H, W] random conditioned stack
+    region = (P // 2, P // 2, 8, 8)                       # interior region (y0,x0,h,w)
+    ref = per_pixel(m, x, P, region)                      # [8, 8, C]
+    full = shift_stitch(m, x, P)                          # [H, W, C]
+    y0, x0, h, w = region
+    sub = full[y0:y0 + h, x0:x0 + w]                      # [8, 8, C]
+    d = (ref - sub).abs().max().item()
+    rng = ref.abs().max().item()
+    print(f"  [{tag}] max|diff|={d:.3e}  out~{rng:.2f}  rel={d/max(rng,1e-12):.1e}  "
+          f"{'MATCH' if d < 1e-4 else 'mismatch'}")
+
+
+if __name__ == "__main__":
+    run = sys.argv[1] if len(sys.argv) > 1 else "runs/weighted9_pm_s"
+    m, N, P = load(run)
+    # 1) default: cuDNN on, fp32 -> small-patch and dense convs may pick different algos
+    torch.backends.cudnn.enabled = True
+    compare(m, P, N, "cuDNN fp32")
+    # 2) cuDNN OFF: both paths use the same aten kernels -> isolates algorithm-choice noise
+    torch.backends.cudnn.enabled = False
+    compare(m, P, N, "cuDNN OFF fp32")
+    # 3) fp64: near-exact arithmetic -> proves the ALIGNMENT (geometry) is right
+    torch.backends.cudnn.enabled = True
+    compare(m.double(), P, N, "fp64")
--- a/diag_clean.py
+++ b/diag_clean.py
+# diag_clean.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Why is learned S jittery on clean input? Compare learned S vs EXACT MF path-sum. By Claude 06/18/2026
+Usage: python diag_clean.py [model.pt]  (default runs/stage1_mfs/model.pt)"""
+import sys, numpy as np, torch
+import synth, stage2 as S2
+from model import RawFCN
+dev = "cuda" if torch.cuda.is_available() else "cpu"
+N, vmax, P = 9, 2.8, 52; half, Nm1, HW = P // 2, N - 1, 140
+ckpt = sys.argv[1] if len(sys.argv) > 1 else "/work/runs/stage1_mfs/model.pt"
+print("model:", ckpt)
+s1 = RawFCN(n_frames=N, patch=P, velocity_mode="reg", vmax=vmax).to(dev)
+s1.load_state_dict(torch.load(ckpt, map_location=dev)["model"]); s1.eval()
+
+def render(V, amp, noise, seed=0):
+    rng = np.random.default_rng(seed)
+    fr = rng.standard_normal((N, HW, HW)).astype(np.float32) if noise else np.zeros((N, HW, HW), np.float32)
+    Hx = HW / 2 + V * Nm1 * 0.5; Hy = HW / 2; nb = 4; subs = np.arange(nb) * (1.0 / nb)
+    for i in range(N):
+        acc = np.zeros((HW, HW))
+        for ss in subs: acc += synth.halfcos_bump(Hx - V * (i + ss), Hy, HW, HW)
+        fr[i] += (amp * acc / nb).astype(np.float32)
+    return fr, Hx, Hy
+
+def rough(a): return float(np.std(np.diff(a)))   # pixel-to-pixel roughness (0 = perfectly smooth)
+
+for noise, lbl in [(False, "CLEAN (no noise)"), (True, "NOISY amp=5")]:
+    fr, Hx, Hy = render(1.0, 5, noise)
+    s_t, vx_t, vy_t = S2.stage1_dense(s1, fr, dev=dev, mf_s=True)
+    mf = S2.mf_sum(torch.from_numpy(fr).to(dev), vx_t, vy_t, half, N)   # EXACT MF along the net's own velocity
+    s = s_t.cpu().numpy(); vx = vx_t.cpu().numpy(); mfn = mf.cpu().numpy()
+    yf = int(round(Hy - half)); hxf = Hx - half; xs = np.arange(int(hxf) - 14, int(hxf) + 15)
+    print(f"\n===== {lbl} =====")
+    print("  x-off |  S_learned   Vx_learned   S_exactMF")
+    for x in xs[::2]:
+        print("   %+4d  |  %8.2f    %+6.3f     %8.2f" % (x - hxf, s[yf, x], vx[yf, x], mfn[yf, x]))
+    print("  roughness(diff-std):  S_learned=%.3f   S_exactMF=%.3f   Vx_learned=%.4f"
+          % (rough(s[yf, xs]), rough(mfn[yf, xs]), rough(vx[yf, xs])))
--- a/eval_mfs.py
+++ b/eval_mfs.py
+# eval_mfs.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Eval the MF-S v2 pipeline (Stage-1 native-S weight + retrained refine). By Claude on 06/18/2026"""
+import numpy as np, torch
+import stage2 as S2
+from model import RawFCN
+dev="cuda" if torch.cuda.is_available() else "cpu"; N,vmax,HW=9,2.8,140; Nm1=N-1; half=26
+s1=RawFCN(n_frames=N,patch=52,velocity_mode="reg",vmax=vmax).to(dev)
+s1.load_state_dict(torch.load("/work/runs/stage1_mfs/model.pt",map_location=dev)["model"]); s1.eval()
+net=S2.VoteRefine().to(dev); net.load_state_dict(torch.load("/work/runs/stage2_mfs/model.pt",map_location=dev)["model"]); net.eval()
+def peaks(p,th,r=2):
+    out=[]; H_,W_=p.shape
+    for y in range(r,H_-r):
+        for x in range(r,W_-r):
+            if p[y,x]>th and p[y,x]>=p[y-r:y+r+1,x-r:x+r+1].max()-1e-6: out.append((x,y,p[y,x]))
+    return out
+for TH in (0.5,0.7):
+    ndet=0;ntot=0;errs=[];gh=[];rng=np.random.default_rng(123)
+    for t in range(15):
+        fr,tg=S2.gen_field(rng,HW,4,N,vmax,[3.0,8.0]); s,vx,vy=S2.stage1_dense(s1,fr,dev=dev,mf_s=True)
+        aS,aVx,aVy=S2.vote_scatter(s,vx,vy,Nm1); nrm=aS.max().clamp(min=1e-6); aS=aS/nrm;aVx=aVx/nrm;aVy=aVy/nrm
+        with torch.no_grad(): p=torch.sigmoid(net(aS,aVx,aVy)[0]).cpu().numpy()
+        F_=p.shape[0]; pk=peaks(p,TH)
+        tt=[((hx-tvx*Nm1)-half,(hy-tvy*Nm1)-half) for hx,hy,tvx,tvy in tg]; tt=[(x,y) for x,y in tt if 0<=x<F_ and 0<=y<F_]
+        for tx,ty in tt:
+            ntot+=1; near=[np.hypot(px-tx,py-ty) for px,py,pv in pk if np.hypot(px-tx,py-ty)<8]
+            if near: ndet+=1; errs.append(min(near))
+        for px,py,pv in pk:
+            if all(np.hypot(px-tx,py-ty)>=8 for tx,ty in tt): gh.append(pv)
+    print("th=%.2f: det %d/%d (%.0f%%) locerr %.2f | TRUE ghosts(>8px) %d max %.3f"%(TH,ndet,ntot,100*ndet/ntot,np.median(errs) if errs else -1,len(gh),max(gh) if gh else 0))
--- a/export_onnx.py
+++ b/export_onnx.py
+# export_onnx.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Export an existing model.pt checkpoint to ONNX (no retrain). # By Claude on 06/13/2026
+Usage:  python export_onnx.py /work/runs/weighted/model.pt"""
+import sys, torch
+from model import RawFCN
+
+ckpt = sys.argv[1]
+ck = torch.load(ckpt, map_location="cpu")
+a = ck["args"]
+N = a.get("nframes", 8); P = a.get("patch", 24); vr = a.get("vel_radius", 5)
+m = RawFCN(n_frames=N, vel_radius=vr)
+m.load_state_dict(ck["model"]); m.eval()
+onnx_path = ckpt[:-3] + ".onnx" if ckpt.endswith(".pt") else ckpt + ".onnx"
+dummy = torch.zeros(1, N, P, P)
+torch.onnx.export(m, dummy, onnx_path,
+    input_names=["frames"], output_names=["out"],
+    dynamic_axes={"frames": {0: "B", 2: "H", 3: "W"}, "out": {0: "B", 2: "Hout", 3: "Wout"}},
+    opset_version=17)
+print(f"exported {onnx_path}  (frames[B,{N},H,W] -> out[B,{m.out_ch},Hout,Wout])")
--- a/export_refine.py
+++ b/export_refine.py
+# export_refine.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Export a trained Stage-2 VoteRefine checkpoint to a self-contained ONNX. By Claude on 06/18/2026
+
+CuasStage2Infer feeds a single [1,3,H,W] tensor (accS, accVx, accVy stacked) and reads [1,3,H,W]
+(det logit + Vx,Vy). So we export the inner conv stack (VoteRefine.net), which is exactly that
+map. dynamo=False keeps weights INLINE (no external .onnx.data) - Java's ORT load wants one file.
+Usage:  python export_refine.py /work/runs/stage2_mfs/model.pt
+"""
+import sys, torch
+from stage2 import VoteRefine
+
+ckpt = sys.argv[1]
+ck = torch.load(ckpt, map_location="cpu")
+net = VoteRefine()
+net.load_state_dict(ck["model"]); net.eval()
+onnx_path = ckpt[:-3] + ".onnx" if ckpt.endswith(".pt") else ckpt + ".onnx"
+dummy = torch.zeros(1, 3, 64, 64)        # [B,3,H,W]; H,W dynamic below
+torch.onnx.export(net.net, dummy, onnx_path,
+    input_names=["acc"], output_names=["out"],
+    dynamic_axes={"acc": {0: "B", 2: "H", 3: "W"}, "out": {0: "B", 2: "H", 3: "W"}},
+    opset_version=17, dynamo=False)
+print(f"exported {onnx_path}  (acc[B,3,H,W] -> out[B,3,H,W])")
--- a/extract_B.py
+++ b/extract_B.py
+#!/usr/bin/env python3
+# extract_B.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+# By Claude on 06/12/2026. Extract the measured 4D ambiguity function B from a
+# synthetic-run -C5P-RECT (or -RECT) rendering, verify it against ground truth,
+# and solve the regularized 4D deconvolution D (Wiener) with a
+# condensation-vs-noise-gain sweep.
+#
+# RECT layout: ROI pixel (px,py) -> 12x12 block at [1+py*12 : 12+py*12,
+# 1+px*12 : 12+px*12] (11x11 velocity cells + 1px NaN gap grid).
+
+import argparse, json
+import numpy as np
+import tifffile
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument('--rect', required=True, help='-RECT.tiff rendering (synthetic run)')
+    ap.add_argument('--synth', required=True, help='the -CUAS-SYNTHETIC-CUAS.tiff (for frame index mapping)')
+    ap.add_argument('--gt', required=True, help='ground truth json')
+    ap.add_argument('--roi', default='160,96,320,320', help='SYNTH_ROI x,y,w,h')
+    ap.add_argument('--srad', type=int, default=8, help='spatial PSF half-extent, px')
+    ap.add_argument('--slice', type=int, default=-1, help='RECT slice (0-based); -1 = auto: newest frame t%%4==0, mid-file')
+    args = ap.parse_args()
+
+    gt = json.load(open(args.gt))
+    vrad = gt['vel_radius']; vdim = 2*vrad + 1
+    rx, ry, rw, rh = [int(v) for v in args.roi.split(',')]
+
+    with tifffile.TiffFile(args.synth) as tf:
+        synth_labels = tf.imagej_metadata['Labels'][1:]  # frame t = index
+    with tifffile.TiffFile(args.rect) as tf:
+        rect_labels = tf.imagej_metadata['Labels']
+        nsl = len(tf.pages)
+        # map each rect slice to frame index of its newest data
+        frames = [synth_labels.index(l) for l in rect_labels]
+        sl = args.slice
+        if sl < 0:  # newest frame on the 4-frame phase grid (all targets pixel-centered)
+            cands = [i for i, t in enumerate(frames) if t % 4 == 0 and 20 <= t]
+            sl = cands[len(cands)//2] if cands else nsl//2
+        t = frames[sl]
+        img = tf.asarray(key=sl)
+    print('rect slices %d, using slice %d (label %s, frame t=%d)' % (nsl, sl, rect_labels[sl], t))
+
+    srad = args.srad
+    sdim = 2*srad + 1
+    psfs = {}
+    for tg in gt['targets']:
+        vx_c, vy_c = tg['v_cells']
+        vx, vy = tg['v_pix_per_frame']
+        x = tg['node'][0] + vx*t
+        y = tg['node'][1] + vy*t
+        px = int(np.floor(x)) - rx
+        py = int(np.floor(y)) - ry
+        if not (srad <= px < rw-srad and srad <= py < rh-srad):
+            continue
+        # extract [sdim, sdim, vdim, vdim] patch (spatial dy, dx, vel vy, vx)
+        P = np.zeros((sdim, sdim, vdim, vdim), np.float32)
+        for dy in range(-srad, srad+1):
+            for dx in range(-srad, srad+1):
+                bx = px+dx; by = py+dy
+                P[dy+srad, dx+srad] = img[by*12:by*12+11, bx*12:bx*12+11]
+        psfs[(vx_c, vy_c)] = P
+    print('extracted %d target PSFs' % len(psfs))
+
+    # --- sanity: blob center at (dp=0, v=v_true)? and +-v symmetry of the static one
+    P0 = psfs.get((0, 0))
+    if P0 is not None:
+        c = P0[srad, srad]
+        iy, ix = np.unravel_index(np.argmax(c), c.shape)
+        print('static target: center-pixel argmax at vel (%+d,%+d) (expect 0,0), peak %.2f'
+              % (ix-vrad, iy-vrad, c[iy, ix]))
+        row = c[vrad]
+        asym = np.abs(row - row[::-1]).max() / row.max()
+        print('static target vx row:', np.round(row, 1))
+        print('  max +-vx asymmetry: %.3f (relative)' % asym)
+
+    # --- shift-invariance in velocity: align central targets' PSFs and compare
+    keys = [k for k in psfs if max(abs(k[0]), abs(k[1])) <= 2]
+    aligned = []
+    for (i, j) in keys:
+        P = psfs[(i, j)]
+        A = np.roll(np.roll(P, -j, axis=2), -i, axis=3)  # shift v_true to center
+        # mask cells rolled across the velocity border
+        M = np.ones((vdim, vdim), bool)
+        M = np.roll(np.roll(M, -j, axis=0), -i, axis=1)
+        aligned.append((A, M))
+    ref = aligned[len(aligned)//2][0]
+    devs = []
+    for A, M in aligned:
+        d = (np.abs(A - ref)[:, :, M]).max() / ref.max()
+        devs.append(d)
+    print('velocity shift-invariance over %d central targets: max rel deviation %.3f'
+          % (len(keys), max(devs)))
+
+    # --- average PSF (central targets) = B; Wiener D sweep
+    B = np.mean([A for A, _ in aligned], axis=0).astype(np.float64)
+    B /= B.max()
+    # desired G: separable half-cosine, +-1 cell in all four dims
+    g1 = np.array([0.5, 1.0, 0.5])
+    G = np.zeros_like(B)
+    cs, cv = srad, vrad
+    for a in range(-1, 2):
+        for b in range(-1, 2):
+            for cc in range(-1, 2):
+                for d in range(-1, 2):
+                    G[cs+a, cs+b, cv+cc, cv+d] = g1[a+1]*g1[b+1]*g1[cc+1]*g1[d+1]
+    Bf = np.fft.fftn(np.fft.ifftshift(B))
+    Gf = np.fft.fftn(np.fft.ifftshift(G))
+    print('\nlambda    resid     noise_gain   out_extent(cells>10%)')
+    for lam in (1e-4, 1e-3, 1e-2, 0.03, 0.1, 0.3):
+        Df = np.conj(Bf)*Gf / (np.abs(Bf)**2 + lam)
+        D = np.real(np.fft.fftshift(np.fft.ifftn(Df)))
+        out = np.real(np.fft.fftshift(np.fft.ifftn(Df*Bf)))
+        resid = np.sqrt(((out-G)**2).sum()/ (G**2).sum())
+        ngain = np.sqrt((D**2).sum())
+        extent = int((out > 0.1*out.max()).sum())
+        print('%7.0e  %7.3f   %8.3f      %d (G itself: %d)'
+              % (lam, resid, ngain, extent, int((G > 0.1).sum())))
+    np.save(args.rect + '.B.npy', B)
+    print('\nsaved averaged B to', args.rect + '.B.npy')
+
+if __name__ == '__main__':
+    main()
--- a/gap_eval.py
+++ b/gap_eval.py
+# gap_eval.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Gap-L2 eval on HELD-OUT modulated synthetic: sharpness + coast + hallucination. By Claude 06/23/2026"""
+import argparse, numpy as np, torch, synth, layer2_data as L1D
+from layer2 import Layer2Net
+
+def fwhm_at(field,cx,cy,G):
+    f=np.roll(np.roll(field,int(round(G//2-cy)),0),int(round(G//2-cx)),1); c=G//2
+    def w(l):
+        pk=float(l[c])
+        if pk<=1e-6: return np.nan
+        h=pk/2; i=c
+        while i<len(l)-1 and l[i]>=h: i+=1
+        r=(i-1)+(l[i-1]-h)/(l[i-1]-l[i]) if l[i-1]>l[i] else float(i-1)
+        i=c
+        while i>0 and l[i]>=h: i-=1
+        L=(i+1)-(l[i+1]-h)/(l[i+1]-l[i]) if l[i+1]>l[i] else float(i+1)
+        return r-L
+    return np.nanmean([w(f[c,:]),w(f[:,c])])
+
+ap=argparse.ArgumentParser()
+ap.add_argument("--l1",default="runs/weighted9_pm/model.pt"); ap.add_argument("--model",required=True)
+ap.add_argument("--out",required=True); ap.add_argument("--T",type=int,default=128); ap.add_argument("--G",type=int,default=32)
+ap.add_argument("--seeds",default="3000,3001,3002,3003,3004,3005")
+ap.add_argument("--train_n",type=int,default=0,help="if >0: reproduce first N TRAINING sequences (one rng seed 0, sequential, == build_cache) instead of --seeds")
+a=ap.parse_args(); import os; os.makedirs(a.out,exist_ok=True)
+dev="cuda" if torch.cuda.is_available() else "cpu"
+net1,N,_=L1D._load_l1(a.l1,dev)
+ck=torch.load(a.model,map_location=dev); ar=ck["args"]
+net=Layer2Net(ch_in=3,ch_hidden=ar["ch"],grid=a.G,vmax=ar["vmax"]).to(dev); net.load_state_dict(ck["model"]); net.eval()
+G=a.G; seeds=[int(s) for s in a.seeds.split(",")]
+# train_n mode: ONE rng(0) generating N sequences sequentially == build_cache training data. By Claude 06/23
+train_rng = np.random.default_rng(0) if a.train_n>0 else None
+items = list(range(a.train_n)) if a.train_n>0 else seeds
+INs,L1s,L2s,TRs,SGs=[],[],[],[],[]
+fl1,fl2,s_bright,s_gap,fp_dark=[],[],[],[],[]
+for sd in items:
+    rng = train_rng if a.train_n>0 else np.random.default_rng(sd)
+    fr,pos,vel,pres,sig=L1D.render_run(rng,T=a.T,G=G,vmax=1.4,snr=6.0,gaps=True,bp_lo=6,bp_hi=18,duty_offset=0.2,starter_len=8,return_signal=True)
+    seq=L1D.gen_field_sequence(net1,fr,pos,G,N,dev)
+    with torch.no_grad():
+        det,_=net(torch.from_numpy(seq[None]).to(dev)); l2=torch.sigmoid(det)[0,:,0].cpu().numpy()
+    l1s=seq[:,0]
+    tr=np.zeros((a.T,G,G),np.float32)
+    for t in range(a.T):
+        if pres[t]: tr[t]=L1D.halfcos_bump_torus(pos[t,0],pos[t,1],G)
+    INs.append(fr);L1s.append(l1s);L2s.append(l2);TRs.append(tr);SGs.append(np.broadcast_to(sig[:,None,None],(a.T,G,G)).astype(np.float32))
+    for t in range(a.T):
+        if not pres[t]: continue
+        cx,cy=pos[t]; ci,cj=int(round(cy))%G,int(round(cx))%G
+        l1v=l1s[t,ci,cj]; l2v=float(l2[t,max(0,ci-1):ci+2,max(0,cj-1):cj+2].max())
+        if sig[t]>0.5:   # BRIGHT frame
+            s_bright.append(l2v)
+            if l1v>0.5: fl1.append(fwhm_at(l1s[t],cx,cy,G))
+            if l2[t].max()>0.5: fl2.append(fwhm_at(l2[t],cx,cy,G))
+        else:            # GAP frame (present, L1 starved)
+            s_gap.append(l2v)
+            m=np.ones((G,G),bool); yy,xx=np.ogrid[:G,:G]
+            m[(((xx-cj+G/2)%G-G/2)**2+((yy-ci+G/2)%G-G/2)**2)<=9]=False
+            fp_dark.append(float(l2[t][m].max()))
+def cat(L): return np.concatenate(L,0)
+for nm,st in [("input",INs),("L1_s",L1s),("L2_det",L2s),("truth",TRs),("signal",SGs)]:
+    synth.save_tiff_stack(cat(st).astype(np.float32),f"{a.out}/{nm}.tif")
+_src = f"TRAINING seqs 0..{a.train_n-1}" if a.train_n>0 else f"held-out seeds {seeds}"
+print(f"=== GAP-L2 eval ({a.model}) {_src} ===")
+print(f"FWHM @ target (bright):  L1 ~ {np.nanmean(fl1):.2f}   L2 ~ {np.nanmean(fl2):.2f} px")
+print(f"COAST  L2 s@truth:  BRIGHT frames ~ {np.mean(s_bright):.2f}   GAP frames ~ {np.mean(s_gap):.2f}  (want gap HIGH = coasting)")
+print(f"HALLUCINATION  L2 max away-from-target on GAP frames: mean {np.mean(fp_dark):.2f}  max {np.max(fp_dark):.2f}")
--- a/gen_synth_cuas.py
+++ b/gen_synth_cuas.py
+#!/usr/bin/env python3
+# gen_synth_cuas.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+# By Claude on 06/11/2026, design by Andrey Filippov (2026-06-11).
+#
+# Generate a synthetic *-CUAS-MERGED-CUAS.tiff for the 4D-deconvolution experiment:
+# ideal point targets ("straight line segments" in x,y,t) on a grid, one fine
+# velocity per grid cell, structurally identical to a real merged file (same
+# 640x512 float32, same slice count, timestamps borrowed verbatim from the real
+# file) so it runs through CuasDetectRT unchanged, with the same pyramid levels.
+#
+# Layout:
+# - grid step (default 40 px, human-friendly): each cell holds ONE target so no
+#   other target enters the conv5d kernel attention area (direct-kernel spatial
+#   reach is 8 px at VR5/VD4/NH6; travel budget per segment = step/2 - reach).
+#   Step 30 would leave only a 7 px budget -> the fastest velocity (1.25 px/frame
+#   per axis) could not complete one clean 6-frame window; step 40 gives 12 px
+#   -> 9 frames, clean windows for every velocity.
+# - velocity per cell: vx_cell = (col % 11) - 5, vy_cell = (row % 11) - 5
+#   (all 121 fine velocities in the top-left 11x11 block, repeating beyond).
+#   Pixel velocity = cell / vel_decimate px/frame.
+# - motion: position = node + v * (frame - segment_start); each target restarts
+#   (jumps back to its node) when its per-axis travel would exceed the budget.
+#   Restart frames are recorded in the ground-truth JSON - analysis can select
+#   windows that do not span a restart.
+# - target rendering: sum-normalized Gaussian spot (sigma ~0.7 px) at the
+#   fractional position - constant total flux at any sub-pixel phase.
+# - slice 1 is "average" (temporal mean; skipped by the importer, kept for
+#   human browsing); the rest reuse the real file's timestamp labels.
+#
+# Usage:
+#   python gen_synth_cuas.py --ref <real-CUAS-MERGED-CUAS.tiff> --out <out.tiff>
+# Ground truth is written next to the output as <out>.groundtruth.json
+
+import argparse, json, os
+import numpy as np
+import tifffile
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument('--ref', required=True, help='real *-CUAS-MERGED-CUAS.tiff (labels/shape source)')
+    ap.add_argument('--out', required=True, help='output synthetic tiff path')
+    ap.add_argument('--layout', choices=['radial','tiled'], default='radial',
+                    help='radial (default): one target per velocity cell at center + step*(vx,vy), velocities point outward from image center - distances only grow, no restarts, full-duration tracks. tiled: velocity by (col%%11,row%%11) over the whole frame with per-target restart jumps.')
+    ap.add_argument('--step', type=int, default=30, help='grid step, px (default 30 radial / use 40 for tiled)')
+    ap.add_argument('--center-x', type=float, default=None, help='radial velocity-0 node X (default image center). Set to a ROI corner so the full +velocity sweep (0..+vmax) lands inside the ROI instead of radiating out of it. By Claude 06/19/2026')
+    ap.add_argument('--center-y', type=float, default=None, help='radial velocity-0 node Y (default image center)')
+    ap.add_argument('--vel-radius', type=int, default=5, help='fine velocity radius, cells (VR)')
+    ap.add_argument('--vel-decimate', type=int, default=4, help='velocity decimation (VD): px/frame = cell/VD')
+    ap.add_argument('--vel-step', type=float, default=None, help='explicit target velocity step, px/frame: v_pix = cell*vel_step (overrides cell/vel_decimate). For LEV-emulation set to v_LEV3/4 so cells 1,2,4 = LEV1,LEV2,LEV3. By Claude 06/19/2026')
+    ap.add_argument('--kernel-reach', type=int, default=8, help='conv kernel spatial reach, px (direct kernel: 8 at VR5/VD4/NH6)')
+    ap.add_argument('--amplitude', type=float, default=100.0, help='target total flux, counts')
+    ap.add_argument('--shape', choices=['halfcosine','gaussian'], default='halfcosine',
+                    help='target spot shape: halfcosine (default) = canonical 3-px half-period half-cosine bump (matches matched-filter kernel, G, condition()); gaussian = sigma-defined')
+    ap.add_argument('--sigma', type=float, default=0.7, help='Gaussian sigma, px (shape=gaussian only)')
+    ap.add_argument('--background', type=float, default=0.0, help='constant background level')
+    ap.add_argument('--background-from', default=None,
+                    help='tiff to use as per-frame background (e.g. the REAL merged file: synthetic targets are ADDED to real clutter); overrides --background')
+    ap.add_argument('--bg-decimate-average', type=int, default=1, help='each output bg frame = mean of N consecutive --background-from frames (noise-floor / level control: N=2^L gives the LEV-L noise floor, std down sqrt(N)). By Claude 06/19/2026')
+    ap.add_argument('--peak', type=float, default=0.0,
+                    help='if >0: set amplitude so a pixel-centered target PEAK equals this (halfcosine: amplitude = 4*peak)')
+    ap.add_argument('--nframes', type=int, default=0, help='limit number of frames (0 = same as ref)')
+    ap.add_argument('--motion-blur', action='store_true', help='causal/RT motion blur (Andrey naive method): each frame = average of sub-frame target positions trailing into the past; streak ~|v|*blur_frac, centroid lags ~0.5*blur_frac*|v| (RT registration bias) // By Claude on 06/17/2026')
+    ap.add_argument('--blur-frac', type=float, default=1.0, help='MB averaging window in frames (1.0 = non-overlap decimation)')
+    ap.add_argument('--blur-subs', type=int, default=4, help='sub-frames per averaging window (the 4-8 subdivision)')
+    args = ap.parse_args()
+
+    with tifffile.TiffFile(args.ref) as tf:
+        labels = list(tf.imagej_metadata['Labels'])
+        nslices, height, width = tf.series[0].shape
+    assert labels[0] == 'average' and len(labels) == nslices
+    ts_labels = labels[1:]
+    navg = max(1, args.bg_decimate_average)   # noise-floor / level control: each output frame averages navg real frames // By Claude 06/19/2026
+    avail = len(ts_labels) // navg
+    nframes = avail if args.nframes <= 0 else min(args.nframes, avail)
+    ts_labels = ts_labels[::navg][:nframes]
+
+    step = args.step
+    vrad = args.vel_radius
+    vdim = 2 * vrad + 1
+    vstep = args.vel_step if args.vel_step is not None else 1.0 / args.vel_decimate  # px/frame per velocity cell // By Claude 06/19/2026
+    budget = step / 2.0 - args.kernel_reach
+    assert budget > 0, 'grid step too small for kernel reach'
+
+    targets = []
+    if args.layout == 'radial':
+        # one target per fine velocity cell, at center + step*(vx_cell, vy_cell), moving
+        # outward: pure expansion, pairwise distances only grow - no restarts ever
+        cx = (args.center_x if args.center_x is not None else width // 2 + 0.5)   # velocity-0 node (phase 0); ROI corner -> +sweep lands in-ROI
+        cy = (args.center_y if args.center_y is not None else height // 2 + 0.5)
+        tid = 0
+        for vy_cell in range(-vrad, vrad + 1):
+            for vx_cell in range(-vrad, vrad + 1):
+                targets.append({
+                    'id': tid,
+                    'node': [cx + step * vx_cell, cy + step * vy_cell],
+                    'v_cells': [vx_cell, vy_cell],
+                    'v_pix_per_frame': [vx_cell * vstep, vy_cell * vstep],
+                    'restart_period_frames': 0,  # never
+                })
+                tid += 1
+    else:  # tiled
+        cols = width // step
+        rows = height // step
+        for row in range(rows):
+            for col in range(cols):
+                vx_cell = (col % vdim) - vrad
+                vy_cell = (row % vdim) - vrad
+                vx = vx_cell * vstep  # px/frame
+                vy = vy_cell * vstep
+                node_x = col * step + step / 2.0 + 0.5  # +0.5: start exactly on a pixel center (phase 0)
+                node_y = row * step + step / 2.0 + 0.5
+                vmax = max(abs(vx), abs(vy))
+                period = int(np.floor(budget / vmax)) if vmax > 0 else 0  # 0 = static, never restarts
+                targets.append({
+                    'id': row * cols + col,
+                    'node': [node_x, node_y],
+                    'v_cells': [vx_cell, vy_cell],
+                    'v_pix_per_frame': [vx, vy],
+                    'restart_period_frames': period,
+                })
+
+    if args.peak > 0:  # peak-referenced amplitude: peak = amplitude * splat_peak/splat_sum at phase 0
+        xs = np.arange(-3, 4) + 0.0
+        prof = (np.where(np.abs(xs) < 1.5, np.cos(np.pi/3.0*np.abs(xs)), 0.0) if args.shape == 'halfcosine'
+                else np.exp(-xs**2/(2*args.sigma**2)))
+        args.amplitude = args.peak * (prof.sum()**2) / (prof.max()**2)
+        print('peak %.3g -> amplitude (total flux) %.4g' % (args.peak, args.amplitude))
+
+    # sum-normalized splat: half-cosine bump (canonical 3-px half-period) or Gaussian
+    def splat_1d(coords, center):
+        if args.shape == 'halfcosine':
+            d = np.abs(coords - center)
+            return np.where(d < 1.5, np.cos(np.pi / 3.0 * d), 0.0)
+        return np.exp(-((coords - center) ** 2) / (2 * args.sigma ** 2))
+    rad = 2 if args.shape == 'halfcosine' else int(np.ceil(3 * args.sigma)) + 1
+    if args.background_from:
+        with tifffile.TiffFile(args.background_from) as tf:
+            bg_labels = tf.imagej_metadata['Labels']
+            first = 1 if bg_labels[0] == 'average' else 0
+            need = nframes * navg
+            raw = tf.asarray()[first:first+need].astype(np.float32)
+        assert raw.shape[0] == need, 'need %d bg frames, have %d (reduce --nframes or --bg-decimate-average)' % (need, raw.shape[0])
+        if navg > 1:
+            raw = raw.reshape(nframes, navg, height, width).mean(axis=1)   # LEV-log2(navg) noise floor: std down sqrt(navg) // By Claude 06/19/2026
+        data = raw.copy()
+        assert data.shape == (nframes, height, width)
+        print('background: %d output frames, each = mean of %d real frames (%d total) from %s'
+              % (nframes, navg, need, args.background_from))
+    else:
+        data = np.full((nframes, height, width), args.background, dtype=np.float32)
+    for t in range(nframes):
+        frame = data[t]
+        for tg in targets:
+            vx, vy = tg['v_pix_per_frame']
+            period = tg['restart_period_frames']
+            dt = t if period == 0 else (t % period)
+            x = tg['node'][0] + vx * dt
+            y = tg['node'][1] + vy * dt
+            # causal/RT motion blur: average sub-frame positions trailing into the past (naive method)
+            if args.motion_blur and (vx or vy):
+                subs = np.arange(args.blur_subs) * (args.blur_frac / args.blur_subs)  # s in [0, blur_frac)
+                pxs = [x - vx * sb for sb in subs]; pys = [y - vy * sb for sb in subs]
+            else:
+                pxs = [x]; pys = [y]
+            ix0 = max(int(np.floor(min(pxs))) - rad, 0)
+            ix1 = min(int(np.floor(max(pxs))) + rad + 1, width)
+            iy0 = max(int(np.floor(min(pys))) - rad, 0)
+            iy1 = min(int(np.floor(max(pys))) + rad + 1, height)
+            if ix0 >= ix1 or iy0 >= iy1:
+                continue
+            xs = np.arange(ix0, ix1) + 0.5  # pixel centers
+            ys = np.arange(iy0, iy1) + 0.5
+            spot = np.zeros((iy1 - iy0, ix1 - ix0), dtype=np.float64)
+            for sx, sy in zip(pxs, pys):
+                spot += np.outer(splat_1d(ys, sy), splat_1d(xs, sx))   # equal-weight trailing average
+            spot /= len(pxs)
+            s = spot.sum()
+            if s > 0:
+                frame[iy0:iy1, ix0:ix1] += (args.amplitude / s) * spot.astype(np.float32)
+
+    avg = data.mean(axis=0, keepdims=True)
+    stack = np.concatenate([avg, data], axis=0)
+    out_labels = ['average'] + ts_labels
+
+    os.makedirs(os.path.dirname(args.out), exist_ok=True)
+    # axes 'ZYX' is essential: without it tifffile declares the planes as CHANNELS
+    # (channels=498 composite) and ImageJ gives each its own display range - slices
+    # then LOOK differently scaled although pixel data is identical
+    tifffile.imwrite(
+        args.out, stack, imagej=True,
+        metadata={'axes': 'ZYX', 'Labels': out_labels})
+
+    gt = {
+        'ref': args.ref,
+        'layout': args.layout,
+        'width': width, 'height': height, 'nframes': nframes,
+        'grid_step': step, 'vel_radius': vrad, 'vel_decimate': args.vel_decimate,
+        'vel_step_px': vstep, 'bg_decimate_average': navg,
+        'kernel_reach': args.kernel_reach, 'travel_budget_px': budget,
+        'amplitude_total_flux': args.amplitude, 'shape': args.shape, 'sigma_px': args.sigma,
+        'background': args.background, 'background_from': args.background_from, 'peak': args.peak,
+        'motion': ('pos = node + v*t, no restarts (radial expansion)' if args.layout == 'radial'
+                   else 'pos = node + v*(t % restart_period); restart_period 0 = static'),
+        'targets': targets,
+    }
+    with open(args.out + '.groundtruth.json', 'w') as f:
+        json.dump(gt, f, indent=1)
+    print('wrote', args.out, stack.shape, 'and ground truth json;',
+          len(targets), 'targets, layout', args.layout, 'step', step)
+
+if __name__ == '__main__':
+    main()
--- a/ghost_probe.py
+++ b/ghost_probe.py
+#!/usr/bin/env python3
+# ghost_probe.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Trajectory-alias ghost probe. # By Claude on 06/16/2026
+A STATIC bright target placed at offset d from the patch center (= the output/ROI pixel) should be
+SUPPRESSED (det s -> 0): it's a neighbour's target, not a fast target through me. The failure mode
+(seen in the clean synthetic grid) is the net hallucinating a FAST velocity whose tail lands on the
+static blob -> a ghost detection at d in the untrained off-center band. This sweeps d and prints s +
+the argmax velocity, so we can compare the 24-patch (off_max=9) vs 32-patch (off_max=13) models:
+the wider RF + extended off-center suppression should drive s->~0 out to a larger d."""
+import argparse, numpy as np, torch, synth
+from model import RawFCN
+
+
+def sigmoid(z): return 1.0 / (1.0 + np.exp(-z))
+
+
+if __name__ == "__main__":
+    ap = argparse.ArgumentParser()
+    ap.add_argument("ck")
+    ap.add_argument("--nframes", type=int, default=9)
+    ap.add_argument("--patch", type=int, default=24)
+    ap.add_argument("--vel_radius", type=int, default=5)
+    ap.add_argument("--vel_decimate", type=int, default=4)
+    ap.add_argument("--amp", type=float, default=8.0)   # static target peak (training: snr*bump, snr 1..8)
+    a = ap.parse_args()
+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+    ck = torch.load(a.ck, map_location=dev)
+    m = RawFCN(n_frames=a.nframes, vel_radius=a.vel_radius, patch=a.patch).to(dev)
+    m.load_state_dict(ck["model"]); m.eval()
+    P = a.patch; cx = P / 2.0; cy = P / 2.0      # deployment reference = P/2
+    n = 2 * a.vel_radius + 1; step = 1.0 / a.vel_decimate
+    ix = np.arange(n * n); vxc = (ix % n - a.vel_radius) * step; vyc = (ix // n - a.vel_radius) * step
+    print(f"model {a.ck}  patch={a.patch}  amp={a.amp}  off_max={P/2 - 2 - 1:.0f}px")
+    print(f"  STATIC target at offset d along +x; want s->0 as d grows (ghost = high s + fast v)")
+    print(f"{'d':>4} {'s':>7} {'argmax v (cells)':>17} {'|v| px':>7}")
+    for d in range(0, P // 2):
+        frames = np.empty((a.nframes, P, P), dtype=np.float32)
+        bump = (a.amp * synth.halfcos_bump(cx + d, cy, P, P)).astype(np.float32)
+        for i in range(a.nframes):
+            frames[i] = bump                      # static: identical every frame
+        with torch.no_grad():
+            out = m(torch.from_numpy(frames[None]).to(dev))[0, :, 0, 0].cpu().numpy()
+        s = sigmoid(out[0]); k = int(np.argmax(out[1:1 + n * n]))
+        vx, vy = vxc[k], vyc[k]
+        print(f"{d:4d} {s:7.3f}   ({vx:+.2f},{vy:+.2f})   {np.hypot(vx, vy):6.3f}")
--- a/infer_server.py
+++ b/infer_server.py
+#!/usr/bin/env python3
+# infer_server.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""DGX remote-inference server for the CUAS RawFCN (stateful, batched). By Claude on 06/20/2026.
+
+Java uploads the SUBAVG+LoG-conditioned (optionally synth-mixed) stack ONCE; the DGX builds the
+temporal pyramid (0.5*(now+prev), as Java's temporalAverageLReLU), then serves BATCHED full-res
+inference: one INFER does a whole range of scenes of a level (chunked for GPU memory), shift-and-
+stitch -> on-GPU GHOSTBUSTER (== CuasDetectRT.dnnGhostbust) -> decode -> returns the full-frame
+offset {dx,dy,s,Vx,Vy} + the ROI-only 121-cell softmax*s, per scene. A debug READBACK returns one
+conditioned pyramid frame. Runs inside nvcr.io/nvidia/pytorch:25.10-py3.
+
+Protocol (big-endian, matches Java DataInput/OutputStream). Each request: int32 cmd, then:
+  UPLOAD  (1): int32 T,H,W ; T*H*W float32 (conditioned stack)
+              -> reply: int32 n_levels ; n_levels x int32 frames_per_level ; int32 N ; float64 build_ms
+  INFER   (2): int32 level, start, count, stride, roi_x, roi_y, roi_w, roi_h ; float64 rmax_cells
+              (count scenes: newest = start + j*stride, j=0..count-1; rmax_cells<=0 disables ghostbuster)
+              -> reply: float64 gpu_ms ; int32 H,W,count,nvel,rh,rw
+                        count*5*H*W float32 (offset5 dx,dy,s,Vx,Vy, plane-major per scene)
+                        count*rh*rw*nvel float32 (ROI softmax*s, pixel-major per scene)
+  READBACK(3): int32 level, frame -> reply: int32 H,W ; H*W float32
+  BYE     (0): close
+"""
+import argparse, os, socket, struct, time
+from datetime import datetime
+import numpy as np
+import torch
+import torch.nn.functional as F
+from model import RawFCN
+
+CMD_BYE, CMD_UPLOAD, CMD_INFER, CMD_READBACK = 0, 1, 2, 3
+CMD_STATUS = 4     # report loaded L1/L2 model paths so the client can detect a model change. By Claude on 06/24/2026
+GPU_CHUNK = 16     # scenes processed per batched GPU pass (memory vs utilization)
+VEL_DECIMATE = 4   # velocity-grid cells per px/level-frame (Java curt_vel_decimate); L2 was trained on Vx,Vy in px/frame (cells/4). By Claude on 06/22/2026
+AGE_THR = 0.2      # L2 track-age death threshold: a cell with det<=AGE_THR "dies" (age 0). Raised 0.01->0.2 so the
+                   # weak noise halo dies and the 5x5 max-pool can't dilate age across gaps. By Claude on 06/24/2026
+AGE_K   = 0.5      # ancestor gate: a 5x5 previous-frame neighbor may pass its age only if its det >= AGE_K * (local
+                   # max det in that 5x5) - blocks a weak-but-old straggler from seeding age. By Claude on 06/24/2026
+NOISE_REF_LEVEL = 3  # the net is calibrated to ~LEV3's absolute noise (low-contrast signals tested mainly on LEV3).
+                   # The pyramid averages 2 frames/level so sigma drops sqrt(2)/level; scale each level's L1 input by
+                   # sqrt(2)^(level-REF) to put every level at LEV3's absolute noise (uniform FP). By Claude on 06/24/2026
+
+
+def load_l2(run_dir, device):
+    # Optional Layer-2 (track-before-detect) recurrent net; FCN so it runs on any H,W. By Claude on 06/22/2026
+    from layer2 import Layer2Net
+    ck = torch.load(os.path.join(run_dir, "model.pt"), map_location="cpu", weights_only=False)
+    a = ck.get("args", {}) or {}
+    m = Layer2Net(ch_in=3, ch_hidden=a.get("ch", 24), grid=a.get("G", 32), vmax=a.get("vmax", 1.4))
+    m.load_state_dict(ck["model"]); m.eval().to(device)
+    print(f"loaded L2 {run_dir}/model.pt: ch_hidden={a.get('ch',24)} vmax={a.get('vmax',1.4)}", flush=True)
+    return m
+
+
+def load_model(run_dir, device):
+    ck = torch.load(os.path.join(run_dir, "model.pt"), map_location="cpu", weights_only=False)
+    a = ck.get("args", {}) or {}
+    kw = dict(n_frames=a.get("nframes", 8), vel_radius=a.get("vel_radius", 5),
+              patch=a.get("patch", 24), velocity_mode=a.get("velocity_mode", "grid"),
+              vmax=a.get("vmax", 1.4))
+    if a.get("ch") is not None:
+        kw["ch"] = tuple(a["ch"])
+    m = RawFCN(**kw)
+    m.load_state_dict(ck["model"])
+    m.eval().to(device)
+    print(f"loaded {run_dir}/model.pt: N={kw['n_frames']} patch={kw['patch']} vr={kw['vel_radius']} "
+          f"mode={kw['velocity_mode']} out_ch={m.out_ch}", flush=True)
+    return m, kw["n_frames"], kw["patch"], kw["vel_radius"]
+
+
+def recvall(conn, n):
+    buf = bytearray()
+    while len(buf) < n:
+        chunk = conn.recv(n - len(buf))
+        if not chunk:
+            raise ConnectionError("short read")
+        buf += chunk
+    return bytes(buf)
+
+
+def build_pyramid(log, n_levels_max=8):
+    # Replicates Java's pyramid (CuasDetectRT temporalAverageLReLU, linear). log: [T,H,W].
+    levels = [0.5 * (log[1:] + log[:-1])]                 # [T-1,H,W]
+    while len(levels) < n_levels_max:
+        prev = levels[-1]
+        nl = len(prev) // 2 - 1
+        if nl < 1:
+            break
+        idx = torch.arange(nl, device=log.device)
+        levels.append(0.5 * (prev[2 * idx + 2] + prev[2 * idx]))
+    return levels
+
+
+@torch.no_grad()
+def shift_stitch(m, x, P, S=4):
+    # x: [B,N,H,W] -> full-res field [B,C,H,W] (validated == per-pixel in fp64, dense_check.py).
+    B, N, H, W = x.shape
+    half = P // 2
+    xp = F.pad(x, (half, half, half, half))               # [B,N,H+P,W+P]
+    full = torch.zeros(B, m.out_ch, H, W, device=x.device, dtype=x.dtype)
+    for sy in range(S):
+        for sx in range(S):
+            y = m(xp[:, :, sy:, sx:])                     # [B,C,oH,oW]
+            oh = min(y.shape[2], (H - sy + S - 1) // S)
+            ow = min(y.shape[3], (W - sx + S - 1) // S)
+            full[:, :, sy::S, sx::S] = y[:, :, :oh, :ow]
+    return full                                           # [B,C,H,W]
+
+
+@torch.no_grad()
+def decode(field, vr, roi, rmax_cells):
+    # field: [B,C,H,W]. On-GPU ghostbuster (== CuasDetectRT.dnnGhostbust) + decode.
+    vdim = 2 * vr + 1
+    nvel = vdim * vdim
+    B, C, H, W = field.shape
+    x0, y0, rw, rh = roi
+    s = field[:, 0].sigmoid()                             # [B,H,W]
+    p = field[:, 1:1 + nvel].softmax(1)                   # [B,nvel,H,W]
+    k = torch.arange(nvel, device=field.device)
+    cx = (k % vdim - vr).to(field.dtype)                  # [nvel] vx cell coord
+    cy = (k // vdim - vr).to(field.dtype)                 # [nvel] vy cell coord
+    if rmax_cells > 0:                                    # ghostbuster
+        corner = (cx * cx + cy * cy) > (rmax_cells * rmax_cells)   # [nvel] untrained corner cells
+        ghost = corner[p.argmax(1)]                       # [B,H,W] peak lands in a corner -> whole pixel is a ghost
+        p = p * (~corner).to(p.dtype)[None, :, None, None]   # zero corner cells everywhere
+        keep = (~ghost).to(p.dtype)                       # [B,H,W]
+        p = p * keep[:, None]                             # zero all cells at ghost pixels
+        s = s * keep                                      # s=0 at ghost pixels
+    psum = p.sum(1).clamp_min(1e-12)                      # [B,H,W] (normalize the centroid)
+    vx = (p * cx[None, :, None, None]).sum(1) / psum      # [B,H,W] velocity centroid (cells)
+    vy = (p * cy[None, :, None, None]).sum(1) / psum
+    dx, dy = field[:, 1 + nvel], field[:, 1 + nvel + 1]   # [B,H,W]
+    offset5 = torch.stack([dx, dy, s, vx, vy], 1)         # [B,5,H,W]  -> -OFFSET
+    roi_field = (p * s[:, None])[:, :, y0:y0 + rh, x0:x0 + rw]   # [B,nvel,rh,rw] softmax*s over ROI
+    return offset5, roi_field.permute(0, 2, 3, 1).contiguous(), nvel  # [B,5,H,W], [B,rh,rw,nvel]
+
+
+def serve(run_dir, host, port, l2_run=None):
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    torch.backends.cudnn.benchmark = True
+    torch.set_grad_enabled(False)   # inference-only server; L2 recurrence (m2.cell/decode) isn't @no_grad'd. By Claude 06/22/2026
+    m, N, P, vr = load_model(run_dir, device)
+    m2 = load_l2(l2_run, device) if l2_run else None    # optional Layer-2; None -> L1-only (old way). By Claude 06/22/2026
+    print(f"device={device} gpu={torch.cuda.get_device_name(0) if device=='cuda' else 'cpu'} "
+          f"patch={P} N={N} vr={vr} L2={'on('+l2_run+')' if m2 is not None else 'off'}", flush=True)
+    _ = shift_stitch(m, torch.zeros(1, N, 64, 64, device=device), P)   # warm-up
+    if device == "cuda":
+        torch.cuda.synchronize()
+    print("warm-up done", flush=True)
+
+    srv = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
+    srv.bind((host, port))
+    srv.listen(4)
+    print(f"listening on {host}:{port} (N={N}, batched full-res shift-and-stitch + ghostbuster)", flush=True)
+    pyr = None
+    while True:
+        conn, addr = srv.accept()
+        conn.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
+        print(f"{datetime.now():%H:%M:%S} client {addr}", flush=True)
+        h_l2 = None     # Layer-2 recurrent hidden state [1,ch,H,W]; persists across INFER chunks, reset on l2_reset. By Claude 06/22/2026
+        age_l2 = None   # L2 track-age field [1,1,H,W]; sprev_l2 = previous-frame L2 det; carried+reset like h_l2. By Claude 06/24/2026
+        sprev_l2 = None
+        try:
+            while True:
+                cmd = struct.unpack(">i", recvall(conn, 4))[0]
+                if cmd == CMD_BYE:
+                    break
+                if cmd == CMD_STATUS:
+                    # Reply with loaded L1 + L2 model paths (len-prefixed UTF-8); empty L2 = L1-only.
+                    # Lets the Java client detect a model change and relaunch. By Claude on 06/24/2026
+                    b1 = run_dir.encode("utf-8")
+                    b2 = (l2_run or "").encode("utf-8")
+                    conn.sendall(struct.pack(">i", len(b1)) + b1 + struct.pack(">i", len(b2)) + b2)
+                    continue
+                if cmd == CMD_UPLOAD:
+                    T, H, W = struct.unpack(">iii", recvall(conn, 12))
+                    data = recvall(conn, T * H * W * 4)
+                    log = torch.from_numpy(np.frombuffer(data, dtype=">f4").astype(np.float32)
+                                           .reshape(T, H, W)).to(device)
+                    t0 = time.perf_counter()
+                    pyr = build_pyramid(log)
+                    if device == "cuda":
+                        torch.cuda.synchronize()
+                    bms = (time.perf_counter() - t0) * 1e3
+                    nl = len(pyr)
+                    print(f"{datetime.now():%H:%M:%S} UPLOAD T={T} {H}x{W} -> {nl} levels "
+                          f"{[len(l) for l in pyr]}  build={bms:.1f}ms ({T*H*W*4/1e6:.1f}MB)", flush=True)
+                    conn.sendall(struct.pack(">i", nl) + b"".join(struct.pack(">i", len(l)) for l in pyr)
+                                 + struct.pack(">id", N, bms))
+                elif cmd == CMD_INFER:
+                    level, start, count, stride, rx, ry, rw, rh = struct.unpack(">iiiiiiii", recvall(conn, 32))
+                    rmax = struct.unpack(">d", recvall(conn, 8))[0]
+                    l2_enable, l2_reset = struct.unpack(">ii", recvall(conn, 8))   # By Claude 06/22/2026
+                    noise_scale = struct.unpack(">d", recvall(conn, 8))[0]         # per-level L1-input noise scale from Java (single source of truth); <=0 -> server fallback. By Claude 06/24/2026
+                    use_l2 = bool(l2_enable) and (m2 is not None)
+                    # Per-level noise normalization: scale this level's L1 input to LEV3's absolute noise so all
+                    # levels sit in the net's trained regime (uniform FP across levels). LEV3 -> 1.0, lower/noisier
+                    # levels scale down, higher levels up. Independent of the age filter. By Claude on 06/24/2026
+                    if noise_scale <= 0.0:                # fallback only: Java didn't send one -> theoretical sqrt(2)^(level-ref)
+                        noise_scale = 2.0 ** ((level - NOISE_REF_LEVEL) / 2.0)
+                    lev = pyr[level] * noise_scale        # [Tl,H,W]
+                    H, W = lev.shape[1], lev.shape[2]
+                    nvel = (2 * vr + 1) ** 2
+                    o5_gpu, rf_gpu = [], []
+                    # Time PURE GPU compute (shift-and-stitch + decode), continuous over the whole range -
+                    # the production throughput. Results stay on-GPU (prod feeds Layer 2 there); the D2H copy
+                    # below is dev-only and NOT timed. By Claude on 06/20/2026
+                    if device == "cuda":
+                        ev0 = torch.cuda.Event(enable_timing=True); ev1 = torch.cuda.Event(enable_timing=True)
+                        torch.cuda.synchronize(); ev0.record()
+                    else:
+                        t0 = time.perf_counter()
+                    for c0 in range(0, count, GPU_CHUNK):
+                        b = min(GPU_CHUNK, count - c0)
+                        # newest-first windows (channel 0 = newest), matching the Java order
+                        wins = torch.stack([lev[(start + (c0 + j) * stride) - N + 1:
+                                                (start + (c0 + j) * stride) + 1].flip(0) for j in range(b)])  # [b,N,H,W]
+                        field = shift_stitch(m, wins, P)  # [b,C,H,W]
+                        o5, rf, nv = decode(field, vr, (rx, ry, rw, rh), rmax)   # L1: ghostbusted offset5 + ROI
+                        nvel = nv
+                        if use_l2:
+                            # Layer-2 (track-before-detect) over the scene/time axis. Feed the FULL
+                            # (non-ghostbusted) field as (s, Vx/vd, Vy/vd) px/level-frame; carry the recurrent
+                            # hidden state across chunks (reset on l2_reset at the level's first chunk). Output
+                            # replaces offset5 with {L1 dx, L1 dy, L2 det, L2 Vx*vd, L2 Vy*vd} (vel back to cells
+                            # so Java's existing /vel_decimate viz scaling -> px/level-frame). By Claude 06/22/2026
+                            ong, _, _ = decode(field, vr, (rx, ry, rw, rh), 0.0)        # no ghostbuster (L2 gets full field)
+                            l2in = torch.stack([ong[:, 2], ong[:, 3] / VEL_DECIMATE, ong[:, 4] / VEL_DECIMATE], 1)  # [b,3,H,W]
+                            # FPN-bad margins arrive as NaN; the recurrent circular conv would otherwise spread
+                            # NaN inward by the kernel radius every frame ("eating" the borders). Sanitize the
+                            # input so NaN can never seed/propagate through the hidden state. By Claude 06/22/2026
+                            l2in = torch.nan_to_num(l2in, nan=0.0, posinf=0.0, neginf=0.0)
+                            Hf, Wf = l2in.shape[2], l2in.shape[3]
+                            if (h_l2 is None) or (h_l2.shape[2] != Hf) or (h_l2.shape[3] != Wf) or (l2_reset and c0 == 0):
+                                h_l2 = torch.zeros(1, m2.ch_hidden, Hf, Wf, device=device, dtype=field.dtype)
+                                age_l2 = torch.zeros(1, 1, Hf, Wf, device=device, dtype=field.dtype)   # track age, carried+reset like h_l2
+                                sprev_l2 = torch.zeros(1, 1, Hf, Wf, device=device, dtype=field.dtype)  # previous-frame L2 det
+                            dets, vxs, vys, ages = [], [], [], []
+                            for j in range(b):                                          # forward in time, carry hidden + age
+                                h_l2 = m2.cell(l2in[j:j+1], h_l2)
+                                dlog, vel = m2.decode(h_l2)                             # [1,1,H,W],[1,2,H,W]
+                                s = torch.sigmoid(dlog[:, 0:1])                         # [1,1,H,W] current L2 det
+                                # AGE (track-before-detect persistence): die where det<=AGE_THR, else 1 + oldest age among
+                                # 5x5 PREVIOUS-frame neighbors that are themselves STRONG (det >= AGE_K * local-max det) -
+                                # so a weak-but-old straggler can't seed age; the raised AGE_THR stops the noise halo from
+                                # dilating age across gaps. Level-uniform 5x5 (pyramid keeps ~const px/level-frame). By Claude 06/24/2026
+                                maxS = F.max_pool2d(sprev_l2, 5, 1, 2)                          # local max prev-det in 5x5
+                                elig = (sprev_l2 >= AGE_K * maxS) & (sprev_l2 > AGE_THR)        # strong AND alive ancestors
+                                prev = torch.where(elig, age_l2, torch.zeros_like(age_l2))      # only strong ancestors pass age
+                                age_l2 = torch.where(s > AGE_THR, F.max_pool2d(prev, 5, 1, 2) + 1.0, torch.zeros_like(age_l2))
+                                sprev_l2 = s
+                                dets.append(s[:, 0]); ages.append(age_l2[:, 0]); vxs.append(vel[:, 0]); vys.append(vel[:, 1])
+                            l2vx = torch.cat(vxs, 0) * VEL_DECIMATE; l2vy = torch.cat(vys, 0) * VEL_DECIMATE
+                            o5 = torch.stack([o5[:, 0], o5[:, 1], torch.cat(dets, 0), l2vx, l2vy, torch.cat(ages, 0)], 1)  # +L2 age (6th); keep L1 dx,dy
+                        o5_gpu.append(o5); rf_gpu.append(rf)   # keep on GPU (rf = L1 ROI reference even when L2 on)
+                    if device == "cuda":
+                        ev1.record(); torch.cuda.synchronize(); gms = ev0.elapsed_time(ev1)
+                    else:
+                        gms = (time.perf_counter() - t0) * 1e3
+                    allo = torch.cat(o5_gpu, 0).cpu().numpy().astype(">f4")  # [count,nch,H,W] nch=5 (L1) or 6 (L2:+age) D2H untimed
+                    nch = allo.shape[1]                                      # channel count sent in the header (was hardcoded 5)
+                    allr = torch.cat(rf_gpu, 0).cpu().numpy().astype(">f4")  # [count,rh,rw,nvel]
+                    print(f"{datetime.now():%H:%M:%S} INFER lev={level} {count} scenes (f{start}..,stride {stride}) "
+                          f"ROI={rw}x{rh} ghost={rmax:.1f} nscale={noise_scale:.3f} L2={'on' if use_l2 else 'off'}{'(reset)' if (use_l2 and l2_reset) else ''} "
+                          f"gpu={gms:.1f}ms ({(allo.nbytes+allr.nbytes)/1e6:.1f}MB out)", flush=True)
+                    conn.sendall(struct.pack(">diiiiiii", gms, H, W, count, nch, nvel, rh, rw))  # +nch (offset channels). By Claude 06/24/2026
+                    conn.sendall(allo.tobytes())
+                    conn.sendall(allr.tobytes())
+                elif cmd == CMD_READBACK:
+                    level, frame = struct.unpack(">ii", recvall(conn, 8))
+                    fr = pyr[level][frame].cpu().numpy().astype(">f4")
+                    print(f"{datetime.now():%H:%M:%S} READBACK lev={level} f={frame}", flush=True)
+                    conn.sendall(struct.pack(">ii", fr.shape[0], fr.shape[1]) + fr.tobytes())
+        except (ConnectionError, struct.error, IndexError) as ex:
+            print(f"client {addr} closed/err: {ex}", flush=True)
+        finally:
+            conn.close()
+
+
+if __name__ == "__main__":
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--run", default="runs/weighted9_pm_s")
+    ap.add_argument("--l2run", default=None, help="optional Layer-2 run dir (model.pt); omit for L1-only")
+    ap.add_argument("--host", default="0.0.0.0")
+    ap.add_argument("--port", type=int, default=5577)
+    args = ap.parse_args()
+    serve(args.run, args.host, args.port, l2_run=args.l2run)
--- a/l1_samples.py
+++ b/l1_samples.py
+# l1_samples.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Multi-sample L1-output viewer, 2x2-tiled (64x64) only. By Claude on 06/23/2026
+
+Runs frozen L1 over several synthetic gap-runs and writes 2x2-tiled stacks so Andrey can scrub L1
+behavior across samples in Fiji (seam through the center cross => torus continuity visible; no need
+for the single 32x32). Per quantity, pages = nsamples*T concatenated. Channels: input, L1 s, truth
+marker, signal (1=bump rendered / 0=gap). The point: SEE how L1's s-field clears noise near the
+target when present, and how noise returns in a gap (coast-a-gap = higher noise)."""
+
+import argparse
+import numpy as np
+import torch
+import synth
+import layer2_data as L1D
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--l1", default="runs/weighted9_pm/model.pt")
+    ap.add_argument("--T", type=int, default=64); ap.add_argument("--G", type=int, default=32)
+    ap.add_argument("--nsamples", type=int, default=6)
+    ap.add_argument("--vmax", type=float, default=1.4); ap.add_argument("--snr", type=float, default=6.0)
+    ap.add_argument("--gaps", action="store_true")
+    ap.add_argument("--bp_lo", type=int, default=3); ap.add_argument("--bp_hi", type=int, default=9)
+    ap.add_argument("--duty_offset", type=float, default=-0.3); ap.add_argument("--starter_len", type=int, default=8)
+    ap.add_argument("--out", default="runs/l1_samples")
+    a = ap.parse_args()
+    import os; os.makedirs(a.out, exist_ok=True)
+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+    net, N, meta = L1D._load_l1(a.l1, dev)
+    G, T = a.G, a.T
+    rkw = dict(gaps=a.gaps, bp_lo=a.bp_lo, bp_hi=a.bp_hi, duty_offset=a.duty_offset, starter_len=a.starter_len)
+
+    def tile2x2(st):                                                 # [T,G,G] -> [T,2G,2G]
+        return np.tile(st, (1, 2, 2))
+
+    inputs, sfields, truths, signals = [], [], [], []
+    print(f"L1 {a.l1} N={N}; {a.nsamples} samples T={T} gaps={a.gaps}", flush=True)
+    for k in range(a.nsamples):
+        rng = np.random.default_rng(100 + k)                        # distinct, reproducible per sample
+        frames, pos, vel, present, signal = L1D.render_run(rng, T=T, G=G, vmax=vmax_(a), snr=a.snr,
+                                                           return_signal=True, **rkw)
+        seq = L1D.gen_field_sequence(net, frames, pos, G, N, dev)   # [T,3,G,G]
+        truth = np.zeros((T, G, G), np.float32)
+        for t in range(T):
+            if present[t]:
+                truth[t] = L1D.halfcos_bump_torus(pos[t, 0], pos[t, 1], G)
+        inputs.append(tile2x2(frames))
+        sfields.append(tile2x2(seq[:, 0]))
+        truths.append(tile2x2(truth))
+        signals.append(tile2x2(np.broadcast_to(signal[:, None, None], (T, G, G)).astype(np.float32)))
+        ng = int((present > 0).sum()); ngap = int(((present > 0) & (signal < 0.5)).sum())
+        print(f"  sample {k}: present {ng}/{T}, gap {ngap}", flush=True)
+
+    # concatenate samples along the page axis -> one scrubable stack per quantity
+    synth.save_tiff_stack(np.concatenate(inputs, 0),  f"{a.out}/input_2x2.tif")
+    synth.save_tiff_stack(np.concatenate(sfields, 0), f"{a.out}/s_2x2.tif")
+    synth.save_tiff_stack(np.concatenate(truths, 0),  f"{a.out}/truth_2x2.tif")
+    synth.save_tiff_stack(np.concatenate(signals, 0), f"{a.out}/signal_2x2.tif")
+    print(f"wrote {a.out}/{{input,s,truth,signal}}_2x2.tif  "
+          f"({a.nsamples}x{T}={a.nsamples*T} pages, 64x64, 32-bit; sample k = pages [k*{T},(k+1)*{T}))", flush=True)
+
+
+def vmax_(a):
+    return a.vmax
+
+
+if __name__ == "__main__":
+    main()
--- a/l2_fp_analysis.py
+++ b/l2_fp_analysis.py
+# l2_fp_analysis.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""L2 false-positive + P_d analysis vs UAS flight-log truth, per pyramid level. By Claude on 06/24/2026.
+
+PLUMBING/CORRECTNESS first pass (not final numbers). For one sequence dir (the center's vNNN dir), it:
+  - finds every  *-OFFSET-<model>.tiff  (level from "-LEVn-"; an untagged legacy file = level 0),
+  - reads only the s-channel pages (label starts with "s:") one at a time (memory-safe; LEV0 is ~3GB),
+  - matches each frame to the UAS truth in  *-UAS_DATA.tsv  by timestamp,
+  - applies the clean-sky ROI geometry (see project_l2_fp_measurement), counts FP "blobs" (local maxima
+    of s above threshold, numpy-only) per pixel-hectare in the HIGH/LOW sky zones, excluding a disk around
+    the UAS truth on IN-FoV frames, and scores P_d (UAS detected within that disk),
+  - sweeps the s-threshold and prints a per-level table.
+
+UAS not always on screen is handled by the TSV `status` (IN FoV / OUT OF FoV / no entry): P_d only on IN-FoV
+frames; other frames contribute to FP only. A sequence with no IN-FoV frames -> pure FP (P_d = n/a).
+
+Run:  /home/elphel/.venvs/c5p/bin/python l2_fp_analysis.py --dir <.../center/vNNN> [--model mexhat_gaps_boost40]
+"""
+import argparse, glob, os, re, numpy as np, tifffile
+
+# --- clean-sky FP geometry (per-sequence; the 620->560m UAS clip family) -----------------------------
+ROI        = (42, 45, 555, 270)   # clean ROI x,y,w,h -> x in [42,597), y in [45,315)
+SIGN       = (27, 200, 123, 135)  # bomb-sign + rotation artefacts exclusion x,y,w,h (right edge 150: artefacts reach x=149). By Claude 06/25/2026
+SKY_SPLIT  = 230                  # high sky y<230, low sky y>=230
+HECTARE    = 100.0 * 100.0        # 10000 px
+
+
+def build_zone_masks(H, W):
+    base = np.zeros((H, W), bool)
+    x0, y0, w, h = ROI
+    base[y0:y0 + h, x0:x0 + w] = True
+    sx, sy, sw, sh = SIGN
+    base[sy:sy + sh, sx:sx + sw] = False
+    high = base.copy(); high[SKY_SPLIT:, :] = False
+    low  = base.copy(); low[:SKY_SPLIT, :] = False
+    return high, low
+
+
+def local_maxima(s, thr):
+    """boolean map: strict-enough 3x3 local maxima with s > thr (one mark per blob). numpy-only."""
+    from numpy.lib.stride_tricks import sliding_window_view
+    sp = np.pad(s, 1, mode="constant", constant_values=-np.inf)
+    nmax = sliding_window_view(sp, (3, 3)).max(axis=(2, 3))   # 3x3 neighborhood max
+    return (s >= nmax) & (s > thr)
+
+
+def norm_ts(label):
+    """'s:1773135520_851518-0 f8' or '1773135519.534413' -> '1773135519.534413' (bare ts string)."""
+    t = label.replace("_", ".")
+    m = re.search(r"\d{5,10}\.\d{6}", t)
+    return m.group(0) if m else t
+
+
+def load_truth(tsv_path):
+    """ts(str, rounded) -> (status, px, py). Only rows with a parseable ts."""
+    truth = {}
+    with open(tsv_path) as f:
+        header = f.readline()
+        for line in f:
+            c = line.rstrip("\n").split("\t")
+            if len(c) < 5:
+                continue
+            ts = norm_ts(c[1])
+            try:
+                key = round(float(ts), 6)
+            except ValueError:
+                continue
+            status = c[2]
+            px = float(c[3]) if c[3] else np.nan
+            py = float(c[4]) if c[4] else np.nan
+            truth[key] = (status, px, py)
+    return truth
+
+
+def level_of(path):
+    m = re.search(r"-LEV(\d+)-", os.path.basename(path))
+    return int(m.group(1)) if m else 0   # untagged legacy file = level 0
+
+
+def analyze_level(tiff_path, truth, thresholds, disk_r, dbg=False):
+    """Return per-threshold dict of accumulated FP blobs / valid hectares / P_d counts."""
+    disk_r2 = disk_r * disk_r
+    acc = {t: dict(fp_hi=0, fp_lo=0, ha_hi=0.0, ha_lo=0.0, pd_hit=0, pd_tot=0) for t in thresholds}
+    matched = 0
+    unmatched = 0
+    with tifffile.TiffFile(tiff_path) as tf:
+        labels = (tf.imagej_metadata or {}).get("Labels") or []
+        H, W = tf.pages[0].shape
+        high0, low0 = build_zone_masks(H, W)
+        yy, xx = np.ogrid[:H, :W]
+        for i, page in enumerate(tf.pages):
+            lab = labels[i] if i < len(labels) else ""
+            if not lab.startswith("s:"):   # s-channel pages only
+                continue
+            key = None
+            try:
+                key = round(float(norm_ts(lab)), 6)
+            except ValueError:
+                pass
+            tr = truth.get(key)
+            if tr is None:
+                unmatched += 1
+            else:
+                matched += 1
+            s = np.asarray(page.asarray(), dtype=np.float32)
+            valid = np.isfinite(s)
+            s = np.where(valid, s, -np.inf)
+            # target disk (only on IN-FoV frames)
+            in_fov = (tr is not None) and (tr[0] == "IN FoV") and np.isfinite(tr[1]) and np.isfinite(tr[2])
+            if in_fov:
+                px, py = tr[1], tr[2]
+                disk = ((xx - px) ** 2 + (yy - py) ** 2) <= disk_r2
+            else:
+                disk = np.zeros((H, W), bool)
+            hi = high0 & valid
+            lo = low0 & valid
+            for t in thresholds:
+                peaks = local_maxima(s, t)
+                # FP = peaks in zone, outside target disk
+                acc[t]["fp_hi"] += int(np.count_nonzero(peaks & hi & ~disk))
+                acc[t]["fp_lo"] += int(np.count_nonzero(peaks & lo & ~disk))
+                acc[t]["ha_hi"] += np.count_nonzero(hi) / HECTARE
+                acc[t]["ha_lo"] += np.count_nonzero(lo) / HECTARE
+                if in_fov:
+                    acc[t]["pd_tot"] += 1
+                    if np.any((s > t) & disk):
+                        acc[t]["pd_hit"] += 1
+    return acc, matched, unmatched
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--dir", required=True, help="center vNNN dir with -OFFSET tiffs + -UAS_DATA.tsv")
+    ap.add_argument("--model", default="mexhat_gaps_boost40")
+    ap.add_argument("--disk", type=float, default=6.0, help="target-disk radius (px) for P_d / FP exclusion")
+    ap.add_argument("--thr", default="0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.6")
+    a = ap.parse_args()
+    thresholds = [float(x) for x in a.thr.split(",")]
+
+    tsvs = glob.glob(os.path.join(a.dir, "*-UAS_DATA.tsv"))
+    truth = load_truth(tsvs[0]) if tsvs else {}
+    print(f"truth: {'<none>' if not tsvs else os.path.basename(tsvs[0])}  rows={len(truth)}  "
+          f"IN-FoV={sum(1 for v in truth.values() if v[0]=='IN FoV')}")
+
+    tiffs = sorted(glob.glob(os.path.join(a.dir, f"*-OFFSET-{a.model}.tiff")), key=level_of)
+    if not tiffs:
+        raise SystemExit(f"no *-OFFSET-{a.model}.tiff in {a.dir}")
+    seq = os.path.basename(tiffs[0]).split("-SUBAVG")[0]   # center name, for cross-sequence concat
+    rows = []   # reusable summary rows (raw counts + derived), one per (level, threshold)
+    for f in tiffs:
+        lev = level_of(f)
+        acc, matched, unmatched = analyze_level(f, truth, thresholds, a.disk)
+        print(f"\n=== LEV{lev}  (frames matched={matched} unmatched={unmatched})  {os.path.basename(f)[:48]}... ===")
+        print(f"{'thr':>5} {'FP/ha_hi':>9} {'FP/ha_lo':>9} {'P_d':>6}")
+        for t in thresholds:
+            d = acc[t]
+            fph = d["fp_hi"] / d["ha_hi"] if d["ha_hi"] > 0 else float("nan")
+            fpl = d["fp_lo"] / d["ha_lo"] if d["ha_lo"] > 0 else float("nan")
+            pd = (d["pd_hit"] / d["pd_tot"]) if d["pd_tot"] > 0 else float("nan")
+            print(f"{t:5.2f} {fph:9.3f} {fpl:9.3f} {pd:6.2f}")
+            rows.append((seq, lev, t, d["fp_hi"], d["ha_hi"], fph, d["fp_lo"], d["ha_lo"], fpl,
+                         d["pd_hit"], d["pd_tot"], pd, matched, unmatched))
+
+    # reusable summary CSV in the same dir (raw counts kept so densities/P_d can be re-aggregated
+    # across sequences without re-reading the tiffs). By Claude on 06/24/2026
+    out = os.path.join(a.dir, f"{seq}-L2FP-{a.model}.csv")
+    with open(out, "w") as fo:
+        fo.write(f"# L2 FP/P_d summary  model={a.model}  disk_r={a.disk}px  "
+                 f"ROI={ROI}  SIGN={SIGN}  sky_split_y={SKY_SPLIT}  hectare={int(HECTARE)}px\n")
+        fo.write("seq,level,thr,fp_hi,ha_hi,fp_per_ha_hi,fp_lo,ha_lo,fp_per_ha_lo,pd_hit,pd_tot,pd,matched,unmatched\n")
+        for r in rows:
+            fo.write(",".join(str(x) for x in r) + "\n")
+    print(f"\nsummary -> {out}")
+
+
+if __name__ == "__main__":
+    main()
--- a/layer2.py
+++ b/layer2.py
+# layer2.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""C5P Layer-2 (track-before-detect) — minimal circular-ConvGRU on a torus. By Claude on 06/21/2026
+
+Layer 1 (frozen RawFCN) emits, per level-frame, a dense stride-4 field {s, Vx, Vy, dx, dy}.
+Layer 2 is a RECURRENT net whose hidden state is the running 4D track memory (x, y, vx, vy),
+fed a target-following 32x32 slice of that field one frame at a time. This first cut is the
+SIMPLEST viable version (per Andrey 06/21):
+  - plain circular ConvGRU  (NO explicit velocity-advection warp yet — added as a 2nd step;
+    the conv recurrence still learns local motion implicitly),
+  - dense Gaussian-bump readout (det map + Vx,Vy maps; supervise with a bump at truth),
+  - single target, free-orbit (absolute position = torus-local + winding offset, tracked
+    OUTSIDE the net; not needed for this module's forward/backward).
+
+Torus rationale: xy is a PERIODIC 32x32 grid (Conv2d padding_mode='circular'). With the target
+drift over a window staying << 32 cells, the single target "lives in infinite space" on a tiny
+fixed array — no border code, translation-equivariant everywhere, trivial to batch. vx,vy are
+NOT periodic (bounded by vmax; velocity does not wrap).
+
+UNITS: the field grid is stride-4, so one torus cell = 4 scene px. Vx,Vy channels and the
+velocity readout are kept in Layer-1 units (px/level-frame); vmax≈1.4 px/frame => ~0.35 cells/
+frame => ~2.8 cells over N=8 (<< 32, the R<<G condition the torus relies on). The /4 conversion
+to cells only matters once we add the advection warp.
+
+Run the smoke test:  python layer2.py
+"""
+
+import argparse
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+# ---------------------------------------------------------------------------
+# Recurrent cell
+# ---------------------------------------------------------------------------
+class ConvGRUCellTorus(nn.Module):
+    """One ConvGRU step with circular (toroidal) padding on the xy grid. By Claude on 06/21/2026
+
+    Standard ConvGRU:
+        z = sigmoid(Wz . [x, h])            update gate     [B, Ch, G, G]
+        r = sigmoid(Wr . [x, h])            reset gate      [B, Ch, G, G]
+        n = tanh   (Wn . [x, r*h])          candidate state [B, Ch, G, G]
+        h'= (1 - z) * h + z * n             new hidden      [B, Ch, G, G]
+    All convs are k x k with padding_mode='circular' so the 32x32 grid wraps both axes.
+    """
+    def __init__(self, ch_in, ch_hidden, k=3):
+        super().__init__()
+        pad = k // 2
+        cat = ch_in + ch_hidden                                  # concat of input + hidden along channels
+        # one conv per gate; circular pad makes the receptive field wrap the torus edges
+        self.conv_z = nn.Conv2d(cat, ch_hidden, k, padding=pad, padding_mode='circular')
+        self.conv_r = nn.Conv2d(cat, ch_hidden, k, padding=pad, padding_mode='circular')
+        self.conv_n = nn.Conv2d(cat, ch_hidden, k, padding=pad, padding_mode='circular')
+
+    def forward(self, x, h):
+        # x: [B, Cin, G, G]   h: [B, Ch, G, G]  ->  h_new: [B, Ch, G, G]
+        xh = torch.cat([x, h], dim=1)                            # [B, Cin+Ch, G, G]
+        z = torch.sigmoid(self.conv_z(xh))                       # [B, Ch, G, G] update gate
+        r = torch.sigmoid(self.conv_r(xh))                       # [B, Ch, G, G] reset gate
+        xrh = torch.cat([x, r * h], dim=1)                       # [B, Cin+Ch, G, G] reset-masked hidden
+        n = torch.tanh(self.conv_n(xrh))                         # [B, Ch, G, G] candidate
+        return (1.0 - z) * h + z * n                             # [B, Ch, G, G] new hidden
+
+
+# ---------------------------------------------------------------------------
+# Layer-2 net
+# ---------------------------------------------------------------------------
+class Layer2Net(nn.Module):
+    """Recurrent track-before-detect over a torus field sequence. By Claude on 06/21/2026
+
+    forward(seq) consumes T frames of the Layer-1 field slice and returns, per frame, a dense
+    det logit + (Vx,Vy) over the torus. Hidden state starts at 0 (no track) and accumulates
+    evidence across frames — the recurrence IS the track filter.
+    """
+    def __init__(self, ch_in=3, ch_hidden=24, grid=32, vmax=1.4, k=3):
+        super().__init__()
+        self.ch_in = ch_in           # field channels fed in: s, Vx, Vy
+        self.ch_hidden = ch_hidden   # hidden track-memory channels
+        self.grid = grid             # torus side G (cells); one cell = 4 scene px
+        self.vmax = vmax             # velocity readout bound, px/level-frame (matches Layer-1 training vmax)
+        self.cell = ConvGRUCellTorus(ch_in, ch_hidden, k=k)
+        # readout head: hidden -> det(1) + raw Vx,Vy(2); 1x1 conv = per-cell decode
+        self.head = nn.Conv2d(ch_hidden, 1 + 2, 1)
+
+    def init_hidden(self, B, device, dtype):
+        # zero hidden = "no track yet"; [B, Ch, G, G]
+        return torch.zeros(B, self.ch_hidden, self.grid, self.grid, device=device, dtype=dtype)
+
+    def decode(self, h):
+        # h: [B, Ch, G, G] -> det_logit [B, 1, G, G], vel [B, 2, G, G] bounded to +-vmax
+        o = self.head(h)                                         # [B, 3, G, G]
+        det = o[:, 0:1]                                          # [B, 1, G, G] raw logit
+        vel = self.vmax * torch.tanh(o[:, 1:3])                  # [B, 2, G, G] px/level-frame
+        return det, vel
+
+    def forward(self, seq, h=None):
+        # seq: [B, T, Cin, G, G]  ->  det [B, T, 1, G, G], vel [B, T, 2, G, G]
+        B, T = seq.shape[0], seq.shape[1]
+        if h is None:
+            h = self.init_hidden(B, seq.device, seq.dtype)
+        dets, vels = [], []
+        for t in range(T):                                       # BPTT unrolls this loop
+            h = self.cell(seq[:, t], h)                          # [B, Ch, G, G] recurrent update
+            det, vel = self.decode(h)                            # per-frame readout
+            dets.append(det)
+            vels.append(vel)
+        det = torch.stack(dets, dim=1)                           # [B, T, 1, G, G]
+        vel = torch.stack(vels, dim=1)                           # [B, T, 2, G, G]
+        return det, vel
+
+
+# ---------------------------------------------------------------------------
+# Dense Gaussian-bump supervision (single target)
+# ---------------------------------------------------------------------------
+def bump_target(pos_xy, grid, sigma=1.0, device="cpu"):
+    """Toroidal Gaussian bump at (sub-cell) position pos_xy. By Claude on 06/21/2026
+    pos_xy: [B, T, 2] (x, y) in torus cells (may be fractional / out of [0,G) — wraps).
+    Returns det bump [B, T, 1, G, G] in [0,1]. Distance uses the WRAPPED (toroidal) metric so
+    a target near the edge still gets a single round bump that straddles the seam.
+    """
+    B, T = pos_xy.shape[0], pos_xy.shape[1]
+    coord = torch.arange(grid, device=device).float()           # [G]
+    gy, gx = torch.meshgrid(coord, coord, indexing='ij')        # [G, G] each
+    gx = gx[None, None]; gy = gy[None, None]                     # [1,1,G,G] broadcast over B,T
+    px = pos_xy[..., 0][..., None, None]                         # [B, T, 1, 1]
+    py = pos_xy[..., 1][..., None, None]                         # [B, T, 1, 1]
+    # wrapped (toroidal) coordinate difference: nearest image around the G-periodic grid
+    dx = (gx - px + grid / 2) % grid - grid / 2                  # [B, T, G, G] in (-G/2, G/2]
+    dy = (gy - py + grid / 2) % grid - grid / 2
+    g = torch.exp(-(dx * dx + dy * dy) / (2 * sigma * sigma))    # [B, T, G, G]
+    return g[:, :, None]                                         # [B, T, 1, G, G]
+
+
+def layer2_loss(det_logit, vel, det_t, vel_t, support=0.3, pos_weight=20.0):
+    """Detection BCE (sparse bump -> pos_weight) + velocity MSE on the bump support. By Claude 06/21
+    det_logit: [B,T,1,G,G] raw   det_t: [B,T,1,G,G] in [0,1]
+    vel:       [B,T,2,G,G]       vel_t: [B,T,2,G,G]  (px/level-frame; only used where det_t>support)
+    """
+    pw = torch.tensor(pos_weight, device=det_logit.device)
+    l_det = F.binary_cross_entropy_with_logits(det_logit, det_t, pos_weight=pw)
+    m = (det_t > support)                                        # [B,T,1,G,G] bump core mask
+    if m.any():
+        m2 = m.expand_as(vel)                                    # [B,T,2,G,G]
+        l_vel = F.mse_loss(vel[m2], vel_t[m2])
+    else:
+        l_vel = vel.sum() * 0.0
+    return l_det + 0.3 * l_vel, {"det": float(l_det.detach()), "vel": float(l_vel.detach() if torch.is_tensor(l_vel) else l_vel)}
+
+
+# ---------------------------------------------------------------------------
+# Smoke test: fake Layer-1-like field, single target on a wrapping straight line.
+# Verifies the module trains end-to-end (forward + BPTT + loss) BEFORE real Layer-1 fields.
+# This is NOT the real training data — that comes in the next step (trajectory-sequence gen).
+# ---------------------------------------------------------------------------
+def fake_field_batch(rng, B, T, grid, vmax, sigma=1.0, snr=4.0, device="cpu"):
+    """Build a toy 'Layer-1 field' sequence + truth. By Claude on 06/21/2026
+    A single target starts at a random torus cell, moves at constant (vx,vy) px/frame
+    (=> (vx,vy)/4 cells/frame), wrapping. The s-channel is a noisy Gaussian bump at the target;
+    Vx,Vy channels carry the true velocity over the bump (+ noise), 0 elsewhere. Returns:
+      seq    [B,T,3,G,G]  (s, Vx, Vy)
+      pos    [B,T,2]      target (x,y) in cells
+      veltru [B,T,2]      true (Vx,Vy) px/level-frame
+    """
+    seq = torch.zeros(B, T, 3, grid, grid, device=device)
+    pos = torch.zeros(B, T, 2, device=device)
+    veltru = torch.zeros(B, T, 2, device=device)
+    for b in range(B):
+        x0 = rng.uniform(0, grid); y0 = rng.uniform(0, grid)
+        ang = rng.uniform(0, 2 * np.pi); spd = rng.uniform(0.3, 1.0) * vmax
+        vx = spd * np.cos(ang); vy = spd * np.sin(ang)          # px/level-frame
+        for t in range(T):
+            cx = (x0 + vx / 4.0 * t)                             # cells (stride-4 => /4)
+            cy = (y0 + vy / 4.0 * t)
+            pos[b, t, 0] = cx % grid; pos[b, t, 1] = cy % grid
+            veltru[b, t, 0] = vx; veltru[b, t, 1] = vy
+        # s channel: noisy toroidal bump at the target; vel channels: truth over the bump
+        bump = bump_target(pos[b:b+1].unsqueeze(0).reshape(1, T, 2), grid, sigma, device)  # [1,T,1,G,G]
+        bump = bump[0, :, 0]                                     # [T,G,G]
+        noise = torch.from_numpy(rng.standard_normal((T, grid, grid)).astype(np.float32)).to(device)
+        seq[b, :, 0] = (snr * bump + noise).clamp(min=0.0)       # s >= 0, SNR-scaled signal in noise
+        core = (bump > 0.3).float()                              # [T,G,G]
+        seq[b, :, 1] = vx * core; seq[b, :, 2] = vy * core
+    return seq, pos, veltru
+
+
+def smoke_test(steps=400, B=16, T=8, grid=32, vmax=1.4, device=None):
+    """Overfit the toy generator a few hundred steps; det peak should sharpen, vel MSE drop."""
+    device = device or ("cuda" if torch.cuda.is_available() else "cpu")
+    rng = np.random.default_rng(0)
+    net = Layer2Net(ch_in=3, ch_hidden=24, grid=grid, vmax=vmax).to(device)
+    opt = torch.optim.Adam(net.parameters(), 2e-3)
+    nparams = sum(p.numel() for p in net.parameters())
+    print(f"Layer2Net: {nparams} params, grid={grid}, ch_hidden=24, device={device}", flush=True)
+    for step in range(1, steps + 1):
+        seq, pos, veltru = fake_field_batch(rng, B, T, grid, vmax, device=device)
+        det_t = bump_target(pos, grid, sigma=1.0, device=device)         # [B,T,1,G,G]
+        vel_t = torch.zeros(B, T, 2, grid, grid, device=device)
+        core = (det_t[:, :, 0] > 0.3)                                     # [B,T,G,G]
+        for c in range(2):
+            vel_t[:, :, c][core] = veltru[..., c][..., None, None].expand(-1, -1, grid, grid)[core]
+        det_logit, vel = net(seq)
+        loss, comp = layer2_loss(det_logit, vel, det_t, vel_t)
+        opt.zero_grad(); loss.backward(); opt.step()
+        if step % 50 == 0 or step == 1:
+            with torch.no_grad():
+                p = torch.sigmoid(det_logit)
+                peak = float(p[det_t > 0.3].mean())
+                bg = float(p[det_t < 0.05].max())
+            print(f"step {step:4d}  det {comp['det']:.4f}  vel {comp['vel']:.4f}  "
+                  f"peak(s@truth) {peak:.3f}  max-bg {bg:.3f}", flush=True)
+    print("smoke test done.", flush=True)
+
+
+if __name__ == "__main__":
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--steps", type=int, default=400)
+    ap.add_argument("--grid", type=int, default=32)
+    ap.add_argument("--vmax", type=float, default=1.4)
+    a = ap.parse_args()
+    smoke_test(steps=a.steps, grid=a.grid, vmax=a.vmax)
--- a/layer2_data.py
+++ b/layer2_data.py
+# layer2_data.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""C5P Layer-2 v1 training-data generator + torus border-crossing verifier. By Claude on 06/21/2026
+
+v1 = absolute minimum, to debug the infrastructure and the TORUS BORDER CROSSING (Andrey 06/21):
+  - one long, steady, STRAIGHT-LINE run; strong target; fade-in onset; single class;
+  - periodic 32x32-px scene torus (target position mod 32) + iid Gaussian noise;
+  - frozen Layer-1 (weighted9_pm, patch=24, N=9, GRID mode) run DENSELY at full resolution
+    (stride-1) via circular-unfold -> a 32x32 wrapped field {s, Vx, Vy} per level-frame;
+  - L2 ingests the full 32x32 field sequence (free-orbit; absolute pos = truth-only).
+
+Full-res rationale (vs stride-4 8x8): pixel-grid == torus-grid, so the truth bump and readout
+sit at the actual pixel — no stride-4 / dx,dy reconstruction to get wrong while we verify the
+seam. Cost is trivial at 32x32. Torus >= L1 patch (32 >= 24) so a wrap-slice holds the target
+ONCE (no L1 self-alias).
+
+L1 dense eval: circular-pad the periodic 32x32 scene by (12,11) per axis, unfold 24x24 stride-1
+-> 32x32 patches; output pixel j <-> target at scene index j (matches training cx0=P/2=12, so no
+half-pixel offset). Grid head decoded: s=sigmoid(det); (Vx,Vy)=softmax-centroid of the 11x11
+velocity grid / vel_decimate (px/level-frame).
+
+Run the verifier:  python layer2_data.py --l1 runs/weighted9_pm/model.pt
+"""
+
+import argparse
+import numpy as np
+import torch
+import torch.nn.functional as F
+import synth
+from model import RawFCN
+
+VEL_RADIUS = 5      # weighted9_pm: 11x11 velocity grid
+VEL_DECIMATE = 4    # 4 cells = 1 px/level-frame
+L1_PATCH = 24
+L1_HALF = L1_PATCH // 2   # 12; patch for output j spans original [j-12, j+11]
+
+
+# ---------------------------------------------------------------------------
+# Periodic scene rendering (numpy)
+# ---------------------------------------------------------------------------
+def halfcos_bump_torus(cx, cy, G):
+    """Half-cosine bump centered at (cx,cy) on a G x G TORUS (wraps both axes). By Claude 06/21
+    Separable cos(pi/3|d|) for |d|<1.5 (same shape as synth.halfcos_bump), but distance uses the
+    wrapped (toroidal) metric so a bump near the seam straddles it as one round bump."""
+    xs = np.arange(G)[None, :] - cx
+    ys = np.arange(G)[:, None] - cy
+    dx = (xs + G / 2) % G - G / 2          # wrapped delta in (-G/2, G/2]
+    dy = (ys + G / 2) % G - G / 2
+    bx = np.where(np.abs(dx) < 1.5, np.cos(np.pi / 3.0 * np.abs(dx)), 0.0)
+    by = np.where(np.abs(dy) < 1.5, np.cos(np.pi / 3.0 * np.abs(dy)), 0.0)
+    return (bx * by).astype(np.float32)    # [G,G]
+
+
+def gauss_blob_torus(cx, cy, G, sigma):
+    """Broad isotropic Gaussian blob on a G x G TORUS (wraps both axes). By Claude 06/22
+    Used for PERSISTENT CLUTTER: a wider-than-target scene feature that L1 lights up as a standing
+    low-freq cloud. Distance uses the wrapped (toroidal) metric (round blob across the seam)."""
+    xs = np.arange(G)[None, :] - cx
+    ys = np.arange(G)[:, None] - cy
+    dx = (xs + G / 2) % G - G / 2
+    dy = (ys + G / 2) % G - G / 2
+    return np.exp(-(dx * dx + dy * dy) / (2.0 * sigma * sigma)).astype(np.float32)   # [G,G]
+
+
+def _sample_velocity(rng, vmax):
+    """One (vx,vy) on the annulus 0.3*vmax <= |v| <= vmax (px/level-frame). By Claude 06/22"""
+    for _ in range(50):
+        vx = rng.uniform(-vmax, vmax); vy = rng.uniform(-vmax, vmax)
+        r2 = vx * vx + vy * vy
+        if (0.3 * vmax) ** 2 < r2 <= vmax * vmax:
+            return float(vx), float(vy)
+    return 0.7 * vmax, 0.0
+
+
+def _bandpass_envelope(rng, T, bp_lo, bp_hi, offset):
+    """Per-frame target amplitude multiplier from a limited band-pass filter. By Claude on 06/23/2026
+    white noise -> band-pass [bp_lo,bp_hi] cyc/seq -> unit-std -> +offset -> ReLU.
+    Band-pass removes flicker (HF) and DC (mean->0 so it crosses zero => GAPS where ReLU clips to 0);
+    the offset sets the duty cycle (~fraction present ≈ Phi(offset)). NOT capped: the amplitude is the
+    natural filter output, so it occasionally runs high (>1) and often sits moderate — random is
+    random (Andrey 06/23). The target bump is simply scaled by this per frame; everything else in the
+    scene stays exactly like L1 training (no clutter, single target)."""
+    w = rng.standard_normal(T)
+    Fw = np.fft.rfft(w)
+    fbin = np.arange(Fw.shape[0])                                # frequency in cycles per sequence
+    Fw[(fbin < bp_lo) | (fbin > bp_hi)] = 0.0                    # band-pass
+    x = np.fft.irfft(Fw, n=T).astype(np.float32)
+    x /= (x.std() + 1e-8)                                        # unit std -> offset is the duty knob
+    return np.maximum(0.0, x + offset).astype(np.float32)        # ReLU: 0 in gaps, uncapped when present
+
+
+def render_run(rng, T=64, G=32, vmax=1.4, snr=6.0, noise_prefix=(8, 24), fade=(4, 10),
+               p_abrupt=0.35, p_maneuver=0.0, turn_sigma=0.07,   # p_maneuver=0: CONSTANT velocity for now;
+               #            add maneuvering as a SEPARATE later step (one challenge at a time, Andrey 06/23)
+               gaps=True, bp_lo=3, bp_hi=9, duty_offset=-0.3, starter_len=8,
+               p_death=0.0, return_signal=False):   # p_death=0: targets IMMORTAL (always coast/hope);
+               #            mechanism kept (gated) to re-enable "voluntary death" later. Andrey 06/23
+    """One L2 training run: an L1-INPUT scene (same distribution L1 was trained on) of a SINGLE sharp
+    target, whose amplitude is multiplied PER FRAME by a band-pass envelope. By Claude 06/23 (v3)
+
+    The synthetic scene here is the L1 INPUT (raw scene -> frozen L1 -> L1 output field -> L2). It must
+    stay in L1's training distribution, so: iid-N(0,1) background + ONE half-cosine target, NO clutter,
+    no "bad"/extra targets (Andrey 06/23). The ONLY L2-specific change is the per-frame amplitude
+    multiplier from a limited band-pass filter, which creates fades and hard zero-streaks (gaps).
+
+    Returns:
+      frames  [T,G,G]  iid-N(0,1) background + the per-frame-amplitude-scaled target bump
+      pos     [T,2]    target (x,y) in px mod G; NaN when the target is truth-ABSENT
+      vel     [T,2]    per-frame instantaneous (vx,vy) px/level-frame (0 where absent)
+      present [T]      0/1 TRUTH-present flag (the supervision target) — 1 THROUGH gaps, 0 after death
+
+    The two masks:
+      - TRUTH-present (`present`): what L2 must report. 1 from onset, stays 1 across signal gaps
+        (the target keeps moving), drops to 0 only at death (permanent disappearance).
+      - rendered-SIGNAL (internal `signal[t]`): whether the target bump is actually drawn into
+        `frames`. During a GAP the amplitude envelope is 0 while present stays 1 -> the frozen 9-frame
+        L1 window is starved -> L1's field goes dark THERE -> the ConvGRU must COAST on hidden state
+        + velocity to keep firing at the (moved) truth position.
+    """
+    frames = rng.standard_normal((T, G, G)).astype(np.float32)   # iid N(0,1) background (as L1 training)
+    pos = np.full((T, 2), np.nan, np.float32)
+    vel = np.zeros((T, 2), np.float32)
+    present = np.zeros(T, np.float32)
+    # (clutter / "bad targets" removed 06/23 — the L1-input scene is a SINGLE target on noise.)
+
+    # --- target trajectory: onset, abrupt/fade, straight/maneuvering, gaps, optional death
+    onset = int(rng.integers(noise_prefix[0], noise_prefix[1] + 1))
+    nfade = 1 if rng.random() < p_abrupt else int(rng.integers(fade[0], fade[1] + 1))
+    maneuver = rng.random() < p_maneuver
+    vx, vy = _sample_velocity(rng, vmax)
+    cx = rng.uniform(0, G); cy = rng.uniform(0, G)               # onset position (sub-pixel)
+
+    # death: with some prob the target permanently leaves after it has had time to lock; the tail
+    # is supervised ABSENT so L2 must RELEASE a dead track. Death-absence is long (to end of run);
+    # gaps are short (envelope zero-streaks). The net learns "coast a few frames, then release" from the
+    # contrast: an absence that ends quickly == gap (still present); one that never ends == death.
+    death_t = T
+    if rng.random() < p_death:
+        earliest = onset + nfade + 8                            # only after a real lock window
+        if earliest < T - 2:
+            death_t = int(rng.integers(earliest, T))
+
+    # amplitude envelope: smooth fades + hard zero-streaks (gaps) from ONE band-pass process.
+    # Full amplitude through a "starter" window (clean acquire) then modulated; env==0 => target
+    # bump not drawn => the frozen 9-frame L1 window is starved => the ConvGRU must COAST on hidden
+    # state + velocity. By Claude on 06/22/2026.
+    signal = np.zeros(T, np.float32)
+    env = np.zeros(T, np.float32)
+    if gaps:
+        env_bp = _bandpass_envelope(rng, T, bp_lo, bp_hi, duty_offset)  # [T] >=0, uncapped
+        acquire_end = onset + nfade + starter_len                # clean full-SNR acquire (no gaps)
+        env[onset:death_t] = env_bp[onset:death_t]
+        env[onset:min(death_t, acquire_end)] = 1.0               # starter clamp: focus on MAINTAIN
+    else:
+        env[onset:death_t] = 1.0                                 # no gaps: full amplitude (v1-like)
+
+    for t in range(onset, T):
+        k = t - onset
+        cxw = cx % G; cyw = cy % G
+        if t < death_t:
+            pos[t] = (cxw, cyw); vel[t] = (vx, vy); present[t] = 1.0
+            base = snr * min(1.0, (k + 1) / nfade)             # onset fade-in (nfade=1 => abrupt)
+            amp = base * env[t]                                # envelope: fades + GAPS(env=0) + camels
+            signal[t] = 1.0 if env[t] > 1e-6 else 0.0          # 0 in a gap => L1 starved => coast
+            if amp > 0:
+                frames[t] += amp * halfcos_bump_torus(cxw, cyw, G)
+        # advance the target (it keeps MOVING through gaps; truth pos stays correct)
+        cx += vx; cy += vy
+        if maneuver:                                           # smooth heading/speed random walk
+            ang = np.arctan2(vy, vx) + rng.normal(0.0, turn_sigma)
+            spd = float(np.hypot(vx, vy)) * float(np.exp(rng.normal(0.0, 0.05)))
+            spd = min(max(spd, 0.3 * vmax), vmax)
+            vx = spd * np.cos(ang); vy = spd * np.sin(ang)
+    if return_signal:
+        return frames, pos, vel, present, signal   # signal[t]=1 if bump rendered, 0 in a gap
+    return frames, pos, vel, present
+
+
+# ---------------------------------------------------------------------------
+# Frozen Layer-1 dense eval at full resolution (torch, GPU)
+# ---------------------------------------------------------------------------
+def l1_field_torus(net, window9, G, dev):
+    """Run frozen grid-mode L1 over a periodic 9-frame window -> 32x32 {s,Vx,Vy}. By Claude 06/21
+    window9: [N,G,G] (N=9, newest first to match training). Returns field [3,G,G] = (s, Vx, Vy),
+    Vx,Vy in px/level-frame. Stride-1 dense via circular-pad + unfold (output j <-> scene idx j)."""
+    N = window9.shape[0]
+    x = torch.from_numpy(window9[None]).to(dev)                       # [1,N,G,G]
+    # circular pad (left=12, right=11) both axes so a 24-patch centers on every output pixel
+    xp = F.pad(x, (L1_HALF, L1_PATCH - 1 - L1_HALF, L1_HALF, L1_PATCH - 1 - L1_HALF), mode='circular')
+    cols = F.unfold(xp, kernel_size=L1_PATCH)                        # [1, N*24*24, G*G]
+    L = cols.shape[-1]                                               # = G*G = 1024
+    cols = cols.reshape(N, L1_PATCH, L1_PATCH, L).permute(3, 0, 1, 2).contiguous()   # [L,N,24,24]
+    with torch.no_grad():
+        out = net(cols)[:, :, 0, 0]                                 # [L, 124]
+    det, vel, off = net.split(out.unsqueeze(-1).unsqueeze(-1))       # det[L,1,1], vel[L,121,1,1]...
+    s = torch.sigmoid(det.reshape(L))                               # [L] confidence
+    vdim = net.vdim
+    p = torch.softmax(vel.reshape(L, vdim * vdim), dim=1).reshape(L, vdim, vdim)   # [L, vy, vx]
+    cells = torch.arange(vdim, device=dev).float() - VEL_RADIUS     # -5..5
+    pvx = (p.sum(1) * cells).sum(1) / VEL_DECIMATE                  # px/level-frame (vx inner dim)
+    pvy = (p.sum(2) * cells).sum(1) / VEL_DECIMATE
+    field = torch.stack([s, pvx, pvy], 0).reshape(3, G, G)          # [3,G,G]
+    return field.cpu().numpy()
+
+
+def gen_field_sequence(net, frames, pos, G, N, dev):
+    """Slide the N-frame L1 window over a run -> field seq [T,3,G,G] aligned to each frame.
+    Frame t uses window [t, t-1, ..., t-(N-1)] (newest first), clamped at the start. By Claude 06/21"""
+    T = frames.shape[0]
+    seq = np.zeros((T, 3, G, G), np.float32)
+    for t in range(T):
+        idx = [max(0, t - i) for i in range(N)]                     # newest first; clamp pre-roll
+        seq[t] = l1_field_torus(net, frames[idx], G, dev)
+    return seq
+
+
+# ---------------------------------------------------------------------------
+# Border-crossing verifier
+# ---------------------------------------------------------------------------
+def wrapped_err(a, b, G):
+    d = (a - b + G / 2) % G - G / 2
+    return float(np.hypot(d[0], d[1]))
+
+
+def verify(l1_path, T=64, G=32, vmax=1.4, snr=6.0, seed=0):
+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+    ck = torch.load(l1_path, map_location=dev)
+    a = ck.get("args", {})
+    N = a.get("nframes", 9)
+    net = RawFCN(n_frames=N, patch=a.get("patch", 24), velocity_mode="grid",
+                 vel_radius=a.get("vel_radius", VEL_RADIUS)).to(dev)
+    net.load_state_dict(ck["model"]); net.eval()
+    print(f"L1 {l1_path}: patch={a.get('patch',24)} N={N} grid; dev={dev}", flush=True)
+
+    rng = np.random.default_rng(seed)
+    frames, pos, vel, present = render_run(rng, T=T, G=G, vmax=vmax, snr=snr)
+    seq = gen_field_sequence(net, frames, pos, G, N, dev)            # [T,3,G,G]
+
+    vp = vel[present > 0] if (present > 0).any() else vel[:1]    # per-frame now ([T,2]); summarize
+    print(f"velocity truth (px/level-frame, mean over present): vx={vp[:,0].mean():+.3f} "
+          f"vy={vp[:,1].mean():+.3f}  |v|~{np.hypot(vp[:,0],vp[:,1]).mean():.3f}  (per-frame; maneuver varies)",
+          flush=True)
+    print("frame  present   truth(x,y)     L1 s-peak(x,y)   pos-err  s@peak  Vx,Vy@peak(decoded)  cross?", flush=True)
+    prev_peak = None
+    perr_present = []
+    for t in range(T):
+        s = seq[t, 0]
+        pj = int(np.argmax(s)); py, px = divmod(pj, G)
+        speak = s[py, px]
+        vxp, vyp = seq[t, 1, py, px], seq[t, 2, py, px]
+        cross = ""
+        if prev_peak is not None:
+            # seam-crossing flag: peak jumped across the wrap (large raw jump, small wrapped jump)
+            raw = np.hypot(px - prev_peak[0], py - prev_peak[1])
+            wrp = wrapped_err(np.array([px, py], float), np.array(prev_peak, float), G)
+            if raw - wrp > 4: cross = "  <-- SEAM"
+        prev_peak = (px, py)
+        if present[t]:
+            perr = wrapped_err(np.array([px, py], float), pos[t], G)
+            perr_present.append(perr)
+            tru = f"({pos[t,0]:5.1f},{pos[t,1]:5.1f})"
+        else:
+            perr = float('nan'); tru = "    --       "
+        print(f"{t:4d}    {int(present[t])}     {tru}    ({px:2d},{py:2d})        "
+              f"{perr:5.2f}   {speak:.3f}   {vxp:+.3f},{vyp:+.3f}{cross}", flush=True)
+    pe = np.array(perr_present)
+    print(f"\nposition error over present frames: mean {np.nanmean(pe):.2f} px  max {np.nanmax(pe):.2f} px "
+          f"(over {len(pe)} frames)", flush=True)
+    print("PASS: L1 field peak tracks the target across the seam." if np.nanmean(pe) < 2.0
+          else "CHECK: peak/truth offset > 2px — inspect alignment (half-pixel? velocity sign?).", flush=True)
+    return seq, frames, pos, vel, present
+
+
+def _load_l1(l1_path, dev):
+    ck = torch.load(l1_path, map_location=dev)
+    a = ck.get("args", {})
+    N = a.get("nframes", 9)
+    net = RawFCN(n_frames=N, patch=a.get("patch", 24), velocity_mode="grid",
+                 vel_radius=a.get("vel_radius", VEL_RADIUS)).to(dev)
+    net.load_state_dict(ck["model"]); net.eval()
+    return net, N, a
+
+
+def display_run(l1_path, T=120, G=32, vmax=1.4, snr=6.0, seed=0, out="runs/l2_l1view", render_kw=None):
+    """Run L1 over a long run and write scrubable TIFF stacks + a PNG montage. By Claude 06/21
+    Outputs (in `out/`): input.tif, s.tif, Vx.tif, Vy.tif (T-page 32-bit, ImageJ) and montage.png
+    (12 evenly-spaced s-frames with the truth position overlaid).
+    render_kw forwards the gap-envelope knobs; prints a per-frame present/signal/L1-s@truth table so
+    the L1 stage can be VERIFIED on gap data IN ISOLATION before trusting L2. By Claude 06/23"""
+    import os
+    os.makedirs(out, exist_ok=True)
+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+    net, N, a = _load_l1(l1_path, dev)
+    rng = np.random.default_rng(seed)
+    frames, pos, vel, present, signal = render_run(rng, T=T, G=G, vmax=vmax, snr=snr,
+                                                   return_signal=True, **(render_kw or {}))
+    seq = gen_field_sequence(net, frames, pos, G, N, dev)            # [T,3,G,G] = s,Vx,Vy
+    vp = vel[present > 0] if (present > 0).any() else vel[:1]        # [.,2] per-frame velocity
+    vmean = float(np.hypot(vp[:, 0], vp[:, 1]).mean())
+    ncross = int(round((vmean * present.sum()) / G))                 # rough seam-crossing count
+    print(f"L1 {l1_path}: patch={a.get('patch',24)} N={N}; T={T}, |v|~{vmean:.3f} px/fr, "
+          f"~{ncross} seam crossings, target present {int(present.sum())}/{T} frames", flush=True)
+
+    truth = np.zeros((T, G, G), np.float32)                         # truth marker (target-shape bump)
+    for t in range(T):
+        if present[t]:
+            truth[t] = halfcos_bump_torus(pos[t, 0], pos[t, 1], G)
+    synth.save_tiff_stack(frames, f"{out}/input.tif")               # raw periodic scene
+    synth.save_tiff_stack(seq[:, 0], f"{out}/s.tif")                # L1 confidence field
+    synth.save_tiff_stack(seq[:, 1], f"{out}/Vx.tif")               # decoded Vx (px/level-frame)
+    synth.save_tiff_stack(seq[:, 2], f"{out}/Vy.tif")
+    synth.save_tiff_stack(truth, f"{out}/truth.tif")                # truth position (compare vs s)
+    synth.save_tiff_stack(np.broadcast_to(signal[:, None, None], (T, G, G)).astype(np.float32),
+                          f"{out}/signal.tif")                      # 1=bump rendered, 0=GAP (L1 starved)
+    print(f"wrote {out}/{{input,s,Vx,Vy,truth,signal}}.tif  ({T} pages each, 32-bit float, ImageJ)", flush=True)
+
+    # --- L1 VERIFICATION on gap data: per-frame present / signal / L1 s@truth + away-FP. The check:
+    #     in GAP frames (signal=0, present=1) L1 s@truth should COLLAPSE (L1 starved); when signal
+    #     returns it should re-lock. If L1 does NOT go dark in gaps, the L2 gap result is meaningless.
+    #     By Claude on 06/23/2026.
+    print("\nframe pres sig  L1_s@truth  L1_max_away   note", flush=True)
+    ng = int((present > 0).sum()); ngap = int(((present > 0) & (signal < 0.5)).sum())
+    for t in range(T):
+        if not present[t]:
+            continue
+        cx, cy = pos[t]
+        ci, cj = int(round(cy)) % G, int(round(cx)) % G
+        s_at = float(seq[t, 0, ci, cj])
+        mask = np.ones((G, G), bool)                                 # away = outside a 3px disk of truth
+        yy, xx = np.ogrid[:G, :G]
+        dwin = ((((xx - cj + G/2) % G - G/2)**2 + ((yy - ci + G/2) % G - G/2)**2) <= 9)
+        mask[dwin] = False
+        s_away = float(seq[t, 0][mask].max())
+        note = "GAP -> want s@truth low" if signal[t] < 0.5 else ""
+        # print sparsely: every gap frame + a few present frames
+        if signal[t] < 0.5 or t % 8 == 0:
+            print(f"{t:4d}  {int(present[t])}    {int(signal[t])}    {s_at:6.3f}      {s_away:6.3f}     {note}", flush=True)
+    print(f"\nsummary: present {ng}/{T} frames, of which {ngap} are GAP frames (signal=0).", flush=True)
+    print("VERIFY: gap-frame s@truth should be markedly LOWER than non-gap s@truth.", flush=True)
+
+    # 2x2-tiled stacks: the torus seam now runs through the CENTER CROSS (x=G, y=G) of the 2Gx2G
+    # image -> any seam discontinuity is glaring there. Clean torus => invisible cross. By Claude 06/21
+    def tile2x2(st): return np.tile(st, (1, 2, 2))                  # [T,G,G] -> [T,2G,2G]
+    for nm, st in [("input", frames), ("s", seq[:, 0]), ("Vx", seq[:, 1]),
+                   ("Vy", seq[:, 2]), ("truth", truth)]:
+        synth.save_tiff_stack(tile2x2(st), f"{out}/{nm}_2x2.tif")
+    print(f"wrote {out}/*_2x2.tif  (2Gx2G tiled; seam = center cross at {G},{G})", flush=True)
+    _tiled_montage(tile2x2(seq[:, 0]), G, f"{out}/montage_2x2.png")
+
+    import matplotlib; matplotlib.use("Agg"); import matplotlib.pyplot as plt
+    idx = np.linspace(0, T - 1, 12).astype(int)
+    fig, axs = plt.subplots(3, 4, figsize=(13, 10))
+    for ax, t in zip(axs.ravel(), idx):
+        ax.imshow(seq[t, 0], vmin=0, vmax=1, cmap="magma", origin="upper")
+        if present[t]:
+            ax.plot(pos[t, 0], pos[t, 1], "c+", ms=12, mew=2)       # truth position (x,y)
+        ax.set_title(f"f{t}  {'tgt' if present[t] else 'noise'}", fontsize=9)
+        ax.set_xticks([]); ax.set_yticks([])
+    fig.suptitle(f"L1 confidence field s — {T}-frame run (cyan + = truth)  |v|~{vmean:.2f}px/fr")
+    fig.tight_layout(); fig.savefig(f"{out}/montage.png", dpi=90)
+    print(f"wrote {out}/montage.png", flush=True)
+
+    _save_gif(frames, seq[:, 0], pos, present, f"{out}/watch.gif")  # [input | s] + truth, animated
+    print(f"wrote {out}/watch.gif  ({T} frames, [input | L1 s] side-by-side, cyan + = truth)", flush=True)
+    return seq, frames, pos, vel, present
+
+
+def _tiled_montage(s2, G, path):
+    """8 frames of the 2x2-tiled s field with seam guide-lines at the center cross. By Claude 06/21
+    Continuity across the cyan guide lines == clean torus (no seam artifact)."""
+    import matplotlib; matplotlib.use("Agg"); import matplotlib.pyplot as plt
+    T = s2.shape[0]; idx = np.linspace(0, T - 1, 8).astype(int)
+    fig, axs = plt.subplots(2, 4, figsize=(13, 7))
+    for ax, t in zip(axs.ravel(), idx):
+        ax.imshow(s2[t], vmin=0, vmax=1, cmap="magma", origin="upper")
+        ax.axvline(G - 0.5, color="cyan", lw=0.6, alpha=0.6)        # seam (center cross)
+        ax.axhline(G - 0.5, color="cyan", lw=0.6, alpha=0.6)
+        ax.set_title(f"f{t}", fontsize=9); ax.set_xticks([]); ax.set_yticks([])
+    fig.suptitle("2x2-tiled L1 s — seam = cyan center cross; continuous across it => clean torus")
+    fig.tight_layout(); fig.savefig(path, dpi=90); plt.close(fig)
+    print(f"wrote {path}", flush=True)
+
+
+def _save_gif(frames, sfield, pos, present, path, up=8, dur=120):
+    """Animated [input | s] with cyan truth marker, upscaled x`up` for watchability. By Claude 06/21"""
+    import matplotlib
+    from PIL import Image, ImageDraw
+    T, G, _ = frames.shape
+    fn = (frames - frames.min()) / (np.ptp(frames) + 1e-9)           # global gray-normalize input
+    gray = matplotlib.colormaps["gray"]; mag = matplotlib.colormaps["magma"]
+    pages = []
+    for t in range(T):
+        li = (gray(fn[t])[..., :3] * 255).astype(np.uint8)          # [G,G,3] input
+        ls = (mag(np.clip(sfield[t], 0, 1))[..., :3] * 255).astype(np.uint8)  # [G,G,3] s in [0,1]
+        sep = np.full((G, 2, 3), 60, np.uint8)
+        row = np.concatenate([li, sep, ls], axis=1)                 # [G, 2G+2, 3]
+        im = Image.fromarray(row).resize((row.shape[1] * up, G * up), Image.NEAREST)
+        if present[t]:
+            d = ImageDraw.Draw(im)
+            for x0 in (0, (G + 2) * up):                            # mark both panels
+                cx = x0 + pos[t, 0] * up; cy = pos[t, 1] * up
+                d.line([(cx - 6, cy), (cx + 6, cy)], fill=(0, 255, 255), width=2)
+                d.line([(cx, cy - 6), (cx, cy + 6)], fill=(0, 255, 255), width=2)
+        pages.append(im)
+    pages[0].save(path, save_all=True, append_images=pages[1:], duration=dur, loop=0)
+
+
+if __name__ == "__main__":
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--mode", choices=["verify", "display"], default="verify")
+    ap.add_argument("--l1", default="runs/weighted9_pm/model.pt")
+    ap.add_argument("--T", type=int, default=64); ap.add_argument("--G", type=int, default=32)
+    ap.add_argument("--vmax", type=float, default=1.4); ap.add_argument("--snr", type=float, default=6.0)
+    ap.add_argument("--seed", type=int, default=0); ap.add_argument("--out", default="runs/l2_l1view")
+    # gap-envelope knobs for the L1-on-gaps verification (display mode). By Claude 06/23
+    ap.add_argument("--gaps", action="store_true", help="render band-pass amplitude gaps (the fancy data)")
+    ap.add_argument("--bp_lo", type=int, default=3); ap.add_argument("--bp_hi", type=int, default=9)
+    ap.add_argument("--duty_offset", type=float, default=-0.3); ap.add_argument("--starter_len", type=int, default=8)
+    a = ap.parse_args()
+    if a.mode == "verify":
+        verify(a.l1, T=a.T, G=a.G, vmax=a.vmax, snr=a.snr, seed=a.seed)
+    else:
+        render_kw = dict(gaps=a.gaps, bp_lo=a.bp_lo, bp_hi=a.bp_hi,
+                         duty_offset=a.duty_offset, starter_len=a.starter_len)
+        display_run(a.l1, T=a.T, G=a.G, vmax=a.vmax, snr=a.snr, seed=a.seed, out=a.out, render_kw=render_kw)
--- a/layer2_eval.py
+++ b/layer2_eval.py
+# layer2_eval.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""C5P Layer-2 testing/visualization on same-generator input. By Claude 06/22/2026
+
+Loads a trained Layer2Net + frozen L1, generates a same-generator test sequence (render_run ->
+frozen-L1 field), runs L2, and writes 32-bit float TIFF stacks AND 2x2-tiled versions (seam =
+center cross, same check we used for L1) so the L2 output can be watched the same way:
+  L1_s   = L1 input confidence field
+  L2_det = L2 track-before-detect output (sigmoid)
+  L2_s_v = L2 detection masked-overlay of |V| (optional sanity of velocity field)
+  truth  = target-shape bump at truth position
+Also prints per-frame lock / FP metrics. Run on DGX:
+  python layer2_eval.py --l2 runs/l2_v1/model.pt --T 120 --out runs/l2_v1/test
+"""
+
+import argparse
+import numpy as np
+import torch
+import synth
+import layer2_data as L1D
+from layer2 import Layer2Net
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--l1", default="runs/weighted9_pm/model.pt")
+    ap.add_argument("--l2", default="runs/l2_v1/model.pt")
+    ap.add_argument("--T", type=int, default=120); ap.add_argument("--seed", type=int, default=777)
+    ap.add_argument("--snr", type=float, default=6.0); ap.add_argument("--out", default="runs/l2_v1/test")
+    a = ap.parse_args()
+    import os; os.makedirs(a.out, exist_ok=True)
+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+
+    net1, N, _ = L1D._load_l1(a.l1, dev)
+    ck = torch.load(a.l2, map_location=dev); la = ck.get("args", {})
+    G = la.get("G", 32); vmax = la.get("vmax", 1.4)
+    net2 = Layer2Net(ch_in=3, ch_hidden=la.get("ch", 24), grid=G, vmax=vmax).to(dev)
+    net2.load_state_dict(ck["model"]); net2.eval()
+    print(f"L1={a.l1} (N={N})  L2={a.l2} (ch={la.get('ch',24)}, G={G})  dev={dev}", flush=True)
+
+    rng = np.random.default_rng(a.seed)
+    frames, pos, vel, present = L1D.render_run(rng, T=a.T, G=G, vmax=vmax, snr=a.snr)
+    seq = L1D.gen_field_sequence(net1, frames, pos, G, N, dev)        # [T,3,G,G]
+    with torch.no_grad():
+        det_logit, velo = net2(torch.from_numpy(seq[None]).to(dev))
+        l2det = torch.sigmoid(det_logit)[0, :, 0].cpu().numpy()       # [T,G,G]
+        l2vx = velo[0, :, 0].cpu().numpy(); l2vy = velo[0, :, 1].cpu().numpy()  # [T,G,G] px/level-frame
+    truth = np.zeros((a.T, G, G), np.float32)
+    for t in range(a.T):
+        if present[t]:
+            truth[t] = L1D.halfcos_bump_torus(pos[t, 0], pos[t, 1], G)
+
+    L1_s = seq[:, 0]
+    stacks = {"L1_s": L1_s, "L2_det": l2det, "L2_Vx": l2vx, "L2_Vy": l2vy, "truth": truth}
+    for nm, st in stacks.items():
+        synth.save_tiff_stack(st, f"{a.out}/{nm}.tif")
+        synth.save_tiff_stack(np.tile(st, (1, 2, 2)), f"{a.out}/{nm}_2x2.tif")
+    print(f"wrote {a.out}/{{L1_s,L2_det,L2_Vx,L2_Vy,truth}}{{,_2x2}}.tif ({a.T} pages, 32-bit float)", flush=True)
+
+    # velocity accuracy: L2 (Vx,Vy) read at the detected peak vs truth velocity
+    verr = []
+    for t in range(a.T):
+        if not present[t]:
+            continue
+        pj = int(np.argmax(l2det[t])); py, px = divmod(pj, G)
+        verr.append((l2vx[t, py, px] - vel[t, 0], l2vy[t, py, px] - vel[t, 1]))  # vel is per-frame [T,2]
+    verr = np.array(verr[3:])  # skip pre-lock
+    vp = vel[present > 0]
+    print(f"truth vel (mean over present) = ({vp[:,0].mean():+.3f},{vp[:,1].mean():+.3f}) px/level-frame;  "
+          f"L2 vel@peak error: mean |dV| {np.hypot(verr[:,0],verr[:,1]).mean():.3f}  "
+          f"(bias {verr[:,0].mean():+.3f},{verr[:,1].mean():+.3f})", flush=True)
+    L1D._tiled_montage(np.tile(l2det, (1, 2, 2)), G, f"{a.out}/L2_det_2x2.png")
+
+    # L1 -> L2 -> truth comparison montage (the cloud->clean-blob transformation)
+    import matplotlib; matplotlib.use("Agg"); import matplotlib.pyplot as plt
+    idx = np.linspace(0, a.T - 1, 7).astype(int)
+    fig, axs = plt.subplots(3, 7, figsize=(15, 6.5))
+    for j, t in enumerate(idx):
+        for r, (img, lab) in enumerate([(L1_s[t], "L1 s"), (l2det[t], "L2 det"), (truth[t], "truth")]):
+            axs[r, j].imshow(img, vmin=0, vmax=1, cmap="magma", origin="upper")
+            axs[r, j].set_xticks([]); axs[r, j].set_yticks([])
+            if j == 0: axs[r, j].set_ylabel(lab, fontsize=11)
+        axs[0, j].set_title(f"f{t} {'tgt' if present[t] else 'noise'}", fontsize=9)
+    fig.suptitle("L1 input (clouds) -> L2 detection (clean) -> truth")
+    fig.tight_layout(); fig.savefig(f"{a.out}/compare.png", dpi=90); plt.close(fig)
+    print(f"wrote {a.out}/compare.png", flush=True)
+
+    # per-frame metrics: s@truth (lock) vs HONEST FP = max bg excluding a radius-R disk around truth
+    yy, xx = np.mgrid[0:G, 0:G]; R = 4.0
+    print("\nframe present  s@truth  FP(>4px from tgt)  note", flush=True)
+    locked = None; onset = int(np.argmax(present)) if present.any() else 0
+    fp_present = []
+    for t in range(a.T):
+        core = truth[t] > 0.3
+        sat = float(l2det[t][core].mean()) if core.any() else float('nan')
+        if present[t]:
+            dx = (xx - pos[t, 0] + G / 2) % G - G / 2; dy = (yy - pos[t, 1] + G / 2) % G - G / 2
+            far = (dx * dx + dy * dy) > R * R                        # exclude target neighborhood
+        else:
+            far = np.ones((G, G), bool)
+        fp = float(l2det[t][far].max()) if far.any() else 0.0
+        if present[t]: fp_present.append(fp)
+        note = ""
+        if present[t] and locked is None and sat > 0.5:
+            locked = t; note = f"<- LOCK (+{t - onset} fr)"
+        if t % 8 == 0 or note:
+            print(f"{t:4d}  {'tgt' if present[t] else 'noise':5s}   {sat:6.3f}        {fp:6.3f}      {note}", flush=True)
+    print(f"\nlock frame: {locked} (onset {onset});  FP on locked frames: "
+          f"mean {np.mean(fp_present[2:] or [0]):.3f} max {np.max(fp_present[2:] or [0]):.3f}", flush=True)
+
+
+if __name__ == "__main__":
+    main()
--- a/layer2_gapcheck.py
+++ b/layer2_gapcheck.py
+# layer2_gapcheck.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""C5P Layer-2 gap-coast diagnostic — the direct test for the wild-test failure. By Claude 06/22/2026
+
+The 2026-06-22 wild test found L2 drops the track the instant the L1 signal does (v1 had no gaps in
+training, so the recurrence never learned to coast). This script builds a DETERMINISTIC run with a
+single, explicit signal GAP in the middle of an otherwise clean straight track — no clutter, no
+death, no maneuver — runs frozen L1 then trained L2, and reports, frame by frame:
+
+  L1 s@truth   — frozen-L1 confidence AT the (moving) truth position. COLLAPSES inside the gap.
+  L2 s@truth   — trained-L2 confidence AT the truth position.        Should COAST (stay high).
+  L2 pos-err   — wrapped distance from L2 peak to truth (px).        Should stay small in the gap.
+
+PASS = L2 holds the track through the gap while L1 has gone dark. That is the memory doing its job.
+
+Run on DGX:  python layer2_gapcheck.py --l2 runs/l2_v2/model.pt --gap 18 26
+"""
+
+import argparse
+import numpy as np
+import torch
+import synth
+import layer2_data as L1D
+from layer2 import Layer2Net
+
+
+def render_gap(T, G, vmax, snr, gap, onset=8, seed=0):
+    """Clean straight track with ONE explicit signal gap [gap0,gap1). By Claude 06/22
+    Returns frames[T,G,G], pos[T,2] (truth, defined from onset), vel[2], present[T], gapmask[T].
+    The target keeps MOVING through the gap (truth pos advances); only the rendered bump is removed,
+    so the frozen 9-frame L1 window is starved while truth-present stays 1."""
+    rng = np.random.default_rng(seed)
+    frames = rng.standard_normal((T, G, G)).astype(np.float32)
+    pos = np.full((T, 2), np.nan, np.float32)
+    present = np.zeros(T, np.float32)
+    gapmask = np.zeros(T, bool); gapmask[gap[0]:gap[1]] = True
+    vx, vy = L1D._sample_velocity(rng, vmax)
+    x0 = rng.uniform(0, G); y0 = rng.uniform(0, G)
+    for t in range(onset, T):
+        k = t - onset
+        cx = (x0 + vx * k) % G; cy = (y0 + vy * k) % G
+        pos[t] = (cx, cy); present[t] = 1.0
+        if not gapmask[t]:                                    # gap => bump NOT rendered (L1 starves)
+            frames[t] += snr * L1D.halfcos_bump_torus(cx, cy, G)
+    return frames, pos, np.array([vx, vy], np.float32), present, gapmask
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--l1", default="runs/weighted9_pm/model.pt")
+    ap.add_argument("--l2", default="runs/l2_v2/model.pt")
+    ap.add_argument("--T", type=int, default=40); ap.add_argument("--gap", type=int, nargs=2, default=[18, 26])
+    ap.add_argument("--snr", type=float, default=6.0); ap.add_argument("--seed", type=int, default=0)
+    ap.add_argument("--out", default="runs/l2_v2/gapcheck")
+    a = ap.parse_args()
+    import os; os.makedirs(a.out, exist_ok=True)
+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+
+    net1, N, _ = L1D._load_l1(a.l1, dev)
+    ck = torch.load(a.l2, map_location=dev); la = ck.get("args", {})
+    G = la.get("G", 32); vmax = la.get("vmax", 1.4)
+    net2 = Layer2Net(ch_in=3, ch_hidden=la.get("ch", 24), grid=G, vmax=vmax).to(dev)
+    net2.load_state_dict(ck["model"]); net2.eval()
+    print(f"L1={a.l1} (N={N})  L2={a.l2} (ch={la.get('ch',24)}, G={G})  gap={a.gap}  dev={dev}", flush=True)
+
+    frames, pos, vel, present, gap = render_gap(a.T, G, vmax, a.snr, a.gap, seed=a.seed)
+    seq = L1D.gen_field_sequence(net1, frames, pos, G, N, dev)        # [T,3,G,G]
+    with torch.no_grad():
+        det_logit, _ = net2(torch.from_numpy(seq[None]).to(dev))
+        l2 = torch.sigmoid(det_logit)[0, :, 0].cpu().numpy()          # [T,G,G]
+    l1s = seq[:, 0]
+
+    truth = np.zeros((a.T, G, G), np.float32)
+    for t in range(a.T):
+        if present[t]: truth[t] = L1D.halfcos_bump_torus(pos[t, 0], pos[t, 1], G)
+    for nm, st in [("L1_s", l1s), ("L2_det", l2), ("truth", truth)]:
+        synth.save_tiff_stack(st, f"{a.out}/{nm}.tif")
+    print(f"wrote {a.out}/{{L1_s,L2_det,truth}}.tif ({a.T} pages)\n", flush=True)
+
+    print("frame  present  in_gap   L1 s@truth   L2 s@truth   L2 pos-err", flush=True)
+    pre, ingap = [], []
+    for t in range(a.T):
+        core = truth[t] > 0.3
+        l1v = float(l1s[t][core].mean()) if core.any() else float('nan')
+        l2v = float(l2[t][core].mean()) if core.any() else float('nan')
+        pj = int(np.argmax(l2[t])); py, px = divmod(pj, G)
+        perr = L1D.wrapped_err(np.array([px, py], float), pos[t], G) if present[t] else float('nan')
+        flag = "GAP" if gap[t] else ("tgt" if present[t] else "noise")
+        print(f"{t:4d}     {int(present[t])}     {flag:5s}    {l1v:7.3f}      {l2v:7.3f}      {perr:6.2f}", flush=True)
+        if present[t] and not gap[t] and t < a.gap[0]:
+            pre.append((l1v, l2v))
+        if gap[t]:
+            ingap.append((l1v, l2v, perr))
+    pre = np.array(pre); ingap = np.array(ingap)
+    l1_drop = ingap[:, 0].mean(); l2_hold = ingap[:, 1].mean(); perr_gap = np.nanmean(ingap[:, 2])
+    print(f"\nbefore gap (locked):  L1 s@truth {pre[:,0].mean():.3f}   L2 s@truth {pre[:,1].mean():.3f}", flush=True)
+    print(f"inside gap        :  L1 s@truth {l1_drop:.3f}   L2 s@truth {l2_hold:.3f}   L2 pos-err {perr_gap:.2f}px", flush=True)
+    # PASS: L1 collapsed in the gap (signal truly gone) AND L2 coasted (held high, near truth)
+    coasted = (l1_drop < 0.5) and (l2_hold > 0.5) and (perr_gap < 4.0)
+    print("COAST PASS — L2 holds the track through the gap while L1 goes dark." if coasted
+          else "COAST FAIL — L2 did not coast (still follows the live L1 signal). Train w/ stronger gaps.",
+          flush=True)
+
+
+if __name__ == "__main__":
+    main()
--- a/layer2_train.py
+++ b/layer2_train.py
+# layer2_train.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""C5P Layer-2 v1 training: track-before-detect recurrent over frozen-L1 fields. By Claude 06/22/2026
+
+(a) first end-to-end L2 train on the CLEAN steady runs (Andrey 06/21). Layer 1 is FROZEN, so the
+field sequences are PRE-COMPUTED ONCE into a cache (stage2.py pattern), then Layer2Net trains fast
+on the cache via BPTT. Supervision per frame:
+  - target PRESENT  -> det = toroidal Gaussian bump at truth pos; (Vx,Vy) at the bump support;
+  - target ABSENT (noise prefix) -> det = 0 everywhere  == the false-positive-suppression signal.
+The net must (i) stay quiet on the noise prefix + clutter clouds, (ii) lock onto the faded-in
+target as coherent evidence accumulates (survival), (iii) hold + report velocity across seams.
+
+Step rate = raw per-level-frame (decimation / stride-2-overlap parked for later). Run on DGX:
+  python layer2_train.py --l1 runs/weighted9_pm/model.pt --nseq 128 --T 48 --steps 4000 --out runs/l2_v1
+"""
+
+import argparse
+import numpy as np
+import torch
+import torch.nn.functional as F
+import synth
+import layer2_data as L1D
+from layer2 import Layer2Net, bump_target, layer2_loss
+
+
+def build_cache(net, N, nseq, T, G, vmax, snr, dev, seed=0, render_kw=None):
+    """Pre-compute nseq frozen-L1 field sequences + truth. By Claude 06/22
+    Returns lists of: seq[T,3,G,G], pos[T,2], vel[2], present[T] (all torch on dev).
+    render_kw overrides render_run realism knobs (e.g. --v1 zeroes gaps/clutter). By Claude 06/22"""
+    rng = np.random.default_rng(seed)
+    render_kw = render_kw or {}
+    cache = []
+    for k in range(nseq):
+        frames, pos, vel, present = L1D.render_run(rng, T=T, G=G, vmax=vmax, snr=snr, **render_kw)
+        seq = L1D.gen_field_sequence(net, frames, pos, G, N, dev)     # [T,3,G,G]
+        cache.append((torch.from_numpy(seq).to(dev),
+                      torch.from_numpy(np.nan_to_num(pos)).float().to(dev),
+                      torch.from_numpy(vel).to(dev),
+                      torch.from_numpy(present).to(dev)))
+        if (k + 1) % 32 == 0:
+            print(f"  cache {k+1}/{nseq}", flush=True)
+    return cache
+
+
+def make_targets(batch, G, dev, sigma=1.5):
+    """Stack a minibatch -> (seq, det_t, vel_t, present). By Claude 06/22
+    det_t: bump at truth, zeroed on absent frames. vel_t: constant per-seq velocity broadcast."""
+    seq = torch.stack([b[0] for b in batch], 0)                      # [B,T,3,G,G]
+    pos = torch.stack([b[1] for b in batch], 0)                      # [B,T,2]
+    vel = torch.stack([b[2] for b in batch], 0)                      # [B,T,2] per-frame (v2: maneuver)
+    present = torch.stack([b[3] for b in batch], 0)                  # [B,T] truth-present (1 thru gaps)
+    det_t = bump_target(pos, G, sigma=sigma, device=dev)             # [B,T,1,G,G]
+    det_t = det_t * present[:, :, None, None, None]                  # zero where no target (prefix/death)
+    B, T = present.shape
+    vel_t = vel[:, :, :, None, None].expand(B, T, 2, G, G).contiguous()      # [B,T,2,G,G]
+    return seq, det_t, vel_t, present
+
+
+def evaluate(net, batch, G, dev, sigma=1.5):
+    """Per-seq lock/FP metrics over a held-out batch. By Claude 06/22
+    sigma must match training supervision so the 'core' mask matches the bump. By Claude 06/22"""
+    seq, det_t, vel_t, present = make_targets(batch, G, dev, sigma=sigma)
+    with torch.no_grad():
+        det_logit, vel = net(seq)
+        p = torch.sigmoid(det_logit)[:, :, 0]                        # [B,T,G,G]
+    pres = present.bool()
+    core = det_t[:, :, 0] > 0.3                                      # [B,T,G,G] truth bump core
+    peak = float(p[core].mean()) if core.any() else 0.0             # s at truth (want ->1)
+    absent = ~pres                                                   # noise-prefix frames
+    bg_absent = float(p[absent].amax()) if absent.any() else 0.0    # worst FP on noise (want ->0)
+    # background on PRESENT frames, away from the target (clutter-cloud FP)
+    bg_present = float((p * (~core) * pres[:, :, None, None]).amax())
+    return peak, bg_absent, bg_present
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--l1", default="runs/weighted9_pm/model.pt")
+    ap.add_argument("--nseq", type=int, default=128); ap.add_argument("--T", type=int, default=48)
+    ap.add_argument("--G", type=int, default=32); ap.add_argument("--vmax", type=float, default=1.4)
+    ap.add_argument("--snr", type=float, default=6.0); ap.add_argument("--steps", type=int, default=4000)
+    ap.add_argument("--bs", type=int, default=8); ap.add_argument("--lr", type=float, default=2e-3)
+    ap.add_argument("--ch", type=int, default=24); ap.add_argument("--out", default="runs/l2_v1")
+    # --- loss / supervision knobs (Option A artifact check). By Claude 06/22 ---
+    ap.add_argument("--sigma", type=float, default=1.5, help="supervision bump sigma (A: try 0.7-0.8)")
+    ap.add_argument("--pos_weight", type=float, default=30.0, help="BCE positive weight (A: try 3-8 AFTER sigma)")
+    ap.add_argument("--v1", action="store_true", help="v1-equivalent data: no gaps/maneuver/abrupt/death/clutter (isolate the sigma sweep)")
+    a = ap.parse_args()
+    import os; os.makedirs(a.out, exist_ok=True)
+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+    # --v1 zeroes the v2 realism knobs so Option A varies ONLY the supervision width. By Claude 06/22
+    render_kw = dict(p_gap=0.0, p_maneuver=0.0, p_abrupt=0.0, p_death=0.0, clutter_n=(0, 0)) if a.v1 else {}
+    net1, N, _ = L1D._load_l1(a.l1, dev)
+    print(f"L1 {a.l1} N={N}; precomputing {a.nseq} field sequences (T={a.T}, G={a.G}) "
+          f"sigma={a.sigma} pos_weight={a.pos_weight} v1={a.v1}...", flush=True)
+    cache = build_cache(net1, N, a.nseq, a.T, a.G, a.vmax, a.snr, dev, render_kw=render_kw)
+    nval = max(8, a.nseq // 8); val = cache[:nval]; train = cache[nval:]
+    print(f"cache: {len(train)} train / {len(val)} val sequences", flush=True)
+
+    net = Layer2Net(ch_in=3, ch_hidden=a.ch, grid=a.G, vmax=a.vmax).to(dev)
+    opt = torch.optim.Adam(net.parameters(), a.lr)
+    nparams = sum(p.numel() for p in net.parameters())
+    print(f"Layer2Net {nparams} params; training {a.steps} steps bs={a.bs}", flush=True)
+    rng = np.random.default_rng(1)
+    for step in range(1, a.steps + 1):
+        idx = rng.integers(0, len(train), a.bs)
+        seq, det_t, vel_t, present = make_targets([train[i] for i in idx], a.G, dev, sigma=a.sigma)
+        det_logit, vel = net(seq)
+        loss, comp = layer2_loss(det_logit, vel, det_t, vel_t, pos_weight=a.pos_weight)
+        opt.zero_grad(); loss.backward(); opt.step()
+        if step % 250 == 0 or step == 1:
+            peak, bga, bgp = evaluate(net, val, a.G, dev, sigma=a.sigma)
+            print(f"step {step:5d}  det {comp['det']:.4f} vel {comp['vel']:.4f}  | "
+                  f"val: s@truth {peak:.3f}  max-FP(noise) {bga:.3f}  max-FP(clutter) {bgp:.3f}", flush=True)
+    torch.save({"model": net.state_dict(), "args": vars(a)}, f"{a.out}/model.pt")
+    print(f"saved {a.out}/model.pt", flush=True)
+
+    # eval viz: run trained L2 on a fresh sequence, dump tiffs to watch track-before-detect
+    rng2 = np.random.default_rng(777)
+    frames, pos, vel, present = L1D.render_run(rng2, T=120, G=a.G, vmax=a.vmax, snr=a.snr)
+    seq = L1D.gen_field_sequence(net1, frames, pos, a.G, N, dev)
+    with torch.no_grad():
+        det_logit, velo = net(torch.from_numpy(seq[None]).to(dev))
+        l2s = torch.sigmoid(det_logit)[0, :, 0].cpu().numpy()        # [T,G,G] L2 detection
+    truth = np.zeros((120, a.G, a.G), np.float32)
+    for t in range(120):
+        if present[t]: truth[t] = L1D.halfcos_bump_torus(pos[t, 0], pos[t, 1], a.G)
+    synth.save_tiff_stack(seq[:, 0], f"{a.out}/L1_s.tif")            # L1 input field
+    synth.save_tiff_stack(l2s, f"{a.out}/L2_det.tif")               # L2 track-before-detect output
+    synth.save_tiff_stack(truth, f"{a.out}/truth.tif")
+    print(f"wrote {a.out}/{{L1_s,L2_det,truth}}.tif (120 pages) — compare L2_det vs L1_s vs truth", flush=True)
+
+
+if __name__ == "__main__":
+    main()
--- a/layer2_train_A.py
+++ b/layer2_train_A.py
+# layer2_train_A.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""C5P Layer-2 Option A: supervision-width sweep + OUTPUT-FWHM measurement. By Claude 06/22/2026
+
+Same FROZEN-L1 v1 (no-gap) training data as runs/l2_v1 — the ONLY variables are the
+supervision bump width (--sigma) and the BCE positive weight (--pos_weight). Question this
+answers: can L2's *output* detection blob be made as tight as L1's (~FWHM 2 px), or does it
+stay fat regardless of how sharp we supervise (which would mean per-pixel BCE can't
+concentrate mass -> motivates the spatial-softmax readout, Option B)?
+
+  FWHM = 2.355 * sigma  ->  L1-like FWHM ~2 px  ==  --sigma ~0.85
+
+Runs on DGX in the NGC torch container:
+  python layer2_train_A.py --l1 runs/weighted9_pm/model.pt --nseq 128 --T 48 --steps 4000 \
+      --sigma 1.5 --pos_weight 30 --out runs/l2_A_s15
+  python layer2_train_A.py ... --sigma 0.85 --pos_weight 30 --out runs/l2_A_s085
+"""
+
+import argparse
+import numpy as np
+import torch
+import torch.nn.functional as F
+import synth
+import layer2_data as L1D
+from layer2 import Layer2Net, bump_target, layer2_loss
+
+
+def build_cache(net, N, nseq, T, G, vmax, snr, dev, seed=0, render_kw=None):
+    """Pre-compute nseq frozen-L1 field sequences + truth. By Claude 06/22
+    render_kw forwards gaps/band-pass knobs (default gaps=False = no-gap, apples-to-apples). By Claude 06/23"""
+    rng = np.random.default_rng(seed)
+    rkw = dict(gaps=False); rkw.update(render_kw or {})
+    cache = []
+    for k in range(nseq):
+        frames, pos, vel, present, signal = L1D.render_run(rng, T=T, G=G, vmax=vmax, snr=snr,
+                                                           return_signal=True, **rkw)
+        seq = L1D.gen_field_sequence(net, frames, pos, G, N, dev)     # [T,3,G,G]
+        cache.append((torch.from_numpy(seq).to(dev),
+                      torch.from_numpy(np.nan_to_num(pos)).float().to(dev),
+                      torch.from_numpy(vel).to(dev),
+                      torch.from_numpy(present).to(dev),
+                      torch.from_numpy(signal).to(dev)))           # signal: 1=bump rendered, 0=GAP
+        if (k + 1) % 32 == 0:
+            print(f"  cache {k+1}/{nseq}", flush=True)
+    return cache
+
+
+def make_targets(batch, G, dev, sigma=1.5):
+    """Stack a minibatch -> (seq, det_t, vel_t, present). By Claude 06/22
+    det_t: bump at truth, zeroed on absent frames. vel_t: constant per-seq velocity broadcast."""
+    seq = torch.stack([b[0] for b in batch], 0)                      # [B,T,3,G,G]
+    pos = torch.stack([b[1] for b in batch], 0)                      # [B,T,2]
+    vel = torch.stack([b[2] for b in batch], 0)                      # [B,2] (v1) or [B,T,2] (v2 per-frame)
+    present = torch.stack([b[3] for b in batch], 0)                  # [B,T]
+    signal = torch.stack([b[4] for b in batch], 0)                   # [B,T] 1=bump rendered, 0=GAP
+    det_t = bump_target(pos, G, sigma=sigma, device=dev)             # [B,T,1,G,G]
+    det_t = det_t * present[:, :, None, None, None]                  # zero where no target
+    B, T = present.shape
+    if vel.dim() == 2:                                               # per-seq -> broadcast over T
+        vel_t = vel[:, None, :, None, None].expand(B, T, 2, G, G).contiguous()
+    else:                                                            # per-frame (maneuver) [B,T,2]
+        vel_t = vel[:, :, :, None, None].expand(B, T, 2, G, G).contiguous()
+    return seq, det_t, vel_t, present, pos, signal
+
+
+def mexhat_loss(det_logit, vel, pos_t, present, vel_t, G,
+                sig_core=0.85, sig_wide=1.5, w_center=30.0, w_ring=15.0, w_bg=1.0,
+                signal=None, gap_boost=1.0):
+    """Center-surround (LoG / Mexican-hat) weighted dense BCE. By Claude on 06/22/2026
+    Encourage firing in the FWHM~2 core (sig_core); HARD-suppress the surrounding ring
+    (sig_wide minus core = the 2<FWHM<4 moat); light background floor elsewhere. A spatial
+    weight map replaces the scalar pos_weight (so it both fixes the imbalance AND carves the
+    skirt). We KNOW the true target is FWHM~2, so the ring is known-empty. Velocity MSE on core.
+    gap_boost (Andrey 06/23): weight GAP frames (present but L1-starved, signal==0 -> the net must
+    COAST from memory) by gap_boost x relative to easy-following frames, so the loss prioritizes
+    getting the hard remembered frames right (ground truth tells us exactly which they are)."""
+    pres = present[:, :, None, None, None]
+    core = bump_target(pos_t, G, sigma=sig_core, device=det_logit.device) * pres   # [B,T,1,G,G] peak 1
+    wide = bump_target(pos_t, G, sigma=sig_wide, device=det_logit.device) * pres
+    ring = (wide - core).clamp_min(0.0)                              # annulus (the moat)
+    target = core
+    W = w_bg + w_center * core + w_ring * ring                      # Mexican-hat per-pixel weight
+    if signal is not None and gap_boost != 1.0:                     # per-FRAME boost on coast (gap) frames
+        gap = (present > 0.5) & (signal < 0.5)                       # [B,T] present AND no L1 signal
+        fw = 1.0 + (gap_boost - 1.0) * gap.float()                  # [B,T]
+        W = W * fw[:, :, None, None, None]                          # broadcast frame weight onto the map
+    bce = F.binary_cross_entropy_with_logits(det_logit, target, reduction='none')
+    l_det = (W * bce).mean()
+    m = (core[:, :, 0] > 0.3)                                       # core disk for velocity supervision
+    if m.any():
+        l_vel = F.mse_loss(vel[m[:, :, None].expand_as(vel)], vel_t[m[:, :, None].expand_as(vel)])
+    else:
+        l_vel = vel.sum() * 0.0
+    return l_det + 0.3 * l_vel, {"det": float(l_det.detach()),
+                                 "vel": float(l_vel.detach() if torch.is_tensor(l_vel) else l_vel)}
+
+
+def evaluate(net, batch, G, dev, sigma=1.5):
+    """Per-seq lock/FP metrics over a held-out batch. By Claude 06/22
+    sigma matches the training supervision so the 'core' mask matches the bump. By Claude 06/22"""
+    seq, det_t, vel_t, present, _, _ = make_targets(batch, G, dev, sigma=sigma)
+    with torch.no_grad():
+        det_logit, vel = net(seq)
+        p = torch.sigmoid(det_logit)[:, :, 0]                        # [B,T,G,G]
+    pres = present.bool()
+    core = det_t[:, :, 0] > 0.3                                      # [B,T,G,G] truth bump core
+    peak = float(p[core].mean()) if core.any() else 0.0             # s at truth (want ->1)
+    absent = ~pres                                                   # noise-prefix frames
+    bg_absent = float(p[absent].amax()) if absent.any() else 0.0    # worst FP on noise (want ->0)
+    # background on PRESENT frames, away from the target (clutter-cloud FP)
+    bg_present = float((p * (~core) * pres[:, :, None, None]).amax())
+    return peak, bg_absent, bg_present
+
+
+# --- Output-blob width measurement (the Option A deliverable). By Claude 06/22 -------------
+def _half_max_width(line):
+    """FWHM (px) of a 1-D profile with its peak at the CENTER index. Linear-interpolated
+    half-maximum crossings on each side. Returns NaN if the center is not a real peak."""
+    n = len(line); c = n // 2
+    pk = float(line[c])
+    if pk <= 1e-6:
+        return float("nan")
+    half = pk / 2.0
+    # walk right until below half, interpolate the crossing
+    j = c
+    while j < n - 1 and line[j] >= half:
+        j += 1
+    right = (j - 1) + (line[j - 1] - half) / (line[j - 1] - line[j]) if line[j - 1] > line[j] else float(j - 1)
+    # walk left
+    i = c
+    while i > 0 and line[i] >= half:
+        i -= 1
+    left = (i + 1) - (half - line[i]) / (line[i + 1] - line[i]) if line[i + 1] > line[i] else float(i + 1)
+    return right - left
+
+
+def peak_fwhm_at(field, cx, cy, G):
+    """Mean FWHM (px) of the blob at toroidal truth (cx,cy): roll truth to grid center so the
+    bump can't straddle the seam, then average the half-max widths of the center row & column."""
+    f = np.roll(np.roll(field, int(round(G // 2 - cy)), axis=0), int(round(G // 2 - cx)), axis=1)
+    c = G // 2
+    wx = _half_max_width(f[c, :])
+    wy = _half_max_width(f[:, c])
+    return np.nanmean([wx, wy])
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--l1", default="runs/weighted9_pm/model.pt")
+    ap.add_argument("--nseq", type=int, default=128); ap.add_argument("--T", type=int, default=48)
+    ap.add_argument("--G", type=int, default=32); ap.add_argument("--vmax", type=float, default=1.4)
+    ap.add_argument("--snr", type=float, default=6.0); ap.add_argument("--steps", type=int, default=4000)
+    ap.add_argument("--bs", type=int, default=8); ap.add_argument("--lr", type=float, default=2e-3)
+    ap.add_argument("--ch", type=int, default=24); ap.add_argument("--out", default="runs/l2_A")
+    # Option A knobs (this is the whole experiment):
+    ap.add_argument("--sigma", type=float, default=1.5, help="supervision bump sigma; FWHM=2.355*sigma (try 0.85 for ~2px)")
+    ap.add_argument("--pos_weight", type=float, default=30.0, help="BCE positive weight (try lower AFTER sigma)")
+    # Mexican-hat (LoG center-surround) weighting — replaces pos_weight when --mexhat is set.
+    ap.add_argument("--mexhat", action="store_true", help="center-surround weighted BCE (encourage core, suppress ring)")
+    ap.add_argument("--mh_center", type=float, default=30.0); ap.add_argument("--mh_ring", type=float, default=15.0)
+    ap.add_argument("--mh_bg", type=float, default=1.0)
+    ap.add_argument("--mh_sig_core", type=float, default=0.85); ap.add_argument("--mh_sig_wide", type=float, default=1.5)
+    # gap-envelope knobs (default off = no-gap, apples-to-apples with the sharpening test). By Claude 06/23
+    ap.add_argument("--gaps", action="store_true", help="train on band-pass amplitude gaps (coast challenge)")
+    ap.add_argument("--bp_lo", type=int, default=6); ap.add_argument("--bp_hi", type=int, default=18)
+    ap.add_argument("--duty_offset", type=float, default=0.2); ap.add_argument("--starter_len", type=int, default=8)
+    ap.add_argument("--gap_boost", type=float, default=1.0, help="weight GAP (coast) frames Nx vs easy following (Andrey: 30-50)")
+    a = ap.parse_args()
+    import os; os.makedirs(a.out, exist_ok=True)
+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+    render_kw = dict(gaps=a.gaps, bp_lo=a.bp_lo, bp_hi=a.bp_hi, duty_offset=a.duty_offset, starter_len=a.starter_len)
+    net1, N, _ = L1D._load_l1(a.l1, dev)
+    print(f"L1 {a.l1} N={N}; mexhat={a.mexhat} gaps={a.gaps} (bp[{a.bp_lo},{a.bp_hi}] off={a.duty_offset}); "
+          f"precomputing {a.nseq} seqs (T={a.T}, G={a.G})...", flush=True)
+    cache = build_cache(net1, N, a.nseq, a.T, a.G, a.vmax, a.snr, dev, render_kw=render_kw)
+    nval = max(8, a.nseq // 8); val = cache[:nval]; train = cache[nval:]
+    print(f"cache: {len(train)} train / {len(val)} val sequences", flush=True)
+
+    net = Layer2Net(ch_in=3, ch_hidden=a.ch, grid=a.G, vmax=a.vmax).to(dev)
+    opt = torch.optim.Adam(net.parameters(), a.lr)
+    nparams = sum(p.numel() for p in net.parameters())
+    print(f"Layer2Net {nparams} params; training {a.steps} steps bs={a.bs}", flush=True)
+    rng = np.random.default_rng(1)
+    for step in range(1, a.steps + 1):
+        idx = rng.integers(0, len(train), a.bs)
+        seq, det_t, vel_t, present, pos_t, signal = make_targets([train[i] for i in idx], a.G, dev, sigma=a.sigma)
+        det_logit, vel = net(seq)
+        if a.mexhat:
+            loss, comp = mexhat_loss(det_logit, vel, pos_t, present, vel_t, a.G,
+                                     sig_core=a.mh_sig_core, sig_wide=a.mh_sig_wide,
+                                     w_center=a.mh_center, w_ring=a.mh_ring, w_bg=a.mh_bg,
+                                     signal=signal, gap_boost=a.gap_boost)
+        else:
+            loss, comp = layer2_loss(det_logit, vel, det_t, vel_t, pos_weight=a.pos_weight)
+        opt.zero_grad(); loss.backward(); opt.step()
+        if step % 250 == 0 or step == 1:
+            peak, bga, bgp = evaluate(net, val, a.G, dev, sigma=a.sigma)
+            print(f"step {step:5d}  det {comp['det']:.4f} vel {comp['vel']:.4f}  | "
+                  f"val: s@truth {peak:.3f}  max-FP(noise) {bga:.3f}  max-FP(clutter) {bgp:.3f}", flush=True)
+    torch.save({"model": net.state_dict(), "args": vars(a)}, f"{a.out}/model.pt")
+    print(f"saved {a.out}/model.pt", flush=True)
+
+    # eval viz + FWHM: run trained L2 on a fresh sequence, dump tiffs, measure output blob width
+    rng2 = np.random.default_rng(777)
+    frames, pos, vel, present = L1D.render_run(rng2, T=120, G=a.G, vmax=a.vmax, snr=a.snr)
+    seq = L1D.gen_field_sequence(net1, frames, pos, a.G, N, dev)
+    with torch.no_grad():
+        det_logit, velo = net(torch.from_numpy(seq[None]).to(dev))
+        l2s = torch.sigmoid(det_logit)[0, :, 0].cpu().numpy()        # [T,G,G] L2 detection
+    truth = np.zeros((120, a.G, a.G), np.float32)
+    for t in range(120):
+        if present[t]: truth[t] = L1D.halfcos_bump_torus(pos[t, 0], pos[t, 1], a.G)
+    synth.save_tiff_stack(seq[:, 0], f"{a.out}/L1_s.tif")            # L1 input field
+    synth.save_tiff_stack(l2s, f"{a.out}/L2_det.tif")               # L2 track-before-detect output
+    synth.save_tiff_stack(truth, f"{a.out}/truth.tif")
+
+    # FWHM + argmax pos-MAE at the target on LOCKED frames (a shared metric across all rungs).
+    fw_l1, fw_l2, perr = [], [], []
+    for t in range(120):
+        if not present[t]:
+            continue
+        fw_l1.append(peak_fwhm_at(seq[t, 0], pos[t, 0], pos[t, 1], a.G))
+        if l2s[t].max() > 0.5:                                       # only where L2 actually locked
+            fw_l2.append(peak_fwhm_at(l2s[t], pos[t, 0], pos[t, 1], a.G))
+            pj = int(np.argmax(l2s[t])); pyk, pxk = divmod(pj, a.G)  # L2 peak cell
+            dx = (pxk - pos[t, 0] + a.G / 2) % a.G - a.G / 2
+            dy = (pyk - pos[t, 1] + a.G / 2) % a.G - a.G / 2
+            perr.append(float(np.hypot(dx, dy)))
+    mu_l1 = float(np.nanmean(fw_l1)) if fw_l1 else float("nan")
+    mu_l2 = float(np.nanmean(fw_l2)) if fw_l2 else float("nan")
+    mu_perr = float(np.mean(perr)) if perr else float("nan")
+    mode = "MEXICAN-HAT" if a.mexhat else f"plain BCE pos_weight={a.pos_weight}"
+    print(f"wrote {a.out}/{{L1_s,L2_det,truth}}.tif (120 pages)", flush=True)
+    print(f"=== dense result ({mode}, sigma_core={a.mh_sig_core if a.mexhat else a.sigma}) ===", flush=True)
+    print(f"    output blob FWHM @ target:  L1_in ~ {mu_l1:.2f} px   L2_out ~ {mu_l2:.2f} px", flush=True)
+    print(f"    pos-MAE(argmax) ~ {mu_perr:.2f} px   (L2 locked on {len(fw_l2)}/{len(fw_l1)} present frames)", flush=True)
+
+
+if __name__ == "__main__":
+    main()
--- a/layer2_train_P.py
+++ b/layer2_train_P.py
+# layer2_train_P.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""C5P Layer-2 PARAMETRIC training: position+presence head, known-width output. By Claude 06/22/2026
+
+Trains Layer2NetP (layer2p.py): softmax "where" -> sub-pixel position + scalar presence + expected
+velocity. No dense sigmoid bump, no pos_weight on a spatial map -> the wrong-skirt failure mode is
+impossible and output FWHM = 2.355*sigma_where BY CONSTRUCTION (Andrey 06/22: known width 2.0).
+
+First run is a SHAKEDOWN on the pristine v1 (no-gap) cache already on the DGX -- isolate "does the
+new readout lock + localize + separate presence" before adding gap-coast difficulty. The gap
+(memory) run comes next, on the v2 generator (per-frame velocity), once synced.
+
+Run on DGX (NGC torch container):
+  python layer2_train_P.py --l1 runs/weighted9_pm/model.pt --nseq 128 --T 48 --steps 4000 \
+      --out runs/l2_P_v1shake
+"""
+
+import argparse
+import numpy as np
+import torch
+import synth
+import layer2_data as L1D
+from layer2p import Layer2NetP, layer2p_loss, render_blob, _wrap
+
+
+def build_cache(net, N, nseq, T, G, vmax, snr, dev, seed=0, render_kw=None):
+    """Pre-compute nseq frozen-L1 field sequences + truth. By Claude 06/22
+    render_kw forwards the gap-envelope / realism knobs to render_run."""
+    rng = np.random.default_rng(seed)
+    render_kw = render_kw or {}
+    cache = []
+    for k in range(nseq):
+        frames, pos, vel, present = L1D.render_run(rng, T=T, G=G, vmax=vmax, snr=snr, **render_kw)
+        seq = L1D.gen_field_sequence(net, frames, pos, G, N, dev)
+        cache.append((torch.from_numpy(seq).to(dev),
+                      torch.from_numpy(np.nan_to_num(pos)).float().to(dev),   # [T,2] truth pos
+                      torch.from_numpy(vel).to(dev),                          # [2] per-seq velocity
+                      torch.from_numpy(present).to(dev)))                     # [T]
+        if (k + 1) % 32 == 0:
+            print(f"  cache {k+1}/{nseq}", flush=True)
+    return cache
+
+
+def make_targets(batch, G, dev):
+    """Stack -> (seq, pos_t[B,T,2], vel_t[B,T,2], present[B,T]). By Claude 06/22
+    v1 velocity is per-seq [2] -> broadcast across T."""
+    seq = torch.stack([b[0] for b in batch], 0)                      # [B,T,3,G,G]
+    pos_t = torch.stack([b[1] for b in batch], 0)                    # [B,T,2]
+    vel = torch.stack([b[2] for b in batch], 0)                      # [B,2] (v1) or [B,T,2] (v2 gaps)
+    present = torch.stack([b[3] for b in batch], 0)                  # [B,T]
+    B, T = present.shape
+    if vel.dim() == 2:                                               # per-seq -> broadcast over T
+        vel_t = vel[:, None, :].expand(B, T, 2).contiguous()
+    else:                                                            # per-frame (maneuver) -> as-is
+        vel_t = vel.contiguous()
+    return seq, pos_t, vel_t, present
+
+
+def evaluate(net, batch, G, dev):
+    """Position MAE (px, present frames) + presence separation (on vs off). By Claude 06/22"""
+    seq, pos_t, vel_t, present = make_targets(batch, G, dev)
+    with torch.no_grad():
+        logP, pos, pres_logit, vel = net(seq)
+        pres_p = torch.sigmoid(pres_logit)                           # [B,T]
+    m = present.bool()
+    dxy = _wrap(pos - pos_t, G)                                      # [B,T,2] toroidal error
+    perr = torch.sqrt((dxy ** 2).sum(-1))                           # [B,T] distance, cells
+    pos_mae = float(perr[m].mean()) if m.any() else float("nan")    # localize accuracy (want small)
+    pp_on = float(pres_p[m].mean()) if m.any() else float("nan")    # presence on present (want ->1)
+    pp_off = float(pres_p[~m].mean()) if (~m).any() else float("nan")  # presence on absent (want ->0)
+    return pos_mae, pp_on, pp_off
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--l1", default="runs/weighted9_pm/model.pt")
+    ap.add_argument("--nseq", type=int, default=128); ap.add_argument("--T", type=int, default=48)
+    ap.add_argument("--G", type=int, default=32); ap.add_argument("--vmax", type=float, default=1.4)
+    ap.add_argument("--snr", type=float, default=6.0); ap.add_argument("--steps", type=int, default=4000)
+    ap.add_argument("--bs", type=int, default=8); ap.add_argument("--lr", type=float, default=2e-3)
+    ap.add_argument("--ch", type=int, default=24); ap.add_argument("--out", default="runs/l2_P")
+    # parametric-loss knobs (no spatial pos_weight by design):
+    ap.add_argument("--sigma_where", type=float, default=0.85, help="where-target sigma; render FWHM=2.355*it (~2px)")
+    ap.add_argument("--w_pos", type=float, default=0.2)
+    ap.add_argument("--w_pres", type=float, default=1.0)
+    ap.add_argument("--w_vel", type=float, default=0.3)
+    ap.add_argument("--pres_pos_weight", type=float, default=2.0, help="SCALAR presence BCE weight (not spatial)")
+    # gap-envelope knobs (forwarded to render_run). Without --gaps it's the v1-like no-gap shakedown.
+    ap.add_argument("--gaps", action="store_true", help="band-pass amplitude gaps (the coast/memory test)")
+    ap.add_argument("--bp_lo", type=int, default=3)
+    ap.add_argument("--bp_hi", type=int, default=9, help="HF cutoff ~3-5*bp_lo")
+    ap.add_argument("--duty_offset", type=float, default=-0.3, help="more negative => more/longer gaps")
+    ap.add_argument("--starter_len", type=int, default=8, help="clean full-SNR acquire (yesterday locked +6)")
+    a = ap.parse_args()
+    import os; os.makedirs(a.out, exist_ok=True)
+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+    render_kw = dict(gaps=a.gaps, bp_lo=a.bp_lo, bp_hi=a.bp_hi,
+                     duty_offset=a.duty_offset, starter_len=a.starter_len)
+    net1, N, _ = L1D._load_l1(a.l1, dev)
+    print(f"L1 {a.l1} N={N}; PARAMETRIC head, render FWHM={2.355*a.sigma_where:.2f}px; "
+          f"gaps={a.gaps} bp[{a.bp_lo},{a.bp_hi}] off={a.duty_offset}; "
+          f"precomputing {a.nseq} seqs (T={a.T}, G={a.G})...", flush=True)
+    cache = build_cache(net1, N, a.nseq, a.T, a.G, a.vmax, a.snr, dev, render_kw=render_kw)
+    nval = max(8, a.nseq // 8); val = cache[:nval]; train = cache[nval:]
+    print(f"cache: {len(train)} train / {len(val)} val sequences", flush=True)
+
+    net = Layer2NetP(ch_in=3, ch_hidden=a.ch, grid=a.G, vmax=a.vmax).to(dev)
+    opt = torch.optim.Adam(net.parameters(), a.lr)
+    nparams = sum(p.numel() for p in net.parameters())
+    print(f"Layer2NetP {nparams} params; training {a.steps} steps bs={a.bs}", flush=True)
+    rng = np.random.default_rng(1)
+    for step in range(1, a.steps + 1):
+        idx = rng.integers(0, len(train), a.bs)
+        seq, pos_t, vel_t, present = make_targets([train[i] for i in idx], a.G, dev)
+        logP, pos, pres_logit, vel = net(seq)
+        loss, comp = layer2p_loss(logP, pos, pres_logit, vel, pos_t, present, vel_t, a.G,
+                                  sigma_where=a.sigma_where, w_pos=a.w_pos, w_pres=a.w_pres,
+                                  w_vel=a.w_vel, pres_pos_weight=a.pres_pos_weight)
+        opt.zero_grad(); loss.backward(); opt.step()
+        if step % 250 == 0 or step == 1:
+            mae, pon, poff = evaluate(net, val, a.G, dev)
+            print(f"step {step:5d}  where {comp['where']:.3f} pos {comp['pos']:.3f} "
+                  f"pres {comp['pres']:.3f} vel {comp['vel']:.3f}  | "
+                  f"val: pos-MAE {mae:.2f}px  presence on/off {pon:.2f}/{poff:.2f}", flush=True)
+    torch.save({"model": net.state_dict(), "args": vars(a)}, f"{a.out}/model.pt")
+    print(f"saved {a.out}/model.pt", flush=True)
+
+    # eval viz: run on a fresh sequence, RENDER the known-width blob, dump tiffs + metrics.
+    rng2 = np.random.default_rng(777)
+    frames, pos, vel, present = L1D.render_run(rng2, T=120, G=a.G, vmax=a.vmax, snr=a.snr)
+    seq = L1D.gen_field_sequence(net1, frames, pos, a.G, N, dev)
+    with torch.no_grad():
+        logP, pp, pres_logit, velo = net(torch.from_numpy(seq[None]).to(dev))
+        pos_pred = pp[0].cpu().numpy()                               # [T,2]
+        pres_p = torch.sigmoid(pres_logit)[0].cpu().numpy()          # [T]
+    l2render = render_blob(pos_pred, pres_p, a.G, sigma=a.sigma_where)   # [T,G,G] FWHM 2 by construction
+    truth = np.zeros((120, a.G, a.G), np.float32)
+    for t in range(120):
+        if present[t]: truth[t] = L1D.halfcos_bump_torus(pos[t, 0], pos[t, 1], a.G)
+    synth.save_tiff_stack(seq[:, 0], f"{a.out}/L1_s.tif")
+    synth.save_tiff_stack(l2render.astype(np.float32), f"{a.out}/L2P_render.tif")
+    synth.save_tiff_stack(truth, f"{a.out}/truth.tif")
+
+    # metrics: position MAE on present frames where the net says present (locked), presence stats.
+    perr, locked = [], 0
+    for t in range(120):
+        if not present[t]:
+            continue
+        if pres_p[t] > 0.5:
+            locked += 1
+            d = _wrap(torch.tensor(pos_pred[t] - np.array([pos[t, 0], pos[t, 1]])), a.G).numpy()
+            perr.append(float(np.sqrt((d ** 2).sum())))
+    mae = float(np.mean(perr)) if perr else float("nan")
+    npres = int(present.sum())
+    print(f"wrote {a.out}/{{L1_s,L2P_render,truth}}.tif (120 pages)", flush=True)
+    print(f"=== Parametric result (render FWHM {2.355*a.sigma_where:.2f}px BY CONSTRUCTION) ===", flush=True)
+    print(f"    position MAE @ locked frames: {mae:.2f} px   (locked {locked}/{npres} present frames)", flush=True)
+    print(f"    presence prob  present~{pres_p[present.astype(bool)].mean():.2f}  "
+          f"absent~{pres_p[~present.astype(bool)].mean():.2f}", flush=True)
+
+
+if __name__ == "__main__":
+    main()
--- a/layer2p.py
+++ b/layer2p.py
+# layer2p.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""C5P Layer-2 PARAMETRIC head (position + presence), known-width readout. By Claude on 06/22/2026
+
+Endpoint of the "we KNOW the target is FWHM 2.0" decision (Andrey 06/22). Instead of the dense
+sigmoid-bump detector (Layer2Net), the net emits, per frame, ONLY:
+  - a spatial-SOFTMAX "where" map over the GxG torus  -> sub-pixel position (toroidal centroid),
+  - a scalar "whether" presence logit,
+  - an expected velocity (softmax-weighted Vx,Vy).
+The detection image, if needed, is RENDERED as a fixed FWHM-2 blob at the predicted position,
+scaled by presence -> the output width is correct BY CONSTRUCTION; a "wrong skirt" cannot occur.
+
+Why softmax (not sigmoid/BCE): a softmax map's total mass is FIXED at 1, so the net cannot lower
+its loss by spreading into a skirt -- it MUST concentrate. That removes BOTH the fat-skirt failure
+of the dense head AND the rare-positive class imbalance that forced pos_weight=30 (Andrey 06/22:
+"it does not spread wide, so ratio of positives is not a concern"). Single-target by construction.
+
+Reuses ConvGRUCellTorus + the toroidal Gaussian from layer2.py; the deployed dense Layer2Net is
+left untouched (the inference server keeps loading it + runs/l2_v1). By Claude on 06/22/2026.
+"""
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from layer2 import ConvGRUCellTorus, bump_target
+
+
+def _torus_centroid(P, G):
+    """Sub-pixel toroidal centroid of a probability map. By Claude on 06/22/2026
+    P: [B, G, G] (>=0, sums to 1 over the GxG grid). Returns pos [B, 2] = (x, y) in cells, in
+    [0, G). Uses the CIRCULAR mean (map each index to an angle, average sin/cos, atan2 back) so a
+    peak straddling the wrap seam averages to the seam -- NOT to the middle of the grid as a plain
+    weighted mean would. x = last axis (columns), y = first spatial axis (rows)."""
+    dev = P.device
+    ang = (2.0 * np.pi / G) * torch.arange(G, device=dev, dtype=P.dtype)     # [G] cell index -> angle
+    cos_x = torch.cos(ang)[None, None, :]                                    # [1,1,G] over columns (x)
+    sin_x = torch.sin(ang)[None, None, :]
+    cos_y = torch.cos(ang)[None, :, None]                                    # [1,G,1] over rows (y)
+    sin_y = torch.sin(ang)[None, :, None]
+    Cx = (P * cos_x).sum(dim=(-1, -2)); Sx = (P * sin_x).sum(dim=(-1, -2))   # [B] each
+    Cy = (P * cos_y).sum(dim=(-1, -2)); Sy = (P * sin_y).sum(dim=(-1, -2))
+    x = torch.atan2(Sx, Cx) % (2.0 * np.pi) * (G / (2.0 * np.pi))            # [B] in [0,G)
+    y = torch.atan2(Sy, Cy) % (2.0 * np.pi) * (G / (2.0 * np.pi))
+    return torch.stack([x, y], dim=-1)                                       # [B, 2]
+
+
+class Layer2NetP(nn.Module):
+    """Recurrent track-before-detect with a PARAMETRIC (position + presence) readout. By Claude 06/22
+
+    forward(seq) -> per frame: log-softmax where-map [B,T,G,G], position [B,T,2] (cells, sub-pixel),
+    presence logit [B,T], velocity [B,T,2] (px/level-frame). Same ConvGRU recurrence as Layer2Net.
+    """
+    def __init__(self, ch_in=3, ch_hidden=24, grid=32, vmax=1.4, k=3):
+        super().__init__()
+        self.ch_hidden = ch_hidden
+        self.grid = grid
+        self.vmax = vmax
+        self.cell = ConvGRUCellTorus(ch_in, ch_hidden, k=k)
+        self.score_head = nn.Conv2d(ch_hidden, 1, 1)   # -> spatial logits, softmaxed to "where"
+        self.vel_head   = nn.Conv2d(ch_hidden, 2, 1)   # -> per-cell velocity field
+        self.pres_head  = nn.Linear(ch_hidden, 1)      # -> scalar "whether" from pooled hidden
+
+    def init_hidden(self, B, device, dtype):
+        return torch.zeros(B, self.ch_hidden, self.grid, self.grid, device=device, dtype=dtype)
+
+    def decode(self, h):
+        # h: [B, Ch, G, G]
+        B, _, G, _ = h.shape
+        score = self.score_head(h).view(B, G * G)                # [B, G*G] spatial logits
+        logP = F.log_softmax(score, dim=1).view(B, G, G)         # [B, G, G] log "where" map (sums to 1)
+        P = logP.exp()                                           # [B, G, G] probability map
+        pos = _torus_centroid(P, G)                              # [B, 2] sub-pixel (x,y) in cells
+        pooled = h.mean(dim=(-1, -2))                            # [B, Ch] global context for presence
+        pres_logit = self.pres_head(pooled)[:, 0]                # [B] "is a target present this frame"
+        vel_field = self.vmax * torch.tanh(self.vel_head(h))     # [B, 2, G, G] bounded velocity field
+        vel = (P[:, None] * vel_field).sum(dim=(-1, -2))         # [B, 2] expected velocity under "where"
+        return logP, pos, pres_logit, vel
+
+    def forward(self, seq, h=None):
+        # seq: [B, T, Cin, G, G]
+        B, T = seq.shape[0], seq.shape[1]
+        if h is None:
+            h = self.init_hidden(B, seq.device, seq.dtype)
+        logPs, poss, press, vels = [], [], [], []
+        for t in range(T):                                       # BPTT unrolls this loop
+            h = self.cell(seq[:, t], h)
+            logP, pos, pres, vel = self.decode(h)
+            logPs.append(logP); poss.append(pos); press.append(pres); vels.append(vel)
+        return (torch.stack(logPs, 1),                           # [B,T,G,G] log where-map
+                torch.stack(poss, 1),                            # [B,T,2]   position (cells)
+                torch.stack(press, 1),                           # [B,T]     presence logit
+                torch.stack(vels, 1))                            # [B,T,2]   velocity
+
+
+def _wrap(d, G):
+    """Toroidal signed difference into (-G/2, G/2]. By Claude on 06/22/2026"""
+    return (d + G / 2.0) % G - G / 2.0
+
+
+def layer2p_loss(logP, pos, pres_logit, vel, pos_t, present, vel_t, G,
+                 sigma_where=0.85, w_pos=0.2, w_pres=1.0, w_vel=0.3, pres_pos_weight=2.0):
+    """Parametric loss. By Claude on 06/22/2026
+      logP   [B,T,G,G] log-softmax where-map      pos      [B,T,2] predicted (x,y) cells
+      pres_logit [B,T] presence logit             vel      [B,T,2] predicted velocity
+      pos_t  [B,T,2] truth position (cells)       present  [B,T]   1=target present
+      vel_t  [B,T,2] truth velocity
+    Terms (NO pos_weight on a spatial map -- softmax already fixes total mass, so no skirt / no
+    class imbalance; the only scalar weight is on the per-frame presence BCE):
+      - WHERE : soft cross-entropy of the softmax map vs a SHARP toroidal-Gaussian truth target
+                Q (sigma_where ~ FWHM 2). Concentrates one peak at truth; cannot win by spreading.
+      - POS   : toroidal MSE on the sub-pixel centroid (sub-cell refinement of WHERE).
+      - PRES  : BCE on the per-frame presence scalar (present vs absent/gap frames).
+      - VEL   : MSE of expected velocity, on present frames only.
+    WHERE/POS/VEL are masked to present frames; PRES is supervised every frame."""
+    B, T = present.shape
+    pres = present.float()
+    m = present.bool()
+
+    # WHERE: build a sharp normalized target distribution Q at truth, cross-entropy = -sum Q*logP.
+    Q = bump_target(pos_t, G, sigma=sigma_where, device=logP.device)[:, :, 0]   # [B,T,G,G] (peak 1)
+    Q = Q / Q.sum(dim=(-1, -2), keepdim=True).clamp_min(1e-8)                    # normalize -> sums to 1
+    ce = -(Q * logP).sum(dim=(-1, -2))                                          # [B,T] cross-entropy
+    l_where = ce[m].mean() if m.any() else logP.sum() * 0.0
+
+    # POS: sub-pixel toroidal MSE on present frames.
+    dxy = _wrap(pos - pos_t, G)                                                 # [B,T,2]
+    l_pos = (dxy[m] ** 2).mean() if m.any() else pos.sum() * 0.0
+
+    # PRES: per-frame presence BCE (scalar imbalance handled by a small pos_weight, NOT spatial).
+    pw = torch.tensor(pres_pos_weight, device=pres_logit.device)
+    l_pres = F.binary_cross_entropy_with_logits(pres_logit, pres, pos_weight=pw)
+
+    # VEL: expected-velocity MSE on present frames.
+    l_vel = F.mse_loss(vel[m], vel_t[m]) if m.any() else vel.sum() * 0.0
+
+    total = l_where + w_pos * l_pos + w_pres * l_pres + w_vel * l_vel
+    return total, {"where": float(l_where.detach()), "pos": float(l_pos.detach()),
+                   "pres": float(l_pres.detach()), "vel": float(l_vel.detach())}
+
+
+def render_blob(pos, pres, G, sigma=0.85):
+    """Render the KNOWN-width detection image from (position, presence). By Claude on 06/22/2026
+    pos [T,2] cells, pres [T] in [0,1] -> [T,G,G] = pres * toroidal Gaussian(FWHM=2.355*sigma) at pos.
+    Width is fixed here, so the output FWHM is correct by construction."""
+    T = pos.shape[0]
+    p = torch.as_tensor(pos, dtype=torch.float32).view(1, T, 2)
+    g = bump_target(p, G, sigma=sigma, device="cpu")[0, :, 0].numpy()           # [T,G,G] peak 1
+    return g * np.asarray(pres, dtype=np.float32)[:, None, None]
--- a/make_testvec.py
+++ b/make_testvec.py
+# make_testvec.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Save a fixed (input, raw-output) pair for Java-vs-PyTorch verification. # By Claude on 06/13/2026
+   python make_testvec.py /work/runs/weighted/model.pt /work/runs/weighted/testvec"""
+import sys, numpy as np, torch
+import synth
+from model import RawFCN
+ck = torch.load(sys.argv[1], map_location="cpu"); a = ck["args"]
+N = a.get("nframes", 8); P = a.get("patch", 24); vr = a.get("vel_radius", 5)
+m = RawFCN(n_frames=N, vel_radius=vr); m.load_state_dict(ck["model"]); m.eval()
+rng = np.random.default_rng(999)
+f, lab = synth.generate_sample(rng, N=N, H=P, W=P, snr=6.0, place="center")
+with torch.no_grad():
+    out = m(torch.from_numpy(f[None])).reshape(-1).numpy()   # [124] raw network output
+f.astype('<f4').tofile(sys.argv[2] + "_in.bin")              # [N,H,W] row-major LE float32
+out.astype('<f4').tofile(sys.argv[2] + "_out.bin")           # [124] LE float32
+print(f"testvec N={N} P={P} outlen={out.size}  true vx={lab['vx']:+.3f} vy={lab['vy']:+.3f}  "
+      f"det_logit={out[0]:.4f}")
--- a/model.py
+++ b/model.py
+# model.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""
+All-convolutional (FCN) target estimator for the C5P DNN experiment. # By Claude on 06/13/2026
+
+Per Andrey's minimal-segment / siamese-reuse insight: train the per-location operator on
+small patches (one receptive field), deploy the SAME weights slid over the full frame.
+So there are NO fully-connected layers - the patch-net maps [B, N, P, P] -> [B, C, 1, 1]
+on a P=24 patch, and the identical net run on a larger image yields a dense grid (FCN).
+
+Per-patch output channels C = 1 (detection logit) + Vdim*Vdim (velocity logits, default
+121) + 2 (sub-pixel dx,dy offset). At inference the full P(x,y,vx,vy) field is:
+  x,y  <- convolution position + (dx,dy) offset head
+  vx,vy<- softmax(velocity logits) per location
+detection <- sigmoid(det logit) per location.
+
+Raw branch: learned conv encoder on the conditioned frames (frames as input channels).
+Whitened branch (added next): same head, but the first layer is the FIXED matched-filter
+conv (frozen) - so the comparison is learned-front-end vs analytical-front-end, same back.
+"""
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+class RawFCN(nn.Module):
+    """Patch -> [det, vel(Vdim^2), offset(2)]. Valid convs + maxpool reduce P=24 -> 1x1,
+    so it slides as an FCN over larger inputs (output grid downsampled by the pool stride)."""
+    def __init__(self, n_frames=8, vel_radius=5, patch=24, ch=None, velocity_mode="grid", vmax=1.4):
+        super().__init__()
+        self.vel_radius = vel_radius
+        self.vdim = 2 * vel_radius + 1
+        self.patch = patch
+        self.velocity_mode = velocity_mode  # "grid"=121-cell softmax (legacy) | "reg"=continuous Vx,Vy,logvar. By Claude on 06/17/2026
+        self.vmax = vmax                     # reg velocity bound (px/frame): v = vmax*tanh(raw) -> no grid, no corners
+        # Valid 3x3 convs + 2x2 maxpools reduce patch -> 1x1 (RF = patch). Pools always act on EVEN
+        # sizes so the RF stays centered (preserves the cx0 = P/2 half-pixel alignment). By Claude 06/16/2026:
+        # patch=32 widens the attention area so off-center suppression reaches the full alias distance
+        # (off_max = P/2-margin-1 = 13 > vmax*(N-1) = 11.2 px) -> trains away the trajectory-alias ghosts.
+        if patch == 24:
+            if ch is None: ch = (32, 32, 48, 48, 64)
+            c0, c1, c2, c3, c4 = ch
+            self.features = nn.Sequential(
+                nn.Conv2d(n_frames, c0, 3), nn.ReLU(inplace=True),   # 24->22
+                nn.Conv2d(c0, c1, 3), nn.ReLU(inplace=True),         # 22->20
+                nn.MaxPool2d(2),                                      # 20->10
+                nn.Conv2d(c1, c2, 3), nn.ReLU(inplace=True),         # 10->8
+                nn.Conv2d(c2, c3, 3), nn.ReLU(inplace=True),         # 8->6
+                nn.MaxPool2d(2),                                      # 6->3
+                nn.Conv2d(c3, c4, 3), nn.ReLU(inplace=True),         # 3->1
+            )
+            last = c4
+        elif patch == 32:
+            if ch is None: ch = (32, 32, 48, 48, 64, 64)
+            c0, c1, c2, c3, c4, c5 = ch
+            self.features = nn.Sequential(
+                nn.Conv2d(n_frames, c0, 3), nn.ReLU(inplace=True),   # 32->30
+                nn.Conv2d(c0, c1, 3), nn.ReLU(inplace=True),         # 30->28
+                nn.MaxPool2d(2),                                      # 28->14
+                nn.Conv2d(c1, c2, 3), nn.ReLU(inplace=True),         # 14->12
+                nn.Conv2d(c2, c3, 3), nn.ReLU(inplace=True),         # 12->10
+                nn.MaxPool2d(2),                                      # 10->5
+                nn.Conv2d(c3, c4, 3), nn.ReLU(inplace=True),         # 5->3
+                nn.Conv2d(c4, c5, 3), nn.ReLU(inplace=True),         # 3->1
+            )
+            last = c5
+        elif patch == 52:
+            # v2 Stage-1 MF at the wider velocity range (vmax~2.5-3, N=9 -> reach vmax*(N-1)=24 -> RF 52). By Claude on 06/18/2026
+            # 52->50->48->(pool)24->22->20->(pool)10->8->6->(pool)3->1 : RF=52, 7 conv3 + 3 pool.
+            if ch is None: ch = (32, 32, 48, 48, 64, 64, 96)
+            c0, c1, c2, c3, c4, c5, c6 = ch
+            self.features = nn.Sequential(
+                nn.Conv2d(n_frames, c0, 3), nn.ReLU(inplace=True),   # 52->50
+                nn.Conv2d(c0, c1, 3), nn.ReLU(inplace=True),         # 50->48
+                nn.MaxPool2d(2),                                      # 48->24
+                nn.Conv2d(c1, c2, 3), nn.ReLU(inplace=True),         # 24->22
+                nn.Conv2d(c2, c3, 3), nn.ReLU(inplace=True),         # 22->20
+                nn.MaxPool2d(2),                                      # 20->10
+                nn.Conv2d(c3, c4, 3), nn.ReLU(inplace=True),         # 10->8
+                nn.Conv2d(c4, c5, 3), nn.ReLU(inplace=True),         # 8->6
+                nn.MaxPool2d(2),                                      # 6->3
+                nn.Conv2d(c5, c6, 3), nn.ReLU(inplace=True),         # 3->1
+            )
+            last = c6
+        else:
+            raise ValueError("RawFCN: unsupported patch %d (use 24, 32, or 52)" % patch)
+        # grid: det(1) + Vdim^2 vel logits + offset(2); reg: det(1) + Vx,Vy(2) + logvar(1) + offset(2). By Claude on 06/17/2026
+        self.out_ch = (1 + self.vdim * self.vdim + 2) if (velocity_mode == "grid") else (1 + 2 + 1 + 2)
+        self.head = nn.Conv2d(last, self.out_ch, 1)              # 1x1 -> per-location output
+
+    def forward(self, x):
+        # x: [B, N, P, P]  -> feat [B, c4, Hf, Wf] -> out [B, C, Hf, Wf]
+        out = self.head(self.features(x))
+        if self.velocity_mode == "reg":
+            # bound Vx,Vy to +-vmax (no grid, no corners); det/logvar/offset stay raw. By Claude on 06/17/2026
+            v = self.vmax * torch.tanh(out[:, 1:3])
+            out = torch.cat([out[:, 0:1], v, out[:, 3:]], dim=1)
+        return out
+
+    def split(self, out):   # grid mode: (det_logit [B,*], vel_logits [B,Vdim^2,*], off [B,2,*])
+        det = out[:, 0]
+        vel = out[:, 1:1 + self.vdim * self.vdim]
+        off = out[:, 1 + self.vdim * self.vdim:]
+        return det, vel, off
+
+    def split_reg(self, out):  # reg mode: (det_logit [B,*], vel [B,2,*], logvar [B,*], off [B,2,*]). By Claude on 06/17/2026
+        return out[:, 0], out[:, 1:3], out[:, 3], out[:, 4:6]
+
+
+def fcn_loss(out, model, det_t, vel_soft_t, off_t, det_w=None, w_vel=1.0, w_off=1.0):
+    """Combined loss for center-supervised patches (output is [B,C,1,1]).
+    det_t   : [B]            0/1 detection labels
+    vel_soft_t: [B, Vdim^2]  soft P(vx,vy) target (positives only; ignored for negatives)
+    off_t   : [B, 2]         (dx,dy) target (positives only)
+    det_w   : [B] or None    per-sample detection-loss weight - heavier on near-miss off-center
+                             negatives (confusability ~ PSF overlap with center). None = all 1.
+    Returns (total, dict of components)."""
+    det_logit, vel_logits, off = model.split(out)
+    det_logit = det_logit.reshape(det_logit.shape[0])        # [B]
+    vel_logits = vel_logits.reshape(vel_logits.shape[0], -1) # [B, Vdim^2]
+    off = off.reshape(off.shape[0], 2)                       # [B, 2]
+
+    pos = (det_t > 0.5)
+    # detection: BCE over all samples (optionally per-sample weighted by confusability)
+    l_det = F.binary_cross_entropy_with_logits(det_logit, det_t, weight=det_w)
+    # velocity: cross-entropy to the soft target, positives only (KL up to a const)
+    if pos.any():
+        logp = F.log_softmax(vel_logits[pos], dim=1)
+        l_vel = -(vel_soft_t[pos] * logp).sum(dim=1).mean()
+        l_off = F.mse_loss(off[pos], off_t[pos])
+    else:
+        l_vel = vel_logits.sum() * 0.0
+        l_off = off.sum() * 0.0
+    total = l_det + w_vel * l_vel + w_off * l_off
+    return total, {"det": l_det.item(), "vel": l_vel.detach().item(), "off": l_off.detach().item()}
+
+
+def reg_loss(out, model, det_t, vx_t, vy_t, off_t, det_w=None, w_vel=1.0, w_off=1.0,
+             w_bias=0.0, bin_var=None, n_bins=4, mfsum_t=None, w_mfs=0.02):
+    """Loss for the continuous-velocity (reg) head. By Claude on 06/17/2026
+    Velocity = heteroscedastic isotropic-Gaussian NLL: 0.5*||v-vtrue||^2 * exp(-logvar) + logvar
+    (const dropped) -> learns BOTH velocity and its uncertainty sigma=exp(logvar/2). Plus det BCE,
+    offset MSE, and the batch-moment de-bias (pin per-bin mean gain to 1; bin_var = snr or s).
+    vx_t,vy_t,off_t : [B] / [B,2] tensors (px/frame, px).
+
+    MF-S mode (option a, Andrey 2026-06-18): if mfsum_t is given, channel 0 is no longer a det
+    LOGIT but a direct REGRESSION of the matched-filter path-sum S (sum of clean signal along the
+    trajectory) -> MSE, RAW output (no sigmoid; Java reads it raw). S is then the informative vote
+    weight: full at the true head, fading off-center, ~0 on noise - the same quantity the Hough
+    vote needs, so voteScatter weights by S directly (no separate path-sum pass)."""
+    det_logit, v, logvar, off = model.split_reg(out)
+    det_logit = det_logit.reshape(det_logit.shape[0])
+    v = v.reshape(v.shape[0], 2); logvar = logvar.reshape(logvar.shape[0]); off = off.reshape(off.shape[0], 2)
+    pos = (det_t > 0.5)
+    if mfsum_t is not None:
+        l_det = w_mfs * F.mse_loss(det_logit, mfsum_t)   # channel 0 = MF path-sum regression (raw)
+    else:
+        l_det = F.binary_cross_entropy_with_logits(det_logit, det_t, weight=det_w)
+    if pos.any():
+        dvx = v[pos, 0] - vx_t[pos]; dvy = v[pos, 1] - vy_t[pos]
+        sq = dvx * dvx + dvy * dvy
+        l_vel = (0.5 * sq * torch.exp(-logvar[pos]) + logvar[pos]).mean()   # heteroscedastic NLL
+        l_off = F.mse_loss(off[pos], off_t[pos])
+    else:
+        l_vel = v.sum() * 0.0; l_off = off.sum() * 0.0
+    # de-bias: per equal-population bin of bin_var, pooled LSQ gain through origin -> (gain-1)^2
+    l_bias = v.sum() * 0.0; nb = 0
+    if (w_bias > 0) and (bin_var is not None) and (int(pos.sum()) >= 4 * n_bins):
+        vp = v[pos]; tvx = vx_t[pos]; tvy = vy_t[pos]; b = bin_var[pos]
+        q = torch.linspace(0, 1, n_bins + 1, device=b.device, dtype=b.dtype)
+        edges = torch.quantile(b, q); edges[0] = edges[0] - 1e-4; edges[-1] = edges[-1] + 1e-4
+        for i in range(n_bins):
+            m = (b >= edges[i]) & (b < edges[i + 1])
+            if int(m.sum()) < 4:
+                continue
+            num = (vp[m, 0] * tvx[m] + vp[m, 1] * tvy[m]).sum()
+            den = (tvx[m] * tvx[m] + tvy[m] * tvy[m]).sum() + 1e-6
+            l_bias = l_bias + (num / den - 1.0) ** 2; nb += 1
+        if nb > 0:
+            l_bias = l_bias / nb
+    total = l_det + w_vel * l_vel + w_off * l_off + w_bias * l_bias
+    return total, {"det": l_det.item(), "vel": l_vel.detach().item(), "off": l_off.detach().item(),
+                   "bias": float(l_bias.detach()) if nb > 0 else 0.0}
+
+
+def vel_bias_loss(out, model, vx_true, vy_true, det_t, bin_var, n_bins=4, vel_decimate=4):
+    """Batch-moment de-biasing term (positives only). By Claude on 06/15/2026
+
+    Per equal-population bin of `bin_var` (quantile edges), the pooled least-squares gain
+    through the origin of the predicted softmax-centroid velocity vs the true velocity:
+        gain_bin = sum(pred . true) / sum(true . true)
+    penalized as (gain_bin - 1)^2, averaged over bins. Pins the MEAN velocity scale to 1 in
+    every bin - removing the systematic regime-dependent shrink and the ~0.97 clean bias -
+    WITHOUT penalizing per-sample scatter (variance is information-limited; left for the
+    recurrent layer to average out; only the bias, which the recurrent cannot fix, is removed).
+
+    bin_var : [B] quantity to bin by. For this IN-LOSS term, true SNR is the cleaner label
+      (uniform coverage; no coupling with the simultaneously-trained det head; no s-saturation)
+      - and the conditioning var need NOT exist at inference since the correction is baked into
+      the weights. Confidence s=sigmoid(det) is the right variable for a POST-HOC gain(s)
+      calibration instead (the only signal available at runtime). Membership uses bin_var
+      directly (detach s before passing); the gradient flows only through the velocity centroid.
+    vx_true, vy_true, det_t : [B] tensors (px/frame, px/frame, 0/1)."""
+    _, vel_logits, _ = model.split(out)
+    vel_logits = vel_logits.reshape(vel_logits.shape[0], -1)
+    pos = det_t > 0.5
+    if int(pos.sum()) < 4 * n_bins:
+        return out.sum() * 0.0
+    vdim = model.vdim
+    p = torch.softmax(vel_logits[pos], dim=1).reshape(-1, vdim, vdim)   # [Npos, vy, vx]
+    cells = torch.arange(vdim, device=p.device, dtype=p.dtype) - model.vel_radius
+    pvx = (p.sum(1) * cells).sum(1)              # predicted centroid, cells (vx inner)
+    pvy = (p.sum(2) * cells).sum(1)
+    tvx = vx_true[pos] * vel_decimate            # true, cells
+    tvy = vy_true[pos] * vel_decimate
+    b = bin_var[pos]
+    q = torch.linspace(0, 1, n_bins + 1, device=b.device, dtype=b.dtype)
+    edges = torch.quantile(b, q)                 # equal-population bins, robust to skew
+    edges[0] = edges[0] - 1e-4; edges[-1] = edges[-1] + 1e-4
+    loss = pvx.sum() * 0.0; nb = 0
+    for i in range(n_bins):
+        m = (b >= edges[i]) & (b < edges[i + 1])
+        if int(m.sum()) < 4:
+            continue
+        num = (pvx[m] * tvx[m] + pvy[m] * tvy[m]).sum()
+        den = (tvx[m] * tvx[m] + tvy[m] * tvy[m]).sum() + 1e-6
+        loss = loss + (num / den - 1.0) ** 2
+        nb += 1
+    return loss / nb if nb > 0 else loss
--- a/nettest.py
+++ b/nettest.py
+#!/usr/bin/env python3
+# nettest.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Raw-socket throughput tester (stdlib only). By Claude on 06/20/2026.
+  server: nettest.py server [port]
+  client: nettest.py client HOST PORT MB DIR   (DIR=up client->server | down server->client)
+Receiver times wall-clock from first to last byte and reports MB/s + Gbit/s (no encryption,
+single bulk stream -> clean link throughput)."""
+import socket, struct, sys, time
+
+CHUNK = 1 << 20  # 1 MiB
+
+
+def recv_timed(conn, nbytes):
+    got = 0
+    buf = bytearray(CHUNK)
+    t0 = None
+    while got < nbytes:
+        n = conn.recv_into(buf, min(CHUNK, nbytes - got))
+        if not n:
+            break
+        if t0 is None:
+            t0 = time.perf_counter()
+        got += n
+    dt = time.perf_counter() - t0 if t0 else 0.0
+    return got, dt
+
+
+def send_all(conn, nbytes):
+    block = b"\0" * CHUNK
+    sent = 0
+    while sent < nbytes:
+        sent += conn.send(block[:min(CHUNK, nbytes - sent)])
+    return sent
+
+
+def server(port):
+    s = socket.socket()
+    s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
+    s.bind(("0.0.0.0", port))
+    s.listen(1)
+    print(f"nettest server on :{port}", flush=True)
+    while True:
+        c, a = s.accept()
+        c.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
+        try:
+            hdr = c.recv(16)
+            if len(hdr) < 16:
+                c.close(); continue
+            direction, nbytes = struct.unpack(">qq", hdr)  # 0=up(server recvs) 1=down(server sends)
+            if direction == 0:
+                got, dt = recv_timed(c, nbytes)
+                mbps = got / 1e6 / dt if dt else 0
+                print(f"  UP recv {got/1e6:.0f}MB {dt*1e3:.1f}ms = {mbps:.0f} MB/s ({mbps*8/1000:.2f} Gbit/s)", flush=True)
+                c.sendall(struct.pack(">d", dt))
+            else:
+                send_all(c, nbytes)
+        finally:
+            c.close()
+
+
+def client(host, port, mb, direction):
+    nbytes = mb * (1 << 20)
+    c = socket.socket()
+    c.connect((host, port))
+    c.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
+    d = 0 if direction == "up" else 1
+    c.sendall(struct.pack(">qq", d, nbytes))
+    if d == 0:
+        send_all(c, nbytes)
+        dt, = struct.unpack(">d", c.recv(8))
+        mbps = nbytes / 1e6 / dt if dt else 0
+        print(f"UP   client->server {mb}MB: {mbps:.0f} MB/s = {mbps*8/1000:.2f} Gbit/s (server-timed)")
+    else:
+        got, dt = recv_timed(c, nbytes)
+        mbps = got / 1e6 / dt if dt else 0
+        print(f"DOWN server->client {got/1e6:.0f}MB: {mbps:.0f} MB/s = {mbps*8/1000:.2f} Gbit/s")
+    c.close()
+
+
+if __name__ == "__main__":
+    if sys.argv[1] == "server":
+        server(int(sys.argv[2]) if len(sys.argv) > 2 else 5578)
+    else:
+        client(sys.argv[2], int(sys.argv[3]), int(sys.argv[4]), sys.argv[5])
--- a/partial_votes.py
+++ b/partial_votes.py
+# partial_votes.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Single-target partial-vote visualization. By Claude on 2026-06-19 (design Andrey)
+
+One target (Vx,Vy via args, default 1,0), one timestamp, no interference. Runs frozen Stage 1
+(mf_s) -> per-pixel (Vx,Vy,S). Writes an ImageJ hyperstack TIFF:
+  slice 0      = TOTAL vote (sum of all contributions = accS)
+  slice 1..K   = each voter pixel's S-weighted splat at its tail, slice-labeled "X{dx}:Y{dy}:S={s}"
+                 (dx,dy = pixel offset from the true head; voters sorted by descending S, capped)
+Plus the vote-weighted readout at the peak:  (Vx,Vy) = sum(vote*V)/sum(vote),  head = tail + V*(N-1).
+Usage: python partial_votes.py [model.pt] [out.tif] [vx] [vy] [amp] [noise0/1]
+"""
+import sys, numpy as np, torch
+import synth, stage2 as S2
+from model import RawFCN
+import tifffile
+
+dev = "cuda" if torch.cuda.is_available() else "cpu"
+N, vmax, P = 9, 2.8, 52; half, Nm1, HW = P // 2, N - 1, 120; F_ = HW - P + 1
+ckpt  = sys.argv[1] if len(sys.argv) > 1 else "/work/runs/stage1_mfs2/model.pt"
+out   = sys.argv[2] if len(sys.argv) > 2 else "/work/runs/partial_votes.tif"
+Vx    = float(sys.argv[3]) if len(sys.argv) > 3 else 1.0
+Vy    = float(sys.argv[4]) if len(sys.argv) > 4 else 0.0
+amp   = float(sys.argv[5]) if len(sys.argv) > 5 else 5.0
+noise = (len(sys.argv) > 6 and sys.argv[6] == "1")
+THR_FRAC, MAXV = 0.15, 220                      # voter gate (frac of max S) and slice cap
+
+s1 = RawFCN(n_frames=N, patch=P, velocity_mode="reg", vmax=vmax).to(dev)
+s1.load_state_dict(torch.load(ckpt, map_location=dev)["model"]); s1.eval()
+
+# one target, head centered, causal MB (blur_frac 1.0, matches training)
+rng = np.random.default_rng(0)
+fr = rng.standard_normal((N, HW, HW)).astype(np.float32) if noise else np.zeros((N, HW, HW), np.float32)
+Hx = Hy = HW / 2.0; subs = np.arange(4) * 0.25
+for i in range(N):
+    acc = np.zeros((HW, HW))
+    for ss in subs: acc += synth.halfcos_bump(Hx - Vx * (i + ss), Hy - Vy * (i + ss), HW, HW)
+    fr[i] += (amp * acc / 4).astype(np.float32)
+
+s_t, vx_t, vy_t = S2.stage1_dense(s1, fr, dev=dev, mf_s=True)
+accS, accVx, accVy = S2.vote_scatter(s_t, vx_t, vy_t, Nm1)
+s, vx, vy = s_t.cpu().numpy(), vx_t.cpu().numpy(), vy_t.cpu().numpy()
+accS_n, accVx_n, accVy_n = accS.cpu().numpy(), accVx.cpu().numpy(), accVy.cpu().numpy()
+hxf, hyf = Hx - half, Hy - half                 # true head in field coords
+
+# voters (s above gate), strongest first
+thr = THR_FRAC * float(s.max())
+vox = sorted([(i, j) for i in range(F_) for j in range(F_) if s[i, j] > thr], key=lambda p: -s[p])
+if len(vox) > MAXV: vox = vox[:MAXV]
+
+slices = [accS_n.astype(np.float32)]; labels = ["TOTAL VOTE (sum)"]
+for (i, j) in vox:
+    sl = np.zeros((F_, F_), np.float32); sval = float(s[i, j])
+    tx, ty = j - vx[i, j] * Nm1, i - vy[i, j] * Nm1
+    x0, y0 = int(np.floor(tx)), int(np.floor(ty)); fx, fy = tx - x0, ty - y0
+    for dx in (0, 1):
+        for dy in (0, 1):
+            xi, yi = x0 + dx, y0 + dy
+            if 0 <= xi < F_ and 0 <= yi < F_:
+                sl[yi, xi] += sval * (1 - fx if dx == 0 else fx) * (1 - fy if dy == 0 else fy)
+    slices.append(sl)
+    labels.append("X%+d:Y%+d:S=%.2f" % (j - hxf, i - hyf, sval))
+
+tifffile.imwrite(out, np.stack(slices).astype(np.float32), imagej=True, metadata={"Labels": labels})
+
+# vote-weighted readout at the peak
+pk = np.unravel_index(accS_n.argmax(), accS_n.shape); Sp = accS_n[pk]
+vcx, vcy = accVx_n[pk] / Sp, accVy_n[pk] / Sp
+print("voters: %d (gate S>%.2f), wrote %d slices -> %s" % (len(vox), thr, len(slices), out))
+print("peak tail @field (x=%d,y=%d)  vote-sum S=%.1f" % (pk[1], pk[0], Sp))
+print("vote-weighted velocity: Vx=%.3f Vy=%.3f   (true %.2f,%.2f)" % (vcx, vcy, Vx, Vy))
+print("=> head = tail + V*(N-1) = field (x=%.1f,y=%.1f)   true head field (x=%.1f,y=%.1f)" %
+      (pk[1] + vcx * Nm1, pk[0] + vcy * Nm1, hxf, hyf))
--- a/run_infer_server.sh
+++ b/run_infer_server.sh
+#!/usr/bin/env bash
+# run_infer_server.sh - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+# Start/stop the CUAS DGX inference server (PyTorch RawFCN, cuDNN) in the NGC container.
+# By Claude on 06/20/2026.  Run on the DGX (elphel@192.168.0.62).
+#   start|stop|logs|status        env: RUN=runs/<model>  RUN2=runs/<l2>  PORT=5577
+set -euo pipefail
+NAME=cuas_infer
+IMG=nvcr.io/nvidia/pytorch:25.10-py3
+CODE=/home/elphel/c5p_dnn
+RUN="${RUN:-runs/weighted9_pm_s}"
+RUN2="${RUN2:-}"                 # optional Layer-2 run dir; empty -> L1-only. By Claude 06/22/2026
+PORT="${PORT:-5577}"
+case "${1:-start}" in
+  start)
+    docker rm -f "$NAME" >/dev/null 2>&1 || true
+    L2ARG=""; [ -n "$RUN2" ] && L2ARG="--l2run $RUN2"
+    docker run -d --name "$NAME" --gpus all --network host \
+      -v "$CODE":/work -w /work "$IMG" \
+      python infer_server.py --run "$RUN" $L2ARG --port "$PORT" >/dev/null
+    echo "started $NAME (run=$RUN l2=${RUN2:-off} port=$PORT)"; sleep 3; docker logs "$NAME"
+    ;;
+  stop)   docker rm -f "$NAME" >/dev/null 2>&1 && echo "stopped" || echo "not running" ;;
+  logs)   docker logs --tail 60 "$NAME" ;;
+  status) docker ps --filter "name=$NAME" --format "{{.Names}} {{.Status}}" ;;
+  *) echo "usage: $0 {start|stop|logs|status}  (env: RUN=, RUN2=, PORT=)"; exit 1 ;;
+esac
--- a/run_l2A.sh
+++ b/run_l2A.sh
+#!/usr/bin/env bash
+# run_l2A.sh - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+# Option A: baseline sigma=1.5 vs FWHM~2 target sigma=0.85, SAME v1 data. By Claude 06/22/2026
+set -uo pipefail
+cd /work
+COMMON="--l1 runs/weighted9_pm/model.pt --nseq 128 --T 48 --steps 4000 --pos_weight 30"
+echo "=== Option A run 1: sigma=1.5 (baseline)        $(date) ==="
+python layer2_train_A.py $COMMON --sigma 1.5  --out runs/l2_A_s15  2>&1 | tee runs/l2_A_s15.log
+echo "=== Option A run 2: sigma=0.85 (FWHM~2 target)  $(date) ==="
+python layer2_train_A.py $COMMON --sigma 0.85 --out runs/l2_A_s085 2>&1 | tee runs/l2_A_s085.log
+echo "=== Option A DONE $(date) ==="
+echo "----- FWHM summary -----"
+grep -h "output blob FWHM" runs/l2_A_s15.log runs/l2_A_s085.log
--- a/run_l2_chain.sh
+++ b/run_l2_chain.sh
+#!/usr/bin/env bash
+# run_l2_chain.sh - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+# Overnight chain: parametric L2 (position+presence) trained WITH band-pass gaps, swept over
+# HF x duty-offset to find the best coast/maintain regime. By Claude on 06/22/2026.
+# Each run is isolated (own out dir + log); a failing run does NOT abort the rest. Run on the DGX:
+#   tmux new-session -d -s l2chain "docker exec cuas_infer bash /work/run_l2_chain.sh"
+set -uo pipefail
+cd /work
+L1="runs/weighted9_pm/model.pt"
+COMMON="--l1 $L1 --nseq 128 --T 64 --steps 4000 --gaps --starter_len 8"
+
+run () {  # run <name> <extra args...>
+    local name="$1"; shift
+    echo "=== $name  $(date) ==="
+    python layer2_train_P.py $COMMON "$@" --out "runs/$name" 2>&1 | tee "runs/$name.log" \
+        || echo "!! $name FAILED (continuing)"
+}
+
+# HF x offset sweep (LF=3 fixed). gentle camels (HF 9-12), deeper offset => longer clean gaps.
+run l2P_h9_o2      --bp_hi 9  --duty_offset -0.2
+run l2P_h9_o4      --bp_hi 9  --duty_offset -0.4
+run l2P_h12_o3     --bp_hi 12 --duty_offset -0.3
+run l2P_h12_o5     --bp_hi 12 --duty_offset -0.5
+# one longer high-step run on the conservative-gaps setting for a quality model
+run l2P_h9_o3_long --bp_hi 9  --duty_offset -0.3 --steps 8000
+
+echo "=== L2 CHAIN DONE  $(date) ==="
+echo "----- summary (position MAE + presence separation per run) -----"
+for f in runs/l2P_*.log; do
+    echo "## ${f##*/}"
+    grep -hE "position MAE|presence prob" "$f" 2>/dev/null || echo "   (no result line — check $f)"
+done
--- a/run_l2_dense.sh
+++ b/run_l2_dense.sh
+#!/usr/bin/env bash
+# run_l2_dense.sh - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+# Relaunch the two dense rungs (bug fixed). By Claude 06/23/2026
+set -uo pipefail
+cd /work
+DENSE="--l1 runs/weighted9_pm/model.pt --nseq 128 --T 48 --steps 4000 --sigma 0.85"
+echo "=== seq1: dense sigma=0.85 plain BCE   $(date) ==="
+python layer2_train_A.py $DENSE --out runs/seq1_dense_s085 2>&1 | tee runs/seq1_dense_s085.log || echo "!! seq1 FAILED"
+echo "=== seq2: dense Mexican-hat            $(date) ==="
+python layer2_train_A.py $DENSE --mexhat --out runs/seq2_dense_mexhat 2>&1 | tee runs/seq2_dense_mexhat.log || echo "!! seq2 FAILED"
+echo "=== DENSE DONE $(date) ==="
+for f in runs/seq1_dense_s085.log runs/seq2_dense_mexhat.log; do echo "## ${f##*/}"; grep -hE "dense result|FWHM @ target|pos-MAE" "$f"; done
--- a/run_l2_seq.sh
+++ b/run_l2_seq.sh
+#!/usr/bin/env bash
+# run_l2_seq.sh - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+# Overnight ablation ladder — separate single-change models to evaluate & compare. By Claude 06/22/2026.
+#   seq1  dense, sigma=0.85, plain BCE          (sharper target only)
+#   seq2  dense, sigma=0.85, + Mexican-hat      (center-surround / known-empty ring)
+#   seq4  parametric + gap modulation           ("fancy")
+# (#3 parametric-no-gap = the shakedown, run separately.) Each run is isolated, own out dir + log;
+# a failing run does NOT abort the rest. Run on the DGX:
+#   tmux new-session -d -s l2seq "docker exec cuas_infer bash /work/run_l2_seq.sh"
+set -uo pipefail
+cd /work
+DENSE="--l1 runs/weighted9_pm/model.pt --nseq 128 --T 48 --steps 4000 --sigma 0.85"
+
+echo "=== seq1: dense sigma=0.85 plain BCE      $(date) ==="
+python layer2_train_A.py $DENSE --out runs/seq1_dense_s085 2>&1 | tee runs/seq1_dense_s085.log || echo "!! seq1 FAILED"
+
+echo "=== seq2: dense Mexican-hat (core 0.85)   $(date) ==="
+python layer2_train_A.py $DENSE --mexhat --out runs/seq2_dense_mexhat 2>&1 | tee runs/seq2_dense_mexhat.log || echo "!! seq2 FAILED"
+
+echo "=== seq4: parametric + gaps (fancy)       $(date) ==="
+python layer2_train_P.py --l1 runs/weighted9_pm/model.pt --nseq 128 --T 64 --steps 4000 \
+    --gaps --bp_hi 9 --duty_offset -0.3 --starter_len 8 --out runs/seq4_param_gaps \
+    2>&1 | tee runs/seq4_param_gaps.log || echo "!! seq4 FAILED"
+
+echo "=== L2 SEQUENCE DONE  $(date) ==="
+echo "----- comparison summary -----"
+for f in runs/seq1_dense_s085.log runs/seq2_dense_mexhat.log runs/seq4_param_gaps.log; do
+    echo "## ${f##*/}"
+    grep -hE "FWHM @ target|pos-MAE|presence prob|dense result|Parametric result" "$f" 2>/dev/null || echo "   (no result line — check $f)"
+done
--- a/shake.py
+++ b/shake.py
+# shake.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""C5P v2 diagnostic: "shake the stack". By Claude on 06/18/2026
+
+Andrey's intuition: looking through a noisy stack, shake tilt (velocity) and x0,y0 and watch what
+persists. Maps the matched-filter path-sum S(dx, dvx) around a horizontal target and around a noise
+patch. Expectation: a target shows a central peak (all N frames coincide) with a FAN of weaker
+ridges (single-frame aliases, slope dvx=dx/i) all crossing AT the head -> the head is the
+convergence of many ridges (robust), not a lone max. Noise: no coherent crossing.
+"""
+import numpy as np, synth
+N = 9; HW = 140
+
+def render_h(V, amp, noise, blur_frac=1.0, nb=4, seed=1):
+    rng = np.random.default_rng(seed)
+    frames = rng.standard_normal((N, HW, HW)).astype(np.float32) if noise else np.zeros((N, HW, HW), np.float32)
+    Hx = HW / 2 + V * (N - 1) * 0.5; Hy = HW / 2
+    subs = np.arange(nb) * (blur_frac / nb)
+    for i in range(N):
+        acc = np.zeros((HW, HW))
+        for ss in subs: acc += synth.halfcos_bump(Hx - V * (i + ss), Hy, HW, HW)
+        frames[i] += (amp * acc / nb).astype(np.float32)
+    return frames, Hx, Hy
+
+def bilin(img, x, y):
+    ix = int(np.floor(x)); iy = int(np.floor(y)); fx = x - ix; fy = y - iy
+    if ix < 0 or ix >= HW - 1 or iy < 0 or iy >= HW - 1: return 0.0
+    return float((1-fy)*((1-fx)*img[iy,ix]+fx*img[iy,ix+1]) + fy*((1-fx)*img[iy+1,ix]+fx*img[iy+1,ix+1]))
+
+def landscape(frames, x0, y0, V, DX, DVX):
+    L = np.zeros((len(DX), len(DVX)))
+    for a, dx in enumerate(DX):
+        for b, dvx in enumerate(DVX):
+            L[a, b] = sum(bilin(frames[i], x0 + dx - (V + dvx) * i, y0) for i in range(N))
+    return L
+
+CH = " .:-=+*#%@"
+def show(L, DX, DVX, title):
+    mx = L.max()
+    print(title + "  (peak path-sum %.1f; rows=dx, cols=dvx %.1f..%.1f)" % (mx, DVX[0], DVX[-1]))
+    for a, dx in enumerate(DX):
+        row = "".join(CH[min(9, max(0, int(round(9 * L[a, b] / mx))))] if mx > 0 else " " for b in range(len(DVX)))
+        print("  dx=%+3d |%s|%s" % (dx, row, " <- dx=0" if dx == 0 else ""))
+    z = int(np.argmin(np.abs(DVX)))
+    print("         " + " " * z + "^dvx=0  (ramp ridge: dvx=dx/8)")
+
+DX = np.arange(-12, 13); DVX = np.arange(-1.4, 1.45, 0.1)
+for amp in (5, 3, 2):
+    fr, Hx, Hy = render_h(1.0, amp, True)
+    show(landscape(fr, Hx, Hy, 1.0, DX, DVX), DX, DVX, "\n===== TARGET  V=1.0  amp=%d  (noisy) =====" % amp)
+    show(landscape(fr, 30, 30, 1.0, DX, DVX), DX, DVX, "----- NOISE patch (same stack, off-target) -----")
--- a/stage2.py
+++ b/stage2.py
+# stage2.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""C5P DNN v2 — Stage 2: learned Hough-vote refinement. By Claude on 06/18/2026
+
+Stage 1 (frozen RawFCN reg, patch 52) emits a per-pixel (Vx,Vy,s) field; around a target it forms
+the alias ramp V_P = V_true + (P-H)/(N-1), so every pixel back-projects to the SAME tail
+T = P - V_P*(N-1). Stage 2 splats each pixel's s-weighted vote at T (differentiable bilinear
+scatter), then a small conv REFINES the accumulator into a clean tail-detection + consensus velocity.
+A real target gets many coherent votes -> sharp peak; a ghost/alias has no consensus. This replaces
+the v1 velocity-softmax competition AND the (backwards) ghostbuster. Vote target = tail; head = tail + V*(N-1).
+
+Training: Stage 1 FROZEN. Fields are PRE-COMPUTED once (dense Stage-1 over each field) and cached, then
+the refine net trains fast on the cache (deep-supervision reference; e2e + latent channels come later).
+"""
+import argparse, numpy as np, torch, torch.nn as nn, torch.nn.functional as F
+import synth
+from model import RawFCN
+
+
+def gen_field(rng, HW, ntgt, N, vmax, snr_rng, blur_frac=1.0, nb=4, margin=50):
+    """Render ntgt causal-MB targets (random head/velocity) + Gaussian noise. By Claude on 06/18/2026
+    Returns frames [N,HW,HW], list of (hx,hy,vx,vy). Heads kept in [margin, HW-margin]."""
+    snr = float(np.exp(rng.uniform(np.log(snr_rng[0]), np.log(snr_rng[1]))))
+    frames = rng.standard_normal((N, HW, HW)).astype(np.float32)
+    subs = np.arange(nb) * (blur_frac / nb)
+    tgts = []
+    for _ in range(ntgt):
+        for _try in range(50):
+            vx = rng.uniform(-vmax, vmax); vy = rng.uniform(-vmax, vmax)
+            if vx*vx + vy*vy > vmax*vmax: continue
+            hx = rng.uniform(margin, HW-margin); hy = rng.uniform(margin, HW-margin)
+            xs = [hx - vx*i for i in range(N)]; ys = [hy - vy*i for i in range(N)]
+            if min(xs) >= 3 and max(xs) <= HW-4 and min(ys) >= 3 and max(ys) <= HW-4: break
+        else:
+            continue
+        for i in range(N):
+            acc = np.zeros((HW, HW), np.float64)
+            for ss in subs: acc += synth.halfcos_bump(hx - vx*(i+ss), hy - vy*(i+ss), HW, HW)
+            frames[i] += (snr * acc / nb).astype(np.float32)
+        tgts.append((hx, hy, vx, vy))
+    return frames, tgts
+
+
+def stage1_dense(net, frames, P=52, dev="cuda", chunk=8192, mf_s=False):
+    """Run frozen Stage 1 stride-1 over a field -> (s,Vx,Vy) maps at field-res (HW-P+1). By Claude on 06/18/2026
+    mf_s=True (option a): channel 0 is the RAW matched-filter path-sum S (clamp>=0), not a det
+    logit -> S is itself the informative vote weight, so no separate mf_sum() pass is needed."""
+    N, HW, _ = frames.shape
+    x = torch.from_numpy(frames[None]).to(dev)                       # [1,N,HW,HW]
+    cols = F.unfold(x, kernel_size=P)                                # [1, N*P*P, L]
+    L = cols.shape[-1]; F_ = HW - P + 1
+    cols = cols.reshape(N, P, P, L).permute(3, 0, 1, 2).contiguous() # [L,N,P,P]
+    outs = []
+    with torch.no_grad():
+        for b in range(0, L, chunk):
+            o = net(cols[b:b+chunk])                                 # [bs,6,1,1]
+            outs.append(o[:, :, 0, 0])
+    o = torch.cat(outs, 0)                                           # [L,6]
+    s = o[:, 0].clamp(min=0.0) if mf_s else torch.sigmoid(o[:, 0]); v = o[:, 1:3]
+    return s.reshape(F_, F_), v[:, 0].reshape(F_, F_), v[:, 1].reshape(F_, F_)
+
+
+def mf_sum(frames, vx, vy, half, N):
+    """Matched-filter response = sum of data along each pixel's trajectory (Andrey 2026-06-18).
+    Informative even WITHOUT noise (full path-sum at the true head, partial at aliases) and kills
+    noise-consensus (no real data along a spurious path -> ~0). frames [N,HW,HW]; vx,vy [F,F] (field
+    coords; scene = field + half). Returns [F,F] = max(sum_i frames[i] @ (field+half - V*i), 0)."""
+    dev = frames.device; HW = frames.shape[-1]; F_ = vx.shape[0]
+    fi, fj = torch.meshgrid(torch.arange(F_, device=dev).float(), torch.arange(F_, device=dev).float(), indexing='ij')
+    acc = torch.zeros(F_, F_, device=dev)
+    for i in range(N):
+        sx = (fj + half - vx * i); sy = (fi + half - vy * i)
+        grid = torch.stack([2 * sx / (HW - 1) - 1, 2 * sy / (HW - 1) - 1], dim=-1)[None]
+        acc = acc + F.grid_sample(frames[i][None, None], grid, align_corners=True, padding_mode='zeros')[0, 0]
+    return acc.clamp(min=0.0)
+
+
+def vote_scatter(s, vx, vy, Nm1):
+    """Bilinear vote splat at T = P - V*(N-1); s = vote WEIGHT (MF path-sum), unnormalized. By Claude on 06/18/2026
+    Returns 3 accumulators [F,F]: (sum w, sum w*vx, sum w*vy). Normalize per-field afterward."""
+    F_ = s.shape[0]; dev = s.device
+    ys, xs = torch.meshgrid(torch.arange(F_, device=dev).float(),
+                            torch.arange(F_, device=dev).float(), indexing='ij')
+    tx = xs - vx * Nm1; ty = ys - vy * Nm1
+    accS = torch.zeros(F_*F_, device=dev); accVx = torch.zeros_like(accS); accVy = torch.zeros_like(accS)
+    x0 = torch.floor(tx); y0 = torch.floor(ty)
+    for dx in (0, 1):
+        for dy in (0, 1):
+            xi = (x0 + dx); yi = (y0 + dy)
+            wx = (1 - (tx - x0)) if dx == 0 else (tx - x0)
+            wy = (1 - (ty - y0)) if dy == 0 else (ty - y0)
+            w = (s * wx * wy)   # s = MF path-sum weight (already informative; no squaring) // By Claude 06/18
+            valid = (xi >= 0) & (xi < F_) & (yi >= 0) & (yi < F_)
+            idx = (yi.clamp(0, F_-1) * F_ + xi.clamp(0, F_-1)).long().reshape(-1)
+            wv = (w * valid).reshape(-1)
+            accS.scatter_add_(0, idx, wv)
+            accVx.scatter_add_(0, idx, (wv * vx.reshape(-1)))
+            accVy.scatter_add_(0, idx, (wv * vy.reshape(-1)))
+    return accS.reshape(F_, F_), accVx.reshape(F_, F_), accVy.reshape(F_, F_)
+
+
+class VoteRefine(nn.Module):
+    """Conv refine on the 3-channel vote accumulator -> tail-detection logit + consensus velocity.
+    Reference Stage 2: vote (geometric, physics) is fixed; the conv learns to sharpen/threshold."""
+    def __init__(self, ch=32):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Conv2d(3, ch, 3, padding=1), nn.ReLU(True),
+            nn.Conv2d(ch, ch, 3, padding=1), nn.ReLU(True),
+            nn.Conv2d(ch, 3, 1))                                     # det logit + (Vx,Vy) consensus
+    def forward(self, accS, accVx, accVy):
+        a = torch.stack([accS, accVx, accVy], 0)[None]               # [1,3,F,F]
+        return self.net(a)[0]                                        # [3,F,F]
+
+
+def tail_label(tgts, F_, P=52, N=9, sigma=1.5):
+    """Gaussian tail-detection target map (field coords) + velocity maps. By Claude on 06/18/2026"""
+    Nm1 = N - 1; half = P // 2
+    det = np.zeros((F_, F_), np.float32); vx = np.zeros_like(det); vy = np.zeros_like(det)
+    ys, xs = np.mgrid[0:F_, 0:F_]
+    for hx, hy, tvx, tvy in tgts:
+        tx = (hx - tvx*Nm1) - half; ty = (hy - tvy*Nm1) - half       # tail in field coords
+        g = np.exp(-((xs-tx)**2 + (ys-ty)**2) / (2*sigma*sigma)).astype(np.float32)
+        det = np.maximum(det, g); m = g > 0.3
+        vx[m] = tvx; vy[m] = tvy
+    return det, vx, vy
+
+
+if __name__ == "__main__":
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--stage1", default="runs/stage1_mf/model.pt")
+    ap.add_argument("--nframes", type=int, default=9); ap.add_argument("--vmax", type=float, default=2.8)
+    ap.add_argument("--HW", type=int, default=120); ap.add_argument("--ntgt", type=int, default=4)
+    ap.add_argument("--nfields", type=int, default=384); ap.add_argument("--steps", type=int, default=3000)
+    ap.add_argument("--snr", type=float, nargs=2, default=[2.0, 8.0]); ap.add_argument("--out", default="runs/stage2")
+    ap.add_argument("--mf_s", action="store_true")  # Stage 1 emits the MF path-sum directly as S (option a) -> vote weight = S, no mf_sum() pass // By Claude 06/18
+    a = ap.parse_args()
+    import os; os.makedirs(a.out, exist_ok=True)
+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+    s1 = RawFCN(n_frames=a.nframes, patch=52, velocity_mode="reg", vmax=a.vmax).to(dev)
+    s1.load_state_dict(torch.load(a.stage1, map_location=dev)["model"]); s1.eval()
+    rng = np.random.default_rng(0); Nm1 = a.nframes - 1; half = 52 // 2
+    print(f"precomputing {a.nfields} Stage-1 fields + MF-sum votes (HW={a.HW}, ntgt={a.ntgt})...", flush=True)
+    cache = []
+    for k in range(a.nfields):
+        nt = 0 if (k % 4 == 0) else a.ntgt   # 25% noise-only fields: learn to suppress noise-consensus // By Claude 06/18
+        fr, tg = gen_field(rng, a.HW, nt, a.nframes, a.vmax, a.snr)
+        s, vx, vy = stage1_dense(s1, fr, dev=dev, mf_s=a.mf_s)
+        F_ = s.shape[0]
+        # option a: S already IS the MF path-sum (learned, denoised) -> use it as the vote weight
+        # directly; legacy path computes the explicit data path-sum from the frames.
+        w = s if a.mf_s else mf_sum(torch.from_numpy(fr).to(dev), vx, vy, half, a.nframes)
+        accS, accVx, accVy = vote_scatter(w, vx, vy, Nm1)
+        nrm = accS.max().clamp(min=1e-6)                                    # per-field normalize (regime-invariant)
+        accS = accS / nrm; accVx = accVx / nrm; accVy = accVy / nrm
+        det_t, tvx, tvy = tail_label(tg, F_, N=a.nframes)
+        cache.append((accS.detach(), accVx.detach(), accVy.detach(),
+                      torch.from_numpy(det_t).to(dev), torch.from_numpy(tvx).to(dev), torch.from_numpy(tvy).to(dev)))
+        if (k+1) % 64 == 0: print(f"  {k+1}/{a.nfields}", flush=True)
+    net = VoteRefine().to(dev); opt = torch.optim.Adam(net.parameters(), 1e-3)
+    print("training Stage-2 refine...", flush=True)
+    for step in range(1, a.steps+1):
+        accS, accVx, accVy, det_t, tvx, tvy = cache[np.random.randint(len(cache))]
+        out = net(accS, accVx, accVy)                               # [3,F,F]
+        l_det = F.binary_cross_entropy_with_logits(out[0], det_t, pos_weight=torch.tensor(8.0, device=out.device))  # tail-Gaussians are sparse
+        m = det_t > 0.3
+        l_vel = (F.mse_loss(out[1][m], tvx[m]) + F.mse_loss(out[2][m], tvy[m])) if m.any() else out.sum()*0
+        loss = l_det + 0.3*l_vel
+        opt.zero_grad(); loss.backward(); opt.step()
+        if step % 300 == 0:
+            with torch.no_grad():
+                p = torch.sigmoid(out[0]); peakhit = float(p[m].mean()) if m.any() else 0
+                bg = float(p[~m].max())
+            print(f"step {step:5d}  det {l_det.item():.4f}  vel {float(l_vel):.4f}  "
+                  f"tail-s(peak) {peakhit:.3f}  max-bg {bg:.3f}", flush=True)
+    torch.save({"model": net.state_dict(), "args": vars(a)}, f"{a.out}/model.pt")
+    print(f"saved {a.out}/model.pt", flush=True)
--- a/stage2_eval3.py
+++ b/stage2_eval3.py
+# stage2_eval3.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+import numpy as np, torch
+import stage2 as S2
+from model import RawFCN
+dev="cuda" if torch.cuda.is_available() else "cpu"; N,vmax,HW=9,2.8,140; Nm1=N-1; half=26
+s1=RawFCN(n_frames=N,patch=52,velocity_mode="reg",vmax=vmax).to(dev)
+s1.load_state_dict(torch.load("/work/runs/stage1_mf/model.pt",map_location=dev)["model"]); s1.eval()
+net=S2.VoteRefine().to(dev); net.load_state_dict(torch.load("/work/runs/stage2c/model.pt",map_location=dev)["model"]); net.eval()
+def peaks(p,th,r=2):
+    out=[]; H_,W_=p.shape
+    for y in range(r,H_-r):
+        for x in range(r,W_-r):
+            if p[y,x]>th and p[y,x]>=p[y-r:y+r+1,x-r:x+r+1].max()-1e-6: out.append((x,y,p[y,x]))
+    return out
+for TH in (0.5,0.7):
+    ndet=0;ntot=0;errs=[];gh=[];rng=np.random.default_rng(123)
+    for t in range(15):
+        fr,tg=S2.gen_field(rng,HW,4,N,vmax,[3.0,8.0]); s,vx,vy=S2.stage1_dense(s1,fr,dev=dev)
+        w=S2.mf_sum(torch.from_numpy(fr).to(dev),vx,vy,half,N)
+        aS,aVx,aVy=S2.vote_scatter(w,vx,vy,Nm1); nrm=aS.max().clamp(min=1e-6); aS=aS/nrm;aVx=aVx/nrm;aVy=aVy/nrm
+        with torch.no_grad(): p=torch.sigmoid(net(aS,aVx,aVy)[0]).cpu().numpy()
+        F_=p.shape[0]; pk=peaks(p,TH)
+        tt=[((hx-tvx*Nm1)-half,(hy-tvy*Nm1)-half) for hx,hy,tvx,tvy in tg]; tt=[(x,y) for x,y in tt if 0<=x<F_ and 0<=y<F_]
+        for tx,ty in tt:
+            ntot+=1; near=[np.hypot(px-tx,py-ty) for px,py,pv in pk if np.hypot(px-tx,py-ty)<8]
+            if near: ndet+=1; errs.append(min(near))
+        for px,py,pv in pk:
+            if all(np.hypot(px-tx,py-ty)>=8 for tx,ty in tt): gh.append(pv)
+    print("th=%.2f: det %d/%d (%.0f%%) locerr %.2f | TRUE ghosts(>8px) %d max %.3f"%(TH,ndet,ntot,100*ndet/ntot,np.median(errs) if errs else -1,len(gh),max(gh) if gh else 0))
--- a/synth.py
+++ b/synth.py
+# synth.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""
+Synthetic single-target generator for the C5P DNN experiment. # By Claude on 06/13/2026
+
+Goal: produce unlimited exactly-labeled training data for a network that maps a short
+spatio-temporal patch sequence -> P(x, y, vx, vy) (the full 4D target posterior, one shot
+from N frames - strictly more than phase correlation, which is pairwise and velocity-only).
+
+Model of one sample (matches the validated pipeline conventions):
+  - One target: a canonical half-cosine bump of peak amplitude `snr` (noise sigma = 1, so
+    snr = peak SNR), at sub-pixel position (x0, y0) in the NEWEST frame (frame index 0),
+    moving at constant velocity (vx, vy) px/frame. Frame i (i = 0 newest .. N-1 oldest)
+    has the bump centered at (x0 - vx*i, y0 - vy*i)  [the target was at earlier positions
+    in the past], matching the C5P window convention (window[0] = newest).
+  - i.i.d. Gaussian noise, sigma = 1, added to every pixel of every frame. Pure Gaussian
+    at this stage on purpose: PC's optimality is derived for Gaussian noise, so "matches PC
+    under Gaussian noise" is a meaningful, falsifiable target; real-clutter comes later.
+
+Labels: (x0, y0, vx, vy) continuous. Velocity is in px/frame; the output grid maps cells
+to px/frame via vel_decimate (4 cells = 1 px/frame at decimate=4, as in the pipeline).
+
+Bump: separable half-cosine cos(pi/3 * |d|) for |d| < 1.5 (same shape as
+TemporalKernelGenerator.halfcos and the existing synthetic generator). Set radial=True for
+an isotropic bump (rotationally symmetric) if we switch the whole system that way.
+"""
+
+import numpy as np
+
+
+def halfcos_bump(cx, cy, H, W, radial=False):
+    """Sample the canonical half-cosine bump centered at (cx, cy) on an HxW grid.
+    Separable cos(pi/3|dx|)*cos(pi/3|dy|) (default, matches the system) or radial."""
+    ys = np.arange(H)[:, None] - cy
+    xs = np.arange(W)[None, :] - cx
+    if radial:
+        r = np.sqrt(xs * xs + ys * ys)
+        b = np.where(r < 1.5, np.cos(np.pi / 3.0 * r), 0.0)
+    else:
+        bx = np.where(np.abs(xs) < 1.5, np.cos(np.pi / 3.0 * np.abs(xs)), 0.0)
+        by = np.where(np.abs(ys) < 1.5, np.cos(np.pi / 3.0 * np.abs(ys)), 0.0)
+        b = bx * by
+    return b
+
+
+def generate_sample(rng, N=8, H=24, W=24, vmax_px=1.0, snr=5.0, radial=False, snr_log=False,
+                    place="center", off_range=0.5, off_min=1.0, off_max=None,
+                    margin=2.0, target=None, motion_blur=False, blur_frac=1.0):
+    """One labeled training patch for the fully-convolutional (FCN) regime. # By Claude on 06/13/2026
+    The patch is ONE receptive field; supervision is for its CENTER output pixel only.
+    Per-pixel output is (Vx, Vy, s) - position is the pixel grid (conv location), so the
+    center output's job is: is a target trajectory centered on ME at t0 (newest frame)?
+
+    place='center'    -> POSITIVE (det=1): target t0 within +-off_range px of patch center
+                         (the cell this output owns). Train Vx,Vy on it.
+    place='offcenter' -> NEGATIVE-WITH-TARGET (det=0): a real target owned by a NEIGHBOR
+                         pixel - t0 offset off_min..off_max from center - whose evidence
+                         reaches this receptive field. Teaches the center output to SUPPRESS
+                         competing line segments that don't pass through it at t0. (Vx,Vy NOT
+                         trained - det=0 masks the velocity loss; stored only for viz.)
+    place='none'      -> NOISE negative (det=0): pure Gaussian noise.
+    target= True/False kept for back-compat (-> 'center'/'none').
+    """
+    if target is not None:
+        place = "center" if target else "none"
+    if isinstance(snr, (tuple, list)):
+        # log-uniform spans low->high (keeps the high-SNR/noiseless sharpness anchor while covering
+        # low SNR) - fixes the reg head's high-SNR overshoot; linear otherwise. By Claude on 06/17/2026
+        snr = (float(np.exp(rng.uniform(np.log(snr[0]), np.log(snr[1])))) if snr_log
+               else rng.uniform(snr[0], snr[1]))
+    # Reference the patch center at index P/2 (= deployment's `half` in CuasDnnInfer.inferROI,
+    # where patch index half maps to the output/ROI pixel), NOT the geometric (W-1)/2. The 0.5
+    # gap between (W-1)/2=11.5 and P/2=12 was the systematic half-pixel registration bias; aligning
+    # the training reference to the deployment reference removes it (even patch kept). By Claude on 06/15/2026
+    cx0 = W / 2.0; cy0 = H / 2.0
+    if off_max is None:
+        off_max = min(W, H) / 2.0 - margin - 1.0   # neighbor targets out to the RF reach
+    if place == "none":
+        frames = rng.standard_normal((N, H, W)).astype(np.float32)
+        return frames, {"det": 0.0, "place": place, "x0": np.nan, "y0": np.nan,
+                        "vx": np.nan, "vy": np.nan, "dx": np.nan, "dy": np.nan, "snr": snr, "mfsum": 0.0}
+    # pick t0 offset from center per class, velocity on the disk, retry until the whole
+    # trajectory (t0 - v*i, i=0..N-1) stays inside the patch within `margin`.
+    for _ in range(200):
+        vx = rng.uniform(-vmax_px, vmax_px); vy = rng.uniform(-vmax_px, vmax_px)
+        if vx * vx + vy * vy > vmax_px * vmax_px:
+            continue
+        if place == "center":
+            dx = rng.uniform(-off_range, off_range); dy = rng.uniform(-off_range, off_range)
+        else:  # offcenter: t0 owned by a neighbor (annulus off_min..off_max around center)
+            ang = rng.uniform(0, 2 * np.pi); rad = rng.uniform(off_min, off_max)
+            dx = rad * np.cos(ang); dy = rad * np.sin(ang)
+        x0 = cx0 + dx; y0 = cy0 + dy
+        xs = [x0 - vx * i for i in range(N)]; ys = [y0 - vy * i for i in range(N)]
+        bmx = blur_frac * abs(vx) if motion_blur else 0.0   # causal streak extends older by up to blur_frac*|v|
+        bmy = blur_frac * abs(vy) if motion_blur else 0.0
+        if (min(xs) - bmx >= margin and max(xs) + bmx <= W - 1 - margin and
+            min(ys) - bmy >= margin and max(ys) + bmy <= H - 1 - margin):
+            break
+    frames = np.empty((N, H, W), dtype=np.float32)
+    # motion blur, RT/CAUSAL model (Andrey 2026-06-17): a decimated frame averages the finest
+    # sub-frames AT and BEFORE it (trailing), as realtime must - data[i] = mean of sub-frames at
+    # i, i+1/sub, ... i+(blur_frac - 1/sub) in the OLDER direction (larger index = older here).
+    # The streak is ~|v|*blur_frac long AND its centroid lags by ~0.5*blur_frac*|v| (the RT bias:
+    # the target's apparent position is OLDER than its label time). Flux-conserving (peak drops).
+    # blur_frac = averaging window in frames (1.0 = non-overlap decimation, 4 sub-steps). By Claude on 06/17/2026
+    nb = max(2, int(round(4 * blur_frac))) if motion_blur else 1
+    subs = np.arange(nb) * (blur_frac / nb) if nb > 1 else np.array([0.0])   # causal: s in [0, blur_frac)
+    mfsum = 0.0   # clean-signal matched-filter path-sum = label for the MF-like S head (option a). By Claude on 06/18/2026
+    for i in range(N):
+        if motion_blur and (vx or vy):
+            acc = np.zeros((H, W), dtype=np.float32)
+            for ss in subs:
+                acc += halfcos_bump(x0 - vx * (i + ss), y0 - vy * (i + ss), H, W, radial=radial)
+            sig = snr * acc / nb
+        else:
+            sig = snr * halfcos_bump(x0 - vx * i, y0 - vy * i, H, W, radial=radial)
+        frames[i] = sig + rng.standard_normal((H, W))
+        # accumulate the clean signal sampled (bilinear) along the trajectory tap (x0-vx*i, y0-vy*i):
+        # this is exactly "sum of data along the trajectory" with the noise removed (its expectation).
+        tx = x0 - vx * i; ty = y0 - vy * i
+        ix = int(np.floor(tx)); iy = int(np.floor(ty)); fx = tx - ix; fy = ty - iy
+        if 0 <= ix < W - 1 and 0 <= iy < H - 1:
+            mfsum += float((1 - fy) * ((1 - fx) * sig[iy, ix] + fx * sig[iy, ix + 1])
+                           + fy * ((1 - fx) * sig[iy + 1, ix] + fx * sig[iy + 1, ix + 1]))
+    det = 1.0 if place == "center" else 0.0
+    return frames, {"det": det, "place": place, "x0": x0, "y0": y0,
+                    "vx": vx, "vy": vy, "dx": dx, "dy": dy, "snr": snr, "mfsum": mfsum}
+
+
+def pick_place(rng, frac_pos=0.4, frac_off=0.4):
+    """Three-way class draw: center-positive / off-center-negative / noise-negative. # By Claude on 06/13/2026
+    The off-center class (a target owned by a neighbor pixel) is what stops the net from
+    collapsing to the matched filter - it must learn to SUPPRESS competing line segments."""
+    u = rng.random()
+    if u < frac_pos: return "center"
+    if u < frac_pos + frac_off: return "offcenter"
+    return "none"
+
+
+def generate_batch(rng, B, frac_pos=0.4, frac_off=0.4, **kw):
+    """Batch of B FCN training patches, three classes (center / offcenter / noise). # By Claude on 06/13/2026"""
+    N = kw.get("N", 8); H = kw.get("H", 24); W = kw.get("W", 24)
+    frames = np.empty((B, N, H, W), dtype=np.float32)
+    keys = ("det", "x0", "y0", "vx", "vy", "dx", "dy", "snr", "mfsum")
+    code = {"none": 0, "center": 1, "offcenter": 2}
+    labels = {k: np.empty(B, dtype=np.float32) for k in keys}
+    labels["place_code"] = np.empty(B, dtype=np.int64)  # 0 none / 1 center / 2 offcenter
+    for b in range(B):
+        f, lab = generate_sample(rng, place=pick_place(rng, frac_pos, frac_off), **kw)
+        frames[b] = f
+        for k in keys:
+            labels[k][b] = lab[k]
+        labels["place_code"][b] = code[lab["place"]]
+    return frames, labels
+
+
+def soft_target_vel(label, vel_radius=5, vel_decimate=4, sigma_v=0.9):
+    """Soft P(vx,vy) target for the CENTER output pixel: Gaussian bump at the true velocity # By Claude on 06/13/2026
+    in cell space (cell = vel_decimate*px/frame). Shape [vdim,vdim] (vy outer, vx inner -
+    matches v_out_idx convention). Normalized to sum 1. Use only for positives."""
+    vdim = 2 * vel_radius + 1
+    vcx = label["vx"] * vel_decimate
+    vcy = label["vy"] * vel_decimate
+    vyc = (np.arange(vdim)[:, None] - vel_radius) - vcy
+    vxc = (np.arange(vdim)[None, :] - vel_radius) - vcx
+    vel = np.exp(-(vxc * vxc + vyc * vyc) / (2 * sigma_v * sigma_v))
+    s = vel.sum()
+    return (vel / s).astype(np.float32) if s > 0 else vel.astype(np.float32)
+
+
+def save_tiff_stack(frames, path):
+    """Save [N,H,W] float frames as a multi-page 32-bit TIFF (opens in ImageJ)."""
+    from PIL import Image
+    imgs = [Image.fromarray(np.asarray(f, dtype=np.float32), mode="F") for f in frames]
+    imgs[0].save(path, save_all=True, append_images=imgs[1:])
+
+
+if __name__ == "__main__":
+    import sys, os
+    out = sys.argv[1] if len(sys.argv) > 1 else "/tmp/c5p_dnn_samples"
+    os.makedirs(out, exist_ok=True)
+    rng = np.random.default_rng(12345)
+    # positives at several SNRs (target sub-pixel near patch center)
+    for snr in [2.0, 3.0, 5.0, 8.0]:
+        frames, lab = generate_sample(rng, snr=snr, target=True)
+        p = f"{out}/pos_snr{snr:.0f}.tif"
+        save_tiff_stack(frames, p)
+        print(f"POS snr={snr:.0f}  center-offset dx={lab['dx']:+.2f} dy={lab['dy']:+.2f}  "
+              f"vx={lab['vx']:+.3f} vy={lab['vy']:+.3f} px/fr  -> {p}")
+    # off-center negative (target owned by a neighbor pixel - the key new class)
+    foff, loff = generate_sample(rng, place="offcenter", snr=8.0)
+    save_tiff_stack(foff, f"{out}/offcenter.tif")
+    print(f"OFFCENTER det={loff['det']} at t0-offset dx={loff['dx']:+.2f} dy={loff['dy']:+.2f} "
+          f"(|off|={np.hypot(loff['dx'],loff['dy']):.2f})  vx={loff['vx']:+.3f} vy={loff['vy']:+.3f}  -> {out}/offcenter.tif")
+    # a noise negative
+    fneg, lneg = generate_sample(rng, place="none")
+    save_tiff_stack(fneg, f"{out}/noise.tif")
+    print(f"NOISE det={lneg['det']}  -> {out}/noise.tif")
+    # batch + velocity soft-target sanity (three-class mix)
+    fb, lb = generate_batch(rng, 200, frac_pos=0.4, frac_off=0.4)
+    print(f"batch frames={fb.shape}  positives(det=1)={lb['det'].mean():.2f} (≈frac_pos)")
+    f, lab = generate_sample(rng, snr=5.0, target=True)
+    t = soft_target_vel(lab)
+    print(f"soft_target_vel shape={t.shape} sum={t.sum():.4f} "
+          f"argmax(vy,vx)={np.unravel_index(t.argmax(), t.shape)} "
+          f"(true cells vy={lab['vy']*4:+.2f} vx={lab['vx']*4:+.2f})")
--- a/test_infer_client.py
+++ b/test_infer_client.py
+#!/usr/bin/env python3
+# test_infer_client.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Stateful batched benchmark client for infer_server.py. By Claude on 06/20/2026.
+UPLOAD -> INFER(a chunk of scenes) -> READBACK; reports timing + transfer sizes.
+Usage: test_infer_client.py [host] [port] [T] [H] [W] [roi_w] [roi_h] [count]"""
+import socket, struct, sys, time
+from datetime import datetime
+import numpy as np
+
+host = sys.argv[1] if len(sys.argv) > 1 else "127.0.0.1"
+port = int(sys.argv[2]) if len(sys.argv) > 2 else 5577
+T = int(sys.argv[3]) if len(sys.argv) > 3 else 40
+H = int(sys.argv[4]) if len(sys.argv) > 4 else 512
+W = int(sys.argv[5]) if len(sys.argv) > 5 else 640
+RW = int(sys.argv[6]) if len(sys.argv) > 6 else 70
+RH = int(sys.argv[7]) if len(sys.argv) > 7 else 20
+COUNT = int(sys.argv[8]) if len(sys.argv) > 8 else 16
+CMD_BYE, CMD_UPLOAD, CMD_INFER, CMD_READBACK = 0, 1, 2, 3
+
+s = socket.socket(); s.connect((host, port)); s.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
+def rd(n):
+    b = b""
+    while len(b) < n: b += s.recv(n - len(b))
+    return b
+
+# --- UPLOAD ---
+stack = np.random.randn(T, H, W).astype(">f4").tobytes()
+t = time.perf_counter()
+s.sendall(struct.pack(">iiii", CMD_UPLOAD, T, H, W)); s.sendall(stack)
+nl = struct.unpack(">i", rd(4))[0]
+levs = [struct.unpack(">i", rd(4))[0] for _ in range(nl)]
+N, bms = struct.unpack(">id", rd(12))
+print(f"{datetime.now():%H:%M:%S} UPLOAD {T}x{H}x{W} ({T*H*W*4/1e6:.1f}MB up) -> {nl} levels {levs} N={N} "
+      f"build={bms:.1f}ms  total={(time.perf_counter()-t)*1e3:.0f}ms")
+
+# --- INFER a chunk of scenes at level 0 (newest = start + j*stride) ---
+start = N - 1                                   # first valid newest in level 0
+count = min(COUNT, levs[0] - start)
+for it in range(2):                             # twice: it0 includes autotune
+    t = time.perf_counter()
+    s.sendall(struct.pack(">iiiiiiiii", CMD_INFER, 0, start, count, 1, 200, 250, RW, RH))
+    s.sendall(struct.pack(">d", 1.4 * 4))        # rmax_cells = vmax*vel_decimate
+    gms, oh, ow, cnt, nvel, rh, rw = struct.unpack(">diiiiii", rd(32))
+    o5 = rd(cnt * 5 * oh * ow * 4)
+    rf = rd(cnt * rh * rw * nvel * 4)
+    rt = (time.perf_counter() - t) * 1e3
+    down = (len(o5) + len(rf)) / 1e6
+    print(f"{datetime.now():%H:%M:%S} INFER it{it}: {cnt} scenes -> offset5[{cnt},5,{oh},{ow}] + roi[{cnt},{rh},{rw},{nvel}] "
+          f"{down:.1f}MB down | gpu={gms:.1f}ms ({gms/cnt:.1f}ms/scene) roundtrip={rt:.1f}ms ({rt/cnt:.1f}ms/scene)")
+
+# --- READBACK (debug) ---
+s.sendall(struct.pack(">iii", CMD_READBACK, 0, 0))
+fh, fw = struct.unpack(">ii", rd(8)); _ = rd(fh * fw * 4)
+print(f"{datetime.now():%H:%M:%S} READBACK lev0 f0 -> [{fh},{fw}]")
+s.sendall(struct.pack(">i", CMD_BYE)); s.close()
--- a/train.py
+++ b/train.py
+# train.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""
+Train the C5P FCN on synthetic Gaussian-noise patches. # By Claude on 06/13/2026
+
+Phase 1: all-synthetic, pure Gaussian noise, SNR-swept. The benchmark question is whether
+this matches phase correlation (velocity) and localizes (x,y) at low SNR. Runs inside the
+NGC PyTorch container on the DGX Spark (GB10).
+
+Usage (inside container, with the project dir mounted at /work):
+    python /work/train.py --steps 20000 --batch 256 --snr 1 8 --out /work/runs/raw1
+"""
+
+import argparse, os, time
+import numpy as np
+import torch
+
+import synth
+from model import RawFCN, fcn_loss, vel_bias_loss, reg_loss
+
+
+def batched_soft_vel(vx, vy, vel_radius=5, vel_decimate=4, sigma_v=0.9):
+    """[B] vx,vy (px/frame) -> [B, vdim*vdim] soft P(vx,vy) targets (vy outer, vx inner)."""
+    vdim = 2 * vel_radius + 1
+    cells = np.arange(vdim) - vel_radius                       # [vdim]
+    vcx = (vx * vel_decimate)[:, None, None]                   # [B,1,1]
+    vcy = (vy * vel_decimate)[:, None, None]
+    dvx = cells[None, None, :] - vcx                           # [B,1,vdim]
+    dvy = cells[None, :, None] - vcy                           # [B,vdim,1]
+    g = np.exp(-(dvx * dvx + dvy * dvy) / (2 * sigma_v * sigma_v))  # [B,vdim,vdim]
+    g = np.nan_to_num(g)                                       # negatives have NaN vx/vy
+    s = g.reshape(g.shape[0], -1).sum(1, keepdims=True)
+    flat = g.reshape(g.shape[0], -1)
+    return np.divide(flat, s, out=np.zeros_like(flat), where=s > 0).astype(np.float32)
+
+
+def make_batch(rng, B, dev, frac_pos=0.4, frac_off=0.4,
+               off_w_near=1.5, off_w_far=0.3, off_w_tau=2.5, **kw):
+    vr = kw.pop("vel_radius", 5); vd = kw.pop("vel_decimate", 4); sv = kw.pop("sigma_v", 0.9)
+    frames, lab = synth.generate_batch(rng, B, frac_pos=frac_pos, frac_off=frac_off, **kw)
+    vel_soft = batched_soft_vel(lab["vx"], lab["vy"], vr, vd, sv)
+    off = np.stack([np.nan_to_num(lab["dx"]), np.nan_to_num(lab["dy"])], axis=1).astype(np.float32)
+    # per-sample detection-loss weight: 1 for center/noise, confusability-weighted for off-center
+    det_w = np.ones(frames.shape[0], dtype=np.float32)
+    isoff = lab["place_code"] == 2
+    if isoff.any():
+        offd = np.hypot(lab["dx"][isoff], lab["dy"][isoff])
+        det_w[isoff] = offcenter_weight(offd, w_near=off_w_near, w_far=off_w_far, tau=off_w_tau)
+    x = torch.from_numpy(frames).to(dev)                       # [B,N,P,P]
+    return (x,
+            torch.from_numpy(lab["det"]).to(dev),
+            torch.from_numpy(vel_soft).to(dev),
+            torch.from_numpy(off).to(dev),
+            torch.from_numpy(det_w).to(dev),
+            lab)
+
+
+def offcenter_weight(off, off_min=1.0, w_near=2.0, w_far=0.3, tau=2.5):
+    """Detection-loss weight for off-center negatives: heaviest at the immediate neighbors # By Claude on 06/13/2026
+    (confusability ~ PSF overlap with center), decaying with crossing distance to a floor."""
+    return (w_far + (w_near - w_far) * np.exp(-np.maximum(off - off_min, 0.0) / tau)).astype(np.float32)
+
+
+def s_by_class(out, model, lab, near_r=2.0):
+    """Mean confidence s=sigmoid(det), split center / off-NEAR / off-FAR / noise. # By Claude on 06/13/2026
+    The decisive number is off-NEAR (|off|<=near_r): if the net drives it to 0 it learned
+    fine spatial discrimination (NOT the MF); if it stays high, a small net can't resolve
+    near-misses -> evidence for a deeper CNN."""
+    det, _, _ = model.split(out)
+    s = torch.sigmoid(det.reshape(det.shape[0])).detach().cpu().numpy()
+    pc = lab["place_code"]
+    off = np.hypot(np.nan_to_num(lab["dx"]), np.nan_to_num(lab["dy"]))
+    def mean_of(m):
+        return float(s[m].mean()) if m.any() else float("nan")
+    isoff = pc == 2
+    return [mean_of(pc == 1), mean_of(isoff & (off <= near_r)),
+            mean_of(isoff & (off > near_r)), mean_of(pc == 0)]  # ctr, offNear, offFar, noise
+
+
+def vel_err_px(out, model, lab, vel_decimate=4):
+    """Expected-velocity error (px/frame) on positives: softmax-weighted velocity centroid
+    vs the true (vx,vy). A quick training-quality readout (full PC benchmark is separate)."""
+    pos = lab["det"] > 0.5
+    if pos.sum() == 0:
+        return float("nan")
+    if model.velocity_mode == "reg":                            # T7: direct (Vx,Vy) // By Claude on 06/17/2026
+        _, v, _, _ = model.split_reg(out)
+        v = v.reshape(v.shape[0], 2).detach().cpu().numpy()
+        e = np.sqrt((v[pos, 0] - lab["vx"][pos]) ** 2 + (v[pos, 1] - lab["vy"][pos]) ** 2)
+        return float(e.mean())
+    det, vel, _ = model.split(out)
+    vdim = model.vdim
+    p = torch.softmax(vel.reshape(vel.shape[0], -1), dim=1).reshape(-1, vdim, vdim)
+    cells = torch.arange(vdim, device=p.device) - model.vel_radius
+    evx = (p.sum(1) * cells).sum(1) / vel_decimate             # [B] px/frame
+    evy = (p.sum(2) * cells).sum(1) / vel_decimate
+    evx = evx.detach().cpu().numpy(); evy = evy.detach().cpu().numpy()
+    e = np.sqrt((evx[pos] - lab["vx"][pos]) ** 2 + (evy[pos] - lab["vy"][pos]) ** 2)
+    return float(e.mean())
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--steps", type=int, default=20000)
+    ap.add_argument("--batch", type=int, default=256)
+    ap.add_argument("--lr", type=float, default=1e-3)
+    ap.add_argument("--snr", type=float, nargs=2, default=[1.0, 8.0])
+    ap.add_argument("--frac_pos", type=float, default=0.4)   # center-positive
+    ap.add_argument("--frac_off", type=float, default=0.4)   # off-center negative (the key class)
+    ap.add_argument("--w_vel", type=float, default=1.0)      # velocity loss weight (raise to emphasize velocity)
+    ap.add_argument("--w_bias", type=float, default=1.0)     # batch-moment de-biasing weight (per-SNR mean scale -> 1); 0 disables // By Claude on 06/15/2026
+    ap.add_argument("--bias_bins", type=int, default=4)      # number of bins for the de-biasing term // By Claude on 06/15/2026
+    ap.add_argument("--bias_by", choices=["snr", "s"], default="snr")  # de-bias conditioning var: snr (clean label, in-loss) or s=sigmoid(det) // By Claude on 06/15/2026
+    ap.add_argument("--vmax", type=float, default=1.0)       # training velocity disk radius, px/frame (was hardcoded 1.0; raise to cover the grid +-1.25 on-axis) // By Claude on 06/15/2026
+    ap.add_argument("--velocity_mode", choices=["grid", "reg"], default="grid")  # grid=121-cell softmax | reg=continuous Vx,Vy,logvar (T7) // By Claude on 06/17/2026
+    ap.add_argument("--snr_log", action="store_true")  # sample SNR log-uniform over [snr_lo,snr_hi] (span high for sharpness anchor) // By Claude on 06/17/2026
+    ap.add_argument("--mf_s", action="store_true")  # (reg) channel 0 regresses the MF path-sum S (option a) instead of det BCE -> informative vote weight // By Claude on 06/18/2026
+    ap.add_argument("--w_mfs", type=float, default=0.02)  # MF-S regression loss weight (balances large path-sum MSE vs velocity NLL) // By Claude on 06/18/2026
+    ap.add_argument("--motion_blur", action="store_true")  # render moving targets motion-blurred (streak ~|v|*blur_frac) - higher pyramid levels (T2) // By Claude on 06/17/2026
+    ap.add_argument("--blur_frac", type=float, default=1.0)  # motion-blur window length in frames (1.0 non-overlap decimation; 2.0 ~ 50%-overlap) // By Claude on 06/17/2026
+    ap.add_argument("--w_off", type=float, default=0.3)      # sub-pixel offset loss weight (low: position precision is secondary)
+    ap.add_argument("--off_w_near", type=float, default=1.5) # near-miss suppression weight (lower if +-1px ambiguity is acceptable)
+    ap.add_argument("--off_w_far", type=float, default=0.3)
+    ap.add_argument("--off_w_tau", type=float, default=2.5)
+    ap.add_argument("--nframes", type=int, default=8)
+    ap.add_argument("--patch", type=int, default=24)
+    ap.add_argument("--vel_radius", type=int, default=5)
+    ap.add_argument("--vel_decimate", type=int, default=4)
+    ap.add_argument("--sigma_v", type=float, default=0.9)
+    ap.add_argument("--seed", type=int, default=0)
+    ap.add_argument("--log_every", type=int, default=200)
+    ap.add_argument("--out", type=str, default="/tmp/c5p_run")
+    args = ap.parse_args()
+
+    os.makedirs(args.out, exist_ok=True)
+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+    print(f"device={dev} torch={torch.__version__} steps={args.steps} batch={args.batch} "
+          f"snr={args.snr} out={args.out}", flush=True)
+    rng = np.random.default_rng(args.seed)
+    torch.manual_seed(args.seed)
+
+    model = RawFCN(n_frames=args.nframes, vel_radius=args.vel_radius, patch=args.patch,
+                   velocity_mode=args.velocity_mode, vmax=args.vmax).to(dev)
+    opt = torch.optim.Adam(model.parameters(), lr=args.lr)
+    nparam = sum(p.numel() for p in model.parameters())
+    print(f"model params={nparam} out_ch={model.out_ch}", flush=True)
+
+    bkw = dict(N=args.nframes, H=args.patch, W=args.patch, snr=tuple(args.snr),
+               vmax_px=args.vmax, snr_log=args.snr_log,
+               motion_blur=args.motion_blur, blur_frac=args.blur_frac,
+               vel_radius=args.vel_radius, vel_decimate=args.vel_decimate, sigma_v=args.sigma_v)
+    print(f"vmax_px={args.vmax}  w_bias={args.w_bias}  de-bias: {args.bias_bins} {args.bias_by}-bins (equal-population)", flush=True)
+    csv_path = f"{args.out}/losses.csv"
+    csv = open(csv_path, "w")
+    csv.write("step,det,vel,off,bias,velRMSE\n"); csv.flush()
+    t0 = time.time()
+    run = {"det": 0.0, "vel": 0.0, "off": 0.0, "bias": 0.0}
+    for step in range(1, args.steps + 1):
+        x, det_t, vel_t, off_t, det_w, lab = make_batch(rng, args.batch, dev,
+                                                 frac_pos=args.frac_pos, frac_off=args.frac_off,
+                                                 off_w_near=args.off_w_near, off_w_far=args.off_w_far,
+                                                 off_w_tau=args.off_w_tau, **bkw)
+        out = model(x)
+        if args.velocity_mode == "reg":                                          # T7 continuous head // By Claude on 06/17/2026
+            vx_t = torch.from_numpy(lab["vx"]).to(dev); vy_t = torch.from_numpy(lab["vy"]).to(dev)
+            bin_var = (torch.from_numpy(lab["snr"]).to(dev) if args.bias_by == "snr"
+                       else torch.sigmoid(model.split_reg(out)[0].reshape(-1)).detach())
+            mfsum_t = torch.from_numpy(lab["mfsum"]).to(dev) if args.mf_s else None
+            loss, comp = reg_loss(out, model, det_t, vx_t, vy_t, off_t, det_w=det_w,
+                                  w_vel=args.w_vel, w_off=args.w_off, w_bias=args.w_bias,
+                                  bin_var=bin_var, n_bins=args.bias_bins,
+                                  mfsum_t=mfsum_t, w_mfs=args.w_mfs)
+        else:
+            loss, comp = fcn_loss(out, model, det_t, vel_t, off_t, det_w=det_w,
+                                  w_vel=args.w_vel, w_off=args.w_off)
+            comp["bias"] = 0.0
+            if args.w_bias > 0:                                                  # By Claude on 06/15/2026
+                vx_t = torch.from_numpy(lab["vx"]).to(dev); vy_t = torch.from_numpy(lab["vy"]).to(dev)
+                if args.bias_by == "snr":
+                    bin_var = torch.from_numpy(lab["snr"]).to(dev)
+                else:  # bin by the network's own confidence s
+                    bin_var = torch.sigmoid(model.split(out)[0].reshape(-1)).detach()
+                lb = vel_bias_loss(out, model, vx_t, vy_t, det_t, bin_var, args.bias_bins, args.vel_decimate)
+                loss = loss + args.w_bias * lb
+                comp["bias"] = lb.detach().item()
+        opt.zero_grad(); loss.backward(); opt.step()
+        for k in run: run[k] += comp[k]
+        if step % args.log_every == 0:
+            n = args.log_every
+            verr = vel_err_px(out, model, lab, args.vel_decimate)
+            sc, son, sof, sn = s_by_class(out, model, lab)
+            sps = step / (time.time() - t0)
+            print(f"step {step:6d}  det {run['det']/n:.4f}  vel {run['vel']/n:.4f}  "
+                  f"off {run['off']/n:.4f}  bias {run['bias']/n:.4f}  velRMSE {verr:.4f}px/fr  "
+                  f"s[ctr/offN/offF/noise]={sc:.2f}/{son:.2f}/{sof:.2f}/{sn:.2f}  {sps:.0f} it/s", flush=True)
+            csv.write(f"{step},{run['det']/n:.5f},{run['vel']/n:.5f},{run['off']/n:.5f},{run['bias']/n:.5f},{verr:.5f}\n")
+            csv.flush()
+            run = {k: 0.0 for k in run}
+    csv.close()
+    torch.save({"model": model.state_dict(), "args": vars(args)}, f"{args.out}/model.pt")
+    print(f"saved {args.out}/model.pt  losses->{csv_path}", flush=True)
+    # ONNX export - the single artifact for BOTH deploy phases: ORT-Java (array-fed test) and # By Claude on 06/13/2026
+    # TensorRT (zero-copy CUDA prod). Dynamic H/W axes so the all-conv FCN slides over any frame.
+    try:
+        model.eval()
+        dummy = torch.zeros(1, args.nframes, args.patch, args.patch, device=dev)
+        onnx_path = f"{args.out}/model.onnx"
+        torch.onnx.export(
+            model, dummy, onnx_path,
+            input_names=["frames"], output_names=["out"],
+            dynamic_axes={"frames": {0: "B", 2: "H", 3: "W"}, "out": {0: "B", 2: "Hout", 3: "Wout"}},
+            opset_version=17)
+        print(f"exported {onnx_path}  (input frames[B,{args.nframes},H,W] -> out[B,{model.out_ch},Hout,Wout])", flush=True)
+    except Exception as e:
+        print(f"ONNX export skipped: {type(e).__name__}: {e}", flush=True)
+
+
+if __name__ == "__main__":
+    main()
--- a/velocity_bias.py
+++ b/velocity_bias.py
+#!/usr/bin/env python3
+# velocity_bias.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""Velocity-bias diagnostic. # By Claude on 06/15/2026
+Predicted vs true per-frame velocity across SNR, on clean CENTER targets, to decide whether the
+underestimate is SYSTEMATIC (clean/no-noise also < true) or SNR-DEPENDENT (only low SNR hedges
+toward 0 as the fan broadens). Fits predicted = gain*true + off for argmax and softmax-centroid;
+gain<1 = underestimate. Pools vx and vy (symmetric)."""
+import argparse, numpy as np, torch, synth
+from model import RawFCN
+
+
+def softmax(z):
+    z = z - z.max(); e = np.exp(z); return e / e.sum()
+
+
+if __name__ == "__main__":
+    ap = argparse.ArgumentParser()
+    ap.add_argument("ck")
+    ap.add_argument("--nframes", type=int, default=9)
+    ap.add_argument("--patch", type=int, default=24)  # must match the model's training patch (24 or 32) # By Claude on 06/16/2026
+    ap.add_argument("--vel_radius", type=int, default=5)
+    ap.add_argument("--vel_decimate", type=int, default=4)
+    ap.add_argument("--m", type=int, default=1500)
+    ap.add_argument("--seed", type=int, default=0)
+    ap.add_argument("--velocity_mode", choices=["grid", "reg"], default="grid")  # T7 // By Claude on 06/17/2026
+    ap.add_argument("--vmax", type=float, default=1.4)  # reg bound (must match training)
+    ap.add_argument("--vmax_px", type=float, default=None)  # test-target velocity disk (defaults to vmax); raise for higher-velocity models // By Claude on 06/17/2026
+    ap.add_argument("--motion_blur", action="store_true")  # test on motion-blurred targets (match blur-trained models) // By Claude on 06/17/2026
+    ap.add_argument("--blur_frac", type=float, default=1.0)
+    a = ap.parse_args()
+    vmax_px = a.vmax_px if a.vmax_px is not None else a.vmax
+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+    ck = torch.load(a.ck, map_location=dev)
+    m = RawFCN(n_frames=a.nframes, vel_radius=a.vel_radius, patch=a.patch,
+               velocity_mode=a.velocity_mode, vmax=a.vmax).to(dev)
+    m.load_state_dict(ck["model"]); m.eval()
+    n = 2 * a.vel_radius + 1; step = 1.0 / a.vel_decimate
+    ix = np.arange(n * n); vxc = (ix % n - a.vel_radius) * step; vyc = (ix // n - a.vel_radius) * step
+    rng = np.random.default_rng(a.seed)
+    print(f"model {a.ck}  nframes={a.nframes}  vel grid +/-{a.vel_radius*step:.2f}px step {step}  m={a.m}/snr")
+    print(f"{'snr':>6} {'pairs':>6} {'argmax gain':>12} {'off':>7} {'cen gain':>10} {'off':>7} {'cenRMSE':>8}")
+    for snr in [100.0, 8.0, 4.0, 2.0, 1.0]:
+        tv, pa, pc, sig = [], [], [], []
+        for _ in range(a.m):
+            f, lab = synth.generate_sample(rng, N=a.nframes, H=a.patch, W=a.patch, snr=snr, place="center",
+                                           vmax_px=vmax_px, motion_blur=a.motion_blur, blur_frac=a.blur_frac)
+            x = torch.from_numpy(f[None]).float().to(dev)
+            with torch.no_grad():
+                out = m(x)[0, :, 0, 0].cpu().numpy()
+            tv += [lab["vx"], lab["vy"]]
+            if a.velocity_mode == "reg":          # out = [det, Vx, Vy, logvar, dx, dy]
+                pvx, pvy = float(out[1]), float(out[2])
+                pa += [pvx, pvy]; pc += [pvx, pvy]; sig.append(float(np.exp(0.5 * out[3])))
+            else:
+                vel = softmax(out[1:1 + n * n]); s = vel.sum(); k = int(np.argmax(vel))
+                pa += [vxc[k], vyc[k]]
+                pc += [(vxc * vel).sum() / s, (vyc * vel).sum() / s]
+        tv, pa, pc = np.array(tv), np.array(pa), np.array(pc)
+        ga, ba = np.polyfit(tv, pa, 1); gc, bc = np.polyfit(tv, pc, 1)
+        rmse = float(np.sqrt(np.mean((pc - tv) ** 2)))
+        sigstr = f"  sigma={np.mean(sig):.3f}" if sig else ""
+        print(f"{snr:6.0f} {len(tv):6d} {ga:12.3f} {ba:+7.2f} {gc:10.3f} {bc:+7.2f} {rmse:8.3f}{sigstr}")
--- a/viz_results.py
+++ b/viz_results.py
+# viz_results.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""
+Visualize a trained C5P FCN's OUTPUT on test patches. # By Claude on 06/13/2026
+
+Loads a model.pt, runs a GxG grid of test samples (mixed classes), and writes a velocity-
+block image like our -POST: each cell is the 11x11 softmax P(vx,vy) for that sample, 1-px
+NaN gaps. Two stacks (pages): [0] the net's P(vx,vy); [1] for reference, the true-velocity
+one-hot bump (so you can eyeball pred vs truth side by side). Prints per-cell s + pred/true v.
+Run inside the container:
+    python /work/viz_results.py /work/runs/weighted/model.pt /work/runs/weighted/results
+"""
+import sys
+import numpy as np
+import torch
+import synth
+from model import RawFCN
+
+
+def main():
+    ckpt_path = sys.argv[1]
+    out = sys.argv[2] if len(sys.argv) > 2 else ckpt_path.rsplit("/", 1)[0] + "/results"
+    grid = int(sys.argv[3]) if len(sys.argv) > 3 else 10
+    ck = torch.load(ckpt_path, map_location="cpu")
+    a = ck["args"]
+    vr = a.get("vel_radius", 5); vd = a.get("vel_decimate", 4)
+    N = a.get("nframes", 8); P = a.get("patch", 24)
+    model = RawFCN(n_frames=N, vel_radius=vr); model.load_state_dict(ck["model"]); model.eval()
+    vdim = 2 * vr + 1
+    cell = vdim + 1
+    side = grid * cell - 1
+    pred_img = np.full((side, side), np.nan, dtype=np.float32)   # raw net P(vx,vy)
+    sw_img   = np.full((side, side), np.nan, dtype=np.float32)   # P(vx,vy) * s (confidence-gated)
+    true_img = np.full((side, side), np.nan, dtype=np.float32)   # true velocity one-hot
+    rng = np.random.default_rng(123)
+    print("cell    place      s      pred(vx,vy)        true(vx,vy)")
+    for r in range(grid):
+        for c in range(grid):
+            place = synth.pick_place(rng, 0.5, 0.3)
+            f, lab = synth.generate_sample(rng, N=N, H=P, W=P, snr=(2.0, 8.0), place=place)
+            with torch.no_grad():
+                o = model(torch.from_numpy(f[None]))      # [1,C,1,1]
+                det, vel, _ = model.split(o)
+                s = float(torch.sigmoid(det).reshape(-1)[0])
+                pv = torch.softmax(vel.reshape(1, -1), 1).reshape(vdim, vdim).numpy()
+            y0 = r * cell; x0 = c * cell
+            pred_img[y0:y0 + vdim, x0:x0 + vdim] = pv
+            sw_img[y0:y0 + vdim, x0:x0 + vdim] = pv * s   # dark unless the net is confident
+            # true-velocity marker, intensity encodes CLASS so FP are readable in-image:
+            #   center target -> 1.0 dot, offcenter target -> 0.5 dot, noise -> blank.
+            tb = np.zeros((vdim, vdim), np.float32)
+            if place in ("center", "offcenter"):
+                cy = int(round(vr + lab["vy"] * vd)); cx = int(round(vr + lab["vx"] * vd))
+                if 0 <= cy < vdim and 0 <= cx < vdim:
+                    tb[cy, cx] = 1.0 if place == "center" else 0.5
+            true_img[y0:y0 + vdim, x0:x0 + vdim] = tb
+            cells = np.arange(vdim) - vr
+            evx = (pv.sum(0) * cells).sum() / vd; evy = (pv.sum(1) * cells).sum() / vd
+            tv = (f"{lab['vx']:+.2f},{lab['vy']:+.2f}" if place != "none" else "   -   ")
+            print(f"({r},{c})  {place:9s}  {s:.2f}   {evx:+.2f},{evy:+.2f}      {tv}")
+    # order so the two comparison pages are ADJACENT (single scrollwheel toggle 1<->2):
+    stack = np.stack([sw_img, true_img, pred_img])
+    path = out if out.endswith(".tif") else out + "-velblocks.tif"
+    synth.save_tiff_stack(stack, path)
+    print(f"\nwrote {path}  3 pages: [0]=net P(vx,vy)*s (dark=no detection), "
+          f"[1]=truth (1.0=center target, 0.5=offcenter target, blank=noise), [2]=raw P(vx,vy).  "
+          f"grid={grid}x{grid} cell={vdim}x{vdim}, step=0.25px/fr, center=0")
+
+
+if __name__ == "__main__":
+    main()
--- a/viz_trainingdata.py
+++ b/viz_trainingdata.py
+# viz_trainingdata.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""
+Visualize a subset of the training data as a time-sliced grid. # By Claude on 06/13/2026
+
+Output: a multi-page 32-bit TIFF where
+  - page (slice) index = time/frame (0 = newest), scrub it to watch targets move;
+  - each page is a GxG grid of independent training samples (one patch per cell),
+    1-px NaN separators between cells (renders as a grid in ImageJ, like -RECT).
+Plus a printed label table per cell (det / vx,vy / dx,dy / snr) so you can correlate
+what you see with the ground truth - the (b) "is the training data right?" check.
+
+Usage:
+    python viz_trainingdata.py /tmp/c5p_train_viz --grid 8 --snr 1 8 --neg_frac 0.3
+"""
+
+import argparse
+import numpy as np
+import synth
+
+
+def make_grid_stack(rng, grid=8, N=8, P=24, snr=(1.0, 8.0),
+                    frac_pos=0.4, frac_off=0.4, radial=False):
+    """Returns (stack [N, gh, gw] with NaN gaps, labels list[grid*grid]).
+    Three classes mixed: center-positive / off-center-negative / noise - so you can SEE the
+    off-center targets the net must learn to suppress (det=0 despite a target in the patch)."""
+    gap = 1
+    cell = P + gap
+    gh = grid * cell - gap
+    gw = grid * cell - gap
+    stack = np.full((N, gh, gw), np.nan, dtype=np.float32)
+    labels = []
+    for r in range(grid):
+        for c in range(grid):
+            place = synth.pick_place(rng, frac_pos, frac_off)
+            frames, lab = synth.generate_sample(rng, N=N, H=P, W=P, snr=snr,
+                                                place=place, radial=radial)
+            y0 = r * cell; x0 = c * cell
+            stack[:, y0:y0 + P, x0:x0 + P] = frames
+            lab["cell"] = (r, c)
+            labels.append(lab)
+    return stack, labels
+
+
+if __name__ == "__main__":
+    ap = argparse.ArgumentParser()
+    ap.add_argument("out")
+    ap.add_argument("--grid", type=int, default=8)
+    ap.add_argument("--nframes", type=int, default=8)
+    ap.add_argument("--patch", type=int, default=24)
+    ap.add_argument("--snr", type=float, nargs=2, default=[1.0, 8.0])
+    ap.add_argument("--frac_pos", type=float, default=0.4)
+    ap.add_argument("--frac_off", type=float, default=0.4)
+    ap.add_argument("--radial", action="store_true")
+    ap.add_argument("--seed", type=int, default=0)
+    args = ap.parse_args()
+
+    rng = np.random.default_rng(args.seed)
+    stack, labels = make_grid_stack(rng, args.grid, args.nframes, args.patch,
+                                    tuple(args.snr), args.frac_pos, args.frac_off, args.radial)
+    path = args.out if args.out.endswith(".tif") else args.out + ".tif"
+    synth.save_tiff_stack(stack, path)
+    print(f"grid={args.grid}x{args.grid}  {args.nframes} time slices  patch={args.patch}  "
+          f"-> {path}  (size {stack.shape[2]}x{stack.shape[1]})")
+    print("cell(r,c)  class      det  vx       vy      |off|  snr")
+    for lab in labels:
+        r, c = lab["cell"]
+        pl = lab["place"]
+        if pl == "none":
+            print(f"  ({r},{c})    noise        0     -        -        -    (~{lab['snr']:.1f})")
+        else:
+            off = np.hypot(lab["dx"], lab["dy"])
+            print(f"  ({r},{c})    {pl:9s}  {int(lab['det'])}   {lab['vx']:+.3f}  {lab['vy']:+.3f}  "
+                  f"{off:.2f}   {lab['snr']:.1f}")
--- a/vote_1d.py
+++ b/vote_1d.py
+# vote_1d.py - part of imagej_elphel_dnn (Elphel DNN: tile-processor motion detection / ranging)
+#
+# Copyright (C) 2026 Elphel, Inc.
+#
+# -----------------------------------------------------------------------------
+#  imagej_elphel_dnn is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+# -----------------------------------------------------------------------------
+"""C5P v2 diagnostic: 1-D vote-contribution profile for a single horizontal target. By Claude 06/18/2026
+
+Andrey's question: a disk pixel's local MF can lock the bright target at ANY temporal layer i, not
+only the oldest (i=N-1). The velocity it reports is V_i = V_true + (P-H)/i and it back-projects to
+   T_i = H - V_true*(N-1) - (P-H)*[(N-1)/i - 1]
+so the per-target vote is a FAN, not a delta; the i=N-1 term (T-8) is the densest ("closest wrong
+solution") but real bright points at other layers smear it. This script renders ONE horizontal
+target (vy=0; the y-axis is symmetric, so 1-D along x is enough), runs frozen Stage 1, and shows,
+along the head row: the velocity ramp V_P(x), where each pixel votes (tail), and the resulting
+1-D vote histogram for three weightings (s, s^2, MF path-sum). It quantifies the fan vs SNR/velocity.
+"""
+import numpy as np, torch, torch.nn.functional as F
+import synth, stage2 as S2
+from model import RawFCN
+
+dev = "cuda" if torch.cuda.is_available() else "cpu"
+N, vmax, P = 9, 2.8, 52
+half, Nm1 = P // 2, N - 1
+HW = 140
+F_ = HW - P + 1
+
+s1 = RawFCN(n_frames=N, patch=P, velocity_mode="reg", vmax=vmax).to(dev)
+s1.load_state_dict(torch.load("/work/runs/stage1_mf/model.pt", map_location=dev)["model"]); s1.eval()
+
+
+def render_h(V, amp, noise, blur_frac=1.0, nb=4, seed=0):
+    """One horizontal target, head centered, velocity (V,0), causal motion blur."""
+    rng = np.random.default_rng(seed)
+    frames = rng.standard_normal((N, HW, HW)).astype(np.float32) if noise else np.zeros((N, HW, HW), np.float32)
+    Hx = HW / 2.0 + V * Nm1 * 0.5        # center the whole track in the field
+    Hy = HW / 2.0
+    subs = np.arange(nb) * (blur_frac / nb)
+    for i in range(N):
+        acc = np.zeros((HW, HW), np.float64)
+        for ss in subs:
+            acc += synth.halfcos_bump(Hx - V * (i + ss), Hy, HW, HW)
+        frames[i] += (amp * acc / nb).astype(np.float32)
+    return frames, Hx, Hy
+
+
+def bar(v, vmax_, width=40):
+    n = int(round(width * v / vmax_)) if vmax_ > 0 else 0
+    return "#" * n
+
+
+def analyze(V, amp, noise, nseed=1, win=24):
+    """Print the head-row ramp + the 1-D vote histogram (weights s, s^2, MF-sum)."""
+    txf_true = (HW / 2.0 + V * Nm1 * 0.5 - V * Nm1) - half     # true tail, field x
+    hxf = (HW / 2.0 + V * Nm1 * 0.5) - half                    # head, field x
+    yf = int(round(HW / 2.0 - half))                           # head row, field y
+    xs_win = np.arange(max(0, int(hxf) - win), min(F_, int(hxf) + win + 1))
+
+    # accumulate vote histograms over seeds; keep seed-0 fields for the per-pixel table
+    BINS = np.arange(-12, 21)                                  # tail offset (field x) relative to true tail
+    hist = {k: np.zeros(len(BINS)) for k in ("s", "s2", "mf")}
+    tab = None
+    for seed in range(nseed):
+        fr, Hx, Hy = render_h(V, amp, noise, seed=seed)
+        s, vx, vy = S2.stage1_dense(s1, fr, dev=dev)
+        mf = S2.mf_sum(torch.from_numpy(fr).to(dev), vx, vy, half, N)
+        s = s.cpu().numpy(); vx = vx.cpu().numpy(); vy = vy.cpu().numpy(); mf = mf.cpu().numpy()
+        sv = s[yf, xs_win]; vxv = vx[yf, xs_win]; mfv = mf[yf, xs_win]
+        tail = xs_win - vxv * Nm1                              # where each pixel votes (field x)
+        toff = tail - txf_true                                 # offset from the true tail
+        for name, w in (("s", sv), ("s2", sv * sv), ("mf", mfv)):
+            for t, wv in zip(toff, w):
+                b = int(round(t)) - BINS[0]
+                if 0 <= b < len(BINS):
+                    hist[name][b] += wv
+        if seed == 0:
+            tab = (xs_win - hxf, vxv, vy[yf, xs_win], toff, sv, mfv)
+
+    print(f"\n===== V={V}  amp={amp}  noise={noise}  (disk |dx|<{(vmax-abs(V))*Nm1:.1f}px; nseed={nseed}) =====")
+    dx, vxv, vyv, toff, sv, mfv = tab
+    print("  per-pixel (head row, seed0):  dx=x-head  V_P  vy   tailΔ(=tail-trueT)   s     MF")
+    for j in range(0, len(dx), 2):
+        if sv[j] > 0.05 or abs(dx[j]) < 6:
+            print(f"   dx={dx[j]:+5.1f}  V_P={vxv[j]:+5.2f}  vy={vyv[j]:+4.2f}  tailΔ={toff[j]:+6.2f}  s={sv[j]:.3f}  MF={mfv[j]:6.2f}")
+    for name in ("s2", "mf"):
+        h = hist[name] / max(nseed, 1); mx = h.max()
+        # stats: peak offset, weighted centroid, weighted std, concentration within +/-1.5 of true tail
+        if h.sum() > 0:
+            c = (BINS * h).sum() / h.sum()
+            sd = np.sqrt(((BINS - c) ** 2 * h).sum() / h.sum())
+            conc = h[np.abs(BINS) <= 1.5].sum() / h.sum()
+            pk = BINS[h.argmax()]
+        else:
+            c = sd = conc = pk = 0
+        print(f"  -- vote histogram (weight={name}):  peak@Δ{pk:+d}  centroidΔ{c:+.2f}  width(σ){sd:.2f}  conc(±1.5){conc*100:.0f}%")
+        for b, hv in zip(BINS, h):
+            mark = " <-trueT" if b == 0 else ""
+            print(f"     Δ{b:+3d} |{bar(hv, mx):40s}| {hv:7.2f}{mark}")
+
+
+if __name__ == "__main__":
+    analyze(V=1.0, amp=5, noise=False)
+    analyze(V=2.5, amp=5, noise=False)
+    analyze(V=1.0, amp=5, noise=True, nseed=16)
+    analyze(V=2.5, amp=5, noise=True, nseed=16)
+    analyze(V=2.5, amp=2, noise=True, nseed=16)