Smith-Waterman from Prolog Facts: GPU Sequence Alignment for the Rest of Us

NVIDIA charges $20,000 per year for Clara Parabricks, their GPU-accelerated genomics pipeline. We built an open-source GPU Smith-Waterman aligner from Prolog facts in one evening. It produces identical scores on CPU and GPU across every test we've run. Here's how, and why it matters.

What we built

A complete Smith-Waterman local sequence alignment pipeline:

  • CPU reference in C with affine gap penalties and CIGAR traceback
  • GPU kernel using anti-diagonal wavefront parallelism (one thread per cell per anti-diagonal)
  • FASTQ → SAM pipeline compatible with samtools and IGV
  • Prolog kernel templates expressing the entire algorithm as queryable facts

The algorithm is described in lib/kernel_templates_genomics.pl as six fact families: scoring function, cell update recurrence, gap model, parallelism strategy, kernel parameters, and traceback operation. The same substrate that runs YOLOv5n image classification and Llama transformer inference now runs sequence alignment.

Verification

We verified on synthetic data, random sequences, and real biological sequences from NCBI and UniProt:

Test Size Score CPU vs GPU
145 synthetic pairs 10-200bp varied all identical
57 random + block sweep up to 300bp, 5 block sizes varied all identical
E. coli vs Salmonella 16S rRNA 1541bp × 1540bp 2926 identical
BRCA1 human self-align 1000bp × 1000bp 2000 identical
Sperm whale vs Human myoglobin 154bp × 154bp 236 identical
Human vs Bovine insulin 110bp × 105bp 152 identical
E. coli 16S vs Human insulin (neg ctrl) 1541bp × 110bp 11 identical

The negative control is important: E. coli 16S rRNA aligned against human insulin scores 11 — essentially noise. The aligner correctly finds no meaningful alignment between unrelated sequences.

The biological results match known phylogenetics: E. coli and Salmonella 16S rRNA are approximately 95% identical (known from decades of molecular systematics), and our aligner independently measures ~94.9% identity. Sperm whale and human myoglobin are known homologs at ~88% identity. The biology validates the math.

Performance

On our Tesla P4 (Pascal, sm_61):

Configuration Time (1541bp × 1540bp) vs CPU
CPU -O2 22.4 ms reference
GPU block=32 15.4 ms 1.5x
GPU block=64 15.0 ms 1.5x
GPU block=128 15.4 ms 1.5x
GPU block=256 17.7 ms 1.3x

The 1.5x GPU speedup is modest because 1500bp is small — anti-diagonal parallelism scales with sequence length. At 10,000bp+ (real genomic contigs or whole-chromosome alignments), the GPU advantage grows dramatically because the anti-diagonals become wider, giving more threads useful work.

What the Prolog facts look like

The scoring function is a Prolog fact:

sw_scoring_function(dna_simple,
    [param(c_type(char), qi), param(c_type(char), rj),
     param(c_type(int), match), param(c_type(int), mismatch)],
    c_ternary(
        c_binop('==', c_var(qi), c_var(rj)),
        c_var(match),
        c_var(mismatch))).

The parallelism strategy is a Prolog fact:

sw_parallelism(anti_diagonal, parallelism_facts(
    parallel_dimension(anti_diagonal),
    gpu_mapping(one_thread_per_cell))).

The kernel parameters are sweepable:

sw_kernel_params(gpu_default, [
    tile_width(32),
    block_size(128),
    shared_mem_rows(2),
    match_score(2),
    mismatch_score(-1),
    gap_open(3),
    gap_extend(1)
]).

To switch from DNA to protein alignment, change one fact — swap dna_simple for protein_blosum62. The substrate handles the rest.

What this is NOT

This is not a production genomics pipeline. It lacks: banded alignment for long sequences, SIMD-striped inner loops, database search (BLAST-like filtering), paired-end read support, and many optimizations that production tools like BWA-MEM2 and minimap2 have accumulated over years.

The kernel fusion optimizer that is the crown jewel of LlamaTov for AI inference does not apply here — Smith-Waterman's cell update is naturally a single fused operation. The anti-diagonal parallelism is spatial, not fusable.

What this IS

It is proof that the BPD substrate is domain-general. The same Prolog-dispatched, bit-verified, parameter-sweepable architecture that runs YOLO image classification and Llama transformer inference also runs computational genomics. Three domains, one substrate, same verification discipline.

And it is free. No $20,000/year license. No vendor lock-in. No opaque binaries. Every kernel fact is queryable. Every output is verified. Every parameter is sweepable.

The code is at github.com/heath-hunnicutt-ruach-tov/bpd-substrate. The aligner is in bench/bpd_smith_waterman.c (CPU), bench/bpd_smith_waterman_gpu.cu (GPU), and lib/kernel_templates_genomics.pl (Prolog facts). Run python3 bench/sw_fastq_sam.py --self-test to verify on your hardware.