Smith-Waterman from Prolog Facts: GPU Sequence Alignment for the Rest of Us

NVIDIA charges $20,000 per year for Clara Parabricks, their GPU-accelerated genomics pipeline. We built an open-source GPU Smith-Waterman aligner from Prolog facts in one evening. It produces identical scores on CPU and GPU across every test we've run. Here's how, and why it matters.

What we built¶

A complete Smith-Waterman local sequence alignment pipeline:

CPU reference in C with affine gap penalties and CIGAR traceback
GPU kernel using anti-diagonal wavefront parallelism (one thread per cell per anti-diagonal)
FASTQ → SAM pipeline compatible with samtools and IGV
Prolog kernel templates expressing the entire algorithm as queryable facts

The algorithm is described in lib/kernel_templates_genomics.pl as six fact families: scoring function, cell update recurrence, gap model, parallelism strategy, kernel parameters, and traceback operation. The same substrate that runs YOLOv5n image classification and Llama transformer inference now runs sequence alignment.

Verification¶

We verified on synthetic data, random sequences, and real biological sequences from NCBI and UniProt:

Test	Size	Score	CPU vs GPU
145 synthetic pairs	10-200bp	varied	all identical
57 random + block sweep	up to 300bp, 5 block sizes	varied	all identical
E. coli vs Salmonella 16S rRNA	1541bp × 1540bp	2926	identical
BRCA1 human self-align	1000bp × 1000bp	2000	identical
Sperm whale vs Human myoglobin	154bp × 154bp	236	identical
Human vs Bovine insulin	110bp × 105bp	152	identical
E. coli 16S vs Human insulin (neg ctrl)	1541bp × 110bp	11	identical

The negative control is important: E. coli 16S rRNA aligned against human insulin scores 11 — essentially noise. The aligner correctly finds no meaningful alignment between unrelated sequences.

The biological results match known phylogenetics: E. coli and Salmonella 16S rRNA are approximately 95% identical (known from decades of molecular systematics), and our aligner independently measures ~94.9% identity. Sperm whale and human myoglobin are known homologs at ~88% identity. The biology validates the math.

Performance¶

On our Tesla P4 (Pascal, sm_61):

Configuration	Time (1541bp × 1540bp)	vs CPU
CPU -O2	22.4 ms	reference
GPU block=32	15.4 ms	1.5x
GPU block=64	15.0 ms	1.5x
GPU block=128	15.4 ms	1.5x
GPU block=256	17.7 ms	1.3x

The 1.5x GPU speedup is modest because 1500bp is small — anti-diagonal parallelism scales with sequence length. At 10,000bp+ (real genomic contigs or whole-chromosome alignments), the GPU advantage grows dramatically because the anti-diagonals become wider, giving more threads useful work.

What the Prolog facts look like¶

The scoring function is a Prolog fact:

sw_scoring_function(dna_simple,
    [param(c_type(char), qi), param(c_type(char), rj),
     param(c_type(int), match), param(c_type(int), mismatch)],
    c_ternary(
        c_binop('==', c_var(qi), c_var(rj)),
        c_var(match),
        c_var(mismatch))).

The parallelism strategy is a Prolog fact:

sw_parallelism(anti_diagonal, parallelism_facts(
    parallel_dimension(anti_diagonal),
    gpu_mapping(one_thread_per_cell))).

The kernel parameters are sweepable:

sw_kernel_params(gpu_default, [
    tile_width(32),
    block_size(128),
    shared_mem_rows(2),
    match_score(2),
    mismatch_score(-1),
    gap_open(3),
    gap_extend(1)
]).

To switch from DNA to protein alignment, change one fact — swap dna_simple for protein_blosum62. The substrate handles the rest.

What this is NOT¶

This is not a production genomics pipeline. It lacks: banded alignment for long sequences, SIMD-striped inner loops, database search (BLAST-like filtering), paired-end read support, and many optimizations that production tools like BWA-MEM2 and minimap2 have accumulated over years.

The kernel fusion optimizer that is the crown jewel of LlamaTov for AI inference does not apply here — Smith-Waterman's cell update is naturally a single fused operation. The anti-diagonal parallelism is spatial, not fusable.

What this IS¶

It is proof that the BPD substrate is domain-general. The same Prolog-dispatched, bit-verified, parameter-sweepable architecture that runs YOLO image classification and Llama transformer inference also runs computational genomics. Three domains, one substrate, same verification discipline.

And it is free. No $20,000/year license. No vendor lock-in. No opaque binaries. Every kernel fact is queryable. Every output is verified. Every parameter is sweepable.

The code is at github.com/heath-hunnicutt-ruach-tov/bpd-substrate. The aligner is in bench/bpd_smith_waterman.c (CPU), bench/bpd_smith_waterman_gpu.cu (GPU), and lib/kernel_templates_genomics.pl (Prolog facts). Run python3 bench/sw_fastq_sam.py --self-test to verify on your hardware.