NVIDIA charges $20,000 per year for Clara Parabricks, their GPU-accelerated genomics pipeline. We built an open-source GPU Smith-Waterman aligner from Prolog facts in one evening. It produces identical scores on CPU and GPU across every test we've run. Here's how, and why it matters.
What we built¶
A complete Smith-Waterman local sequence alignment pipeline:
- CPU reference in C with affine gap penalties and CIGAR traceback
- GPU kernel using anti-diagonal wavefront parallelism (one thread per cell per anti-diagonal)
- FASTQ → SAM pipeline compatible with samtools and IGV
- Prolog kernel templates expressing the entire algorithm as queryable facts
The algorithm is described in lib/kernel_templates_genomics.pl as six fact families: scoring function, cell update recurrence, gap model, parallelism strategy, kernel parameters, and traceback operation. The same substrate that runs YOLOv5n image classification and Llama transformer inference now runs sequence alignment.
Verification¶
We verified on synthetic data, random sequences, and real biological sequences from NCBI and UniProt:
| Test | Size | Score | CPU vs GPU |
|---|---|---|---|
| 145 synthetic pairs | 10-200bp | varied | all identical |
| 57 random + block sweep | up to 300bp, 5 block sizes | varied | all identical |
| E. coli vs Salmonella 16S rRNA | 1541bp × 1540bp | 2926 | identical |
| BRCA1 human self-align | 1000bp × 1000bp | 2000 | identical |
| Sperm whale vs Human myoglobin | 154bp × 154bp | 236 | identical |
| Human vs Bovine insulin | 110bp × 105bp | 152 | identical |
| E. coli 16S vs Human insulin (neg ctrl) | 1541bp × 110bp | 11 | identical |
The negative control is important: E. coli 16S rRNA aligned against human insulin scores 11 — essentially noise. The aligner correctly finds no meaningful alignment between unrelated sequences.
The biological results match known phylogenetics: E. coli and Salmonella 16S rRNA are approximately 95% identical (known from decades of molecular systematics), and our aligner independently measures ~94.9% identity. Sperm whale and human myoglobin are known homologs at ~88% identity. The biology validates the math.
Performance¶
On our Tesla P4 (Pascal, sm_61):
| Configuration | Time (1541bp × 1540bp) | vs CPU |
|---|---|---|
| CPU -O2 | 22.4 ms | reference |
| GPU block=32 | 15.4 ms | 1.5x |
| GPU block=64 | 15.0 ms | 1.5x |
| GPU block=128 | 15.4 ms | 1.5x |
| GPU block=256 | 17.7 ms | 1.3x |
The 1.5x GPU speedup is modest because 1500bp is small — anti-diagonal parallelism scales with sequence length. At 10,000bp+ (real genomic contigs or whole-chromosome alignments), the GPU advantage grows dramatically because the anti-diagonals become wider, giving more threads useful work.
What the Prolog facts look like¶
The scoring function is a Prolog fact:
sw_scoring_function(dna_simple,
[param(c_type(char), qi), param(c_type(char), rj),
param(c_type(int), match), param(c_type(int), mismatch)],
c_ternary(
c_binop('==', c_var(qi), c_var(rj)),
c_var(match),
c_var(mismatch))).
The parallelism strategy is a Prolog fact:
sw_parallelism(anti_diagonal, parallelism_facts(
parallel_dimension(anti_diagonal),
gpu_mapping(one_thread_per_cell))).
The kernel parameters are sweepable:
sw_kernel_params(gpu_default, [
tile_width(32),
block_size(128),
shared_mem_rows(2),
match_score(2),
mismatch_score(-1),
gap_open(3),
gap_extend(1)
]).
To switch from DNA to protein alignment, change one fact — swap dna_simple for protein_blosum62. The substrate handles the rest.
What this is NOT¶
This is not a production genomics pipeline. It lacks: banded alignment for long sequences, SIMD-striped inner loops, database search (BLAST-like filtering), paired-end read support, and many optimizations that production tools like BWA-MEM2 and minimap2 have accumulated over years.
The kernel fusion optimizer that is the crown jewel of LlamaTov for AI inference does not apply here — Smith-Waterman's cell update is naturally a single fused operation. The anti-diagonal parallelism is spatial, not fusable.
What this IS¶
It is proof that the BPD substrate is domain-general. The same Prolog-dispatched, bit-verified, parameter-sweepable architecture that runs YOLO image classification and Llama transformer inference also runs computational genomics. Three domains, one substrate, same verification discipline.
And it is free. No $20,000/year license. No vendor lock-in. No opaque binaries. Every kernel fact is queryable. Every output is verified. Every parameter is sweepable.
The code is at github.com/heath-hunnicutt-ruach-tov/bpd-substrate. The aligner is in bench/bpd_smith_waterman.c (CPU), bench/bpd_smith_waterman_gpu.cu (GPU), and lib/kernel_templates_genomics.pl (Prolog facts). Run python3 bench/sw_fastq_sam.py --self-test to verify on your hardware.