Speech-to-text and k8s on a Jetson

Cezar Cocu · · 14 min read

If you want to suffer, buy an NVIDIA Jetson Orin.

If you want to really suffer, put Kubernetes on it and try to bring it into your cluster.

If you are a masochist, try to run SGLang on it.

Alas we proceed:

TL;DR

  • Running Qwen3-ASR on an 8 GB Jetson Orin Nano via the obvious Python serving stacks (SGLang, vLLM) mostly doesn't work (too memory hungry).
  • What actually produces benchmark numbers is INT8 + a TensorRT-Edge-LLM bundle built on the actual Orin target. Qwen3-ASR-0.6B lands around ~6.5% WER and aggregate RTF ~1.2 on a 16-sample slice. Slow, but real.
  • Most of the work ends up being the surrounding stack rather than the model itself: JetPack and BSP, sm_87 quirks, host-vs-container TensorRT, K3s GPU runtime, pod memory limits, all of it.
  • KubeJet (the small K3s-native operator I built along the way) wraps that working path so a coding agent can kubectl apply, watch metrics, and roll back via git revert — instead of SSH-archaeology-ing the Jetson every time it gets moody.
  • Pretty painful process overall from installing JetPack to running, real-ish inference servers.

I wanted to see whether a Jetson Orin Nano could run Qwen/Qwen3-ASR-1.7B as a useful speech-to-text node for real life, tasks without sending that data to cloud models. The Jetson Orin has about 67 TOPS, which should be more than enough.

Decided to k3's ify it. Because, why not?!

The Node Actually Worked

The hardware was a Jetson Orin Nano running JetPack 7.2 / Jetson Linux R39.2:

Ubuntu 24.04
Linux 6.8.12-tegra
aarch64
CUDA 13.2
TensorRT 10.16

Jetson is not just "small NVIDIA GPU, but adorable" — it's an embedded NVIDIA platform with its own BSP, kernel, firmware path, and container runtime assumptions. And the Orin GPU being Ampere-class sm_87 ended up mattering at almost every layer: PyTorch wheels warning about advertised CUDA architectures, FlashInfer doing first-request JIT, TensorRT engine builds caring about the real target, and container TensorRT disagreeing with the host JetPack.

Eventually the node joined K3s as jetson-orin-nano-01, and the NVIDIA device plugin advertised:

nvidia.com/gpu: 1

So far, so good.

Then I tried a Kubernetes GPU smoke test with:

runtimeClassName: nvidia

which failed with:

RuntimeHandler "nvidia" not supported

On this Jetson, K3s was using Docker through cri-dockerd, and Docker was already configured with NVIDIA as the default runtime. There was no RuntimeClass to request. Removing runtimeClassName made CUDA work, and in hindsight that's roughly where the assumption that "Kubernetes GPU stuff just works" quietly died for the rest of this experiment.

SGLang: The Masochism Section

SGLang installed. That alone felt like an event.

The tested path used sglang==0.5.12.post1 from the CUDA wheel index on arm64. It also needed kernels==0.14.1 and kernels-data==0.14.1; a newer kernels==0.15.2 failed during import with:

ValueError: Either a revision or a version must be specified

(And just to be clear: the model itself hadn't even loaded at this point. This was all package resolution.)

The one SGLang path that completed a request was aggressively constrained:

sglang serve \
  --model-path Qwen/Qwen3-ASR-1.7B \
  --trust-remote-code \
  --dtype half \
  --context-length 512 \
  --mem-fraction-static 0.85 \
  --max-total-tokens 128 \
  --max-running-requests 1 \
  --enable-multimodal \
  --skip-server-warmup \
  --disable-cuda-graph \
  --disable-radix-cache

It needed temporary host swap, used a tiny context, served one request at a time, and disabled most of what makes serving actually fast — which by any honest definition isn't really a serving config, more of a "let me get one sentence out without OOMing" config.

The 6 second LibriSpeech smoke request did return the expected text:

{"text":"Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.","usage":{"type":"duration","seconds":6}}

The wall time was about 256s, including first-request FlashInfer/SGLang JIT for sm_87.

That result mattered to me because it proved the stack could execute at all. The numbers themselves were bad, and the path wasn't anything I'd want to actually run.

When I retried SGLang cleanly under Kubernetes after rebooting the Jetson and removing temporary swap, the pod never reached readiness. The node stayed Ready, the device plugin stayed alive, and the container was OOMKilled at the 7Gi memory limit.

The logs got far enough to be useful:

Load weight begin. avail mem=3.77 GB
Using triton_attn as multimodal attention backend
rank 0 scheduler died with exit code -9

(Exit code -9 is the OOM killer, in case you, like me, had forgotten.)

So why not just run SGLang and call it a day?

Because the only successful path required swap, a very small serving shape, first-request JIT, one request at a time, and enough memory pressure that the same deployment OOMKilled before readiness once I ran it as the actual Kubernetes target. I'll happily count that as proof the stack can run, but I wouldn't really call it a baseline I'd trust for anything beyond "it executed once, on its best behavior."

vLLM Made A Respectable Attempt

vLLM was next. The tested path installed vllm[audio]==0.22.1 on arm64 in nvcr.io/nvidia/pytorch:25.08-py3, the wheel recognized the model as Qwen3ASRForConditionalGeneration, and even with an intentionally small config (gpu_memory_utilization=0.30, max_model_len=1024, max_num_seqs=1, enforce_eager=True, bfloat16) it OOMKilled at the 7Gi memory limit during model load. The node itself held fine this time — it was the workload that gave up, which is mostly to say Kubernetes on the Jetson wasn't really the bottleneck anymore. A general-purpose Python serving stack pointed at a multimodal 1.7B model on an 8 GB unified-memory device is just a very tight fit.

The Quantization Strategy That Worked

The path that produced useful numbers was less convenient and much more believable:

  1. use the 4090/x86 host for the heavy quantization and export work,
  2. apply an INT8 strategy around the decoder path,
  3. export a TensorRT-Edge-LLM-style bundle,
  4. build the final TensorRT engines on the Orin target,
  5. benchmark those engines on the Jetson.

The quantization path targeted decoder quantizers with W8A8 SmoothQuant-style INT8 while keeping sensitive parts like embeddings and output projection out of the blast radius. The goal was modest — just shrink the serving path enough that the Jetson could actually run it without every request turning into a memory-pressure incident.

The Orin-side build also mattered. The working build used host JetPack CUDA/TensorRT rather than the NGC container TensorRT, because the container TensorRT rejected Orin sm_87. The working path mounted host /usr/local/cuda-13.2 and /usr, built engines with TensorRT 10.16.2.10, and needed temporary 16 GB swap during engine build.

None of which is the "just docker run this model" dream, but I think it's much closer to how edge inference actually goes once you sit down and try it on a real device.

The Result

The usable full-slice INT8 run used Qwen/Qwen3-ASR-0.6B with a TensorRT-Edge-LLM-style host-TensorRT bundle on jetson-orin-nano-01.

samples:        16 / 16
datasets:       LibriSpeech + FLEURS
audio:          132.73 seconds
latency:        157.07 seconds
mean latency:   9.82 seconds
aggregate RTF:  1.18
mean RTF:       1.42
mean WER:       6.55% after rescoring
errors:         0

The historical Qwen/Qwen3-ASR-1.7B INT8 run also completed, but it was slower on the same 16-sample sanity slice: 224.10s total latency, aggregate RTF 1.69, and rescored mean WER 12.81%. Not a publication-quality WER comparison by any means, but enough for me to land on 0.6B as the practical Orin target for now.

The build numbers were also useful:

audio engine build:       106s
LLM engine build:         266s
temporary swap:           16 GB during engine build
runtime engine cache:     enough to skip swap for inference profiling

There is a giant caveat baked into those numbers: the full-slice benchmark launched llm_inference once per sample, so the aggregate RTF includes repeated process and model startup. A persistent Edge-LLM server or a batched driver is the obvious next optimization target before judging anything like steady-state throughput.

The First Real Optimization Was Boring

The first profiler result told me, more or less, to stop launching the whole runtime over and over (which honestly would have been a less fun thing to write up than "I tuned a custom kernel," but the numbers really weren't subtle). On the two-sample profiler pass, Nsight reported only about 0.176s of CUDA kernel time. The CUDA API summary, by contrast, was full of module loads, memory copies, stream syncs, and allocations — the per-sample harness was making llm_inference reload engines, tokenizer state, audio runtime state, and decoding graphs for every clip.

So the first optimization wasn't anything glamorous — just one JSON file holding multiple requests, fed to a single llm_inference process with batch_size: 1.

That changed the same two German FLEURS clips from:

per-sample subprocess baseline: 14.83s wall time
single-process runner mode:     8.16s total latency
audio duration:                 24.30s
aggregate RTF:                  0.336
errors:                         0

True batching was the next obvious idea. It failed immediately, with a message I couldn't fix by flipping any serving flag:

Requested batch size 2 exceeds maximum supported batch size 1

maxBatch is baked into the engine at compile time. The current engine had been built with maxBatch=1, so the next real experiment was to rebuild a maxBatch=2 or maxBatch=4 INT8 engine and compare against the single-process baseline.

So I did. The maxBatch=2 engine built and ran, and on two samples the latency even improved a little:

batch_size=1, two samples: 8.66s total latency, aggregate RTF 0.356
batch_size=2, two samples: 7.61s total latency, aggregate RTF 0.313

But the transcripts were wrong: the first output roughly matched the second audio clip, the second was corrupted text, and the output JSON reported both responses with request_idx: 0 — exactly the kind of tiny metadata detail that makes you stop trusting the bigger number above it.

Then maxBatch=4 made the point more clearly:

batch_size=1, four samples: 10.14s total latency, aggregate RTF 0.215
batch_size=4, four samples: 11.81s total latency, aggregate RTF 0.250

And the true batch output was still wrong: mean WER 1.0, cross-contaminated hypotheses, and corrupted later rows. Nothing about the process or the pod looked unhealthy from the outside; the model was just quietly producing bad transcripts.

Both still useful data points though. For this Qwen3-ASR TensorRT-Edge-LLM path the win turns out to be runtime reuse instead of true audio batching, so the working rule for now is: keep one llm_inference process warm, and keep each request at batch_size: 1. Slightly annoyingly, the first real performance bug in this whole setup turned out to be orchestration rather than anything happening on the GPU itself.

The Operator Was The Point

At this point the experiment had drifted from "can the Jetson run this model?" to a better question:

Can an agent deploy this without remembering all the ritual?

That became KubeJet: a small K3s-native Jetson inference operator. The first useful resource was deliberately narrow:

apiVersion: kubejet.dev/v1alpha1
kind: EdgeInferenceService
metadata:
  name: qwen3-asr-orin-nano
  namespace: stt-bench
spec:
  task: asr
  runtime:
    name: tensorrt-edgellm
    adapter: trt-edgellm-asr
    runtimeClassName: nvidia
    image: ghcr.io/cezarc1/kubejet-trt-edgellm-asr:0.1.0-cezar-a9da577
  artifact:
    uri: pvc://qwen3-asr-06b-edgellm-work-jp6/bundle
    model: Qwen/Qwen3-ASR-0.6B
    precision: int8
  target:
    deviceClass: jetson-orin-nano-8gb
    nodeSelector:
      accelerator: nvidia-jetson-orin-nano
    memoryBudget: 5200Mi
  serving:
    protocol: openai-compatible
    endpoint: /v1/audio/transcriptions
    port: 8000
  validation:
    contractSmokeTest:
      enabled: true

The operator turned that into:

  • a runtime Deployment,
  • a Service,
  • GPU requests and Jetson node selectors,
  • a PVC mount for the TensorRT-Edge-LLM bundle,
  • /readyz and /metrics,
  • and a contract smoke Job that posts a real audio file to /v1/audio/transcriptions.

The final smoke result was exactly the sort of thing I wanted an agent to be able to verify:

{"ok": true, "text": "He was in a fevered state of mind owing to the blight his wife's action threatened to cast upon his entire future."}

The boring Kubernetes details were not boring in practice. The operator had to learn them:

  • do not set runtimeClassName: nvidia on this Docker-backed Jetson,
  • use Recreate, not rolling updates, because the Orin has one GPU,
  • null stale rollingUpdate and runtimeClassName fields when patching,
  • keep /opt/hpcx/ucx/lib ahead of host-mounted UCX libraries or PyTorch import fails,
  • parse TensorRT-Edge-LLM's responses[].output_text, not the top-level input_file,
  • reject fake smoke success where the returned "transcript" is just a path,
  • strip the model's language English prefix with a Python-compatible regex.

Honestly, that list of quirks is most of the reason I built this as an operator in the first place. What I really wanted was a Kubernetes API that captures the weird parts of running things on this Jetson once, so that future agents (and future me) could just do normal agent things: apply a manifest, watch status, read logs, inspect metrics, roll back through GitOps, and skip the SSH archaeology I'd otherwise be doing every time the Orin got moody about something.

The Recovery Release

Then reality added one more footnote.

The Arducam xISP camera path did not have a clean JetPack 7 driver story, so the useful system became the boringly conservative one: JetPack 6.2 / L4T 36.5, CUDA 12.6, TensorRT 10.3, K3s, containerd, NVIDIA runtime class, and a GitOps deployment that an agent can inspect without SSHing into the Jetson.

That recovery release did three things I now consider table stakes:

  • staged the INT8 bundle onto an Orin-local PVC,
  • built the TensorRT engines on the actual Orin target with temporary host swap,
  • served the result through EdgeInferenceService with a real transcription smoke test and Prometheus metrics.

The final JP6 benchmark was not flashy, but it was reportable:

samples:        16 / 16
audio:          132.73 seconds
latency:        138.42 seconds
mean latency:   8.65 seconds
aggregate RTF:  1.04
mean RTF:       1.42
mean WER:       7.50%
errors:         0

The operator also caught a bug that a normal "curl the Service" test would have missed. During a runtime rollout, the Kubernetes Service can still point at the old ready pod while the new image is starting; if the contract smoke job runs too early, it ends up validating yesterday's pod and then cheerfully declaring today's deployment ready (which is the kind of bug that makes a green dashboard mean basically nothing).

So KubeJet now waits for the desired Deployment template to be observed, updated, ready, and available before it creates the smoke Job. After the final rollout, the evidence lined up:

runtime image:  ghcr.io/cezarc1/kubejet-trt-edgellm-asr:0.1.0-cezar-a9da577
ServingReady:   True
MetricsReady:   True
requests_total: 1
errors_total:   0
last latency:   ~18.2s

If I were going to pitch KubeJet now, that'd be the shape of it: give a coding agent a Kubernetes API it already understands, make it wait for the right pod, make it run a real contract request, make it publish metrics, and make rollback a git revert. The Jetson itself is allowed to stay as weird as it needs to be — I just don't really want to have to remember that weirdness every time I deploy something to it.

What This Actually Says

My short version of all this is: yes, the Orin Nano can be a useful speech-to-text edge node, but probably not in the shape a desktop GPU user is going to expect it to be. Pointing a modern Python serving stack at a multimodal 1.7B model and hoping Kubernetes handles the rest just does not work on an 8 GB Jetson — you get a short note from the OOM killer for your trouble. Doing the slow, annoying systems-level work upfront (quantize, build for the actual target, keep the runtime lean, wrap the result in something you can deploy without remembering the ritual) does work, and produces numbers I'd actually be willing to report.

Honestly, the model itself was the easy part of this. What ate most of the time was the surrounding stack:

  • JetPack and BSP compatibility,
  • sm_87 CUDA behavior,
  • host versus container TensorRT,
  • Docker-backed K3s GPU runtime behavior,
  • Kubernetes scheduling and taints,
  • device plugin registration,
  • pod memory limits,
  • SGLang kernel pinning,
  • vLLM multimodal engine memory,
  • INT8 quantization choices,
  • benchmark scoring,
  • operator reconciliation details.

At each of those layers something had its own opinion, and most of those opinions only really showed up after a run had failed in some new way. I don't think that ever fully goes away on a platform like this, which is part of why I now think wrapping it in an operator is the only way I'd actually want to operate the thing day to day.