How swift works

swift replaces the stackmat with a webcam. Here's the actual pipeline, from raw camera frame to a recorded solve time — and what the app does (and doesn't do) with your video.

The pipeline at a glance

  1. Camera capture. The browser opens your webcam via the standard getUserMedia API. The video stream is rendered into a hidden <video> element and never sent over the network.
  2. Hand landmark detection. Each frame is passed to Google's MediaPipe HandLandmarker running in your browser (WASM + WebGL). The model returns up to two hands, each as 21 3D landmark points.
  3. Gesture classification. A small TypeScript classifier reads finger bend angles at the PIP joints and the palm-normal vector from the world-space landmarks to decide if each hand is palms-down, gripping a cube, or neither.
  4. State machine. A 5-state machine (idle → inspecting → ready → solving → stopped) reacts to gesture changes — with a 2-frame debounce on every transition so the timer doesn't twitch on a single bad frame.
  5. Session logging. When a solve ends, the time, scramble, and any +2 / DNF penalty are written to localStorage. Stats (best, ao5, ao12, session mean) are recomputed.

Why no video leaves the device

There's no upload step. The MediaPipe model is downloaded once from a CDN and cached; inference runs locally on every frame. The only data that crosses the network is the page itself, the model files (once), and anonymous product analytics that capture page interactions — never the camera feed. The <video> element is also explicitly excluded from session replay so even the UI surrounding the video isn't captured visually.

There's a practical reason beyond privacy: streaming raw video would be slow, expensive, and add latency to the gesture pipeline. Running inference client-side is the right architecture for this kind of app.

What about the optional clip recording?

swift has an opt-in toggle that records a short video clip of each solve (READY → SOLVING → STOPPED) and attaches it to that solve's row in the session table. It exists for cubers who want to share a solve with someone else more easily — a friend, a coach, a forum thread — without screen-recording the whole window.

The clip lives entirely in your browser's IndexedDB, capped at 20 clips or 200 MB (whichever you hit first). Nothing about the clip is uploaded automatically. The only way a clip leaves your device is if you hit "download clip" in the kebab menu and share the resulting file yourself. Clearing your session, deleting the solve, or hitting the "delete clip" action removes the clip immediately. The toggle is off by default every session, and Safari may evict locally-stored clips after 7 days of inactivity (a browser limitation, not ours).

Why a debounce window matters

Hand-landmark inference is noisy. A single frame can drop one of your hands, mis-classify a curled finger, or briefly mis-orient the palm normal — especially as your hands cross or rotate. Without debouncing, the state machine would flicker between inspecting and ready several times a second during the natural setup before a solve.

swift waits for 2 consecutive frames of the same target gesture before transitioning. At 60 fps that's about 33 ms — invisible to you, but long enough to absorb a single misclassification. The same debounce applies on the way out (palms-down to stop the solve), so a single noisy frame doesn't end a solve early.

The world-landmark trick for palm orientation

Image-space landmarks shift around when your hand moves across the frame — so the same physical pose can produce different finger-bend readings depending on where your hand sits. MediaPipe also returns world landmarks: a coordinate system anchored to the hand itself, independent of camera position. swift uses the world-space palm normal to detect "palms-down" (|ny| above a threshold) and falls back to image-space bend angles when world landmarks are unavailable.

What's deliberately not in swift

If you need any of those, csTimer is a much deeper app and is what we'd point you to.