Repository: Diixtra/diixtra-forge
Author: jameskazie
## Symptoms
Two Talos amd64 workers went offline simultaneously at \~21:42 BST on 2026-04-20 and haven't come back:
- \`k8-worker-1\` (10.2.0.41) — \`Ready: Unknown\`, \"Kubelet stopped posting node status\", 0% ping
- \`k8-worker-2\` (10.2.0.42) — same
Both share schematic \`c9078f9419961640c712a8bf2bb9174933dfcf1da383fd8ea2b7dc21493f8bac\` (standard amd64 worker, \`siderolabs/iscsi-tools\`).
User note: the ISO is too large to boot on these two workers.
## Impact (downstream)
- **Forgejo** stuck: HelmRelease has been in rolling \`upgrade\` for \~8 h. Old pod \`forgejo-755457498-6c9sm\` trapped \`Terminating\` on a dead node; new pod hits \`Multi-Attach error\` on the iSCSI PVC because the dead kubelet can't run \`ControllerUnpublishVolume\`.
- **\`cnpg-forgejo-1\`** evicted by TaintManager, cannot reschedule until PVC detaches.
- **#1261** is the symptom, not the cause — closing it as a duplicate of this is probably right once this is fixed.
- **#1345** (Forgejo Crossplane provider work) is soft-blocked until Forgejo is serving traffic again.
## Root cause (hypothesis, pending confirmation)
Talos boot image with current extension set exceeds the boot-loader / VM-disk size limit on these two Proxmox VMs. Both went down together, which is the classic signature of a hypervisor-layer problem rather than a pod-level one. Kubelet didn't just hang — the VMs themselves stopped responding (no ICMP).
Things that can cause \"ISO too large\":
- Proxmox VM's configured ISO disk is smaller than the factory image we're trying to boot.
- BIOS seabios boot 0x7E00 limit (unlikely on modern Proxmox, but worth a glance).
- UEFI firmware image size limit if the VMs are configured with OVMF.
- iPXE boot: TFTP payload size caps.
## Fix path (ordered)
1. **Verify at the hypervisor**: what does Proxmox say about VM state for these two nodes? Are they \"stopped\"/\"internal-error\"/\"paused\"? If they're running but unreachable, console output should show the boot failure message.
2. **Confirm the ISO size**: download the factory image for schematic \`c9078f9419961640c712a8bf2bb9174933dfcf1da383fd8ea2b7dc21493f8bac\` at \`v1.12.6\` from \`factory.talos.dev\` and check bytes. Compare against the Proxmox VM's configured disk for the OS/ISO.
3. **Pick a path**:
- (a) Resize the Proxmox VM ISO disk to fit the new image (least intrusive, may require powering down the VM).
- (b) Trim extensions on the worker schematic. iscsi-tools is the only non-default extension on these workers, so there's not much to cut — if that's the bloater, we have no choice but path (a).
- (c) Switch from Factory ISO boot to a smaller disk-image boot flow.
4. **Bring nodes back**: once the images fit, boot the VMs. Kubelet rejoins, PVC detach completes, trapped pods gracefully terminate, scheduler places new pods.
## What NOT to do
- Do NOT \`kubectl delete pod --force --grace-period=0\` the Terminating Forgejo or CNPG pods while the nodes might still be recoverable. Force-deleting tricks the control plane into thinking the pod is gone while the kubelet (on node boot) may resume running it — dual-mount of an iSCSI RWO volume is a known data-corruption vector.
- Do NOT restore the Forgejo PVC from a ZFS snapshot until the iSCSI mount is definitively released.
## Related
- **#1261** — downstream symptom (Forgejo stuck). Likely close as duplicate after this fix lands.
- **#1192** — ERPNext/CRM 500s (unaffected by this, separate bug).
- **#1345** — soft-blocked until Forgejo is healthy.
- Memory note: \"Never reboot/upgrade multiple nodes simultaneously; TrueNAS overload crashes Proxmox\" — relevant precedent for the dual-failure mode.
🤖 Generated with [Claude Code](https://claude.com/claude-code)