Upgrading a 3-Node Proxmox VE Cluster with Ceph to Proxmox VE 9.x
Upgrade your 3-node Proxmox VE cluster with Ceph to 9.x! This guide details the essential steps for a smooth, rolling transition.
Comprehensive operational runbook — focused on Proxmox VE 9.x (current stable: 9.1)
Last verified against: Proxmox VE 9.1 (Debian 13.2 “Trixie”, kernel 6.17.2, Ceph Squid 19.2.3, QEMU 10.1.2, LXC 6.0.5, ZFS 2.3.4). Authoritative sources: the official Proxmox wiki pages Upgrade from 8 to 9 and Ceph Reef to Squid.
Table of Contents
- Scope and Audience
- Background: What Changes in PVE 9.x
- The Mandatory Upgrade Order
- Pre-Upgrade Planning
- Backups — The Only Real Rollback
- Phase 1 — Bring All Nodes to the Latest PVE 8.4
- Phase 2 — Upgrade Ceph from Reef (18.2) to Squid (19.2)
- Phase 3 — Run the
pve8to9Readiness Checker - Phase 4 — Upgrade Each Node from PVE 8.4 to PVE 9.0
- Phase 5 — Post-Cluster Validation
- Phase 6 — Optional: Move to PVE 9.1
- Known Issues and Their Workarounds
- Troubleshooting Reference
- Rollback Strategy
- Appendix A — Repository Reference (deb822)
- Appendix B — Per-Node Operator Checklist
- Appendix C — Useful One-Liner Reference
1. Scope and Audience
This document describes the end-to-end procedure for performing an in-place, rolling upgrade of a 3-node hyper-converged Proxmox VE cluster running Ceph as the primary storage to Proxmox VE 9.x. It assumes:
- Three physical (or bare-metal-equivalent) nodes joined in a single Proxmox cluster (Corosync quorum = 2 of 3).
- A hyper-converged Ceph deployment: monitors, managers, and OSDs are all on the same nodes that run virtual guests.
- The starting point is Proxmox VE 8.x with Ceph Quincy (17.2.x) or Ceph Reef (18.2.x).
- The goal is the latest stable Proxmox VE 9.x release with Ceph Squid (19.2.x).
- The administrator is comfortable on the Linux command line and with
apt.
The procedure is designed to keep workloads running throughout the upgrade by using live migration between nodes. Brief reboots of individual nodes are required, but with proper HA and migration planning the cluster as a whole experiences no service interruption for the guests.
2. Background: What Changes in PVE 9.x
Proxmox VE 9.0 was released on 18 July 2025 and is based on Debian 13 “Trixie”. The 9.1 point release followed on 19 November 2025 and is the current recommended target. Notable platform changes:
- Base distribution: Debian 12 (Bookworm) → Debian 13 (Trixie / 13.2).
- Kernel: 6.8/6.11 (PVE 8) → 6.14.8 in PVE 9.0, 6.17.2 in PVE 9.1.
- Ceph: bundles Squid 19.2.3 (Quincy and Reef are unsupported on PVE 9).
- QEMU: 9.x → 10.0.2 (9.0) / 10.1.2 (9.1).
- LXC: → 6.0.5; ZFS: → 2.3.3 / 2.3.4.
- HA: HA groups are deprecated in favor of HA rules (node and resource affinity). Existing groups are auto-migrated after all nodes are on PVE 9.
- Firewall: Proxmox firewall now uses nftables by default (replaces iptables).
- SDN: EVPN improvements, Fabrics (OpenFabric/OSPF) as a managed feature.
- Snapshots on thick LVM as volume chains (relevant for iSCSI/FC SANs — not Ceph, but worth knowing).
- GlusterFS storage support is removed.
- cgroup v1 is removed — containers running systemd ≤ 230 (e.g., CentOS 7, Ubuntu 16.04) will not start.
/tmpis now atmpfsby Debian default and is periodically cleaned along with/var/tmp./etc/sysctl.confis no longer read bysystemd-sysctl. Migrate settings to/etc/sysctl.d/<NN>-<name>.conf.
PVE 8.4 receives security and bug fixes until August 2026, giving roughly one year of overlap with PVE 9.
3. The Mandatory Upgrade Order
The single most important rule of this upgrade is the order. Reordering these steps will break the cluster.
┌─────────────────────────────────────────────────────────────────┐
│ 1. All nodes → latest PVE 8.4.x (≥ 8.4.1, ideally newest) │
│ 2. Ceph cluster → Ceph Squid (19.2.x), still on PVE 8.4 │
│ 3. pve8to9 --full clean on every node │
│ 4. Node-by-node: PVE 8.4 → PVE 9.0 (one node at a time) │
│ 5. After all 3 nodes are on PVE 9.0 → optional point upgrade │
│ to PVE 9.1 │
└─────────────────────────────────────────────────────────────────┘
Why Ceph first? PVE 9 ships only the Squid Ceph packages built for Debian Trixie. There is no Ceph Reef package set for Trixie. If you upgrade Proxmox first, the Ceph daemons will be left without compatible packages and the cluster will degrade. Conversely, Ceph Squid is fully supported on PVE 8.4, so upgrading Ceph in advance is safe and reversible.
Why one node at a time? A 3-node cluster has Corosync quorum of 2. Taking down two nodes simultaneously freezes the cluster. Likewise, a Ceph pool with size=3, min_size=2 only tolerates a single OSD-host outage at a time.
4. Pre-Upgrade Planning
4.1 Hardware and Console Access
Before touching any package, ensure you have out-of-band console access (IPMI, iLO, iDRAC, or physical KVM) to every node. Major-version upgrades occasionally leave a node unable to come up on the network — for example, due to interface-name changes in the new kernel — and SSH alone is not enough to recover.
If only SSH is available, run the upgrade inside tmux or screen so a dropped session does not interrupt apt dist-upgrade:
apt install tmux
tmux new -s pve-upgrade
# detach with Ctrl-b d, re-attach with: tmux attach -t pve-upgrade
Never run the upgrade from the browser-based “noVNC/xterm.js” console of the node you are upgrading — that session terminates partway through.
4.2 Cluster Health Baseline
A healthy starting cluster is non-negotiable. Confirm all of the following from one of the nodes:
# Proxmox cluster quorum and node states
pvecm status
pvecm nodes
# Per-node Proxmox version (run on each)
pveversion -v
# Ceph cluster health, OSD tree, monitor map
ceph -s
ceph osd tree
ceph mon dump | grep min_mon_release
ceph versions
# Free space on root filesystem (≥ 10 GB strongly recommended)
df -h /
ceph -s must report HEALTH_OK before you start. Address any HEALTH_WARN first — running an upgrade on top of an existing warning is asking for compounded failures.
4.3 Inventory of What Is on Each Node
Document, per node:
- VMs and CTs (IDs, HA state, current node, RAM/CPU footprint)
- Ceph daemons present (mon / mgr / mds / osd IDs)
- Local-only resources that cannot be live-migrated (PCIe passthrough, USB passthrough, local-only storage, raw device mappings)
- Custom changes in
/etcthat you might be prompted about duringdist-upgrade
qm list
pct list
ha-manager status
ceph osd tree | grep -E "host|osd\\."
VMs with PCI passthrough or local LVM/ZFS-only disks must be shut down rather than live-migrated; plan the maintenance window accordingly.
4.4 Compatibility Items to Verify
- Proxmox Backup Server: if you use PBS, check the PVE↔PBS compatibility. PBS 4 is required for full feature parity with PVE 9; PBS 3 still works for backups.
- Third-party backup vendors (Veeam, etc.): verify they support PVE 9 / QEMU 10 before upgrading. Veeam in particular had issues with VMs at QEMU machine version
10.0+— a workaround is to pin affected VMs to9.2+pve1. - Third-party storage plugins: any out-of-tree plugin must be rebuilt for PVE 9.
- NVIDIA vGPU: requires GRID 18.3+ (driver 570.158.02+) for the 6.14 kernel of PVE 9.0, and 19.4+ for the 6.17 kernel of PVE 9.1.
- FreeBSD-based guests (pfSense, OPNsense, TrueNAS Core): no functional impact, but the GUI may show inflated memory percentages — it is cosmetic.
- CentOS 7 / Ubuntu 16.04 containers will not start on PVE 9 because cgroup v1 is removed. Migrate them before upgrading the host.
5. Backups — The Only Real Rollback
There is no in-place downgrade from PVE 9 to PVE 8. Your rollback path is restoring from backup.
5.1 Backup the Configuration
On each node, capture a tarball of /etc and the key config directories:
NODE=$(hostname)
mkdir -p /root/preupgrade
tar czf /root/preupgrade/${NODE}-etc-$(date +%F).tgz \
/etc /var/lib/pve-cluster /var/lib/ceph/ 2>/dev/null
The Proxmox cluster filesystem (/etc/pve) is shared via Corosync; backing it up from any one node is sufficient, but doing it on each is a cheap insurance policy.
5.2 Backup All Guests
If you have Proxmox Backup Server, run a full backup of every VM and CT:
vzdump --all 1 --compress zstd --storage <pbs-storage-id>
For environments without PBS, write to a local or NFS storage — but be aware that local backups stored on the same Ceph pool you are about to upgrade are not really off-site. Ideally, push the backups to storage that is independent of the cluster.
5.3 Filesystem-Level Snapshots
If your root filesystem is on ZFS, take snapshots of rpool/ROOT/pve-1 (or equivalent) on every node before starting:
zfs snapshot rpool/ROOT/pve-1@preupgrade-pve9
zfs list -t snapshot | grep preupgrade
These snapshots can be booted from the GRUB / proxmox-boot-tool menu if the upgrade leaves the OS unbootable.
For LVM-thin root, equivalent snapshots are possible but must be sized carefully and are not as automatic.
5.4 Verify Backups Before You Proceed
A backup you have not tested is a wish, not a backup. Restore at least one VM to a scratch ID on a non-production storage and confirm it boots, before you trust the entire cluster’s safety to your backups:
qmrestore /var/lib/vz/dump/<dumpfile>.vma.zst 9999 --storage local-zfs
qm start 9999
qm stop 9999 && qm destroy 9999
6. Phase 1 — Bring All Nodes to the Latest PVE 8.4
Proxmox VE 8.4.1 or newer is the minimum starting point. Older 8.x releases lack the pve8to9 checker and the new repository hooks. On every node, in sequence:
apt update
apt dist-upgrade -y
pveversion
pveversion must report 8.4.x with x ≥ 1. If an updated kernel was installed, reboot the node. Coordinate the reboots so that only one node is offline at a time:
- On each node, in turn, place it in maintenance and reboot.
- Wait for
pvecm statusto show the nodeonlineagain before moving on. - Wait for
ceph -sto return toHEALTH_OKbefore rebooting the next node.
Maintenance-mode reboot pattern (used throughout this document)
# 1. Migrate or shut down all guests on this node
ha-manager crm-command node-maintenance enable <node>
# 2. Stop Ceph from rebalancing data while the node is briefly offline
ceph osd set noout
# 3. Reboot
reboot
# 4. After reboot, wait for HEALTH_OK
watch -n 5 'ceph -s; echo; pvecm status'
# 5. Take the node out of maintenance and unset noout
ha-manager crm-command node-maintenance disable <node>
ceph osd unset noout
Repeat for the other two nodes. At the end of Phase 1, all three nodes should be on the same PVE 8.4.x version with a healthy Corosync quorum and a HEALTH_OK Ceph cluster.
7. Phase 2 — Upgrade Ceph from Reef (18.2) to Squid (19.2)
This phase upgrades Ceph in place on PVE 8.4. PVE itself is not yet touched.
If your starting Ceph version is Quincy (17.2.x), run the Ceph Quincy → Reef upgrade first. Skipping Reef and going straight from Quincy to Squid is technically supported but not recommended — the procedure below assumes you are at Reef.
The official guide is the Ceph Reef to Squid wiki page. The summary below mirrors it with the cluster-specific commentary for a 3-node setup.
7.1 Verify the Pre-Upgrade Ceph State
ceph -s
ceph versions
ceph osd tree
ceph fs ls # only if you use CephFS
You should see all daemons reporting Reef (18.2.x) and HEALTH_OK.
7.2 Switch the Ceph APT Repository on Every Node
Replace reef with squid in the Ceph repository list on each of the three nodes:
sed -i 's/reef/squid/' /etc/apt/sources.list.d/ceph.list
cat /etc/apt/sources.list.d/ceph.list
You should now see one of these (depending on subscription):
deb https://enterprise.proxmox.com/debian/ceph-squid bookworm enterprise
# or
deb http://download.proxmox.com/debian/ceph-squid bookworm no-subscription
Note that we are still on Bookworm at this stage — that is correct. The Trixie repository line will be set later, in Phase 4.
7.3 Set the noout Flag
ceph osd set noout
This prevents Ceph from rebalancing while OSDs restart. It is set once, cluster-wide; you do not run it per node.
7.4 Install the Squid Packages on All Three Nodes
On every node, in any order:
apt update
apt full-upgrade -y
After the package upgrade, Ceph daemons are still running the old Reef binaries — packages are upgraded, but daemons are not restarted automatically.
7.5 Restart Monitor Daemons (One Node at a Time)
The 3-node cluster has 3 monitors. Restart them sequentially, waiting for quorum to reform between each:
# On node 1
systemctl restart ceph-mon.target
ceph -s # wait for HEALTH_OK / HEALTH_WARN(noout) and 3-of-3 quorum
# On node 2
systemctl restart ceph-mon.target
ceph -s
# On node 3
systemctl restart ceph-mon.target
ceph -s
Once all three monitors are running Squid, verify the monmap:
ceph mon dump | grep min_mon_release
# Expected: min_mon_release 19 (squid)
7.6 Restart Manager Daemons
# Run on each node where a mgr daemon exists
systemctl restart ceph-mgr.target
ceph -s # confirm one mgr active, others as standby
7.7 Restart OSD Daemons (One Node at a Time)
This is the longest step and the most sensitive in a 3-node cluster.
# Node 1
systemctl restart ceph-osd.target
# Wait for all OSDs back up and PGs active+clean
watch -n 5 'ceph -s'
# Only when HEALTH_OK (or HEALTH_WARN noout), continue:
# Node 2
systemctl restart ceph-osd.target
# Wait...
# Node 3
systemctl restart ceph-osd.target
# Wait...
If your cluster has many OSDs per node, restarting ceph-osd.target will bounce all of them at once. With noout set this is safe, but the cluster will go briefly into degraded state until the OSDs come back. Watch placement-group recovery in ceph -s — do not move to the next node until PGs are clean.
After all OSDs are restarted, you may see this warning:
HEALTH_WARN: all OSDs are running squid or later but require_osd_release < squid
This is expected and is cleared in the next step.
7.8 Promote require-osd-release to Squid
Only after every OSD reports a Squid version (ceph versions):
ceph osd require-osd-release squid
This activates Squid-only on-disk features and clears the warning above.
7.9 Upgrade CephFS MDS Daemons (Skip if You Don’t Use CephFS)
For each filesystem listed by ceph fs ls:
FS=<your-fs-name>
# 1. Save current settings, then disable standby_replay
ceph fs get $FS | grep -o allow_standby_replay
ceph fs set $FS allow_standby_replay false
# 2. Reduce ranks to 1 (note original max_mds first)
ceph fs get $FS | grep max_mds
ceph fs set $FS max_mds 1
# 3. Wait until only one MDS is active per FS
watch -n 5 'ceph status'
# 4. Stop standby MDS daemons (do this on hosts running standby MDS)
systemctl stop ceph-mds.target
# 5. Confirm only one MDS is rank 0
ceph status
# 6. Restart the remaining MDS
systemctl restart ceph-mds.target
# 7. Restart the previously stopped standby MDS daemons
systemctl start ceph-mds.target
# 8. Restore original max_mds and allow_standby_replay
ceph fs set $FS max_mds <original_max_mds>
ceph fs set $FS allow_standby_replay <original_value>
7.10 Unset noout and Confirm Clean
ceph osd unset noout
ceph -s # must be HEALTH_OK
ceph versions # all daemons should report 19.2.x
At this point Ceph is on Squid and the cluster is fully healthy. Stop and verify before continuing. Do not begin Phase 3 if Ceph is anything other than HEALTH_OK.
8. Phase 3 — Run the pve8to9 Readiness Checker
Proxmox ships a built-in checklist program that scans for known upgrade blockers. It only reports; it does not change anything.
On each of the three nodes:
pve8to9 --full
Read the entire output. Categories include repositories, storage, network, guests, certificates, kernel/bootloader, Ceph, and HA. Common items it flags and how to address them:
proxmox-vepackage is too old → finish Phase 1; you are not on the latest 8.4.systemd-bootmeta-package should be removed →apt remove systemd-boot(only ifsystemd-boot-efiandsystemd-boot-toolsremain installed; the checker tells you).- LVM/LVM-thin storage has guest volumes with autoactivation enabled → on shared LVM (iSCSI/FC) this is important. Run the migration script if suggested:
For local LVM-thin only (no shared LVM), this is optional./usr/share/pve-manager/migrations/pve-lvm-disable-autoactivation - Running guests detected → not an error; just a reminder that you will migrate or shut them down before each per-node reboot.
- Old Ceph version → return to Phase 2; you skipped a step.
- Bookworm-only repositories present → expected at this stage; will be addressed in Phase 4.
Re-run pve8to9 --full after each fix until the FAIL lines are gone. WARN lines that you understand and accept (such as “running guests”) may be left as-is.
9. Phase 4 — Upgrade Each Node from PVE 8.4 to PVE 9.0
This is the per-node rolling upgrade. It must be done one node at a time, fully completing all steps on a node before starting the next.
Assume below that we are upgrading
pve01first, thenpve02, thenpve03.
9.1 Drain the Node
Migrate every running guest off the node. Live migration from PVE 8 → PVE 9 is supported (the reverse is not generally supported).
# List what is running on this node
qm list
pct list
# Migrate VMs (online)
qm migrate <vmid> pve02 --online
# Migrate CTs (containers usually require restart-migration)
pct migrate <ctid> pve02 --restart
For VMs with PCIe/USB passthrough, you must shut them down and start them on another node manually (or accept that they will be off for the duration of the node’s upgrade).
9.2 Enter Maintenance Mode and Set noout
ha-manager crm-command node-maintenance enable pve01
ceph osd set noout
9.3 Update the Debian Base Repositories to Trixie
Edit the repository files to switch from Bookworm to Trixie:
sed -i 's/bookworm/trixie/g' /etc/apt/sources.list
sed -i 's/bookworm/trixie/g' /etc/apt/sources.list.d/pve-enterprise.list 2>/dev/null
Inspect every file in /etc/apt/sources.list and /etc/apt/sources.list.d/:
grep -r '' /etc/apt/sources.list /etc/apt/sources.list.d/
Comment out (#) any line still referencing bookworm for which no Trixie equivalent exists. Remove any backports line — the upgrade is not tested with backports installed.
9.4 Add the PVE 9 Repository (deb822 Style)
PVE 9 prefers the new deb822 source format. For the enterprise repository:
cat > /etc/apt/sources.list.d/pve-enterprise.sources << 'EOF'
Types: deb
URIs: https://enterprise.proxmox.com/debian/pve
Suites: trixie
Components: pve-enterprise
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg
EOF
For the no-subscription repository:
cat > /etc/apt/sources.list.d/proxmox.sources << 'EOF'
Types: deb
URIs: http://download.proxmox.com/debian/pve
Suites: trixie
Components: pve-no-subscription
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg
EOF
After adding the new file, verify and remove the old .list file:
apt update
apt policy
# If the new repo is correctly listed:
rm -f /etc/apt/sources.list.d/pve-enterprise.list
rm -f /etc/apt/sources.list.d/pve-install-repo.list
apt update && apt policy
9.5 Update the Ceph Repository to Trixie
Replace the existing ceph.list with a deb822 ceph.sources file pointing to the Trixie Ceph-Squid repo. Enterprise:
cat > /etc/apt/sources.list.d/ceph.sources << 'EOF'
Types: deb
URIs: https://enterprise.proxmox.com/debian/ceph-squid
Suites: trixie
Components: enterprise
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg
EOF
No-subscription:
cat > /etc/apt/sources.list.d/ceph.sources << 'EOF'
Types: deb
URIs: http://download.proxmox.com/debian/ceph-squid
Suites: trixie
Components: no-subscription
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg
EOF
Then:
apt update
apt policy
rm -f /etc/apt/sources.list.d/ceph.list
apt update
If apt update returns 401 Unauthorized against the enterprise repo, refresh the subscription token:
pvesubscription update --force
9.6 Optional: Quiet the Audit Log During the Upgrade
A Debian Trixie default change re-enables kernel audit messages, which can flood the journal during dist-upgrade. To suppress them:
systemctl disable --now systemd-journald-audit.socket
9.7 Run the Distribution Upgrade
This is the big step. On the node, with a stable session (tmux or console):
apt update
apt dist-upgrade
You will be prompted about a number of configuration files. Recommended responses:
/etc/issue→ keep yours (default “No”). It is regenerated./etc/lvm/lvm.conf→ install the maintainer’s version unless you have local edits./etc/ssh/sshd_config→ install the maintainer’s version unless you have local edits (the change replaces the deprecatedChallengeResponseAuthenticationdirective withKbdInteractiveAuthentication)./etc/default/grub→ keep yours (default “No”) — only diff non-comment lines and re-apply by hand if needed./etc/chrony/chrony.conf→ install the maintainer’s version. Move local sources to/etc/chrony/sources.d/.
If you see apt-listchanges, press q to exit. For service-restart prompts, use the default — the reboot afterwards restarts everything cleanly anyway.
This step typically takes 5–15 minutes on SSD-backed nodes and considerably longer on rotational disks.
9.8 Address Boot-Loader Items Before the Reboot
If your node boots UEFI from LVM, install the fixed GRUB metapackage to avoid the disk lvmid/... not found boot bug:
[ -d /sys/firmware/efi ] && apt install grub-efi-amd64
If pve8to9 previously suggested removing the systemd-boot meta-package and you forgot, do it now (only if systemd-boot-efi and systemd-boot-tools remain installed):
apt remove systemd-boot
For ZFS-on-root systems using proxmox-boot-tool:
proxmox-boot-tool refresh
9.9 Re-run the Checker, Then Reboot
pve8to9 --full
reboot
Reboot is mandatory, even if you were already running an opt-in 6.14 kernel under PVE 8 — the new userland needs the new ABI.
9.10 After Reboot — Validate the Single Node
When the node is back up:
pveversion # expect 9.x
uname -r # expect 6.14.x (PVE 9.0) or 6.17.x (PVE 9.1)
ceph -s # HEALTH_OK or HEALTH_WARN noout
pvecm status # all 3 nodes shown, this one online
systemctl --failed # should be empty
Then exit maintenance:
ha-manager crm-command node-maintenance disable pve01
ceph osd unset noout
Only unset
nooutonce this node’s OSDs are all backup. If you have more nodes left to upgrade, you may prefer to leavenooutset across the entire cluster upgrade to avoid backfill churn between nodes — set it once before the first node, unset it after the last.
9.11 Migrate Some Guests Back, Then Move to the Next Node
qm migrate <vmid> pve01 --online
You can rebalance guests onto the freshly-upgraded node. Do not start Phase 4 on the next node until this node has been fully validated and is healthy in the cluster. Upgrading two nodes in parallel risks losing Corosync quorum and freezing the cluster.
Repeat sections 9.1 through 9.11 for pve02, then pve03.
10. Phase 5 — Post-Cluster Validation
Once all three nodes report PVE 9.x, perform a full cluster validation.
10.1 Versions and Quorum
# On any node:
pvecm status
pvecm nodes
for n in pve01 pve02 pve03; do
ssh $n "hostname; pveversion -v | head -3"
done
All nodes must report the same pve-manager major.minor.
10.2 Ceph
ceph -s # HEALTH_OK
ceph versions # every daemon at 19.2.x
ceph osd tree # all OSDs up + in
ceph mon dump | grep min_mon_release # squid
10.3 HA Groups Auto-Migrated to Rules
After all nodes are on PVE 9, the HA manager automatically converts legacy HA groups into HA rules. Verify:
ha-manager status
ha-manager rules list
journalctl -eu pve-ha-crm | tail -50
If you see errors, the active CRM node’s log will show them.
10.4 Guests
Check that every VM and CT can be:
- Started and stopped
- Live-migrated between any pair of nodes
- Snapshotted
- Backed up
Pay special attention to:
- FreeBSD-based guests: VM memory percentage may show inflated. This is a host-side accounting change in PVE 9 and is cosmetic.
- VMs using QEMU machine version 10.0+: confirm your backup tool supports them. Veeam users may need to pin affected machines to
9.2+pve1. - VMs with PCI passthrough: a kernel 6.14 regression has been reported by some users; if a passthrough VM fails to start, pin the older kernel as a workaround (see §12.7).
10.5 Network and Firewall
ip -br link
brctl show 2>/dev/null || bridge link
pve-firewall status
nft list ruleset | head -50
PVE 9 migrates the firewall from iptables to nftables. The configuration in /etc/pve/firewall/ is unchanged; the underlying enforcement engine is.
10.6 Browser-Side Cleanup
After upgrading the GUI nodes, force-reload the browser to flush cached JavaScript:
- Linux/Windows: Ctrl + Shift + R
- macOS: ⌘ + Alt + R
10.7 Optionally Modernize APT Sources
PVE 9 ships an apt modernize-sources helper that converts any remaining .list files to deb822 .sources:
apt modernize-sources # answer 'n' to preview, then re-run with 'Y'
The original files are kept with a .list.bak suffix and can be removed once you have validated the new layout.
11. Phase 6 — Optional: Move to PVE 9.1
PVE 9.1 (released November 2025) is a refinement release on top of 9.0:
- Linux kernel 6.17.2 (newer hardware support, may affect some Dell PowerEdge servers — see §12.5)
- QEMU 10.1.2, LXC 6.0.5, ZFS 2.3.4
- OCI/Docker images as LXC application containers (technology preview)
- vTPM state in qcow2 (full snapshots for Windows VMs with vTPM)
- SDN GUI: Fabrics, EVPN learned IPs/MACs in resource tree
- Many bug fixes
The upgrade from 9.0 to 9.1 is a minor upgrade — no repository changes, no per-node rituals beyond the standard maintenance/reboot pattern:
# On each node, one at a time:
ha-manager crm-command node-maintenance enable <node>
ceph osd set noout
apt update && apt dist-upgrade -y
reboot
# After reboot:
ceph osd unset noout
ha-manager crm-command node-maintenance disable <node>
Wait for ceph -s to be HEALTH_OK and pvecm status to show the node online before moving to the next one.
12. Known Issues and Their Workarounds
The following list compiles the issues most likely to bite you in a hyper-converged Ceph cluster. The full list is on the Roadmap wiki page.
12.1 Network Interface Renaming
The 6.14/6.17 kernel can rename network interfaces because of changes in how PCIe addresses or VFs are detected. If your /etc/network/interfaces references eno1, enp3s0, etc., these names may be different after the upgrade — and the host may come up without networking.
Mitigation: use the new helper to pin all interfaces to stable nicX names before rebooting:
pve-network-interface-pinning generate
(Always have IPMI/console access available as a fallback.)
12.2 Ceph Full-Mesh Networks Failing to Boot
Earlier versions of the Full Mesh Network for Ceph Server guide configured frr like this:
post-up /usr/bin/systemctl restart frr.service
Under PVE 9, frr now depends on networking.service, which deadlocks the boot. Change it to:
post-up /usr/bin/systemctl is-active --quiet frr.service && /usr/bin/systemctl restart frr.service || true
If you only realize this after a node won’t boot, use the Rescue Boot option from the PVE installation ISO (Advanced menu) and edit /etc/network/interfaces.
12.3 GRUB Failure on UEFI + LVM
A GRUB bug in PVE 8 may leave UEFI-booting LVM-root systems unbootable with disk 'lvmid/...' not found. Install the fixed metapackage before the upgrade reboot:
[ -d /sys/firmware/efi ] && apt install grub-efi-amd64
ZFS-on-root and legacy-BIOS systems are not affected.
12.4 systemd-boot Meta-Package Misconfigures the Bootloader
If systemd-boot is installed as a meta-package (it was on PVE 8.1–8.4 ISO installs), it now installs hooks that change the bootloader on package upgrades. Proxmox manages booting via proxmox-boot-tool, so this is harmful. Remove the meta-package when pve8to9 says so:
apt remove systemd-boot
12.5 Kernel 6.17 on Some Dell PowerEdge Servers
Some users report machine-check exceptions or boot failures on certain Dell PowerEdge models with kernel 6.17 (PVE 9.1). Reported workarounds: enable SR-IOV Global and I/OAT DMA in BIOS, or pin the 6.14 kernel:
proxmox-boot-tool kernel list
proxmox-boot-tool kernel pin 6.14.8-2-pve
proxmox-boot-tool refresh
12.6 Veeam Backup with QEMU 10+
Veeam Backup & Replication had failures with VMs at QEMU machine version 10.0+ (the new default in PVE 9). Either pin the VM machine version to 9.2+pve1:
# in /etc/pve/qemu-server/<vmid>.conf
machine: pc-q35-9.2+pve1
… or wait for a Veeam patch before upgrading the affected VMs’ machine versions.
12.7 PCI Passthrough Sometimes Broken on Kernel 6.14
A subset of users have reported VMs with PCI passthrough failing to start on kernel 6.14. Workaround: pin an older kernel (the 6.8 LTS series shipped with PVE 8 is not available on PVE 9; pin 6.14 minor variants if multiple are present, or accept downtime until a future kernel update fixes it).
12.8 cgroup v1 Removed — Old Containers Will Not Start
LXC containers running systemd ≤ 230 (CentOS 7, Ubuntu 16.04, older Debian) will fail to boot on PVE 9. Either upgrade the container’s OS or migrate the workload off LXC before upgrading the host.
12.9 /etc/sysctl.conf Is No Longer Honored
Move every entry from /etc/sysctl.conf to a numbered file in /etc/sysctl.d/, e.g. /etc/sysctl.d/90-local.conf. Common things to migrate:
net.ipv4.ip_forwardnet.ipv6.conf.all.forwardingnet.ipv4.conf.all.rp_filter(matters for EVPN exit nodes)
After moving, apply with sysctl --system.
12.10 /tmp is Now a tmpfs
Debian 13 mounts /tmp as a tmpfs (up to 50 % of RAM) and periodically cleans /tmp and /var/tmp. If any application or backup script relies on long-lived files in /tmp, move them. Most Proxmox-native processes are unaffected.
12.11 GlusterFS Storage Removed
If you have any GlusterFS storage definitions, remove them from /etc/pve/storage.cfg or convert them to Directory storage with a manual mount. The upgrade will warn but continue.
12.12 Custom OpenFabric / OSPF FRR Configurations
Custom FRR daemons in /etc/frr/frr.conf.local are now disabled the next time SDN config is applied. To keep your custom config working independently of SDN, create /etc/default/frr with ospfd=yes (or fabricd=yes).
13. Troubleshooting Reference
13.1 apt dist-upgrade Wants to Remove proxmox-ve
This means a Bookworm-only repository is still active, leaving some packages without a Trixie counterpart. Re-check every file in /etc/apt/sources.list.d/:
grep -r '' /etc/apt/sources.list /etc/apt/sources.list.d/
Correct the offending repository, run apt update, and try again. If a package truly has no Trixie version (a third-party plugin, for example), uninstall it before continuing:
apt purge <package>
apt -f install
apt dist-upgrade
13.2 apt update Returns 401 Unauthorized on the Ceph Enterprise Repo
pvesubscription update --force
apt update
If still 401, confirm pve-manager ≥ 8.2.8 (PVE 8 path) or 9.0.x (PVE 9 path) and that your subscription covers the Ceph add-on.
13.3 Ceph Not Going Back to HEALTH_OK After OSD Restart
ceph -s
ceph health detail
ceph osd tree
ceph osd df tree
Common causes: an OSD that did not start (check journalctl -u ceph-osd@<id>), a disk that failed during restart, or PGs stuck peering because a monitor is missing. Resolve before continuing.
13.4 Node Won’t Boot After Upgrade
Boot from the PVE 9 ISO → Advanced → Rescue Boot. This mounts your existing root filesystem and gives you a shell. From there:
# Common fixes:
apt -f install # finish a partial dist-upgrade
update-grub # rebuild GRUB config
proxmox-boot-tool refresh # rebuild systemd-boot/EFI entries
nano /etc/network/interfaces # fix renamed NICs
For ZFS-root systems, boot from a previous ZFS dataset by selecting it in the GRUB menu (the preupgrade-pve9 snapshot you took in §5.3).
13.5 Cluster Loses Quorum During Upgrade
If you accidentally took two nodes offline at once, the cluster will be read-only (pvecm status shows Activity blocked). Restore quorum by bringing one of the offline nodes back online. Never force quorum on a 3-node cluster as a casual workaround — it can cause split-brain in /etc/pve.
For a true emergency on a single surviving node:
pvecm expected 1 # only as last resort, with the other two truly dead
13.6 LVM Thin Pool Needs Repair After Upgrade
Some systems show:
Check of pool pve/data failed (status:64). Manual repair required!
Repair with:
lvconvert --repair pve/data
13.7 HA Rules Errors After Upgrade
If HA groups did not convert cleanly:
journalctl -eu pve-ha-crm | tail -100
ha-manager rules list
Common fix: edit a rule via the GUI (which forces a re-validation), or recreate it.
14. Rollback Strategy
There is no in-place downgrade path from PVE 9 back to PVE 8. If the upgrade has gone wrong on one node, your options are:
- Filesystem-level rollback (if you took ZFS snapshots in §5.3). Reboot the node, select the pre-upgrade ZFS dataset from the GRUB menu, and the node returns to its PVE 8.4 state. The cluster will then be running mixed versions, which is supported temporarily — fix the underlying problem and retry the upgrade for that node.
- Reinstall the node from PVE 8.4 ISO and rejoin the cluster. This is invasive but recovers a node fully. The remaining two PVE 8.4 nodes (or two PVE 9 nodes) accept the rejoin. Ceph OSDs on local disks can usually be re-attached without re-replicating data —
ceph-volume lvm activate --allafter reinstall. - Restore VMs to a fresh PVE 8.4 cluster. This is the worst-case fallback and depends entirely on the backups you took in §5.
The single best thing you can do to make rollback unnecessary is to never start the next node until the current one is fully healthy. A failed upgrade on one node out of three is recoverable. A failed upgrade on two nodes out of three is a long night.
Appendix A — Repository Reference (deb822)
All on Debian Trixie (PVE 9). The legacy .list format still works but Proxmox recommends migrating to .sources.
/etc/apt/sources.list (Debian base):
deb http://deb.debian.org/debian trixie main contrib
deb http://deb.debian.org/debian trixie-updates main contrib
deb http://security.debian.org/debian-security trixie-security main contrib
/etc/apt/sources.list.d/pve-enterprise.sources (with subscription):
Types: deb
URIs: https://enterprise.proxmox.com/debian/pve
Suites: trixie
Components: pve-enterprise
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg
/etc/apt/sources.list.d/proxmox.sources (no subscription):
Types: deb
URIs: http://download.proxmox.com/debian/pve
Suites: trixie
Components: pve-no-subscription
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg
/etc/apt/sources.list.d/ceph.sources (enterprise):
Types: deb
URIs: https://enterprise.proxmox.com/debian/ceph-squid
Suites: trixie
Components: enterprise
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg
/etc/apt/sources.list.d/ceph.sources (no subscription):
Types: deb
URIs: http://download.proxmox.com/debian/ceph-squid
Suites: trixie
Components: no-subscription
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg
Appendix B — Per-Node Operator Checklist
Print this and tick boxes as you go on each of the three nodes.
NODE: __________________ OPERATOR: __________________ DATE: __________
Pre-flight
[ ] Out-of-band console (IPMI / iLO / iDRAC) verified working
[ ] tmux / screen session active for SSH
[ ] Backups of /etc and all guests on this node verified
[ ] ZFS snapshot taken (if applicable): ________________________________
[ ] pveversion = 8.4.x ≥ 8.4.1
[ ] ceph -s = HEALTH_OK
[ ] ceph version = Squid (19.2.x) cluster-wide
[ ] pve8to9 --full output reviewed; FAIL items resolved
Drain
[ ] All VMs migrated off this node (or shut down for passthrough VMs)
[ ] All CTs migrated or shut down
[ ] ha-manager crm-command node-maintenance enable <node>
[ ] ceph osd set noout
Upgrade
[ ] Debian repos switched bookworm -> trixie
[ ] PVE 9 deb822 .sources file present and verified by apt policy
[ ] Old PVE .list file removed
[ ] Ceph deb822 .sources file present (Trixie + ceph-squid)
[ ] Old ceph.list file removed
[ ] (optional) systemctl disable --now systemd-journald-audit.socket
[ ] apt update succeeded with no errors
[ ] apt dist-upgrade completed (configuration-file prompts answered)
[ ] (UEFI+LVM) grub-efi-amd64 installed
[ ] systemd-boot meta-package removed if pve8to9 said so
[ ] proxmox-boot-tool refresh (ZFS-on-root)
Reboot
[ ] pve8to9 --full re-run, no FAIL items
[ ] reboot
[ ] Node back online; pveversion shows 9.x
[ ] uname -r matches expected kernel
[ ] pvecm status shows this node online
[ ] ceph -s healthy from this node's perspective
[ ] systemctl --failed is empty
Re-attach
[ ] ha-manager crm-command node-maintenance disable <node>
[ ] ceph osd unset noout (only after final node, OR re-set before next node)
[ ] At least one test VM live-migrated back to this node
[ ] Test VM started, network OK, console OK
Sign-off: ____________________________
Appendix C — Useful One-Liner Reference
# Cluster health
pvecm status
pvecm nodes
ha-manager status
# Per-node version
pveversion -v
# Ceph health
ceph -s
ceph health detail
ceph versions
ceph osd tree
ceph osd df tree
ceph mon dump | grep min_mon_release
ceph mgr services
ceph fs ls
# Maintenance flags
ceph osd set noout
ceph osd unset noout
ha-manager crm-command node-maintenance enable <node>
ha-manager crm-command node-maintenance disable <node>
# Migration
qm migrate <vmid> <target-node> --online
pct migrate <ctid> <target-node> --restart
# Repository sanity
apt update && apt policy
grep -r '' /etc/apt/sources.list /etc/apt/sources.list.d/
# Bootloader
proxmox-boot-tool status
proxmox-boot-tool refresh
proxmox-boot-tool kernel list
proxmox-boot-tool kernel pin <version>
# Subscription / repo auth
pvesubscription get
pvesubscription update --force
# Backups
vzdump --all 1 --compress zstd --storage <storage-id>
qmrestore <dump-file> <new-vmid> --storage <storage-id>
# Pre-upgrade checker
pve8to9
pve8to9 --full
# Modernize APT
apt modernize-sources
End of document. When in doubt, the canonical reference is the Proxmox wiki: Upgrade from 8 to 9, Ceph Reef to Squid, and Roadmap / known issues. Re-check those pages before any production upgrade — they are updated as new issues are discovered.
Madalin
AI integrator🚀 Senior Architect | SRE & Database Expert | AI Orchestrator 👋 Building the future at the speed of thought. ⚡️ I don't just write code; I architect high-performance, bulletproof ecosystems. With a foundation in Systems Engineering and a mastery of Go and TypeScript, I bridge the gap between heavy-duty backend reliability and seamless, high-conversion frontends.
Continue the conversation
If this article reflects the challenges your organisation is navigating, explore more practical guidance across Madalin.