Madalin
Development enhanced by AI

Upgrading a 3-Node Proxmox VE Cluster with Ceph to Proxmox VE 9.x

Upgrade your 3-node Proxmox VE cluster with Ceph to 9.x! This guide details the essential steps for a smooth, rolling transition.

Create an image inspired by the Proxmox virtualization solution. Include logos of free Gnu Linux distributions like Debian, Alma Linux, Fedora and CentOS.

Comprehensive operational runbook — focused on Proxmox VE 9.x (current stable: 9.1)

Last verified against: Proxmox VE 9.1 (Debian 13.2 “Trixie”, kernel 6.17.2, Ceph Squid 19.2.3, QEMU 10.1.2, LXC 6.0.5, ZFS 2.3.4). Authoritative sources: the official Proxmox wiki pages Upgrade from 8 to 9 and Ceph Reef to Squid.


Table of Contents

  1. Scope and Audience
  2. Background: What Changes in PVE 9.x
  3. The Mandatory Upgrade Order
  4. Pre-Upgrade Planning
  5. Backups — The Only Real Rollback
  6. Phase 1 — Bring All Nodes to the Latest PVE 8.4
  7. Phase 2 — Upgrade Ceph from Reef (18.2) to Squid (19.2)
  8. Phase 3 — Run the pve8to9 Readiness Checker
  9. Phase 4 — Upgrade Each Node from PVE 8.4 to PVE 9.0
  10. Phase 5 — Post-Cluster Validation
  11. Phase 6 — Optional: Move to PVE 9.1
  12. Known Issues and Their Workarounds
  13. Troubleshooting Reference
  14. Rollback Strategy
  15. Appendix A — Repository Reference (deb822)
  16. Appendix B — Per-Node Operator Checklist
  17. Appendix C — Useful One-Liner Reference

1. Scope and Audience

This document describes the end-to-end procedure for performing an in-place, rolling upgrade of a 3-node hyper-converged Proxmox VE cluster running Ceph as the primary storage to Proxmox VE 9.x. It assumes:

  • Three physical (or bare-metal-equivalent) nodes joined in a single Proxmox cluster (Corosync quorum = 2 of 3).
  • A hyper-converged Ceph deployment: monitors, managers, and OSDs are all on the same nodes that run virtual guests.
  • The starting point is Proxmox VE 8.x with Ceph Quincy (17.2.x) or Ceph Reef (18.2.x).
  • The goal is the latest stable Proxmox VE 9.x release with Ceph Squid (19.2.x).
  • The administrator is comfortable on the Linux command line and with apt.

The procedure is designed to keep workloads running throughout the upgrade by using live migration between nodes. Brief reboots of individual nodes are required, but with proper HA and migration planning the cluster as a whole experiences no service interruption for the guests.


2. Background: What Changes in PVE 9.x

Proxmox VE 9.0 was released on 18 July 2025 and is based on Debian 13 “Trixie”. The 9.1 point release followed on 19 November 2025 and is the current recommended target. Notable platform changes:

  • Base distribution: Debian 12 (Bookworm) → Debian 13 (Trixie / 13.2).
  • Kernel: 6.8/6.11 (PVE 8) → 6.14.8 in PVE 9.0, 6.17.2 in PVE 9.1.
  • Ceph: bundles Squid 19.2.3 (Quincy and Reef are unsupported on PVE 9).
  • QEMU: 9.x → 10.0.2 (9.0) / 10.1.2 (9.1).
  • LXC: → 6.0.5; ZFS: → 2.3.3 / 2.3.4.
  • HA: HA groups are deprecated in favor of HA rules (node and resource affinity). Existing groups are auto-migrated after all nodes are on PVE 9.
  • Firewall: Proxmox firewall now uses nftables by default (replaces iptables).
  • SDN: EVPN improvements, Fabrics (OpenFabric/OSPF) as a managed feature.
  • Snapshots on thick LVM as volume chains (relevant for iSCSI/FC SANs — not Ceph, but worth knowing).
  • GlusterFS storage support is removed.
  • cgroup v1 is removed — containers running systemd ≤ 230 (e.g., CentOS 7, Ubuntu 16.04) will not start.
  • /tmp is now a tmpfs by Debian default and is periodically cleaned along with /var/tmp.
  • /etc/sysctl.conf is no longer read by systemd-sysctl. Migrate settings to /etc/sysctl.d/<NN>-<name>.conf.

PVE 8.4 receives security and bug fixes until August 2026, giving roughly one year of overlap with PVE 9.


3. The Mandatory Upgrade Order

The single most important rule of this upgrade is the order. Reordering these steps will break the cluster.

┌─────────────────────────────────────────────────────────────────┐
│  1. All nodes → latest PVE 8.4.x (≥ 8.4.1, ideally newest)      │
│  2. Ceph cluster → Ceph Squid (19.2.x), still on PVE 8.4        │
│  3. pve8to9 --full clean on every node                          │
│  4. Node-by-node: PVE 8.4 → PVE 9.0 (one node at a time)        │
│  5. After all 3 nodes are on PVE 9.0 → optional point upgrade   │
│     to PVE 9.1                                                  │
└─────────────────────────────────────────────────────────────────┘

Why Ceph first? PVE 9 ships only the Squid Ceph packages built for Debian Trixie. There is no Ceph Reef package set for Trixie. If you upgrade Proxmox first, the Ceph daemons will be left without compatible packages and the cluster will degrade. Conversely, Ceph Squid is fully supported on PVE 8.4, so upgrading Ceph in advance is safe and reversible.

Why one node at a time? A 3-node cluster has Corosync quorum of 2. Taking down two nodes simultaneously freezes the cluster. Likewise, a Ceph pool with size=3, min_size=2 only tolerates a single OSD-host outage at a time.


4. Pre-Upgrade Planning

4.1 Hardware and Console Access

Before touching any package, ensure you have out-of-band console access (IPMI, iLO, iDRAC, or physical KVM) to every node. Major-version upgrades occasionally leave a node unable to come up on the network — for example, due to interface-name changes in the new kernel — and SSH alone is not enough to recover.

If only SSH is available, run the upgrade inside tmux or screen so a dropped session does not interrupt apt dist-upgrade:

apt install tmux
tmux new -s pve-upgrade
# detach with Ctrl-b d, re-attach with: tmux attach -t pve-upgrade

Never run the upgrade from the browser-based “noVNC/xterm.js” console of the node you are upgrading — that session terminates partway through.

4.2 Cluster Health Baseline

A healthy starting cluster is non-negotiable. Confirm all of the following from one of the nodes:

# Proxmox cluster quorum and node states
pvecm status
pvecm nodes

# Per-node Proxmox version (run on each)
pveversion -v

# Ceph cluster health, OSD tree, monitor map
ceph -s
ceph osd tree
ceph mon dump | grep min_mon_release
ceph versions

# Free space on root filesystem (≥ 10 GB strongly recommended)
df -h /

ceph -s must report HEALTH_OK before you start. Address any HEALTH_WARN first — running an upgrade on top of an existing warning is asking for compounded failures.

4.3 Inventory of What Is on Each Node

Document, per node:

  • VMs and CTs (IDs, HA state, current node, RAM/CPU footprint)
  • Ceph daemons present (mon / mgr / mds / osd IDs)
  • Local-only resources that cannot be live-migrated (PCIe passthrough, USB passthrough, local-only storage, raw device mappings)
  • Custom changes in /etc that you might be prompted about during dist-upgrade
qm list
pct list
ha-manager status
ceph osd tree | grep -E "host|osd\\."

VMs with PCI passthrough or local LVM/ZFS-only disks must be shut down rather than live-migrated; plan the maintenance window accordingly.

4.4 Compatibility Items to Verify

  • Proxmox Backup Server: if you use PBS, check the PVE↔PBS compatibility. PBS 4 is required for full feature parity with PVE 9; PBS 3 still works for backups.
  • Third-party backup vendors (Veeam, etc.): verify they support PVE 9 / QEMU 10 before upgrading. Veeam in particular had issues with VMs at QEMU machine version 10.0+ — a workaround is to pin affected VMs to 9.2+pve1.
  • Third-party storage plugins: any out-of-tree plugin must be rebuilt for PVE 9.
  • NVIDIA vGPU: requires GRID 18.3+ (driver 570.158.02+) for the 6.14 kernel of PVE 9.0, and 19.4+ for the 6.17 kernel of PVE 9.1.
  • FreeBSD-based guests (pfSense, OPNsense, TrueNAS Core): no functional impact, but the GUI may show inflated memory percentages — it is cosmetic.
  • CentOS 7 / Ubuntu 16.04 containers will not start on PVE 9 because cgroup v1 is removed. Migrate them before upgrading the host.

5. Backups — The Only Real Rollback

There is no in-place downgrade from PVE 9 to PVE 8. Your rollback path is restoring from backup.

5.1 Backup the Configuration

On each node, capture a tarball of /etc and the key config directories:

NODE=$(hostname)
mkdir -p /root/preupgrade
tar czf /root/preupgrade/${NODE}-etc-$(date +%F).tgz \
    /etc /var/lib/pve-cluster /var/lib/ceph/ 2>/dev/null

The Proxmox cluster filesystem (/etc/pve) is shared via Corosync; backing it up from any one node is sufficient, but doing it on each is a cheap insurance policy.

5.2 Backup All Guests

If you have Proxmox Backup Server, run a full backup of every VM and CT:

vzdump --all 1 --compress zstd --storage <pbs-storage-id>

For environments without PBS, write to a local or NFS storage — but be aware that local backups stored on the same Ceph pool you are about to upgrade are not really off-site. Ideally, push the backups to storage that is independent of the cluster.

5.3 Filesystem-Level Snapshots

If your root filesystem is on ZFS, take snapshots of rpool/ROOT/pve-1 (or equivalent) on every node before starting:

zfs snapshot rpool/ROOT/pve-1@preupgrade-pve9
zfs list -t snapshot | grep preupgrade

These snapshots can be booted from the GRUB / proxmox-boot-tool menu if the upgrade leaves the OS unbootable.

For LVM-thin root, equivalent snapshots are possible but must be sized carefully and are not as automatic.

5.4 Verify Backups Before You Proceed

A backup you have not tested is a wish, not a backup. Restore at least one VM to a scratch ID on a non-production storage and confirm it boots, before you trust the entire cluster’s safety to your backups:

qmrestore /var/lib/vz/dump/<dumpfile>.vma.zst 9999 --storage local-zfs
qm start 9999
qm stop 9999 && qm destroy 9999

6. Phase 1 — Bring All Nodes to the Latest PVE 8.4

Proxmox VE 8.4.1 or newer is the minimum starting point. Older 8.x releases lack the pve8to9 checker and the new repository hooks. On every node, in sequence:

apt update
apt dist-upgrade -y
pveversion

pveversion must report 8.4.x with x ≥ 1. If an updated kernel was installed, reboot the node. Coordinate the reboots so that only one node is offline at a time:

  1. On each node, in turn, place it in maintenance and reboot.
  2. Wait for pvecm status to show the node online again before moving on.
  3. Wait for ceph -s to return to HEALTH_OK before rebooting the next node.

Maintenance-mode reboot pattern (used throughout this document)

# 1. Migrate or shut down all guests on this node
ha-manager crm-command node-maintenance enable <node>

# 2. Stop Ceph from rebalancing data while the node is briefly offline
ceph osd set noout

# 3. Reboot
reboot

# 4. After reboot, wait for HEALTH_OK
watch -n 5 'ceph -s; echo; pvecm status'

# 5. Take the node out of maintenance and unset noout
ha-manager crm-command node-maintenance disable <node>
ceph osd unset noout

Repeat for the other two nodes. At the end of Phase 1, all three nodes should be on the same PVE 8.4.x version with a healthy Corosync quorum and a HEALTH_OK Ceph cluster.


7. Phase 2 — Upgrade Ceph from Reef (18.2) to Squid (19.2)

This phase upgrades Ceph in place on PVE 8.4. PVE itself is not yet touched.

If your starting Ceph version is Quincy (17.2.x), run the Ceph Quincy → Reef upgrade first. Skipping Reef and going straight from Quincy to Squid is technically supported but not recommended — the procedure below assumes you are at Reef.

The official guide is the Ceph Reef to Squid wiki page. The summary below mirrors it with the cluster-specific commentary for a 3-node setup.

7.1 Verify the Pre-Upgrade Ceph State

ceph -s
ceph versions
ceph osd tree
ceph fs ls       # only if you use CephFS

You should see all daemons reporting Reef (18.2.x) and HEALTH_OK.

7.2 Switch the Ceph APT Repository on Every Node

Replace reef with squid in the Ceph repository list on each of the three nodes:

sed -i 's/reef/squid/' /etc/apt/sources.list.d/ceph.list
cat /etc/apt/sources.list.d/ceph.list

You should now see one of these (depending on subscription):

deb https://enterprise.proxmox.com/debian/ceph-squid bookworm enterprise
# or
deb http://download.proxmox.com/debian/ceph-squid bookworm no-subscription

Note that we are still on Bookworm at this stage — that is correct. The Trixie repository line will be set later, in Phase 4.

7.3 Set the noout Flag

ceph osd set noout

This prevents Ceph from rebalancing while OSDs restart. It is set once, cluster-wide; you do not run it per node.

7.4 Install the Squid Packages on All Three Nodes

On every node, in any order:

apt update
apt full-upgrade -y

After the package upgrade, Ceph daemons are still running the old Reef binaries — packages are upgraded, but daemons are not restarted automatically.

7.5 Restart Monitor Daemons (One Node at a Time)

The 3-node cluster has 3 monitors. Restart them sequentially, waiting for quorum to reform between each:

# On node 1
systemctl restart ceph-mon.target
ceph -s          # wait for HEALTH_OK / HEALTH_WARN(noout) and 3-of-3 quorum

# On node 2
systemctl restart ceph-mon.target
ceph -s

# On node 3
systemctl restart ceph-mon.target
ceph -s

Once all three monitors are running Squid, verify the monmap:

ceph mon dump | grep min_mon_release
# Expected: min_mon_release 19 (squid)

7.6 Restart Manager Daemons

# Run on each node where a mgr daemon exists
systemctl restart ceph-mgr.target
ceph -s   # confirm one mgr active, others as standby

7.7 Restart OSD Daemons (One Node at a Time)

This is the longest step and the most sensitive in a 3-node cluster.

# Node 1
systemctl restart ceph-osd.target
# Wait for all OSDs back up and PGs active+clean
watch -n 5 'ceph -s'

# Only when HEALTH_OK (or HEALTH_WARN noout), continue:

# Node 2
systemctl restart ceph-osd.target
# Wait...

# Node 3
systemctl restart ceph-osd.target
# Wait...

If your cluster has many OSDs per node, restarting ceph-osd.target will bounce all of them at once. With noout set this is safe, but the cluster will go briefly into degraded state until the OSDs come back. Watch placement-group recovery in ceph -sdo not move to the next node until PGs are clean.

After all OSDs are restarted, you may see this warning:

HEALTH_WARN: all OSDs are running squid or later but require_osd_release < squid

This is expected and is cleared in the next step.

7.8 Promote require-osd-release to Squid

Only after every OSD reports a Squid version (ceph versions):

ceph osd require-osd-release squid

This activates Squid-only on-disk features and clears the warning above.

7.9 Upgrade CephFS MDS Daemons (Skip if You Don’t Use CephFS)

For each filesystem listed by ceph fs ls:

FS=<your-fs-name>

# 1. Save current settings, then disable standby_replay
ceph fs get $FS | grep -o allow_standby_replay
ceph fs set $FS allow_standby_replay false

# 2. Reduce ranks to 1 (note original max_mds first)
ceph fs get $FS | grep max_mds
ceph fs set $FS max_mds 1

# 3. Wait until only one MDS is active per FS
watch -n 5 'ceph status'

# 4. Stop standby MDS daemons (do this on hosts running standby MDS)
systemctl stop ceph-mds.target

# 5. Confirm only one MDS is rank 0
ceph status

# 6. Restart the remaining MDS
systemctl restart ceph-mds.target

# 7. Restart the previously stopped standby MDS daemons
systemctl start ceph-mds.target

# 8. Restore original max_mds and allow_standby_replay
ceph fs set $FS max_mds <original_max_mds>
ceph fs set $FS allow_standby_replay <original_value>

7.10 Unset noout and Confirm Clean

ceph osd unset noout
ceph -s          # must be HEALTH_OK
ceph versions    # all daemons should report 19.2.x

At this point Ceph is on Squid and the cluster is fully healthy. Stop and verify before continuing. Do not begin Phase 3 if Ceph is anything other than HEALTH_OK.


8. Phase 3 — Run the pve8to9 Readiness Checker

Proxmox ships a built-in checklist program that scans for known upgrade blockers. It only reports; it does not change anything.

On each of the three nodes:

pve8to9 --full

Read the entire output. Categories include repositories, storage, network, guests, certificates, kernel/bootloader, Ceph, and HA. Common items it flags and how to address them:

  • proxmox-ve package is too old → finish Phase 1; you are not on the latest 8.4.
  • systemd-boot meta-package should be removedapt remove systemd-boot (only if systemd-boot-efi and systemd-boot-tools remain installed; the checker tells you).
  • LVM/LVM-thin storage has guest volumes with autoactivation enabled → on shared LVM (iSCSI/FC) this is important. Run the migration script if suggested:
    /usr/share/pve-manager/migrations/pve-lvm-disable-autoactivation
    
    For local LVM-thin only (no shared LVM), this is optional.
  • Running guests detected → not an error; just a reminder that you will migrate or shut them down before each per-node reboot.
  • Old Ceph version → return to Phase 2; you skipped a step.
  • Bookworm-only repositories present → expected at this stage; will be addressed in Phase 4.

Re-run pve8to9 --full after each fix until the FAIL lines are gone. WARN lines that you understand and accept (such as “running guests”) may be left as-is.


9. Phase 4 — Upgrade Each Node from PVE 8.4 to PVE 9.0

This is the per-node rolling upgrade. It must be done one node at a time, fully completing all steps on a node before starting the next.

Assume below that we are upgrading pve01 first, then pve02, then pve03.

9.1 Drain the Node

Migrate every running guest off the node. Live migration from PVE 8 → PVE 9 is supported (the reverse is not generally supported).

# List what is running on this node
qm list
pct list

# Migrate VMs (online)
qm migrate <vmid> pve02 --online

# Migrate CTs (containers usually require restart-migration)
pct migrate <ctid> pve02 --restart

For VMs with PCIe/USB passthrough, you must shut them down and start them on another node manually (or accept that they will be off for the duration of the node’s upgrade).

9.2 Enter Maintenance Mode and Set noout

ha-manager crm-command node-maintenance enable pve01
ceph osd set noout

9.3 Update the Debian Base Repositories to Trixie

Edit the repository files to switch from Bookworm to Trixie:

sed -i 's/bookworm/trixie/g' /etc/apt/sources.list
sed -i 's/bookworm/trixie/g' /etc/apt/sources.list.d/pve-enterprise.list 2>/dev/null

Inspect every file in /etc/apt/sources.list and /etc/apt/sources.list.d/:

grep -r '' /etc/apt/sources.list /etc/apt/sources.list.d/

Comment out (#) any line still referencing bookworm for which no Trixie equivalent exists. Remove any backports line — the upgrade is not tested with backports installed.

9.4 Add the PVE 9 Repository (deb822 Style)

PVE 9 prefers the new deb822 source format. For the enterprise repository:

cat > /etc/apt/sources.list.d/pve-enterprise.sources << 'EOF'
Types: deb
URIs: https://enterprise.proxmox.com/debian/pve
Suites: trixie
Components: pve-enterprise
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg
EOF

For the no-subscription repository:

cat > /etc/apt/sources.list.d/proxmox.sources << 'EOF'
Types: deb
URIs: http://download.proxmox.com/debian/pve
Suites: trixie
Components: pve-no-subscription
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg
EOF

After adding the new file, verify and remove the old .list file:

apt update
apt policy
# If the new repo is correctly listed:
rm -f /etc/apt/sources.list.d/pve-enterprise.list
rm -f /etc/apt/sources.list.d/pve-install-repo.list
apt update && apt policy

9.5 Update the Ceph Repository to Trixie

Replace the existing ceph.list with a deb822 ceph.sources file pointing to the Trixie Ceph-Squid repo. Enterprise:

cat > /etc/apt/sources.list.d/ceph.sources << 'EOF'
Types: deb
URIs: https://enterprise.proxmox.com/debian/ceph-squid
Suites: trixie
Components: enterprise
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg
EOF

No-subscription:

cat > /etc/apt/sources.list.d/ceph.sources << 'EOF'
Types: deb
URIs: http://download.proxmox.com/debian/ceph-squid
Suites: trixie
Components: no-subscription
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg
EOF

Then:

apt update
apt policy
rm -f /etc/apt/sources.list.d/ceph.list
apt update

If apt update returns 401 Unauthorized against the enterprise repo, refresh the subscription token:

pvesubscription update --force

9.6 Optional: Quiet the Audit Log During the Upgrade

A Debian Trixie default change re-enables kernel audit messages, which can flood the journal during dist-upgrade. To suppress them:

systemctl disable --now systemd-journald-audit.socket

9.7 Run the Distribution Upgrade

This is the big step. On the node, with a stable session (tmux or console):

apt update
apt dist-upgrade

You will be prompted about a number of configuration files. Recommended responses:

  • /etc/issue → keep yours (default “No”). It is regenerated.
  • /etc/lvm/lvm.conf → install the maintainer’s version unless you have local edits.
  • /etc/ssh/sshd_config → install the maintainer’s version unless you have local edits (the change replaces the deprecated ChallengeResponseAuthentication directive with KbdInteractiveAuthentication).
  • /etc/default/grub → keep yours (default “No”) — only diff non-comment lines and re-apply by hand if needed.
  • /etc/chrony/chrony.conf → install the maintainer’s version. Move local sources to /etc/chrony/sources.d/.

If you see apt-listchanges, press q to exit. For service-restart prompts, use the default — the reboot afterwards restarts everything cleanly anyway.

This step typically takes 5–15 minutes on SSD-backed nodes and considerably longer on rotational disks.

9.8 Address Boot-Loader Items Before the Reboot

If your node boots UEFI from LVM, install the fixed GRUB metapackage to avoid the disk lvmid/... not found boot bug:

[ -d /sys/firmware/efi ] && apt install grub-efi-amd64

If pve8to9 previously suggested removing the systemd-boot meta-package and you forgot, do it now (only if systemd-boot-efi and systemd-boot-tools remain installed):

apt remove systemd-boot

For ZFS-on-root systems using proxmox-boot-tool:

proxmox-boot-tool refresh

9.9 Re-run the Checker, Then Reboot

pve8to9 --full
reboot

Reboot is mandatory, even if you were already running an opt-in 6.14 kernel under PVE 8 — the new userland needs the new ABI.

9.10 After Reboot — Validate the Single Node

When the node is back up:

pveversion         # expect 9.x
uname -r           # expect 6.14.x (PVE 9.0) or 6.17.x (PVE 9.1)
ceph -s            # HEALTH_OK or HEALTH_WARN noout
pvecm status       # all 3 nodes shown, this one online
systemctl --failed # should be empty

Then exit maintenance:

ha-manager crm-command node-maintenance disable pve01
ceph osd unset noout

Only unset noout once this node’s OSDs are all back up. If you have more nodes left to upgrade, you may prefer to leave noout set across the entire cluster upgrade to avoid backfill churn between nodes — set it once before the first node, unset it after the last.

9.11 Migrate Some Guests Back, Then Move to the Next Node

qm migrate <vmid> pve01 --online

You can rebalance guests onto the freshly-upgraded node. Do not start Phase 4 on the next node until this node has been fully validated and is healthy in the cluster. Upgrading two nodes in parallel risks losing Corosync quorum and freezing the cluster.

Repeat sections 9.1 through 9.11 for pve02, then pve03.


10. Phase 5 — Post-Cluster Validation

Once all three nodes report PVE 9.x, perform a full cluster validation.

10.1 Versions and Quorum

# On any node:
pvecm status
pvecm nodes
for n in pve01 pve02 pve03; do
    ssh $n "hostname; pveversion -v | head -3"
done

All nodes must report the same pve-manager major.minor.

10.2 Ceph

ceph -s              # HEALTH_OK
ceph versions        # every daemon at 19.2.x
ceph osd tree        # all OSDs up + in
ceph mon dump | grep min_mon_release      # squid

10.3 HA Groups Auto-Migrated to Rules

After all nodes are on PVE 9, the HA manager automatically converts legacy HA groups into HA rules. Verify:

ha-manager status
ha-manager rules list
journalctl -eu pve-ha-crm | tail -50

If you see errors, the active CRM node’s log will show them.

10.4 Guests

Check that every VM and CT can be:

  • Started and stopped
  • Live-migrated between any pair of nodes
  • Snapshotted
  • Backed up

Pay special attention to:

  • FreeBSD-based guests: VM memory percentage may show inflated. This is a host-side accounting change in PVE 9 and is cosmetic.
  • VMs using QEMU machine version 10.0+: confirm your backup tool supports them. Veeam users may need to pin affected machines to 9.2+pve1.
  • VMs with PCI passthrough: a kernel 6.14 regression has been reported by some users; if a passthrough VM fails to start, pin the older kernel as a workaround (see §12.7).

10.5 Network and Firewall

ip -br link
brctl show 2>/dev/null || bridge link
pve-firewall status
nft list ruleset | head -50

PVE 9 migrates the firewall from iptables to nftables. The configuration in /etc/pve/firewall/ is unchanged; the underlying enforcement engine is.

10.6 Browser-Side Cleanup

After upgrading the GUI nodes, force-reload the browser to flush cached JavaScript:

  • Linux/Windows: Ctrl + Shift + R
  • macOS: ⌘ + Alt + R

10.7 Optionally Modernize APT Sources

PVE 9 ships an apt modernize-sources helper that converts any remaining .list files to deb822 .sources:

apt modernize-sources       # answer 'n' to preview, then re-run with 'Y'

The original files are kept with a .list.bak suffix and can be removed once you have validated the new layout.


11. Phase 6 — Optional: Move to PVE 9.1

PVE 9.1 (released November 2025) is a refinement release on top of 9.0:

  • Linux kernel 6.17.2 (newer hardware support, may affect some Dell PowerEdge servers — see §12.5)
  • QEMU 10.1.2, LXC 6.0.5, ZFS 2.3.4
  • OCI/Docker images as LXC application containers (technology preview)
  • vTPM state in qcow2 (full snapshots for Windows VMs with vTPM)
  • SDN GUI: Fabrics, EVPN learned IPs/MACs in resource tree
  • Many bug fixes

The upgrade from 9.0 to 9.1 is a minor upgrade — no repository changes, no per-node rituals beyond the standard maintenance/reboot pattern:

# On each node, one at a time:
ha-manager crm-command node-maintenance enable <node>
ceph osd set noout
apt update && apt dist-upgrade -y
reboot
# After reboot:
ceph osd unset noout
ha-manager crm-command node-maintenance disable <node>

Wait for ceph -s to be HEALTH_OK and pvecm status to show the node online before moving to the next one.


12. Known Issues and Their Workarounds

The following list compiles the issues most likely to bite you in a hyper-converged Ceph cluster. The full list is on the Roadmap wiki page.

12.1 Network Interface Renaming

The 6.14/6.17 kernel can rename network interfaces because of changes in how PCIe addresses or VFs are detected. If your /etc/network/interfaces references eno1, enp3s0, etc., these names may be different after the upgrade — and the host may come up without networking.

Mitigation: use the new helper to pin all interfaces to stable nicX names before rebooting:

pve-network-interface-pinning generate

(Always have IPMI/console access available as a fallback.)

12.2 Ceph Full-Mesh Networks Failing to Boot

Earlier versions of the Full Mesh Network for Ceph Server guide configured frr like this:

post-up /usr/bin/systemctl restart frr.service

Under PVE 9, frr now depends on networking.service, which deadlocks the boot. Change it to:

post-up /usr/bin/systemctl is-active --quiet frr.service && /usr/bin/systemctl restart frr.service || true

If you only realize this after a node won’t boot, use the Rescue Boot option from the PVE installation ISO (Advanced menu) and edit /etc/network/interfaces.

12.3 GRUB Failure on UEFI + LVM

A GRUB bug in PVE 8 may leave UEFI-booting LVM-root systems unbootable with disk 'lvmid/...' not found. Install the fixed metapackage before the upgrade reboot:

[ -d /sys/firmware/efi ] && apt install grub-efi-amd64

ZFS-on-root and legacy-BIOS systems are not affected.

12.4 systemd-boot Meta-Package Misconfigures the Bootloader

If systemd-boot is installed as a meta-package (it was on PVE 8.1–8.4 ISO installs), it now installs hooks that change the bootloader on package upgrades. Proxmox manages booting via proxmox-boot-tool, so this is harmful. Remove the meta-package when pve8to9 says so:

apt remove systemd-boot

12.5 Kernel 6.17 on Some Dell PowerEdge Servers

Some users report machine-check exceptions or boot failures on certain Dell PowerEdge models with kernel 6.17 (PVE 9.1). Reported workarounds: enable SR-IOV Global and I/OAT DMA in BIOS, or pin the 6.14 kernel:

proxmox-boot-tool kernel list
proxmox-boot-tool kernel pin 6.14.8-2-pve
proxmox-boot-tool refresh

12.6 Veeam Backup with QEMU 10+

Veeam Backup & Replication had failures with VMs at QEMU machine version 10.0+ (the new default in PVE 9). Either pin the VM machine version to 9.2+pve1:

# in /etc/pve/qemu-server/<vmid>.conf
machine: pc-q35-9.2+pve1

… or wait for a Veeam patch before upgrading the affected VMs’ machine versions.

12.7 PCI Passthrough Sometimes Broken on Kernel 6.14

A subset of users have reported VMs with PCI passthrough failing to start on kernel 6.14. Workaround: pin an older kernel (the 6.8 LTS series shipped with PVE 8 is not available on PVE 9; pin 6.14 minor variants if multiple are present, or accept downtime until a future kernel update fixes it).

12.8 cgroup v1 Removed — Old Containers Will Not Start

LXC containers running systemd ≤ 230 (CentOS 7, Ubuntu 16.04, older Debian) will fail to boot on PVE 9. Either upgrade the container’s OS or migrate the workload off LXC before upgrading the host.

12.9 /etc/sysctl.conf Is No Longer Honored

Move every entry from /etc/sysctl.conf to a numbered file in /etc/sysctl.d/, e.g. /etc/sysctl.d/90-local.conf. Common things to migrate:

  • net.ipv4.ip_forward
  • net.ipv6.conf.all.forwarding
  • net.ipv4.conf.all.rp_filter (matters for EVPN exit nodes)

After moving, apply with sysctl --system.

12.10 /tmp is Now a tmpfs

Debian 13 mounts /tmp as a tmpfs (up to 50 % of RAM) and periodically cleans /tmp and /var/tmp. If any application or backup script relies on long-lived files in /tmp, move them. Most Proxmox-native processes are unaffected.

12.11 GlusterFS Storage Removed

If you have any GlusterFS storage definitions, remove them from /etc/pve/storage.cfg or convert them to Directory storage with a manual mount. The upgrade will warn but continue.

12.12 Custom OpenFabric / OSPF FRR Configurations

Custom FRR daemons in /etc/frr/frr.conf.local are now disabled the next time SDN config is applied. To keep your custom config working independently of SDN, create /etc/default/frr with ospfd=yes (or fabricd=yes).


13. Troubleshooting Reference

13.1 apt dist-upgrade Wants to Remove proxmox-ve

This means a Bookworm-only repository is still active, leaving some packages without a Trixie counterpart. Re-check every file in /etc/apt/sources.list.d/:

grep -r '' /etc/apt/sources.list /etc/apt/sources.list.d/

Correct the offending repository, run apt update, and try again. If a package truly has no Trixie version (a third-party plugin, for example), uninstall it before continuing:

apt purge <package>
apt -f install
apt dist-upgrade

13.2 apt update Returns 401 Unauthorized on the Ceph Enterprise Repo

pvesubscription update --force
apt update

If still 401, confirm pve-manager ≥ 8.2.8 (PVE 8 path) or 9.0.x (PVE 9 path) and that your subscription covers the Ceph add-on.

13.3 Ceph Not Going Back to HEALTH_OK After OSD Restart

ceph -s
ceph health detail
ceph osd tree
ceph osd df tree

Common causes: an OSD that did not start (check journalctl -u ceph-osd@<id>), a disk that failed during restart, or PGs stuck peering because a monitor is missing. Resolve before continuing.

13.4 Node Won’t Boot After Upgrade

Boot from the PVE 9 ISO → AdvancedRescue Boot. This mounts your existing root filesystem and gives you a shell. From there:

# Common fixes:
apt -f install                       # finish a partial dist-upgrade
update-grub                          # rebuild GRUB config
proxmox-boot-tool refresh            # rebuild systemd-boot/EFI entries
nano /etc/network/interfaces         # fix renamed NICs

For ZFS-root systems, boot from a previous ZFS dataset by selecting it in the GRUB menu (the preupgrade-pve9 snapshot you took in §5.3).

13.5 Cluster Loses Quorum During Upgrade

If you accidentally took two nodes offline at once, the cluster will be read-only (pvecm status shows Activity blocked). Restore quorum by bringing one of the offline nodes back online. Never force quorum on a 3-node cluster as a casual workaround — it can cause split-brain in /etc/pve.

For a true emergency on a single surviving node:

pvecm expected 1     # only as last resort, with the other two truly dead

13.6 LVM Thin Pool Needs Repair After Upgrade

Some systems show:

Check of pool pve/data failed (status:64). Manual repair required!

Repair with:

lvconvert --repair pve/data

13.7 HA Rules Errors After Upgrade

If HA groups did not convert cleanly:

journalctl -eu pve-ha-crm | tail -100
ha-manager rules list

Common fix: edit a rule via the GUI (which forces a re-validation), or recreate it.


14. Rollback Strategy

There is no in-place downgrade path from PVE 9 back to PVE 8. If the upgrade has gone wrong on one node, your options are:

  1. Filesystem-level rollback (if you took ZFS snapshots in §5.3). Reboot the node, select the pre-upgrade ZFS dataset from the GRUB menu, and the node returns to its PVE 8.4 state. The cluster will then be running mixed versions, which is supported temporarily — fix the underlying problem and retry the upgrade for that node.
  2. Reinstall the node from PVE 8.4 ISO and rejoin the cluster. This is invasive but recovers a node fully. The remaining two PVE 8.4 nodes (or two PVE 9 nodes) accept the rejoin. Ceph OSDs on local disks can usually be re-attached without re-replicating data — ceph-volume lvm activate --all after reinstall.
  3. Restore VMs to a fresh PVE 8.4 cluster. This is the worst-case fallback and depends entirely on the backups you took in §5.

The single best thing you can do to make rollback unnecessary is to never start the next node until the current one is fully healthy. A failed upgrade on one node out of three is recoverable. A failed upgrade on two nodes out of three is a long night.


Appendix A — Repository Reference (deb822)

All on Debian Trixie (PVE 9). The legacy .list format still works but Proxmox recommends migrating to .sources.

/etc/apt/sources.list (Debian base):

deb http://deb.debian.org/debian trixie main contrib
deb http://deb.debian.org/debian trixie-updates main contrib
deb http://security.debian.org/debian-security trixie-security main contrib

/etc/apt/sources.list.d/pve-enterprise.sources (with subscription):

Types: deb
URIs: https://enterprise.proxmox.com/debian/pve
Suites: trixie
Components: pve-enterprise
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg

/etc/apt/sources.list.d/proxmox.sources (no subscription):

Types: deb
URIs: http://download.proxmox.com/debian/pve
Suites: trixie
Components: pve-no-subscription
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg

/etc/apt/sources.list.d/ceph.sources (enterprise):

Types: deb
URIs: https://enterprise.proxmox.com/debian/ceph-squid
Suites: trixie
Components: enterprise
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg

/etc/apt/sources.list.d/ceph.sources (no subscription):

Types: deb
URIs: http://download.proxmox.com/debian/ceph-squid
Suites: trixie
Components: no-subscription
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg

Appendix B — Per-Node Operator Checklist

Print this and tick boxes as you go on each of the three nodes.

NODE: __________________     OPERATOR: __________________     DATE: __________

Pre-flight
  [ ] Out-of-band console (IPMI / iLO / iDRAC) verified working
  [ ] tmux / screen session active for SSH
  [ ] Backups of /etc and all guests on this node verified
  [ ] ZFS snapshot taken (if applicable): ________________________________
  [ ] pveversion = 8.4.x ≥ 8.4.1
  [ ] ceph -s = HEALTH_OK
  [ ] ceph version = Squid (19.2.x) cluster-wide
  [ ] pve8to9 --full output reviewed; FAIL items resolved

Drain
  [ ] All VMs migrated off this node (or shut down for passthrough VMs)
  [ ] All CTs migrated or shut down
  [ ] ha-manager crm-command node-maintenance enable <node>
  [ ] ceph osd set noout

Upgrade
  [ ] Debian repos switched bookworm -> trixie
  [ ] PVE 9 deb822 .sources file present and verified by apt policy
  [ ] Old PVE .list file removed
  [ ] Ceph deb822 .sources file present (Trixie + ceph-squid)
  [ ] Old ceph.list file removed
  [ ] (optional) systemctl disable --now systemd-journald-audit.socket
  [ ] apt update succeeded with no errors
  [ ] apt dist-upgrade completed (configuration-file prompts answered)
  [ ] (UEFI+LVM) grub-efi-amd64 installed
  [ ] systemd-boot meta-package removed if pve8to9 said so
  [ ] proxmox-boot-tool refresh (ZFS-on-root)

Reboot
  [ ] pve8to9 --full re-run, no FAIL items
  [ ] reboot
  [ ] Node back online; pveversion shows 9.x
  [ ] uname -r matches expected kernel
  [ ] pvecm status shows this node online
  [ ] ceph -s healthy from this node's perspective
  [ ] systemctl --failed is empty

Re-attach
  [ ] ha-manager crm-command node-maintenance disable <node>
  [ ] ceph osd unset noout (only after final node, OR re-set before next node)
  [ ] At least one test VM live-migrated back to this node
  [ ] Test VM started, network OK, console OK

Sign-off: ____________________________

Appendix C — Useful One-Liner Reference

# Cluster health
pvecm status
pvecm nodes
ha-manager status

# Per-node version
pveversion -v

# Ceph health
ceph -s
ceph health detail
ceph versions
ceph osd tree
ceph osd df tree
ceph mon dump | grep min_mon_release
ceph mgr services
ceph fs ls

# Maintenance flags
ceph osd set noout
ceph osd unset noout
ha-manager crm-command node-maintenance enable  <node>
ha-manager crm-command node-maintenance disable <node>

# Migration
qm migrate <vmid> <target-node> --online
pct migrate <ctid> <target-node> --restart

# Repository sanity
apt update && apt policy
grep -r '' /etc/apt/sources.list /etc/apt/sources.list.d/

# Bootloader
proxmox-boot-tool status
proxmox-boot-tool refresh
proxmox-boot-tool kernel list
proxmox-boot-tool kernel pin <version>

# Subscription / repo auth
pvesubscription get
pvesubscription update --force

# Backups
vzdump --all 1 --compress zstd --storage <storage-id>
qmrestore <dump-file> <new-vmid> --storage <storage-id>

# Pre-upgrade checker
pve8to9
pve8to9 --full

# Modernize APT
apt modernize-sources

End of document. When in doubt, the canonical reference is the Proxmox wiki: Upgrade from 8 to 9, Ceph Reef to Squid, and Roadmap / known issues. Re-check those pages before any production upgrade — they are updated as new issues are discovered.

Madalin

Madalin

AI integrator

🚀 Senior Architect | SRE & Database Expert | AI Orchestrator 👋 Building the future at the speed of thought. ⚡️ I don't just write code; I architect high-performance, bulletproof ecosystems. With a foundation in Systems Engineering and a mastery of Go and TypeScript, I bridge the gap between heavy-duty backend reliability and seamless, high-conversion frontends.

Continue the conversation

If this article reflects the challenges your organisation is navigating, explore more practical guidance across Madalin.