Best practices for transitioning VM clusters to KubeVirt 1.0

By Ambassador post by Zou Nengren September 22, 2023

Ambassador post by Zou Nengren

The KubeVirt community is thrilled to announce the highly-anticipated release of KubeVirt v1.0! This momentous release signifies the remarkable achievements and widespread adoption within the community, marking a significant milestone for all stakeholders involved. This project became part of CNCF as a sandbox project in September 2019 and attained incubation status in April 2022. KubeVirt has evolved into a production-ready virtual machine management project that seamlessly operates as a native Kubernetes API. We have also chosen KubeVirt as our ultimate solution for virtual machine orchestration. Currently, we are utilizing AlmaLinux as the virtualization foundation and cgroup v2 as the resource isolation mechanism. Throughout the process of implementing KubeVirt, we encountered certain challenges. Therefore, we aim to share some of the practical experiences and insights we’ve gained from working with KubeVirt in this article.

Why KubeVirt?

While OpenStack has seen widespread adoption, its architecture is relatively complex. By utilizing KubeVirt, virtual machine management is streamlined, offering an improved integration experience. With KubeVirt’s inclusion in the CNCF sandbox project and its integration with the CNCF ecosystem, Kubernetes API has been extended with custom resource definitions (CRDs) to enable native VM operation within Kubernetes.

KubeVirt containerizes the trusted virtualization layer of QEMU and libvirt, allowing VMs to be handled just like any other Kubernetes resource. This approach provides users with a more flexible, scalable, and modern virtual machine management solution, offering the following key advantages:

Simplified Architecture and Management: Compared to OpenStack, KubeVirt offers a simplified architecture and management requirements. OpenStack can be unwieldy and costly to maintain, while KubeVirt leverages Kubernetes for the automated lifecycle management of VMs. It eliminates separate processes for VMs and containers, facilitating the integration of workflows for both virtualization and containerization. This simplifies the underlying infrastructure stack and reduces management costs.
Modern, Scalable, Kubernetes-Based Solution: KubeVirt is a modern, scalable, Kubernetes-based virtual machine management solution. By standardizing automated testing and deployment of all applications using Kubernetes, and unifying metadata within Kubernetes, it reduces the risk of deployment errors and enables faster iteration. This minimizes the operational workload for DevOps teams and accelerates day-to-day operations.
Tight Integration with the Kubernetes Ecosystem: KubeVirt seamlessly integrates with the Kubernetes ecosystem, offering improved scalability and performance. When VMs are migrated to Kubernetes, it can lead to cost reductions for software and application use and minimize performance overhead at the virtualization layer.
Ideal for Lightweight, Flexible, and Modern VM Management: KubeVirt is well-suited for scenarios requiring lightweight, flexible, and modern virtual machine management. Users can run their virtual workloads alongside container workloads, managing them in the same manner. They can also leverage familiar cloud-native tools such as Tekton, Istio, ArgoCD, and more, which are already favored by cloud-native users.

What‘s new in kubeVirt 1.0 ？

In POC phase, two RPM-based packages and six container images were used, providing an extension for virtual machine management within Kubernetes:

kubevirt-virtctl: This package can be installed on any machine with administrator access to the cluster. It contains the virtctl tool, which simplifies virtual machine management using kubectl. While kubectl can be used for this purpose, managing VMs can be complex due to their stateful nature. The virtctl tool abstracts this complexity, enabling operations like starting, stopping, pausing, unpausing, and migrating VMs. It also provides access to the virtual machine’s serial console and graphics server.
kubevirt-manifests: This package contains manifests for installing KubeVirt. Key files include kubevirt-cr.yaml, representing the KubeVirt Custom Resource definition, and kubevirt-operator.yaml, which deploys the KubeVirt operator responsible for managing the KubeVirt service within the cluster.

The container images are as follows:

virt-api: Provides a Kubernetes API extension for virtual machine resources.
virt-controller: Watches for new or updated objects created via virt-api and ensures object states match the requested state.
virt-handler: A DaemonSet and node component that keeps cluster-level virtual machine objects in sync with libvirtd domains running in virt-launcher. It can also perform node-centric operations like configuring networking and storage.
virt-launcher: A node component that runs libvirt and QEMU to provide the virtual machine environment.
virt-operator: Implements the Kubernetes operator pattern for managing the KubeVirt application.
libguestfs-tools: Provides utilities for accessing and modifying VM disk images.

The v1.0 release signifies significant growth for the KubeVirt community, progressing from an idea to a production-ready Virtual Machine Management solution over the past six years. This release emphasizes maintaining APIs while expanding the project. The release cadence has shifted to align with Kubernetes practices, enabling better stability, compatibility, and support.

The project has embraced Kubernetes community practices, including SIGs for test and review responsibilities, a SIG release repo for release-related tasks, and regular SIG meetings covering areas like scale, performance, and storage.

Notable features in v1.0 include memory over-commit support, persistent vTPM for easier BitLocker usage on Windows, initial CPU Hotplug support, hot plug, and hot unplug (in alpha), and further developments in API stabilization and SR-IOV interface support.

The focus is on aligning KubeVirt with Kubernetes and fostering community collaboration to enhance virtual machine management within the Kubernetes ecosystem.

What issues in kubeVirt 1.0？

CgroupV2 support

When using cgroup v2, starting a VM with a non-hotpluggable volume can be problematic because cgroup v2 doesn’t provide information about the currently allowed devices for a container. KubeVirt addresses this issue by tracking device rules internally using a global variable:

/kubevirt/7/pkg/virt-handler/cgroup/cgroup_v2_manager.go#L10

var rulesPerPid = make(map[string][]*devices.Rule)

However, this approach has some drawbacks:

The variable won’t survive a crash or restart of the virt-handler pod, resulting in data loss.
The state is stored in a dynamic structure (a map), and stale data is not removed, causing memory consumption to continuously increase.

A potential solution is to store the state in a file, for example, /var/run/kubevirt-private/devices.list. This file should be updated each time a device is added or removed. Additionally, it should be removed when the corresponding VM is destroyed, or periodic cleanup can be performed. The file can follow the same data format as devices.list on cgroup v1 hosts, allowing the same code to parse the current state for both v1 and v2.

However, managing the file introduces the challenge of performing transactions, i.e., applying actual device rules and writing the state to the file atomically.

You can find more details and discussions about this issue in GitHub issue #7710.

Cilium Support

cilium multi-homing

In Kubernetes, each pod typically has only one network interface (aside from a loopback interface). Cilium-native multi-homing aims to enable the attachment of additional network interfaces to pods. This functionality is similar to what the Multus CNI offers, which allows the attachment of multiple network interfaces to pods. However, Cilium-native multi-homing distinguishes itself by relying exclusively on the Cilium CNI as the sole CNI installed.

This feature should provide robust support for all existing Cilium datapath capabilities, including network policies, observability, datapath acceleration, and service discovery. Furthermore, it aims to offer a straightforward developer experience that aligns with the simplicity and usability that Cilium already provides today.

multus

When utilizing Cilium version 1.14.0 alongside multus-cni, there seems to be an issue where the secondary interface does not become visible. Here’s a list of files you can find under the /etc/cni directory after installing multus in a Cilium 1.14 environment:

$ ls -l
/etc/cni/net.d/05-cilium.conflist
/etc/cni/net.d/00-multus.conf.cilium_bak
/etc/cni/net.d/100-crio-bridge.conflist.cilium_bak
/etc/cni/net.d/200-loopback.conflist.cilium_bak
/etc/cni/net.d/multus.d/multus.kubeconfig

The issue with multus installation in Cilium 1.14 has been resolved by setting cni.exclusive to false.

# Make Cilium take ownership over the `/etc/cni/net.d` directory on the
  # node, renaming all non-Cilium CNI configurations to `*.cilium_bak`.
  # This ensures no Pods can be scheduled using other CNI plugins during Cilium
  # agent downtime.
  exclusive: false

Harbor limit

We encountered an issue when attempting to push a container with a single layer size which contain win.qcow2 image exceeding 10.25GB to our Harbor instance hosted on an EC2 instance. Our Harbor version is v2.1.2, and we are using S3 as the storage backend.

Our system has successfully handled containers with total sizes exceeding 15GB in the past. However, this specific container with a single layer size of 13.5GB repeatedly fails to push. On the client side, we receive limited feedback:

Sep 10 22:29:19 backend-ci dockerd[934]: 
time="2023-09-10T22:29:19.628869277+02:00" level=error msg="Upload failed, retrying: blob upload unknown"

Although the push activity completes successfully, the client-side error only appears afterward. In the registry.log, we’ve noticed the following error:

registry[885]: time="2023-09-10T08:47:25.330317861Z" level=error msg="upload resumed at wrong offest: 10485760000 != 12341008872"

We would greatly appreciate any insights or advice on this matter. Perhaps others have encountered similar issues with very large layers, especially when using S3 as a storage backend, where pushing layers larger than 10GB is not supported. We’ve also come across potential fixes proposed in this GitHub pull request:

https://github.com/goharbor/harbor/pull/16322

Conclusion

KubeVirt simplifies running virtual machines on Kubernetes, making it as easy as managing containers. It provides a cloud-native approach to managing virtual machines. KubeVirt addresses the challenge of unifying the management of virtual machines and containers, effectively harnessing the strengths of both. However, there is still a long way to go in practice. Nevertheless, the release of version 1.0 is significant for the community and users. We look forward to the widespread adoption of KubeVirt and its full support for cgroupv2.

https://documentation.suse.com/container/kubevirt/html/SLE-kubevirt/index.html

https://kubevirt.io/2023/KubeVirt-v1-has-landed.html

CFP: Cilium-native multi-homing · Issue #20129

https://github.com/goharbor/harbor/issues/15719

https://github.com/kubevirt/kubevirt/issues/398

https://github.com/k8snetworkplumbingwg/multus-cni/issues/1132