Longhorn V1.6.x: High Memory Usage On Instance Managers

by Lucas 56 views

Hey guys, let's dive into a tricky issue some of us are facing with Longhorn, specifically around memory usage after upgrading to version 1.6.x. This is something that's been hitting our production environment, and I want to share the details to help the community and hopefully find a solution or workaround. Let's break it down!

The Memory Monster: Instance Manager Woes

So, the core problem is pretty straightforward: we've observed a significant increase in memory consumption by our Longhorn instance managers after upgrading to version 1.6.x. Before the upgrade, these instance managers were generally well-behaved, resource-wise. They handled their duties without hogging too much memory. We saw them using around 70% of the available memory on the node. However, after the upgrade, things took a turn. Even when idling, these instance managers are now consistently consuming around 90% of the available memory. That's a pretty big jump, and it's causing us some headaches.

Our setup involves a good number of replicas per instance manager; specifically, we have 439 replicas per instance manager across three nodes. In our environment, the average memory consumption per instance manager now hovers around 11.3 Gi, and this can spike even higher, sometimes exceeding 13 Gi when backup jobs are running. Now, we're on an EKS cluster, and each of our worker nodes is equipped with 16 Gi of memory, which means that with such a high consumption rate, the nodes are really strained. It has caused some pods running on the node to get OOMKilled. We haven't seen a node crash as of yet, but it's definitely something we're keeping a close eye on. The instance managers are crucial for Longhorn's operation. They manage the volumes, handle the data synchronization, and generally keep everything running smoothly. When they're using up so much memory, it can lead to performance bottlenecks and, in the worst cases, even node instability. We've always been CPU-bound with Longhorn in the past, but now the memory is a critical factor. We never experienced any CPU issues from v1.4.x.

Before and After: A Memory Consumption Comparison

The upgrade from v1.5.x to v1.6.x seems to be the tipping point. Before the upgrade, the instance managers were, as mentioned, using around 70% of the available memory. This was acceptable and didn't cause any significant problems. However, after upgrading to v1.6.x, that number jumped to approximately 90%. This means that more memory is being used to run the instance-manager, and less is available for everything else. This increase in memory consumption is particularly concerning because the number of replicas (439) remained constant across both versions. So, the workload didn't change, but the memory usage did, and it jumped significantly. This suggests that the issue might be related to a change in how the instance managers handle their tasks, perhaps due to a bug or an inefficiency introduced in version 1.6.x.

This higher memory consumption has several potential consequences. First, it reduces the amount of memory available for other pods running on the same node. If the instance managers are using most of the memory, it can lead to those other pods getting OOMKilled, as the system struggles to allocate enough resources. It can also impact the overall performance of the Kubernetes cluster. Memory is a crucial resource for any application, and when it's scarce, everything slows down. This can lead to increased latency, slower response times, and a general degradation of the user experience. So this situation with the instance managers is something we need to address quickly to ensure that our production environment remains stable and performant.

Diving into the Technical Details: Environment and Configuration

Let's get a bit more technical and look at the environment where this issue is occurring. We're running Longhorn v1.6.4 in an EKS environment. The underlying infrastructure is AWS EKS, and we're using the Amazon EKS AMI AL2. Our worker nodes are equipped with 16Gi of memory and 2 CPUs, which are powered by a 5.10.238 kernel. The nodes have SSDs and up to 25 Gbps network bandwidth between them. We're using Rancher catalog for installation. We've got about 439 Longhorn volumes running in the cluster.

This information is critical because it gives us a clear picture of the environment and helps us identify potential root causes and find a solution. By knowing the specific versions, hardware, and configurations, we can narrow down the possibilities and focus our efforts on the most relevant areas. This will help us troubleshoot the issue faster and find a solution more efficiently.

Reproduction, Expected Behavior, and the Workaround

To reproduce the issue, simply upgrade to the latest v1.6.x version of Longhorn. The expected behavior, before the upgrade, was that each instance manager would consume about 70% of the memory on the node in v1.5.x. However, after upgrading, it now constantly consumes about 90%. As a workaround, we had to bump the instance type from large to xlarge to allocate more memory for the instance managers.

The Impact and Mitigation Steps

The primary impact of this increased memory consumption is that other pods running on the same node can get OOMKilled and enter crashloopbackoff states. Since most of the memory is being used by a single pod, there's not enough left over for other processes to function correctly. While we haven't seen a node go down or reboot, the risk is still there, and it's something we need to address. The workaround has been to bump up the instance type of the nodes. But obviously, this isn't a scalable or cost-effective long-term solution. It means that we're essentially paying more for resources we weren't using before. This impacts our overall infrastructure costs and could limit our ability to scale the cluster efficiently. Therefore, we need to find a more sustainable solution.

Digging Deeper: The Attached Profile and Additional Context

We've attached a profile of one of the instance managers with high memory consumption to provide a deeper look into the issue. The profile should shed some light on where the memory is being used, which processes are consuming the most resources, and how we can optimize them. The attached profile will help us understand what's causing the high memory usage. This will let us drill down on the specific processes or components within the instance manager that are responsible for the memory spike.

We are still working on a long-term solution, so any help or input is greatly appreciated. If you've encountered a similar problem or have any ideas on how to address it, please feel free to share your insights. Together, we can work towards a solution to keep our Longhorn deployments running smoothly!

Conclusion

In summary, the upgrade to Longhorn v1.6.x has introduced a significant memory usage issue for instance managers, leading to higher resource consumption and potential performance bottlenecks. This issue impacts the stability and efficiency of our Kubernetes cluster and requires a solution. By providing detailed information about the environment, the steps to reproduce the issue, and the attached profile, we hope to find a resolution and help the community.

We hope to update this as soon as we have a solution or additional information.