Post

Victoriametrics k8s stack setup gotchas

Victoriametrics k8s stack setup gotchas

Use case

We want to monitor our Kubernetes cluster using VictoriaMetrics stack using the single version. The stack includes VictoriaMetrics Operator, Grafana dashboards, ServiceScrapes and VMRules, for a full overview of the stack see the official documentation

Before diving into the setup I want to point out the differences between monitoring and observability. Monitoring is about collecting metrics and logs from the system, while observability is about understanding the system’s behavior and performance at any given state, even unknowns states1. While the former is mainly reactive and focused on alerting for known/past issues, the latter aim to provide insights to complex systems, e.g. distributed systems, allowing us to proactively identify root causes. In this case, we are focusing on monitoring, as a complementary solution to a more comprehensive observability solution.

Moreover I would like to stress that we have been using the stack in production and we are greatly satisfied with it, the stack is easy to set up and maintain, and it provides a good overview of the cluster’s health and performance while being cost-effective.

Scraping is done explicitly

Whether you are migrating from Prometheus or using other components that rely on prometheus annotations for auto-discovery (such as prometheus.io/port, prometheus.io/scrape, prometheus.io/path, etc.), you will need to explicitly define the VMPodScrape or VMServiceScrape resources for each service/workload you want to scrape. The official documentation provides examples on how to mimic/migrate the auto-discovery annotations by creating the corresponding VMPodScrape or VMServiceScrape resources.

Disable control-plane metrics collection and alerting for cloud managed clusters

When using cloud-managed Kubernetes clusters (such as GKE, EKS, AKS), you may want to disable the collection of control-plane metrics and alerting to avoid unnecessary noise in your monitoring setup. This can be achieved by modifying the default helm values for the victoriametrics-k8s-stack chart. This will prevent the operator from collecting control-plane metrics and generating alerts for them.

1
2
3
4
5
6
7
8
9
10
kubeControllerManager:
  enabled: false
kubeEtcd:
  enabled: false
kubeProxy:
  enabled: false
kubeScheduler:
  enabled: false
kubeApiServer:
  enabled: false

Setup retention period and storage, cpu and memory resources

By default, the victoriametrics-k8s-stack chart sets a retention period of 1 month for the metrics collected by VictoriaMetrics together with a storage request of 20Gi. Depending on your collection throughput and the number of metrics collected, you may want to adjust these values to fit your needs. In our case, we noticed that the persistence volume was filling up quickly, which causes the remote storage to stop accepting new data. Similarly you will like to measure and adjust the CPU and memory resources. To modify the retention period and storage request, you can use the following values in your values.yaml file:

1
2
3
4
5
6
7
8
9
10
11
12
13
vmsingle:
  spec:
    # default is 30 days
    retentionPeriod: "7d" # depends on your use case
    resources:
      limits:
        memory: 5Gi # depends on your use case
      requests:
        memory: 1Gi # depends on your use case
    storage:
      resources:
        requests:
          storage: 20Gi # depends on your use case

Upgrade the stack regularly and watchout for CRDs changes

I think the stack is quite stable but the pace of development is quite fast, so I recommend to upgrade the stack regularly to benefit from the latest features and bug fixes. However, be aware that some changes may require you to update your custom resources (CRDs). To check for differences between the current CRDs and the ones used by the latest version of the stack, you can use the following commands:

1
2
3
4
5
# 1. check the changes in CRD
$ helm show crds vm/victoria-metrics-k8s-stack --version [YOUR_CHART_VERSION] | kubectl diff -f -

# 2. apply the changes (update CRD)
$ helm show crds vm/victoria-metrics-k8s-stack --version [YOUR_CHART_VERSION] | kubectl apply -f - --server-side

Configure VMAlertmanager using a secret

The victoriametrics-k8s-stack comes with a bunch of dashboards, recording and alerting rules, but to get the most out of it, you will need to configure the VMAlertmanager to send alerts to your preferred notification channels. The recommended way to do this is by creating a secret with the alerting configuration and then referencing it in the values.yaml file. The secret should contain the alerting configuration in the alertmanager.yaml format, which is quite similar to the format used by Prometheus Alertmanager. Here is an example of how to reference the secret in the values.yaml file:

1
2
3
alertmanager:
  spec:
    configSecret: <name_of_your_secret>

Given the reactive nature of monitoring, you will likely will need to tweak the routing and inhibition rules to avoid alert fatigue and ensure that you are only notified of the most relevant alerts.

Mind the relabelConfigs option for the vmScrape resource that will scrape kubelet metrics

By default the victoriametrics-k8s-stack chart will create a VMScrape resource to scrape the kubelet metrics, but will add a relabelConfigs option to map all node labels to the collected metrics. This is useful to have the node labels available in the metrics, but it can also lead to a large number of labels being added to the metrics, which can increase the cardinality of the metrics and lead to performance issues. If you are not interested in having all node labels in the metrics, you can disable this option by modifying values.yaml file:

1
2
3
4
5
6
7
8
9
10
11
kubelet:
    vmScrape:
        spec:
             relabelConfigs:
                # Uncomment if you want to map all node labels to the collected metrics
                # - action: labelmap  
                #   regex: __meta_kubernetes_node_label_(.+)
                - sourceLabels: [__metrics_path__]
                targetLabel: metrics_path
                - targetLabel: job
                replacement: kubelet

Conclusion

In this post, we have covered some of the gotchas and best practices when setting up the VictoriaMetrics k8s stack. The stack is easy to set up and maintain, and it provides a good overview of the cluster’s health and performance while being cost-effective. However, it is important to keep in mind the differences between monitoring and observability, and to configure the stack according to your needs. Regularly upgrading the stack and watching out for CRDs changes is also recommended to benefit from the latest features and bug fixes.

  1. Honeycomb’s O’Reilly Book Observability Engineering: Achieving Production Excellence By Charity Majors, Liz Fong-Jones, and George Miranda ↩︎

This post is licensed under CC BY 4.0 by the author.