自宅のラズパイk8sクラスターでHPAを利用できるようにしようと思い立ち、metrics-serverをインストールしてみたものの、metrics-serverが起動しなかったのでトラブルシュートしてみた
環境
- Kubernetes v1.30.0
- metrics-server v0.7.1
インストール
手順に従いインストールコマンドを実行
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
起動確認
Deployment
READYが0/1のまま...
k get deployment metrics-server -n kube-system NAME READY UP-TO-DATE AVAILABLE AGE metrics-server 0/1 1 0 10h
Pod
metrics-serverだけREADYにならない
k get po -n kube-system NAME READY STATUS RESTARTS AGE coredns-7db6d8ff4d-67gfj 1/1 Running 0 8d coredns-7db6d8ff4d-xlznj 1/1 Running 0 8d etcd-k8s-master01 1/1 Running 0 8d kube-apiserver-k8s-master01 1/1 Running 0 8d kube-controller-manager-k8s-master01 1/1 Running 0 8d kube-proxy-dww28 1/1 Running 0 8d kube-proxy-lb9h2 1/1 Running 0 8d kube-proxy-t9lsn 1/1 Running 0 8d kube-scheduler-k8s-master01 1/1 Running 0 8d metrics-server-7ffbc6d68-pqtdt 0/1 Running 0 10h
Describe Pods
kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
Readiness probeが失敗してしまっているようだ
k describe po metrics-server-7ffbc6d68-pqtdt -n kube-system Name: metrics-server-7ffbc6d68-pqtdt Namespace: kube-system Priority: 2000000000 Priority Class Name: system-cluster-critical Service Account: metrics-server Node: k8s-worker01/192.168.1.111 Start Time: Wed, 29 May 2024 23:51:51 +0900 Labels: k8s-app=metrics-server pod-template-hash=7ffbc6d68 Annotations: <none> Status: Running IP: 10.244.1.73 IPs: IP: 10.244.1.73 Controlled By: ReplicaSet/metrics-server-7ffbc6d68 Containers: metrics-server: Container ID: containerd://603b5f91968f0afb410fa3a49786c563bd26f87de59c11ce1d4822e743a7fa29 Image: registry.k8s.io/metrics-server/metrics-server:v0.7.1 Image ID: registry.k8s.io/metrics-server/metrics-server@sha256:db3800085a0957083930c3932b17580eec652cfb6156a05c0f79c7543e80d17a Port: 10250/TCP Host Port: 0/TCP Args: --cert-dir=/tmp --secure-port=10250 --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --kubelet-use-node-status-port --metric-resolution=15s State: Running Started: Wed, 29 May 2024 23:52:00 +0900 Ready: False Restart Count: 0 Requests: cpu: 100m memory: 200Mi Liveness: http-get https://:https/livez delay=0s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get https://:https/readyz delay=20s timeout=1s period=10s #success=1 #failure=3 Environment: <none> Mounts: /tmp from tmp-dir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6szcf (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: tmp-dir: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> kube-api-access-6szcf: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 4m41s (x4114 over 10h) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
Logs
k logs metrics-server-7ffbc6d68-6srfk -n kube-system I0530 03:08:07.651554 1 serving.go:374] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key) I0530 03:08:08.554701 1 handler.go:275] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager I0530 03:08:08.686371 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController I0530 03:08:08.686459 1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController I0530 03:08:08.686466 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file" I0530 03:08:08.686518 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0530 03:08:08.686517 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" I0530 03:08:08.687032 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I0530 03:08:08.689211 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key" E0530 03:08:08.689974 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.1.101:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.1.101 because it doesn't contain any IP SANs" node="k8s-master01" I0530 03:08:08.690832 1 secure_serving.go:213] Serving securely on [::]:10250 I0530 03:08:08.691791 1 tlsconfig.go:240] "Starting DynamicServingCertificateController" E0530 03:08:08.692009 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.1.112:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.1.112 because it doesn't contain any IP SANs" node="k8s-worker02" E0530 03:08:08.701807 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.1.111:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.1.111 because it doesn't contain any IP SANs" node="k8s-worker01" I0530 03:08:08.786953 1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController I0530 03:08:08.787115 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0530 03:08:08.788043 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file E0530 03:08:23.685957 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.1.112:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.1.112 because it doesn't contain any IP SANs" node="k8s-worker02" E0530 03:08:23.692988 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.1.111:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.1.111 because it doesn't contain any IP SANs" node="k8s-worker01" E0530 03:08:23.695203 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.1.101:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.1.101 because it doesn't contain any IP SANs" node="k8s-master01" I0530 03:08:36.493054 1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
ログからmetrics Serverが起動しない原因は、Kubeletの証明書がIP SAN(Subject Alternative Name)を含んでいないため、TLS検証が失敗していることっぽいことがわかったのでTLS検証を無効化していく
TLS検証の無効化
というわけで現状自宅のラズパイk8sクラスターはkubelet証明書がクラスター認証局に署名されていないのでargs
に--kubelet-insecure-tls
を渡して証明書の検証を無効化していく
k edit deploy metrics-server -n kube-system
template: metadata: creationTimestamp: null labels: k8s-app: metrics-server spec: containers: - args: - --cert-dir=/tmp - --secure-port=10250 - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname - --kubelet-use-node-status-port - --metric-resolution=15s - --kubelet-insecure-tls # ←こいつを追加
再度動作確認
Pods
k get po -n kube-system NAME READY STATUS RESTARTS AGE coredns-7db6d8ff4d-67gfj 1/1 Running 0 12d coredns-7db6d8ff4d-xlznj 1/1 Running 0 12d etcd-k8s-master01 1/1 Running 0 12d kube-apiserver-k8s-master01 1/1 Running 0 12d kube-controller-manager-k8s-master01 1/1 Running 0 12d kube-proxy-dww28 1/1 Running 0 12d kube-proxy-lb9h2 1/1 Running 0 12d kube-proxy-t9lsn 1/1 Running 0 12d kube-scheduler-k8s-master01 1/1 Running 0 12d metrics-server-d994c478f-8f7hh 1/1 Running 0 23m
Deployments
k get deploy -n kube-system NAME READY UP-TO-DATE AVAILABLE AGE coredns 2/2 2 2 12d metrics-server 1/1 1 1 3d10h
というわけで無事起動させることができた!