Troubleshoot Cartographer Conventions

This topic describes how you can troubleshoot Cartographer Conventions.

No server in the cluster

Symptoms

When a PodIntent is submitted, no convention is applied.

Cause

When there are no convention servers (ClusterPodConvention) deployed in the cluster or none of the existing convention servers applied any conventions, the PodIntent is not mutating.

Solution

Deploy a convention server (ClusterPodConvention) in the cluster.

Server with wrong certificates configured

Symptoms

When a PodIntent is submitted, the conventions are not applied.

The convention-controller logs report an error failed to get CABundle as follows:

{
"level": "error",
"ts": 1638222343.6839523,
"logger": "controllers.PodIntent.PodIntent.ResolveConventions",
"msg": "failed to get CABundle",
"ClusterPodConvention": "base-convention",
"error": "unable to find valid certificaterequests for certificate \"convention-template/webhook-certificate\"",
"stacktrace": "reflect.Value.Call\n\treflect/value.go:339\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.(*SyncReconciler).sync\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:287\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.(*SyncReconciler).Reconcile\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:276\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.Sequence.Reconcile\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:815\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.(*ParentReconciler).reconcile\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:146\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.(*ParentReconciler).Reconcile\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:120\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.10.3/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.10.3/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.10.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.10.3/pkg/internal/controller/controller.go:227"

Cause

convention server (ClusterPodConvention) is configured with the wrong certificates. The convention-controller cannot figure out the CA Bundle to perform the request to the server.

Solution

Ensure that the convention server (ClusterPodConvention) is configured with the correct certificates. To do so, verify the value of annotation conventions.carto.run/inject-ca-from which must be set to the used Certificate.

Important
Do not set annotation conventions.carto.run/inject-ca-from if no certificate is used.

Server fails when processing a request

Symptoms

When a PodIntent is submitted, the convention is not applied.

The convention-controller logs report failed to apply convention error like this.

{"level":"error","ts":1638205387.8813763,"logger":"controllers.PodIntent.PodIntent.ApplyConventions","msg":"failed to apply convention","Convention":{"Name":"base-convention","Selectors":null,"Priority":"Normal","ClientConfig":{"service":{"namespace":"convention-template","name":"webhook","port":443},"caBundle":"..."}},"error":"Post \"https://webhook.convention-template.svc:443/?timeout=30s\": EOF","stacktrace":"reflect.Value.call\n\treflect/value.go:543\nreflect.Value.Call\n\treflect/value.go:339\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.(*SyncReconciler).sync\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:287\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.(*SyncReconciler).Reconcile\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:276\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.Sequence.Reconcile\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:815\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.(*ParentReconciler).reconcile\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:146\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.(*ParentReconciler).Reconcile\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:120\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227"}

When a PodIntent status message is updated with failed to apply convention from source base-convention: Post "https://webhook.convention-template.svc:443/?timeout=30s": EOF.

Cause

An unmanaged error occurs in the convention server when processing a request.

Solution

Identify the error and deploy a fixed version of convention server:

Inspect the convention server logs to identify the cause of the error. Retrieve the convention server logs by running:
```
kubectl -n convention-template logs deployment/webhook
```
Where:
- The convention server was deployed as a Deployment.
- webhook is the name of the convention server Deployment.
- convention-template is the namespace where the convention server is deployed.
Identify the error and deploy a fixed version of convention server.

The new deployment is not applied to the existing PodIntents. It is only applied to the new PodIntent resources. To apply a new deployment to an existing PodIntent, update the PodIntent so that the reconciler applies if it matches the criteria.

Connection refused because of an unsecured connection

Symptoms

When a PodIntent is submitted, the convention is not applied.

The convention-controller logs report a connection-refused error as follows:

{"level":"error","ts":1638202791.5734537,"logger":"controllers.PodIntent.PodIntent.ApplyConventions","msg":"failed to apply convention","Convention":{"Name":"base-convention","Selectors":null,"Priority":"Normal","ClientConfig":{"service":{"namespace":"convention-template","name":"webhook","port":443},"caBundle":"..."}},"error":"Post \"https://webhook.convention-template.svc:443/?timeout=30s\": dial tcp 10.56.13.206:443: connect: connection refused","stacktrace":"reflect.Value.call\n\treflect/value.go:543\nreflect.Value.Call\n\treflect/value.go:339\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.(*SyncReconciler).sync\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:287\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.(*SyncReconciler).Reconcile\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:276\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.Sequence.Reconcile\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:815\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.(*ParentReconciler).reconcile\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:146\ngithub.com/vmware-labs/reconciler-runtime/reconcilers.(*ParentReconciler).Reconcile\n\tgithub.com/vmware-labs/reconciler-runtime@v0.3.0/reconcilers/reconcilers.go:120\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227"}

The convention server fails to start because server gave HTTP response to HTTPS client:

When checking the convention server events by running:

kubectl -n convention-template describe pod webhook-594d75d69b-4w4s8

Where:

The convention server was deployed as a Deployment.
webhook-594d75d69b-4w4s8 is the name of the convention server Pod.
convention-template is the namespace where the convention server is deployed.

For example:

$ kubectl -n convention-template describe pod webhook-594d75d69b-4w4s8

Name:         webhook-594d75d69b-4w4s8
Namespace:    convention-template
...
Containers:
  webhook:
...
Events:
Type     Reason     Age                   From               Message
----     ------     ----                  ----               -------
Normal   Scheduled  14m                   default-scheduler  Successfully assigned convention-template/webhook-594d75d69b-4w4s8 to pool
Normal   Pulling    14m                   kubelet            Pulling image "awesome-repo/awesome-user/awesome-convention-..."
Normal   Pulled     14m                   kubelet            Successfully pulled image "awesome-repo/awesome-user/awesome-convention..." in 1.06032653s
Normal   Created    13m (x2 over 14m)     kubelet            Created container webhook
Normal   Started    13m (x2 over 14m)     kubelet            Started container webhook
Warning  Unhealthy  13m (x9 over 14m)     kubelet            Readiness probe failed: Get "https://10.52.2.74:8443/healthz": http: server gave HTTP response to HTTPS client
Warning  Unhealthy  13m (x6 over 14m)     kubelet            Liveness probe failed: Get "https://10.52.2.74:8443/healthz": http: server gave HTTP response to HTTPS client
Normal   Pulled     9m13s (x6 over 13m)   kubelet            Container image "awesome-repo/awesome-user/awesome-convention" already present on machine
Warning  BackOff    4m22s (x32 over 11m)  kubelet            Back-off restarting failed container

Cause

When a convention server is provided without using Transport Layer Security (TLS) but the Deployment is configured to use TLS, Kubernetes fails to deploy the Pod because of the liveness probe.

Solution

Create a differently configured ClusterPodConvention resource:

Deploy a convention server with TLS enabled.
Create ClusterPodConvention resource for the convention server with annotation conventions.carto.run/inject-ca-from as a pointer to the deployed Certificate resource.

A self-signed certificate authority (CA) is not propagated to Convention Service

Symptoms

The self-signed CA for a registry is not propagated to the Convention Service.

Cause

When you provide the self-signed CA for a registry through convention-controller.ca_cert_data, the self-signed CA cannot be propagated to the Convention Service.

Solution

Define the CA by using the available .shared.ca_cert_data top-level key to supply the CA to the Convention Service.

No imagePullSecrets configured

Symptoms

When a PodIntent is submitted:

No convention is applied.
You see an unauthorized to access repository or fetching metadata for Images failed error when you inspect the workload.

Cause

The errors appear when a workload is created in a developer namespace where imagePullSecrets are not defined on the default serviceAccount or on the preferred serviceAccount.

Solution

Add the imagePullSecrets name to the default serviceAccount or the preferred serviceAccount.

For example:

kind: ServiceAccount
metadata:
  name: default
  namespace: my-workload-namespace
imagePullSecrets:
  - name: registry-credentials # ensure this secret is defined
secrets:
- name: registry-credentials

OOMKilled convention controller

Symptoms

While processing workloads with a large SBOM, the Cartographer Convention controller manager pod can fail with the status CrashLoopBackOff or OOMKilled.

To work around this problem you can increase the memory limit to 512Mi to fix the pod crash.

Symptom example:

NAME                                                          READY   STATUS             RESTARTS          AGE
cartographer-conventions-controller-manager-ff4cdf59d-5nzl5   0/1     CrashLoopBackOff   1292 (109s ago)   5d3h

The following is an example controller pod status:

containerStatuses:
  - containerID: containerd://b7b7159a9e00ef726944d642a1b649108bba610b34d8d10f9b5270ea25d3db94
    image: sha256:9827e8e5b30d47c9373a1907dc5e7e15a76d2a4581e803eb6f2cb24e3a9ea62e
    imageID: my.image.registry.com/tanzu-application-platform/tap-packages@sha256:3cd1ae92f534ff935fbaf992b8308aa3dac3d1b6cbc8cf8a856451c8c92540f66
    lastState:
      terminated:
        containerID: containerd://b7b7159a9e00ef726944d642a1b649108bba610b34d8d10f9b5270ea25d3db94
        exitCode: 137
        finishedAt: "2023-11-06T21:02:56Z"
        reason: OOMKilled
        startedAt: "2023-11-06T21:02:10Z"
    name: manager

Cause

This error usually occurs when a workload image, built by the supply chain, contains a large SBOM. The default resource limit set during installation might not be large enough to process the pod conventions which can lead to the controller pod crashing.

Solution

Increase the Cartographer Convention controller manager memory limit through tap-values.yaml. For example:

To increase the memory limit for convention server, see Increase the memory limit for convention server.
To increase the memory limit for convention webhook servers, such as app-live-view-conventions, spring-boot-webhook, and developer-conventions/webhook, see Increase the memory limit for convention webhook servers.

Increase the memory limit for the convention server

To increase the memory limit for the convention server:

Increase the memory limit, add the desired resource limit under key cartographer_conventions in tap-values.yaml:
```
cartographer_conventions:
  resource:
    memory: 512Mi
```

Update Tanzu Application Platform by running:

tanzu package installed update tap -p tap.tanzu.vmware.com -v 1.11.0  \
--values-file tap-values.yaml -n tap-install

For information about the package customization, see Customize your package installation.

Increase the memory limit for convention webhook servers

You might need to increase the memory limit for the following convention webhook servers:

app-live-view-conventions
spring-boot-webhook
developer-conventions/webhook

Use this procedure to increase the memory limit:

Create a Secret with the following ytt overlay.

apiVersion: v1
kind: Secret
metadata:
  name: patch-app-live-view-conventions
  namespace: tap-install
stringData:
  patch-conventions-controller.yaml: |
    #@ load("@ytt:overlay", "overlay")

    #@overlay/match by=overlay.subset({"kind":"Deployment", "metadata":{"name":"appliveview-webhook", "namespace": "app-live-view-conventions"}})
    ---
    spec:
      template:
        spec:
          containers:
            #@overlay/match by=overlay.subset({"name": "webhook"})
            - name: webhook
              resources:
                limits:
                  memory: 512Mi
---
apiVersion: v1
kind: Secret
metadata:
  name: patch-spring-boot-conventions
  namespace: tap-install
stringData:
  patch-conventions-controller.yaml: |
    #@ load("@ytt:overlay", "overlay")

    #@overlay/match by=overlay.subset({"kind":"Deployment", "metadata":{"name":"spring-boot-webhook", "namespace": "spring-boot-convention"}})
    ---
    spec:
      template:
        spec:
          containers:
            #@overlay/match by=overlay.subset({"name": "webhook"})
            - name: webhook
              resources:
                limits:
                  memory: 512Mi
---
apiVersion: v1
kind: Secret
metadata:
  name: patch-developer-conventions
  namespace: tap-install
stringData:
  patch-conventions-controller.yaml: |
    #@ load("@ytt:overlay", "overlay")

    #@overlay/match by=overlay.subset({"kind":"Deployment", "metadata":{"name":"webhook", "namespace": "developer-conventions"}})
    ---
    spec:
      template:
        spec:
          containers:
            #@overlay/match by=overlay.subset({"name": "webhook"})
            - name: webhook
              resources:
                limits:
                  memory: 512Mi

Update tap-values.yaml to include a package_overlays field as follows:

package_overlays:
- name: appliveview-conventions
  secrets:
  - name: patch-app-live-view-conventions
- name: spring-boot-conventions
  secrets:
  - name: patch-spring-boot-conventions
- name: developer-conventions
  secrets:
  - name: patch-developer-conventions

Update Tanzu Application Platform by running:

tanzu package installed update tap -p tap.tanzu.vmware.com -v 1.11.0  \
--values-file tap-values.yaml -n tap-install

For information about the package customization, see Customize your package installation.

Failed to call webhook - x509: certificate signed by unknown authority

Symptom

An error similar to the following appears when processing a workload with a config-provider step:

message: >-unable to apply object [workload-name] for resource [config-provider] in supply chain \
[source-test-scan-to-url]: create: Internal error occurred: failed calling webhook \
"podintents.conventions.carto.run": failed to call webhook: Post \
"https://cartographer-conventions-webhook-service.cartographer-system.svc:443/mutate-conventions-carto-run-v1alpha1-podintent?timeout=10s":x509: certificate signed by unknown authority

Cause

The CA certificate used to secure TLS communications to the Cartographer Conventions webhook pod might have fallen out of sync between the running webhook pod and the certificate that the MutatingWebhookConfiguration and ValidatingWebhookConfiguration resources configured.

Solution

Force cert-manager to re-create the certificates and ensure that they are in sync across the different places they are used:

Delete the Cartographer Conventions webhook configurations by running:

kubectl delete mutatingwebhookconfiguration cartographer-conventions-mutating-webhook-configuration \
-n conventions-system
kubectl delete validatingwebhookconfiguration cartographer-conventions-validating-webhook-configuration \
-n conventions-system

The two webhook configurations are re-created, but their caBundle fields might be empty. If the caBundle fields are empty then cert-manager might be failing. If cert-manager is failing, force cert-manager deployments to restart by running:
```
kubectl rollout restart deployment cert-manager -n cert-manager
kubectl rollout restart deployment cert-manager-cainjector -n cert-manager
kubectl rollout restart deployment cert-manager-webhook -n cert-manager
```

Force the Cartographer Conventions deployment to restart and detect any new certificates by running:

kubectl rollout restart deployment cartographer-conventions-controller-manager -n conventions-system

Re-create the workload.