Aller au contenu principal

Troubleshooting

The dashboard is empty

Check your access right

Make sure you have the access right for the dashboard you are trying to access. Your access right is derived from the group you belong to.

  • adminsysteme: have access to all dashboards
  • adminsecurite: have access to dashboards in folders AUDIT & AUDIT-IAAS
  • admintrait, admindata, adminappli: have access to dashboards Logs (for applications), Rapport Ingestion, Rapport Traitement

Check that clickhouse is running

With Kubectl

kubectl get pod -n kosmos-logs

With Rancher

  • Login to rancher
  • Go the the workload > Pod menu on the left
  • Select namespace kosmos-log
  • Check the status of clickhouse pod

With Grafana

  • Got to folder KUBE-STATE-METRICS
  • Select dashboard Kubernetes / Compute Resources / Namespace (Workloads)
  • Select namespace kosmos-logs
  • Check that clickhouse pods are running

Check that the logs-oidc-proxy is running

With Kubectl

kubectl get pod -n kosmos-logs

Newer logs don't show up

Check that clickhouse PVC has free space left

Look for dashboard Kubernetes / Persistent Volumes in folder Home / Dashboards / KUBE-STATE-METRICS. Then select for namespace kosmos-logs and select PCV starting with *storage-vc-template-chi-clickhouse-cluster-cluster0-x-x-x If the PVC is full, increase it size with this command (update the pvc name):

kubectl patch pvc storage-vc-template-chi-clickhouse-cluster-cluster0-0-0-0 -n kosmos-logs -p '{"spec":{"resources":{"requests":{"storage":"40Gi"}}}}'

Restart vector

When clickhouse is unreachable, vector first logs frantically about not being able to store logs, then it stops. When clickhouse comes back, restarting vector signals it to resume sending logs to clickhouse.

kubectl rollout restart deploy vector -n kosmos-logs

Restart clickhouse-proxy-logs-oidc-proxy

When clickhouse is unreachable for some time, at one point, the proxy seems to cache an error message and always return it, even when clickhouse comes back on. By restarting the proxy, it resumes forwarding request to clickhouse.

kubectl rollout restart deploy clickhouse-proxy-logs-oidc-proxy -n kosmos-logs

Error Code 231

If the main message is Code: 231. DB::Exception: Suspiciously many (112 parts, 0.00 B in total) broken parts to remove while maximum allowed broken parts count is 100.

You need to try a data restore.

Force Data Restore

attention

This procedure may result in potential data loss.

  • Connect to the chi-clickhouse-cluster-cluster0-0-0-0 pod
  • Create the flag file to force the restore
su -c 'touch /var/lib/clickhouse/flags/force_restore_data' clickhouse
  • Delete the pod to restart the service
kubectl -n kosmos-logs delete pod chi-clickhouse-cluster-cluster0-0-0-0
  • After restarting, the Kosmos Logs dashboard should have access to the logs again.

Table is in readonly mode

If logs show DB::Exception: Table is in readonly mode followed by logs like

Checking if anyone has a part all_1759749_1759749_0 and Found the missing part all_1759749_1759749_0

There might be an issue in metadata stored in the keeper.

Connect to clickhouse and check replication queue.

SELECT *
FROM system.replication_queue
WHERE (type = 'GET_PART')
LIMIT 1
FORMAT Vertical

partition_id must match the logs and num_tries may exceed 8.

Recreate metadata

This procedure does not delete data it only recreate keeper metadata.

SELECT concat(database,'.', table) as table, replica_name, zookeeper_path
FROM system.replicas
WHERE is_readonly = '1'

ALTER TABLE TABLE_NAME DROP DETACHED PARTITION ALL SETTINGS allow_drop_detached = 1; -- replace TABLE_NAME
DETACH TABLE TABLE_NAME;
SYSTEM DROP REPLICA 'REPLICA_NAME' FROM ZKPATH 'ZK_PATH'; -- replace REPLICA_NAME and ZK_PATH with data returned from first select

Wait for few seconds

ATTACH TABLE TABLE_NAME;
SYSTEM RESTORE REPLICA TABLE_NAME;
SYSTEM SYNC REPLICA TABLE_NAME; -- command may timeout due to long time syncing

Periodically checks for unsynced tables, it must be equal to 0.

select absolute_delay, table from system.replicas;