Troubleshooting
The dashboard is empty
Check your access right
Make sure you have the access right for the dashboard you are trying to access. Your access right is derived from the group you belong to.
- adminsysteme: have access to all dashboards
- adminsecurite: have access to dashboards in folders AUDIT & AUDIT-IAAS
- admintrait, admindata, adminappli: have access to dashboards Logs (for applications), Rapport Ingestion, Rapport Traitement
Check that clickhouse is running
With Kubectl
kubectl get pod -n kosmos-logs
With Rancher
- Login to rancher
- Go the the workload > Pod menu on the left
- Select namespace kosmos-log
- Check the status of clickhouse pod
With Grafana
- Got to folder KUBE-STATE-METRICS
- Select dashboard Kubernetes / Compute Resources / Namespace (Workloads)
- Select namespace kosmos-logs
- Check that clickhouse pods are running
Check that the logs-oidc-proxy is running
With Kubectl
kubectl get pod -n kosmos-logs
Newer logs don't show up
Check that clickhouse PVC has free space left
Look for dashboard Kubernetes / Persistent Volumes in folder Home / Dashboards / KUBE-STATE-METRICS. Then select for namespace kosmos-logs and select PCV starting with *storage-vc-template-chi-clickhouse-cluster-cluster0-x-x-x If the PVC is full, increase it size with this command (update the pvc name):
kubectl patch pvc storage-vc-template-chi-clickhouse-cluster-cluster0-0-0-0 -n kosmos-logs -p '{"spec":{"resources":{"requests":{"storage":"40Gi"}}}}'
Restart vector
When clickhouse is unreachable, vector first logs frantically about not being able to store logs, then it stops. When clickhouse comes back, restarting vector signals it to resume sending logs to clickhouse.
kubectl rollout restart deploy vector -n kosmos-logs
Restart clickhouse-proxy-logs-oidc-proxy
When clickhouse is unreachable for some time, at one point, the proxy seems to cache an error message and always return it, even when clickhouse comes back on. By restarting the proxy, it resumes forwarding request to clickhouse.
kubectl rollout restart deploy clickhouse-proxy-logs-oidc-proxy -n kosmos-logs
Error Code 231
If the main message is Code: 231. DB::Exception: Suspiciously many (112 parts, 0.00 B in total) broken parts to remove while maximum allowed broken parts count is 100.
You need to try a data restore.
Force Data Restore
This procedure may result in potential data loss.
- Connect to the chi-clickhouse-cluster-cluster0-0-0-0 pod
- Create the flag file to force the restore
su -c 'touch /var/lib/clickhouse/flags/force_restore_data' clickhouse
- Delete the pod to restart the service
kubectl -n kosmos-logs delete pod chi-clickhouse-cluster-cluster0-0-0-0
- After restarting, the Kosmos Logs dashboard should have access to the logs again.
Table is in readonly mode
If logs show DB::Exception: Table is in readonly mode followed by logs like
Checking if anyone has a part all_1759749_1759749_0 and Found the missing part all_1759749_1759749_0
There might be an issue in metadata stored in the keeper.
Connect to clickhouse and check replication queue.
SELECT *
FROM system.replication_queue
WHERE (type = 'GET_PART')
LIMIT 1
FORMAT Vertical
partition_id must match the logs and num_tries may exceed 8.
Recreate metadata
This procedure does not delete data it only recreate keeper metadata.
SELECT concat(database,'.', table) as table, replica_name, zookeeper_path
FROM system.replicas
WHERE is_readonly = '1'
ALTER TABLE TABLE_NAME DROP DETACHED PARTITION ALL SETTINGS allow_drop_detached = 1; -- replace TABLE_NAME
DETACH TABLE TABLE_NAME;
SYSTEM DROP REPLICA 'REPLICA_NAME' FROM ZKPATH 'ZK_PATH'; -- replace REPLICA_NAME and ZK_PATH with data returned from first select
Wait for few seconds
ATTACH TABLE TABLE_NAME;
SYSTEM RESTORE REPLICA TABLE_NAME;
SYSTEM SYNC REPLICA TABLE_NAME; -- command may timeout due to long time syncing
Periodically checks for unsynced tables, it must be equal to 0.
select absolute_delay, table from system.replicas;