在一个由3个节点组成的kafka集群里,有2个kafka节点一直重启失败,下面是其日志
[2023-09-15 06:16:30,240] INFO [Log partition=poc_alarm_queue-2, dir=/bitnami/kafka/data] Recovering unflushed segment 0 (kafka.log.Log)
[2023-09-15 06:16:30,240] INFO [Log partition=poc_alarm_queue-2, dir=/bitnami/kafka/data] Loading producer state till offset 0 with message format version 2 (kafka.log.Log)
[2023-09-15 06:16:31,981] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2023-09-15 06:16:31,983] INFO [KafkaServer id=1] shutting down (kafka.server.KafkaServer)
[2023-09-15 06:16:31,987] ERROR [KafkaServer id=1] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
at kafka.server.KafkaServer.shutdown(KafkaServer.scala:602)
at kafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:54)
at kafka.Kafka$.$anonfun$main$3(Kafka.scala:80)
at kafka.utils.Exit$.$anonfun$addShutdownHook$1(Exit.scala:38)
at java.base/java.lang.Thread.run(Thread.java:834)
[2023-09-15 06:16:31,990] ERROR Halting Kafka. (kafka.server.KafkaServerStartable)
此外,kafka集群连接zk集群,zk集群看起来是正常的,用zkServer.sh status查看3个节点基本ok, 不过也有其客户端连接超时和失败的日志信息:
2023-09-15 06:13:03,026 [myid:3] - WARN [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /192.168.4.12:31708, session = 0x3000004f3460053
at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
2023-09-15 06:13:06,011 [myid:3] - INFO [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x10000043eed0061
2023-09-15 06:13:06,188 [myid:3] - INFO [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x2000003f1530054
2023-09-15 06:13:08,440 [myid:3] - INFO [SessionTracker:ZooKeeperServer@610] - Expiring session 0x10000043eed0060, timeout of 18000ms exceeded
2023-09-15 06:13:08,440 [myid:3] - INFO [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x10000043eed0060
2023-09-15 06:13:20,441 [myid:3] - INFO [SessionTracker:ZooKeeperServer@610] - Expiring session 0x3000004f3460053, timeout of 18000ms exceeded
2023-09-15 06:13:20,441 [myid:3] - INFO [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3000004f3460053
2023-09-15 06:13:38,441 [myid:3] - INFO [SessionTracker:ZooKeeperServer@610] - Expiring session 0x2000003f1530053, timeout of 18000ms exceeded
2023-09-15 06:13:38,441 [myid:3] - INFO [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x2000003f1530053
2023-09-15 06:13:50,441 [myid:3] - INFO [SessionTracker:ZooKeeperServer@610] - Expiring session 0x2000003f1530054, timeout of 18000ms exceeded
2023-09-15 06:13:50,441 [myid:3] - INFO [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x2000003f1530054
2023-09-15 06:14:04,748 [myid:3] - INFO [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x3000004f3460054
2023-09-15 06:14:04,791 [myid:3] - INFO [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3000004f3460054
2023-09-15 06:14:04,903 [myid:3] - INFO [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x2000003f1530055
2023-09-15 06:14:28,637 [myid:3] - INFO [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x10000043eed0062
2023-09-15 06:14:28,807 [myid:3] - INFO [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x3000004f3460055
2023-09-15 06:14:48,441 [myid:3] - INFO [SessionTracker:ZooKeeperServer@610] - Expiring session 0x2000003f1530055, timeout of 18000ms exceeded
2023-09-15 06:14:48,441 [myid:3] - INFO [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x2000003f1530055
2023-09-15 06:14:53,043 [myid:3] - WARN [NIOWorkerThread-1:NIOServerCnxn@364] - Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /192.168.4.12:60734, session = 0x3000004f3460055
at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
2023-09-15 06:14:56,129 [myid:3] - INFO [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x2000003f1530056
2023-09-15 06:14:56,306 [myid:3] - INFO [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x2000003f1530057
2023-09-15 06:15:06,441 [myid:3] - INFO [SessionTracker:ZooKeeperServer@610] - Expiring session 0x3000004f3460055, timeout of 18000ms exceeded
2023-09-15 06:15:06,441 [myid:3] - INFO [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3000004f3460055
2023-09-15 06:15:40,441 [myid:3] - INFO [SessionTracker:ZooKeeperServer@610] - Expiring session 0x2000003f1530057, timeout of 18000ms exceeded
2023-09-15 06:15:40,441 [myid:3] - INFO [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x2000003f1530057
请问这种情况一般如何定位解决?
收到了停机信号,kafka还在初始化中,导致关闭失败。
排查思路:
基于k8s安装的kafka需要注意容器健康探测验证,否则一直被k8s认为不健康而重启。
告诉 kubelet 在第一次执行 probe 之前要的等待120秒钟
参考:kubernetes配置Liveness和Readiness探针
谢谢回复~。目前我的配置如下:
livenessProbe: failureThreshold: 2 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 tcpSocket: port: 9092 timeoutSeconds: 5 readinessProbe: failureThreshold: 6 initialDelaySeconds: 5 periodSeconds: 10 successThreshold: 1 tcpSocket: port: 9092 timeoutSeconds: 5
这里只需要调整initialDelaySeconds为120秒么?调整为这个值的一般会参考什么指标?谢谢
此外,上面的配置还有没有其它问题呢
就是kafka启动,你5秒后开始探测是否健康,太快了,kafka肯定还没有启来。
没有标准的指标,这个指标就有点复杂,越大给它充足的启动时间最好。
比如你设置了2核cpu给kafka程序,那么启动时间是60秒完成,如果是4核就30秒就ok了。
嗯嗯,了解了。您说的是readinessProbe中的initialDelaySeconds应该改为120吧,存活探针中的initialDelaySeconds是否也要相应改到120?
嗯,都改
我试试,多谢
我将上述initalDelaySeconds改为120后,发现pod至少要等120秒才能就绪,这个时间有点久。
是不是可以这样理解:如果:
pod将在连连续6次失败总共等待720秒后pod状态变为crashloopbackoff。
如果将initialDelaySeconds设置为10,则总共等待1分钟pod状态就变为crashloopbackoff呢?
你的答案