最近线上集群出了几个故障,kafka处于一种假“运行”状态,就是jps pid还存在,监听端口9092却不存在,日志还在输出,其他broker无法连接,zk中注册路径还在。重启的话会因为zk中路径还存在,所以无法启动成功。困惑了很久,然后当然也翻阅了apache jira,https://issues.apache.org/jira/browse/KAFKA-3410 其中这个跟我的问题很像。大概意思是follower因为某种原因,从leader中被剔除,而如果此时leader partitions重启,很可能会造成缓存数据丢失以至于leader LEO<follower LEo,follower partition所在broker会退出。我查了源码确实会退出。
// we should never encounter this situation since a non-ISR leader cannot be elected if disallowed by the broker configuration.
if (!LogConfig.fromProps(brokerConfig.originals, AdminUtils.fetchEntityConfig(replicaMgr.zkUtils,
ConfigType.Topic, topicPartition.topic)).uncleanLeaderElectionEnable) {
// Log a fatal error and shutdown the broker to ensure that data loss does not unexpectedly occur.
fatal("Exiting because log truncation is not allowed for partition %s,".format(topicPartition) +
" Current leader %d's latest offset %d is less than replica %d's latest offset %d"
.format(sourceBroker.id, leaderEndOffset, brokerConfig.brokerId, replica.logEndOffset.messageOffset))
System.exit(1)
}
相关日志
[2016-08-30 16:51:03,374] FATAL [ReplicaFetcherThread-0-1], Exiting because log truncation is not allowed for topic test, Current leader 1's latest offset 0 is less than replica 2's latest offset 1 (kafka.server.ReplicaFetcherThread)
[2016-08-30 16:51:03,374] INFO [Kafka Server 2], shutting down (kafka.server.KafkaServer)
[2016-08-30 16:51:03,375] INFO [Kafka Server 2], Starting controlled shutdown (kafka.server.KafkaServer)
[2016-08-30 16:51:03,397] INFO [Kafka Server 2], Controlled shutdown succeeded (kafka.server.KafkaServer)
[2016-08-30 16:51:03,399] INFO [Socket Server on Broker 2], Shutting down (kafka.network.SocketServer)
[2016-08-30 16:51:03,403] INFO [Socket Server on Broker 2], Shutdown completed (kafka.network.SocketServer)
[2016-08-30 16:51:03,404] INFO [Kafka Request Handler on Broker 2], shutting down (kafka.server.KafkaRequestHandlerPool)
现在我有个问题想问下:
System.exit(1) JVM退出,但为什么pid还存在,并且zk中路径还存在?
啊啊啊,有什么解决方法吗?生产遇到了这个问题了?好坑爹啊!!
这种致命的错误导致的原因有很多。
1、磁盘满了
2、数据被破坏了
核心的错误是leader的数据比副本少,数据已经不一致了,依次重启试试。或者放弃掉副本的数据。
unclean.leader.election.enable=true
用这个参数,可能会丢数据,慎用。
如何解决呢?我试过重启zk 重启 kafka 和 服务器都不行
你的答案