环境
- 本地环境:
- 六节点的kafka集群
- 版本1.1.0
背景
kafka-2节点在集群扩容的时候宕机,导致所有的topic的AR,ISR列表中都没有kafka-2,也就是说kafka-2一直是零负载的节点,但这件事情一直都没有发现
问题时间线:
- 18:06 kafka-0因为OOM重启
- 18:13 kafka-1因为OOM重启,重启后正常
- 18:06 — 01:29 kafka-0 重启后服务端一直报错
While recording the replica LEO, the partition ... hasn't been created.
(这些topic一定都是存在的) - 18:06 — 01:29 kafka-3,4,5 服务端自kafka-0重启后一直报错:
UnknownTopicOrPartitionException: This server does not host this topic-partition.
- 01:29 发现客户端报错
UnknownTopicOrPartitionException: This server does not host this topic-partition.
,因为kafka-2不存在于任何AR,ISR中,误以为是kafka-2的问题,重启kafka-2节点。问题解决
疑问:
- 为什么kafka-0重启后会报错并引发kafka-3,4,5报错?
- 为什么看似毫无相干的kafka-2重启会解决所有的问题?
集群之间互相发现,kafka-0重启,其他节点上面的topic分区会发生leader变更,进行负载,当发现分区不正常时候,进而引发报错。
一旦触发了重新平衡,就会持续循环验证分区的正常性。
对,topic分区确实会发生leader变更,但是分区为什么会不正常呢,所有的topic分区似乎是正常的啊?
遇到这样的问题一般需要怎么解决呢?
重启kafka-2解决了是什么原理...?所有的topic partition没有任何一个leader在kafka-2上。
还是没有太理解问题产生的原因,可否再详细指导一下...
你看下kafka-2上面有哪些主题分区在上面,是这些主题分区引起的。
没有任何一个主题分区的leader在2上面,唯一一个AR、ISR中有2的是用来做健康检查、没有任何数据的“kafka-health-check-topic”,这个主题的leader在0.
查询集群描述
bin/kafka-topics.sh --describe --zookeeper
发这个看看。
譬如以下topic,部分分区的leader已经选为kafka-0,但是这些分区的follower向leader(kafka-0)请求数据的时候报错:UnknownTopicOrPartitionException,而此时kafka-0报错:
While recording the replica LEO, the partition ... hasn't been created.
,似乎是数据准备时间过长,持续了6小时以上。似乎陷入了某种死循环。Topic:viid_snapface_topic PartitionCount:60 ReplicationFactor:3 Configs: Topic: viid_snapface_topic Partition: 0 Leader: 3 Replicas: 3,0,1 Isr: 3,0,1 Topic: viid_snapface_topic Partition: 1 Leader: 4 Replicas: 4,1,3 Isr: 4,1,3 Topic: viid_snapface_topic Partition: 2 Leader: 5 Replicas: 5,3,4 Isr: 5,4,3 Topic: viid_snapface_topic Partition: 3 Leader: 0 Replicas: 0,4,5 Isr: 0,5,4 Topic: viid_snapface_topic Partition: 4 Leader: 1 Replicas: 1,5,0 Isr: 5,1,0 Topic: viid_snapface_topic Partition: 5 Leader: 3 Replicas: 3,1,4 Isr: 4,3,1 Topic: viid_snapface_topic Partition: 6 Leader: 4 Replicas: 4,3,5 Isr: 4,5,3 Topic: viid_snapface_topic Partition: 7 Leader: 5 Replicas: 5,4,0 Isr: 5,4 Topic: viid_snapface_topic Partition: 8 Leader: 0 Replicas: 0,5,1 Isr: 0,5,1 Topic: viid_snapface_topic Partition: 9 Leader: 1 Replicas: 1,0,3 Isr: 3,1,0 Topic: viid_snapface_topic Partition: 10 Leader: 3 Replicas: 3,4,5 Isr: 4,3,5 Topic: viid_snapface_topic Partition: 11 Leader: 4 Replicas: 4,5,0 Isr: 4,5 Topic: viid_snapface_topic Partition: 12 Leader: 5 Replicas: 5,0,1 Isr: 5,0,1 Topic: viid_snapface_topic Partition: 13 Leader: 0 Replicas: 0,1,3 Isr: 0,3,1 Topic: viid_snapface_topic Partition: 14 Leader: 1 Replicas: 1,3,4 Isr: 4,3,1 Topic: viid_snapface_topic Partition: 15 Leader: 3 Replicas: 3,5,0 Isr: 3,5 Topic: viid_snapface_topic Partition: 16 Leader: 4 Replicas: 4,0,1 Isr: 4,0,1 Topic: viid_snapface_topic Partition: 17 Leader: 5 Replicas: 5,1,3 Isr: 5,3,1 Topic: viid_snapface_topic Partition: 18 Leader: 0 Replicas: 0,3,4 Isr: 0 Topic: viid_snapface_topic Partition: 19 Leader: 1 Replicas: 1,4,5 Isr: 5,4,1 Topic: viid_snapface_topic Partition: 20 Leader: 3 Replicas: 3,0,1 Isr: 3,0,1 Topic: viid_snapface_topic Partition: 21 Leader: 4 Replicas: 4,1,3 Isr: 4,1,3 Topic: viid_snapface_topic Partition: 22 Leader: 5 Replicas: 5,3,4 Isr: 5,4,3 Topic: viid_snapface_topic Partition: 23 Leader: 0 Replicas: 0,4,5 Isr: 0,5,4 Topic: viid_snapface_topic Partition: 24 Leader: 1 Replicas: 1,5,0 Isr: 5,1,0 Topic: viid_snapface_topic Partition: 25 Leader: 3 Replicas: 3,1,4 Isr: 4,3,1 Topic: viid_snapface_topic Partition: 26 Leader: 4 Replicas: 4,3,5 Isr: 4,5,3 Topic: viid_snapface_topic Partition: 27 Leader: 5 Replicas: 5,4,0 Isr: 5,4 Topic: viid_snapface_topic Partition: 28 Leader: 0 Replicas: 0,5,1 Isr: 0,5,1 Topic: viid_snapface_topic Partition: 29 Leader: 1 Replicas: 1,0,3 Isr: 3,1,0 Topic: viid_snapface_topic Partition: 30 Leader: 3 Replicas: 3,4,5 Isr: 4,3,5 Topic: viid_snapface_topic Partition: 31 Leader: 4 Replicas: 4,5,0 Isr: 4,5 Topic: viid_snapface_topic Partition: 32 Leader: 5 Replicas: 5,0,1 Isr: 5,0,1 Topic: viid_snapface_topic Partition: 33 Leader: 0 Replicas: 0,1,3 Isr: 0,3,1 Topic: viid_snapface_topic Partition: 34 Leader: 1 Replicas: 1,3,4 Isr: 1,4,3 Topic: viid_snapface_topic Partition: 35 Leader: 3 Replicas: 3,5,0 Isr: 3,5 Topic: viid_snapface_topic Partition: 36 Leader: 4 Replicas: 4,0,1 Isr: 4,0,1 Topic: viid_snapface_topic Partition: 37 Leader: 5 Replicas: 5,1,3 Isr: 5,3,1 Topic: viid_snapface_topic Partition: 38 Leader: 0 Replicas: 0,3,4 Isr: 0,4,3 Topic: viid_snapface_topic Partition: 39 Leader: 1 Replicas: 1,4,5 Isr: 5,4,1 Topic: viid_snapface_topic Partition: 40 Leader: 3 Replicas: 3,0,1 Isr: 3,0,1 Topic: viid_snapface_topic Partition: 41 Leader: 4 Replicas: 4,1,3 Isr: 4,1,3 Topic: viid_snapface_topic Partition: 42 Leader: 5 Replicas: 5,3,4 Isr: 5,4,3 Topic: viid_snapface_topic Partition: 43 Leader: 0 Replicas: 0,4,5 Isr: 0,5,4 Topic: viid_snapface_topic Partition: 44 Leader: 1 Replicas: 1,5,0 Isr: 5,1,0 Topic: viid_snapface_topic Partition: 45 Leader: 3 Replicas: 3,1,4 Isr: 4,3,1 Topic: viid_snapface_topic Partition: 46 Leader: 4 Replicas: 4,3,5 Isr: 4,5,3 Topic: viid_snapface_topic Partition: 47 Leader: 5 Replicas: 5,4,0 Isr: 5,4 Topic: viid_snapface_topic Partition: 48 Leader: 0 Replicas: 0,5,1 Isr: 0,5,1 Topic: viid_snapface_topic Partition: 49 Leader: 1 Replicas: 1,0,3 Isr: 3,1,0 Topic: viid_snapface_topic Partition: 50 Leader: 3 Replicas: 3,4,5 Isr: 4,3,5 Topic: viid_snapface_topic Partition: 51 Leader: 4 Replicas: 4,5,0 Isr: 4,5 Topic: viid_snapface_topic Partition: 52 Leader: 5 Replicas: 5,0,1 Isr: 5,0,1 Topic: viid_snapface_topic Partition: 53 Leader: 0 Replicas: 0,1,3 Isr: 0,3,1 Topic: viid_snapface_topic Partition: 54 Leader: 1 Replicas: 1,3,4 Isr: 4,3,1 Topic: viid_snapface_topic Partition: 55 Leader: 3 Replicas: 3,5,0 Isr: 3,5 Topic: viid_snapface_topic Partition: 56 Leader: 4 Replicas: 4,0,1 Isr: 4,0,1 Topic: viid_snapface_topic Partition: 57 Leader: 5 Replicas: 5,1,3 Isr: 5,3,1 Topic: viid_snapface_topic Partition: 58 Leader: 0 Replicas: 0,3,4 Isr: 0,4,3 Topic: viid_snapface_topic Partition: 59 Leader: 1 Replicas: 1,4,5 Isr: 5,4,1
While recording the replica LEO, the partition ... hasn't been created. 意思是follwer的分区还未创建。
https://issues.apache.org/jira/browse/KAFKA-6221
是的,UnknownTopicOrPartitionException一般是一个暂时性错误,短暂出现可能因为metadata更新不及时,不会有问题。但是kafka-0输出这个日志6小时了,kafka-3,4,5也一直在报错。
而且,这些topic很早就创建了,并不是在创建过程中出现的问题。
猜测可能是kafka-0重启之后因为某些未知原因,导致各个主题leader在kafka-0上的分区不可用了。前辈能看出来是什么原因导致的这个错误吗?
前辈有任何更新吗?
kafka-0启动的时候,会check整个环境,发现缺失与kafka-2之间有主题交集,持续check,验证自己的主题分区是否是最新的。
因为你之前kafka-0已经是启动的了,这个时候kafka-2宕机的,所以不影响kakfa-0,而现在你是重启的kafka-0。
kafka-0无法确认自己的topic和分区是最新的。
你的答案