kafka集群每隔30多天的时候就会有副本被踢出ISR

ighack 发表于: 2020-03-24   最后更新时间: 2020-03-24 22:21:52   3,007 游览

kafka集群每隔30多天的时候就会有副本被踢出ISR

我一共有5台机器做集群。每个主题都设了3个副本
kafka_2.11-0.10.2.1

现在不知道是怎么回事,不知道从那里下手

只知道其中23这台机器的
Kafka Topic Byte In (byte/sec) 要明显高于其他4台机器。
Kafka Topic Byte Out (byte/sec)中 23,27 这两台要高于其他3台机器

发表于 2020-03-24
添加评论

你得先从节点日志查查。

ighack -> 半兽人 4年前

我在27这台机器看到有很多controller.log.2020-03-25-08内容为

[2020-03-25 08:00:01,455] DEBUG [Controller 4]: topics not in preferred replica Map() (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] TRACE [Controller 4]: leader imbalance ratio for broker 0 is 0.000000 (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] DEBUG [Controller 4]: topics not in preferred replica Map() (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] TRACE [Controller 4]: leader imbalance ratio for broker 1 is 0.000000 (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] DEBUG [Controller 4]: topics not in preferred replica Map() (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] TRACE [Controller 4]: leader imbalance ratio for broker 2 is 0.000000 (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] DEBUG [Controller 4]: topics not in preferred replica Map() (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] TRACE [Controller 4]: leader imbalance ratio for broker 3 is 0.000000 (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] DEBUG [Controller 4]: topics not in preferred replica Map() (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] TRACE [Controller 4]: leader imbalance ratio for broker 4 is 0.000000 (kafka.controller.KafkaController)
[2020-03-25 08:05:01,450] TRACE [Controller 4]: checking need to trigger partition rebalance (kafka.controller.KafkaController)
[2020-03-25 08:05:01,454] DEBUG [Controller 4]: preferred replicas by broker Map(0 -> Map([gtp_data_log,1] -> List(0, 3, 4), [wlpt_to_mdb,2] -> List(0, 2, 3), [JLP_TO_LMIS_CHO
NGQ1,3] -> List(0, 4, 1), [JLP_TO_LMIS_SHANGH,5] -> List(0, 1, 2), [mdb_Fd_Route_NM,4] -> List(0, 2, 3), [TMP_TO_LMIS_SD,1] -> List(0, 3, 4), [JLP_TO_LMIS_GD,0] -> List(0, 1,
2), [consumer_offsets,30] -> List(0, 2, 3), [JLP_TO_LMIS_HEN,7] -> List(0, 2, 3), [TMP_TO_LMIS_GD,7] -> List(0, 4, 1), [TMP_TO_LMIS_LZ,9] -> List(0, 2, 3), [gtp_data_log,6]
-> List(0, 4, 1), [TMP_TO_LMIS_CHONGQ,4] -> List(0, 4, 1), [TMP_TO_LMIS_HAIN,7] -> List(0, 2, 3), [JTmdb_Fd_Good,2] -> List(0, 3, 4), [JLP_TO_LMIS_FJ,6] -> List(0, 2, 3), [sen
demail,2] -> List(0, 4), [JLP_TO_LMIS_SHANGH,0] -> List(0, 4, 1), [mdb_Fd_Route_LZ,0] -> List(0, 2, 3), [consumer_offsets,10] -> List(0, 2, 3), [JLP_TO_LMIS_FJ1,2] -> List(0
, 3, 4), [mdb_Fd_Route_HAIN,2] -> List(0, 1, 2), [JLP_TO_LMIS_HEN,2] -> List(0, 1, 2), [TMP_TO_LMIS_FJ,6] -> List(0, 4, 1), [TMP_TO_LMIS_XM,0] -> List(0, 1, 2), [JLP_TO_LMIS_S
D,2] -> List(0, 3, 4), [TMP_TO_LMIS_JIANGX,1] -> List(0, 1, 2), [__consumer_offsets,40] -> List(0, 4, 1), [TMP_TO_LMIS_BEIJ,4] -> List(0, 3, 4), [Parallel_Computing_Stock,0]

其他的机器上也有controller.log.2020-03-这样的日志。但不会每个小时都生成。内容也不像上面这样

[2020-03-03 10:57:28,711] INFO [Controller 1]: Controller startup complete (kafka.controller.KafkaController)
[2020-03-03 10:57:31,354] DEBUG [Controller 1]: Controller resigning, broker id 1 (kafka.controller.KafkaController)
[2020-03-03 10:57:31,354] DEBUG [Controller 1]: De-registering IsrChangeNotificationListener (kafka.controller.KafkaController)
[2020-03-03 10:57:31,356] INFO [Partition state machine on Controller 1]: Stopped partition state machine (kafka.controller.PartitionStateMachine)
[2020-03-03 10:57:31,357] INFO [Replica state machine on controller 1]: Stopped replica state machine (kafka.controller.ReplicaStateMachine)
[2020-03-03 10:57:31,358] INFO [Controller 1]: Broker 1 resigned as the controller (kafka.controller.KafkaController)
[2020-03-03 10:57:33,325] INFO [Controller 1]: Controller starting up (kafka.controller.KafkaController)
[2020-03-03 10:57:33,342] INFO [Controller 1]: Controller startup complete (kafka.controller.KafkaController)

看起来比较正常,只有在踢出ISR中的副本时有的机器上有这样的日志

[2020-03-03 10:59:54,553] DEBUG [Controller 2]: Removing replica 1 from ISR 3,0 for partition [TMP_TO_LMIS_SHANGH,6]. (kafka.controller.KafkaController)
[2020-03-03 10:59:54,554] WARN [Controller 2]: Cannot remove replica 1 from ISR of partition [TMP_TO_LMIS_SHANGH,6] since it is not in the ISR. Leader = 3 ; ISR = List(3, 0) (
kafka.controller.KafkaController)[2020-03-03 10:59:54,554] DEBUG The stop replica request (delete = true) sent to broker 1 is  (kafka.controller.ControllerBrokerRequestBatch)
[2020-03-03 10:59:54,554] DEBUG The stop replica request (delete = false) sent to broker 1 is [Topic=TMP_TO_LMIS_SHANGH,Partition=6,Replica=1] (kafka.controller.ControllerBrok
erRequestBatch)[2020-03-03 10:59:54,554] DEBUG The stop replica request (delete = true) sent to broker 1 is  (kafka.controller.ControllerBrokerRequestBatch)
[2020-03-03 10:59:54,554] DEBUG The stop replica request (delete = false) sent to broker 1 is [Topic=__consumer_offsets,Partition=17,Replica=1] (kafka.controller.ControllerBro
kerRequestBatch)[2020-03-03 10:59:54,554] INFO [Replica state machine on controller 2]: Invoking state change to OfflineReplica for replicas [Topic=__consumer_offsets,Partition=17,Replica=1] 
(kafka.controller.ReplicaStateMachine)[2020-03-03 10:59:54,554] DEBUG [Controller 2]: Removing replica 1 from ISR 2,0 for partition [__consumer_offsets,17]. (kafka.controller.KafkaController)
zookeeper

/controller_epoch 记录了controller变化的次数,也就是切换了多少次,次数大了说明集群不稳定,controller总是重新选举
我有225。但不知道不稳定在那里

我在27这台机器看到有很多controller.log.2020-03-25-08内容为:

[2020-03-25 08:00:01,455] DEBUG [Controller 4]: topics not in preferred replica Map() (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] TRACE [Controller 4]: leader imbalance ratio for broker 0 is 0.000000 (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] DEBUG [Controller 4]: topics not in preferred replica Map() (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] TRACE [Controller 4]: leader imbalance ratio for broker 1 is 0.000000 (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] DEBUG [Controller 4]: topics not in preferred replica Map() (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] TRACE [Controller 4]: leader imbalance ratio for broker 2 is 0.000000 (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] DEBUG [Controller 4]: topics not in preferred replica Map() (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] TRACE [Controller 4]: leader imbalance ratio for broker 3 is 0.000000 (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] DEBUG [Controller 4]: topics not in preferred replica Map() (kafka.controller.KafkaController)
[2020-03-25 08:00:01,455] TRACE [Controller 4]: leader imbalance ratio for broker 4 is 0.000000 (kafka.controller.KafkaController)
[2020-03-25 08:05:01,450] TRACE [Controller 4]: checking need to trigger partition rebalance (kafka.controller.KafkaController)
[2020-03-25 08:05:01,454] DEBUG [Controller 4]: preferred replicas by broker Map(0 -> Map([gtp_data_log,1] -> List(0, 3, 4), [wlpt_to_mdb,2] -> List(0, 2, 3), [JLP_TO_LMIS_CHO
NGQ1,3] -> List(0, 4, 1), [JLP_TO_LMIS_SHANGH,5] -> List(0, 1, 2), [mdb_Fd_Route_NM,4] -> List(0, 2, 3), [TMP_TO_LMIS_SD,1] -> List(0, 3, 4), [JLP_TO_LMIS_GD,0] -> List(0, 1, 
2), [__consumer_offsets,30] -> List(0, 2, 3), [JLP_TO_LMIS_HEN,7] -> List(0, 2, 3), [TMP_TO_LMIS_GD,7] -> List(0, 4, 1), [TMP_TO_LMIS_LZ,9] -> List(0, 2, 3), [gtp_data_log,6] 
-> List(0, 4, 1), [TMP_TO_LMIS_CHONGQ,4] -> List(0, 4, 1), [TMP_TO_LMIS_HAIN,7] -> List(0, 2, 3), [JTmdb_Fd_Good,2] -> List(0, 3, 4), [JLP_TO_LMIS_FJ,6] -> List(0, 2, 3), [sen
demail,2] -> List(0, 4), [JLP_TO_LMIS_SHANGH,0] -> List(0, 4, 1), [mdb_Fd_Route_LZ,0] -> List(0, 2, 3), [__consumer_offsets,10] -> List(0, 2, 3), [JLP_TO_LMIS_FJ1,2] -> List(0
, 3, 4), [mdb_Fd_Route_HAIN,2] -> List(0, 1, 2), [JLP_TO_LMIS_HEN,2] -> List(0, 1, 2), [TMP_TO_LMIS_FJ,6] -> List(0, 4, 1), [TMP_TO_LMIS_XM,0] -> List(0, 1, 2), [JLP_TO_LMIS_S
D,2] -> List(0, 3, 4), [TMP_TO_LMIS_JIANGX,1] -> List(0, 1, 2), [__consumer_offsets,40] -> List(0, 4, 1), [TMP_TO_LMIS_BEIJ,4] -> List(0, 3, 4), [Parallel_Computing_Stock,0]

其他的机器上也有controller.log.2020-03-这样的日志。但不会每个小时都生成。内容也不像上面这样

我的Partition应该算是比较均衡

topic: mdb_Fd_Route_GD    Partition: 0    Leader: 2    Replicas: 2,3,4    Isr: 2,4,3
    Topic: mdb_Fd_Route_GD    Partition: 1    Leader: 3    Replicas: 3,4,0    Isr: 4,3,0
    Topic: mdb_Fd_Route_GD    Partition: 2    Leader: 4    Replicas: 4,0,1    Isr: 4,1,0
    Topic: mdb_Fd_Route_GD    Partition: 3    Leader: 0    Replicas: 0,1,2    Isr: 2,1,0
    Topic: mdb_Fd_Route_GD    Partition: 4    Leader: 1    Replicas: 1,2,3    Isr: 2,3,1
    Topic: mdb_Fd_Route_GD    Partition: 5    Leader: 2    Replicas: 2,4,0    Isr: 2,4,0
    Topic: mdb_Fd_Route_GD    Partition: 6    Leader: 3    Replicas: 3,0,1    Isr: 3,1,0
    Topic: mdb_Fd_Route_GD    Partition: 7    Leader: 4    Replicas: 4,1,2    Isr: 4,2,1
    Topic: mdb_Fd_Route_GD    Partition: 8    Leader: 0    Replicas: 0,2,3    Isr: 2,3,0
    Topic: mdb_Fd_Route_GD    Partition: 9    Leader: 1    Replicas: 1,3,4    Isr: 4,3,1

大多数都是这样的

你的答案

查看kafka相关的其他问题或提一个您自己的问题