Controller重启导致kafka元数据错乱？

我下午三点下线了controller节点1后，2被选为新的controller。但是发现有个topic A的partition1元数据信息不对，原先它的三个副本在2，3，4（2为leader）。但是看打印的日志中发现新controller 2 告诉6节点说topic A的partition1的副本是3，2，6，但并没有向2，3，4同步元数据的变更，导致6节点无故新增了一副本并向leader拉取一直失败，

以下是日志：各位大佬帮我看看这是为什么呢？感觉像是controller的下线导致元数据错乱了

6节点机器上的日志：

（两点的日志还正常）
[2021-03-16 14:12:39.979] TRACE [Broker id=6] Cached leader info UpdateMetadataPartitionState(topicName='A', p
artitionIndex=1, controllerEpoch=29, leader=2, leaderEpoch=13, isr=[2, 3, 4], zkVersion=25, replicas=[2, 3, 4], offlineReplicas=[
]) for partition A-1 in response to UpdateMetadata request sent by controller 1 epoch 29 with correlation id 0 (state.change.logger)

（这是重启controller后的日志）
[2021-03-16 15:00:46.600] TRACE [Broker id=6] Cached leader info UpdateMetadataPartitionState(topicName='A', p
artitionIndex=1, controllerEpoch=29, leader=2, leaderEpoch=13, isr=[2, 3, 4], zkVersion=25, replicas=[3, 2, 6], offlineReplicas=[
]) for partition A-1 in response to UpdateMetadata request sent by controller 2 epoch 30 with correlation id 0 (state.change.logger)

这个日志上还有一点疑惑的是controller epoch是30，但UpdateMetadataPartitionState里的controller epoch是29，和两点的一致。这块是不是有问题呢？

2节点机器上的日志：
[2021-03-16 15:07:26,164] WARN [ReplicaManager broker=2] Leader 2 failed to record follower 6's position 11679162 since the replica is not recognized to be one of the assigned replicas 3,4,2 for partition A-1. Empty records will be returned for this partition. (kafka.server.ReplicaManager)

[2021-03-16 15:00:46.829] INFO Created log for partition A-1 in /data4/kafka-data/A-1 with properties {compression.type -> producer, message.downconversion.enable -> true, min.insync.replicas -> 2, segment.jitter.ms -> 0, cleanup.policy -> [delete], flush.ms -> 9223372036854775807, segment.bytes -> 1073741824, retention.ms -> 172800000, flush.messages -> 9223372036854775807, message.format.version -> 2.0-IV1, file.delete.delay.ms -> 60000, max.compaction.lag.ms -> 9223372036854775807, max.message.bytes -> 4194304, min.compaction.lag.ms -> 0, message.timestamp.type -> CreateTime, preallocate -> false, min.cleanable.dirty.ratio -> 0.5, index.interval.bytes -> 4096, unclean.leader.election.enable -> true, retention.bytes -> -1, delete.retention.ms -> 86400000, segment.ms -> 604800000, message.timestamp.difference.max.ms -> 9223372036854775807, segment.index.bytes -> 10485760}. (kafka.log.LogManager)

[2021-03-16 15:11:04.092] TRACE [Broker id=6] Cached leader info UpdateMetadataPartitionState(topicName='A', partitionIndex=1, controllerEpoch=30, leader=2, leaderEpoch=14, isr=[2, 4], zkVersion=26, replicas=[3, 2, 6], offlineReplicas=[3]) for partition A-1 in response to UpdateMetadata request sent by controller 2 epoch 30 with correlation id 23 (state.change.logger) [2021-03-16 15:11:10.331] TRACE [Broker id=6] Cached leader info UpdateMetadataPartitionState(topicName='A', partitionIndex=1, controllerEpoch=30, leader=2, leaderEpoch=14, isr=[2, 4], zkVersion=26, replicas=[3, 2, 6], offlineReplicas=[3]) for partition A-1 in response to UpdateMetadata request sent by controller 2 epoch 30 with correlation id 29 (state.change.logger) [2021-03-16 15:11:15.076] TRACE [Broker id=6] Cached leader info UpdateMetadataPartitionState(topicName='A', partitionIndex=1, controllerEpoch=30, leader=2, leaderEpoch=14, isr=[2, 4, 6], zkVersion=27, replicas=[3, 2, 6], offlineReplicas=[3]) for partition A-1 in response to UpdateMetadata request sent by controller 2 epoch 30 with correlation id 34 (state.change.logger) [2021-03-16 15:11:15.236] TRACE [Broker id=6] Cached leader info UpdateMetadataPartitionState(topicName='A', partitionIndex=1, controllerEpoch=30, leader=2, leaderEpoch=14, isr=[2, 4, 6], zkVersion=27, replicas=[3, 2, 6], offlineReplicas=[3]) for partition A-1 in response to UpdateMetadata request sent by controller 2 epoch 30 with correlation id 35 (state.change.logger) [2021-03-16 15:11:23.389] TRACE [Broker id=6] Cached leader info UpdateMetadataPartitionState(topicName='A', partitionIndex=1, controllerEpoch=30, leader=2, leaderEpoch=14, isr=[2, 6], zkVersion=28, replicas=[3, 2, 6], offlineReplicas=[3]) for partition A-1 in response to UpdateMetadata request sent by controller 2 epoch 30 with correlation id 38 (state.change.logger) [2021-03-16 15:11:32.727] TRACE [Broker id=6] Cached leader info UpdateMetadataPartitionState(topicName='A', partitionIndex=1, controllerEpoch=30, leader=2, leaderEpoch=14, isr=[2, 6], zkVersion=28, replicas=[3, 2, 6], offlineReplicas=[3]) for partition A-1 in response to UpdateMetadata request sent by controller 2 epoch 30 with correlation id 40 (state.change.logger) [2021-03-16 15:11:34.809] TRACE [Broker id=6] Cached leader info UpdateMetadataPartitionState(topicName='A', partitionIndex=1, controllerEpoch=30, leader=2, leaderEpoch=14, isr=[2, 6], zkVersion=28, replicas=[3, 2, 6], offlineReplicas=[]) for partition A-1 in response to UpdateMetadata request sent by controller 2 epoch 30 with correlation id 41 (state.change.logger) [2021-03-16 15:11:34.884] TRACE [Broker id=6] Cached leader info UpdateMetadataPartitionState(topicName='A', partitionIndex=1, controllerEpoch=30, leader=2, leaderEpoch=14, isr=[2, 6], zkVersion=28, replicas=[3, 2, 6], offlineReplicas=[]) for partition A-1 in response to UpdateMetadata request sent by controller 2 epoch 30 with correlation id 42 (state.change.logger) [2021-03-16 15:11:55.426] TRACE [Broker id=6] Cached leader info UpdateMetadataPartitionState(topicName='A', partitionIndex=1, controllerEpoch=30, leader=2, leaderEpoch=14, isr=[2, 6, 3], zkVersion=29, replicas=[3, 2, 6], offlineReplicas=[]) for partition A-1 in response to UpdateMetadata request sent by controller 2 epoch 30 with correlation id 46 (state.change.logger) [2021-03-16 15:15:54.197] TRACE [Broker id=6] Cached leader info UpdateMetadataPartitionState(topicName='A', partitionIndex=1, controllerEpoch=30, leader=3, leaderEpoch=15, isr=[2, 6, 3], zkVersion=30, replicas=[3, 2, 6], offlineReplicas=[]) for partition A-1 in response to UpdateMetadata request sent by controller 2 epoch 30 with correlation id 50 (state.change.logger)

[2021-03-16 14:12:39.979] metis-main2-sg2-kafka TRACE [Broker id=10040] Cached leader info UpdateMetadataPartitionState(topicName='metis_mobile_live_stat', p artitionIndex=1, controllerEpoch=29, leader=10045, leaderEpoch=13, isr=[10045, 10042, 10043], zkVersion=25, replicas=[10045, 10042, 10043], offlineReplicas=[ ]) for partition metis_mobile_live_stat-1 in response to UpdateMetadata request sent by controller 10041 epoch 29 with correlation id 0 (state.change.logger) [2021-03-16 15:00:46.600] metis-main2-sg2-kafka TRACE [Broker id=10040] Cached leader info UpdateMetadataPartitionState(topicName='metis_mobile_live_stat', p artitionIndex=1, controllerEpoch=29, leader=10045, leaderEpoch=13, isr=[10045, 10042, 10043], zkVersion=25, replicas=[10042, 10045, 10040], offlineReplicas=[ ]) for partition metis_mobile_live_stat-1 in response to UpdateMetadata request sent by controller 10045 epoch 30 with correlation id 0 (state.change.logger)

[2021-03-16 14:12:39.979] TRACE [Broker id=6] Cached leader info UpdateMetadataPartitionState(topicName='A', p artitionIndex=1, controllerEpoch=29, leader=2, leaderEpoch=13, isr=[2, 3, 4], zkVersion=25, replicas=[2, 3, 4], offlineReplicas=[ ]) for partition A-1 in response to UpdateMetadata request sent by controller 1 epoch 29 with correlation id 0 (state.change.logger) [2021-03-16 15:00:46.600] TRACE [Broker id=6] Cached leader info UpdateMetadataPartitionState(topicName='A', p artitionIndex=1, controllerEpoch=29, leader=2, leaderEpoch=13, isr=[2, 3, 4], zkVersion=25, replicas=[3, 2, 6], offlineReplicas=[ ]) for partition A-1 in response to UpdateMetadata request sent by controller 2 epoch 30 with correlation id 0 (state.change.logger)

kafka默认topic一旦创建，分区（包含副本）一旦创建，分配到的节点就不会改变。
所以有没有做过什么操作或测试，影响了这个。

比如其他kafka错连了zk
broker.id重复了

回答于 4年前

半兽人

贝 -> 半兽人 4年前

并没有，当时只是重启了controller，然后看日志就发现了这个问题

半兽人 -> 贝 4年前

神奇的6节点，超出我的认知了。
最后，我还是坚持上面的观点。

ps：建议你清理一下脏数据。

我刚又查了一下日志，发现三点重启controller节点之后进行replica state machine启动，发现启动完成的日志里副本是3，2，6.这个我是不是可以理解为zk里的元数据就是这。但不懂的就是三点之前副本是2，3，4的这个元数据怎么来的，而且集群跑了好长时间了，都没问题，从kafka-manager以及监控指标上看也没有未同步的副本等任何错误信息。另外两点重启了一台普通broker，集群也正常。只有三点重启controller才开始有异常。

同时问一下您指的清理脏数据怎么操作呢？指哪些脏数据

你用命令查看一下topic A的分区情况：

bin/kafka-topics.sh --list --bootstrap-server localhost:9092

看看。

该topic是否正常使用？

现在是正常的，因为是线上的，当时做了紧急恢复，目前元数据信息就是3，2，6

而且从监控上看，这个topic在三点重启controller后也在正常生产、消费，follower3，4正常同步。6副本无法同步，日志里报Leader 2 failed to record follower 6's position 11679162 since the replica is not recognized to be one of the assigned replicas 3,4,2 for partition A-1. Empty records will be returned for this partition. (kafka.server.ReplicaManager)。

是后来又重启了另外一台后，集群才开始出现未同步副本的

你是如何恢复的？
leader2忽略了follower6的，不是分区A-1的副本（3、4、2）中。
我好奇的是6节点，怎么会有不存在A-1的数据的，之前有吗。

是三点十几重启了3节点后，分区A-1的元数据开始变更了，这才知道6是follower。之前都没有，6节点上的目录都是三点那会创建的。这是6节点的日志：

[2021-03-16 15:00:46.829] INFO Created log for partition A-1 in /data4/kafka-data/A-1 with properties {compression.type -&gt; producer, message.downconversion.enable -&gt; true, min.insync.replicas -&gt; 2, segment.jitter.ms -&gt; 0, cleanup.policy -&gt; [delete], flush.ms -&gt; 9223372036854775807, segment.bytes -&gt; 1073741824, retention.ms -&gt; 172800000, flush.messages -&gt; 9223372036854775807, message.format.version -&gt; 2.0-IV1, file.delete.delay.ms -&gt; 60000, max.compaction.lag.ms -&gt; 9223372036854775807, max.message.bytes -&gt; 4194304, min.compaction.lag.ms -&gt; 0, message.timestamp.type -&gt; CreateTime, preallocate -&gt; false, min.cleanable.dirty.ratio -&gt; 0.5, index.interval.bytes -&gt; 4096, unclean.leader.election.enable -&gt; true, retention.bytes -&gt; -1, delete.retention.ms -&gt; 86400000, segment.ms -&gt; 604800000, message.timestamp.difference.max.ms -&gt; 9223372036854775807, segment.index.bytes -&gt; 10485760}. (kafka.log.LogManager)

（这是三点十几重启了3节点后的日志，可以看出leaderEpoch变了）

其实我现在疑惑的点在这呢：这两个时间点的leader epoch、zkVersion都一致，怎么唯独replicas信息变了？这个是不是有问题呢

贝 -> 贝 4年前

日志发错了：

分区副本创建后，是不会改变的，超出我的认知范围了。
1、你用我上面提供的获取topic分区的命令，自己做一个快照，然后过几天同样查看一下，是否有变化。
2、针对日志，其实都是info级别的，并没有error级别的，虽然有些打印不合理，但是如果只是在举证，并不影响topic的使用，可以忽视。

有了个新发现：监控上PartitionReassignmentRateAndTimeMs这个指标竟然有值。但是2成为controller的时候打印的日志里说没有正在均衡的partition（这个就是从/admin/reassign_partitions获取的）：
[2021-03-16 15:00:46,246] INFO [Controller id=2] Partitions being reassigned: Map() (kafka.controller.KafkaController)

当时也没有调用API进行均衡，所以我怀疑kafka内部是不是有什么机制会导致呢？

现在查这个问题也是想弄清楚原因，要不然我后边都没法滚动升级了，其它机器都不敢重启了。
关于升级时需要逐台重启kafka机器有什么好的建议不？

kafka什么版本？如果是大版本升级，你需要看看升级注意事项。
如果你没用到新的功能（如：流、连接器）等，其实老版本比较稳定的（因为功能单一）。
升级文档好久没更新了，我抽时间更新一下：https://www.orchome.com/505

从2.1.1升级到2.6.1。官网给的就是让滚动升级，流程它每个版本说的基本一样。

你的答案

查看kafka相关的其他问题或提一个您自己的问题。

Controller重启导致kafka元数据错乱？

你的答案

昵称