kafka consumer offset 突然变成最小值,然后重头开始消费,请问怎么排查?

化暗影~锁 发表于: 2018-04-17   最后更新时间: 2018-04-18 09:32:01   7,745 游览

kafka 版本:0.8

我们线上一个集群在运行中突然发现大量堆积,读写qps并没有任何改变。发现消费的数据重复了重头开始消费。我们auto.offset.reset = smallest,也就是在拿不到zk的offset的时候会重头消费。请问记录在zk中的offset为啥会没有呢?或者是其他原因导致重头消费?或者应该怎么查看呢?

线上日志中有一些:

[2018-04-15 04:48:26,574] INFO Partition [communication.dmSingle.admin,11] on broker 16127: Shrinking ISR for partition [communication.dmSingle.admin,11] from 16127,20241,2229,2253 to 16127 (kafka.cluster.Partition)
[2018-04-15 04:48:26,577] INFO Partition [communication.dmSingle.admin,11] on broker 16127: Cached zkVersion [1012] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-04-15 04:48:26,610] INFO Partition [communication.dmSingle.admin,3] on broker 16127: Shrinking ISR for partition [communication.dmSingle.admin,3] from 16127,2229,2253,20241 to 16127 (kafka.cluster.Partition)
[2018-04-15 04:48:26,614] INFO Partition [communication.dmSingle.admin,3] on broker 16127: Cached zkVersion [60] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-04-15 04:48:27,043] INFO Partition [communication.dmSingle.admin,1] on broker 16127: Shrinking ISR for partition [communication.dmSingle.admin,1] from 16127,20241,2229,2253 to 16127 (kafka.cluster.Partition)
[2018-04-15 04:48:27,047] INFO Partition [communication.dmSingle.admin,1] on broker 16127: Cached zkVersion [5457] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-04-15 04:48:27,079] INFO Partition [communication.dmSingle.admin,14] on broker 16127: Shrinking ISR for partition [communication.dmSingle.admin,14] from 16127,20241,2229,2253 to 16127 (kafka.cluster.Partition)
[2018-04-15 04:48:27,083] INFO Partition [communication.dmSingle.admin,14] on broker 16127: Cached zkVersion [64] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-04-15 04:48:27,350] INFO Partition [communication.dmSingle.admin,13] on broker 16127: Shrinking ISR for partition [communication.dmSingle.admin,13] from 16127,20241,2229,2253 to 16127 (kafka.cluster.Partition)
[2018-04-15 04:48:27,354] INFO Partition [communication.dmSingle.admin,13] on broker 16127: Cached zkVersion [5455] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-04-15 04:48:27,433] INFO Partition [communication.dmSingle.admin,15] on broker 16127: Shrinking ISR for partition [communication.dmSingle.admin,15] from 16127,2229,2253,20241 to 16127 (kafka.cluster.Partition)
[2018-04-15 04:48:28,152] ERROR [Replica Manager on Broker 16127]: Error when processing fetch request for partition [communication.dmSingle.admin,2] offset 152413846 from consumer with correlation id 0. Possible cause: Attempt to read with a maximum offset (152413628) less than the start offset (152413846). (kafka.server.ReplicaManager)
[2018-04-15 04:48:28,153] ERROR [Replica Manager on Broker 16127]: Error when processing fetch request for partition [communication.dmSingle.admin,1] offset 500978291 from consumer with correlation id 0. Possible cause: Attempt to read with a maximum offset (500978075) less than the start offset (500978291). (kafka.server.ReplicaManager)
[2018-04-15 04:48:28,230] ERROR [Replica Manager on Broker 16127]: Error when processing fetch request for partition [communication.dmSingle.admin,3] offset 500978415 from consumer with correlation id 0. Possible cause: Attempt to read with a maximum offset (500978194) less than the start offset (500978415). (kafka.server.ReplicaManager)
[2018-04-15 04:48:28,280] ERROR [Replica Manager on Broker 16127]: Error when processing fetch request for partition [communication.dmSingle.admin,19] offset 500979645 from consumer with correlation id 0. Possible cause: Attempt to read with a maximum offset (500979429) less than the start offset (500979645). (kafka.server.ReplicaManager)
[2018-04-15 04:48:28,365] ERROR [Replica Manager on Broker 16127]: Error when processing fetch request for partition [communication.dmSingle.admin,2] offset 152413846 from consumer with correlation id 2. Possible cause: Attempt to read with a maximum offset (152413628) less than the start offset (152413846). (kafka.server.ReplicaManager)
[2018-04-15 04:48:28,365] ERROR [Replica Manager on Broker 16127]: Error when processing fetch request for partition [communication.dmSingle.admin,1] offset 500978291 from consumer with correlation id 2. Possible cause: Attempt to read with a maximum offset (500978075) less than the start offset (500978291). (kafka.server.ReplicaManager)
[2018-04-15 04:48:28,592] ERROR [Replica Manager on Broker 16127]: Error when processing fetch request for partition [communication.dmSingle.admin,19] offset 500979645 from consumer with correlation id 4. Possible cause: Attempt to read with a maximum offset (500979429) less than the start offset (500979645). (kafka.server.ReplicaManager)
[2018-04-15 04:48:29,888] ERROR [Replica Manager on Broker 16127]: Error when processing fetch request for partition [communication.dmSingle.admin,7] offset 500979609 from consumer with correlation id 0. Possible cause: Attempt to read with a maximum offset (500979393) less than the start offset (500979609). (kafka.server.ReplicaManager)
[2018-04-15 04:57:58,145] ERROR [Replica Manager on Broker 16127]: Error when processing fetch request for partition [communication.dmSingle.admin,15] offset 500979544 from consumer with correlation id 0. Possible cause: Request for offset 500979544 but we only have log segments in the range 500515658 to 500979497. (kafka.server.ReplicaManager)
[2018-04-15 04:57:59,404] ERROR [Replica Manager on Broker 16127]: Error when processing fetch request for partition [communication.dmSingle.admin,3] offset 500978559 from consumer with correlation id 0. Possible cause: Request for offset 500978559 but we only have log segments in the range 500521997 to 500978520. (kafka.server.ReplicaManager)

其中发现的bug:https://issues.apache.org/jira/browse/KAFKA-725

发表于 2018-04-17
添加评论

最后一个错的意思是,消费者要求从500978559开始,但是你的日字段中没有这个范围,说明消费者保留的offset和kafka中存储的不对应。

你的答案

查看kafka相关的其他问题或提一个您自己的问题