kafka 2.12-2.3.1
9台虚机 16C 32G 12T
每台主机1块盘
配置:
broker.id=5
num.network.threads=8
transaction.state.log.replication.factor=3
num.partitions=6
offsets.topic.replication.factor=3
default.replication.factor=3
num.io.threads=8
有个broker突然失联,然后一直同步失效分区,同步了1天,上面的kafka-zk也显示无法对外服务:
This ZooKeeper instance is not currently serving requests
该broker主机执行命令很卡,state-change.log一直在打日志,server.log日志里面有以下异常和warning:
WARN Attempting to send response via channel for which there is no open connection, connection id xxx
WARN [ReplicaFetcher replicaId=5, leaderId=1, fetcherId=0] Error in response for fetch request
WARN [ReplicaFetcher replicaId=5, leaderId=9, fetcherId=0] Partition xxx marked as failed (kafka.server.ReplicaFetcherThread)
Shrinking ISR from 2,5 to 5. Leader: (highWatermark: 0, endOffset: 0). Out of sync replicas: (brokerId: 2, endOffset: -1). (kafka.cluster.Partition)
初步怀疑是因为只有1块盘,I/O系统已经满负荷,磁盘可能存在瓶颈,关停这个broker后,zk恢复正常,I/O负载正常
监控报表如下:
请问讲应用全部停止后,等分区全部同步完成再启动应用,能否恢复?
实在不行,打算提供个更大规模的集群,每台主机多块盘
当集群里的节点只剩下一台,或者不足半数时,就会出现这个错误提示。
这个量吞吐量已经到瓶颈吗?broker恢复之后,分区leader都切走了,理论上只有同步的流量了,如果你觉得还是达到了io上限,可以考虑kafka限流
你的答案