服务器信息:
- kafka版本:kafka_2.11-0.10.1.0
- jvm版本:java version "1.7.0_67"
- kafka最大文件数配置为: 65535
kafka配置如下
broker.id=23
listeners=PLAINTEXT://192.168.1.100:9092
num.network.threads=33
num.io.threads=64
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/data/disk01/,/data/disk02/,/data/disk03/,/data/disk04/,/data/disk05/,/data/disk06/,/data/disk07/,/data/disk08/,/data/disk09/,/data/disk10/,/data/disk11/,/data/disk12/
num.partitions=20
num.recovery.threads.per.data.dir=10
log.flush.interval.messages=20000
log.flush.interval.ms=10000
log.flush.scheduler.interval.ms=2000
log.retention.hours=24
log.roll.hours=24
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
log.cleanup.interval.mins=10
log.cleaner.enable=false
message.max.bytes=20000000
auto.create.topics.enable=false
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
replica.fetch.max.bytes=104857600
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=65536
replica.lag.time.max.ms=10000
auto.leader.rebalance.enable=true
delete.topic.enable=true
log.message.format.version=0.8.2
zookeeper.connect=192.168.1.101:2181,192.168.1.102:2181,192.168.1.103:2181,192.168.1.104:2181,192.168.1.105:2181
zookeeper.connection.timeout.ms=6000
zookeeper.sync.time.ms=2000
问题描述:
近期生产环境kafka集群经常发生由于打开文件数过多导致kafka关闭的情况。
每次出现问题的kafka数量为1个,每次出现问题的kafka不固定,日志中有大量报错:
打开文件过多报错前有大量如下info信息,kafka isr由多个变成出错设备自己一个
Shrinking ISR for partition [some_topic,13] from 23,17 to 17 Shrinking ISR for partition [some_topic2,12] from 22,23,17 to 17 Shrinking ISR for partition [some_topic3,11] from 20,23,17 to 17 Shrinking ISR for partition [some_topic4,12] from 19,25,17 to 17
打开文件过多报错前有大量
[2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic2,18] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic1,5] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic19,9] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic1,9] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
大量打开文件数过多报错
[2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor) java.io.IOException: 打开的文件过多 at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241) at kafka.network.Acceptor.accept(SocketServer.scala:326) at kafka.network.Acceptor.run(SocketServer.scala:269) at java.lang.Thread.run(Thread.java:745) [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor) java.io.IOException: 打开的文件过多 at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241) at kafka.network.Acceptor.accept(SocketServer.scala:326) at kafka.network.Acceptor.run(SocketServer.scala:269) at java.lang.Thread.run(Thread.java:745) [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor) java.io.IOException: 打开的文件过多
问题初步分析:
- smartacl检查了文件系统,文件系统正常
- 检控kafka打开文件数,发现是大量close-wait状态socket连接,有1000个左右close-wait状态连接的ip,每个ip有500以上的连接,其中有1000以上连接有50个左右
问题已经解决.是kafka的bug
https://github.com/apache/kafka/pull/2116
https://issues.apache.org/jira/browse/KAFKA-4362
升级到kafka kafka_2.11-0.10.1.1版本就可以解决
2018-10-09 11:10:00,304] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:33414-440063 (kafka.network.Processor) [2018-10-09 11:35:00,105] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:56600-440875 (kafka.network.Processor) [2018-10-09 12:05:00,098] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:54642-441768 (kafka.network.Processor) [2018-10-09 13:25:00,266] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:37534-444049 (kafka.network.Processor) [2018-10-09 14:18:37,630] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:34310-445956 (kafka.network.Processor) [2018-10-09 14:25:00,190] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:39004-446107 (kafka.network.Processor) [2018-10-09 15:40:02,721] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:54002-448589 (kafka.network.Processor) [2018-10-09 15:50:00,153] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:34168-448891 (kafka.network.Processor)
我也出现这些警告。不知道你解决了没有
请问这个告警解决了吗
博主,问题解决了吗?
您好,现在问题解决了吗?最近我这边也遇到了
我是cdh版本的开发,CDK3.0.0,对应kafka的0.11.0版本,最近也经常出现这个问题,需要升级到1.0版本吗
升级到kafka kafka_2.11-0.10.1.1版本就可以解决
您好,现在问题解决了吗?最近我这边也遇到了
问题已经解决.是kafka的bug
https://github.com/apache/kafka/pull/2116
https://issues.apache.org/jira/browse/KAFKA-4362
升级到kafka kafka_2.11-0.10.1.1版本就可以解决
我的版本是kafka_2.11-0.10.1.0
大佬方便留下联系吗,帮忙看下
1、调整下系统的openfile
2、一个客户端正常的话是只建立一个长连接,看样子你是不是获取连接那块写到循环里了,检查下
谢谢,openfile调整过两次
lsof -p pid|wc -l
统计的。调整后还是不行吗?
如果openfile调整过了,看下是否和启动kafka的用户对应。
每个ip有500个连接?
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189987 -> socket:[3517032353] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189988 -> socket:[3517032354] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189989 -> socket:[3517032355] lrwx------ 1 kafka kafka 64 Aug 20 07:39 18999 -> socket:[3516466241] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189990 -> socket:[3517032356] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189991 -> socket:[3517032357] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189992 -> socket:[3517032358] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189993 -> socket:[3517032359] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189994 -> socket:[3517032360] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189995 -> socket:[3517032361] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189996 -> socket:[3517032362] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189997 -> socket:[3517032363] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189998 -> socket:[3517032364] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189999 -> socket:[3517032365]
嗯。每个ip都有1000多个time-wait的链接
从
/proc/pid/fd
下统计文件数,也是很多。部分
netstat -aupt
结果tcp 1481 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.25.42:35271 CLOSE_WAIT 126050/java tcp 770 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.24.19:54740 CLOSE_WAIT 126050/java tcp 232 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.110.10:18666 CLOSE_WAIT 126050/java tcp 509 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.32.73:52718 CLOSE_WAIT 126050/java tcp 204 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.31.79:23604 CLOSE_WAIT 126050/java tcp 377 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.32.115:49461 CLOSE_WAIT 126050/java tcp 958 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.18.44:46041 CLOSE_WAIT - tcp 877 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.26.27:32511 CLOSE_WAIT 126050/java tcp 226 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.26.80:63607 CLOSE_WAIT 126050/java tcp 30207 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:172.21.31.56:53851 CLOSE_WAIT 126050/java tcp 854 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.34.56:12484 CLOSE_WAIT 126050/java tcp 447 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.13.25:32355 CLOSE_WAIT 126050/java
把java升级到1.8,防火墙关了吧。
你观察某台机器的某个连接,看看过段时间,连接会不会自动断开。
谢谢,防火墙是关闭的。
谢谢你提供的思路。
找到问题了吗?
没有,和以下这个问题类似,应该是kafka的bug。
https://issues.apache.org/jira/browse/KAFKA-4477
kafka020 [2018-08-24 08:39:04,559] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@1b37b18 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka025 [2018-08-24 08:39:04,582] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@544c05bb (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka024 [2018-08-24 08:32:24,253] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7059a711 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka019 [2018-08-24 08:39:04,566] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@37f0b8a1 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka016 [2018-08-24 08:39:04,571] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7af80ad4 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka017 [2018-08-24 08:39:04,572] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@3caa6934 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka018 [2018-08-24 08:39:04,583] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@659c8509 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka022 [2018-08-24 08:32:24,088] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@74478c1b (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read
场景能重现吗?
现在每隔两天会重现一次,监控文件数,然后重启对应节点。
不知道怎么人工重现
那之前提到监控某台机器的某个连接是否被重连过呢?
另外,建议你升级kafka版本,0.10版本我遇到了2次灾难级的问题了。
只对kafka做了监控,发送端没做监控。
能具体说明下吗。
1、客户端大量连接kafka集群进行消费和发送,你监管一些客户端与kafka的连接,正常的话是长连接,不会变的,所以你定时监控下是否有连接变动。
2、错误详情不在这台电脑上。1个是消费者客户端commit消息的时候大量失败。2、持续的消费者群组进行重新分配。
对客户端监控进行了监控。
最近又出现了一次。
监控结果是:客户端连接数没有明显变化,状态都是ESTABLISHED,但是出问题的kafka服务器上有大量的CLOSE_WAIT状态连接。
我对比了一下出问题时的连接数大约是正常情况连接数的7到8倍,并且从监控来看大约2分钟连接数由20000左右涨到30000
我分析是,客户端连接不上服务器,就断开连接重新连接,这样重复几次,就成功连上服务器了,但是服务器一直不释放之前失败的连接。
问题解决了么?
没有,应该是版本问题。
做了连接监控超过3万就自动重启节点。
我们这边的节点,还不能随便重启。有实时任务在跑 这也是个问题。听了半兽人的建议,准备把客户端升级下看看
哦。我们主要是日志,重启会导致消息积压。
客户端升级?详细说明一下
感觉问题不在客户端,是服务端一直没有断开连接。
嗯 我的 版本已经是 2.11_1.0.0 了,目前也没有多大的头绪。
reconnect_backoff_ms
,max_poll_records
,heartbeat_interval_ms
都做过调整 问题依旧出现我的这个集群 还有另外一个 问题 就是有个 consumer group 不断的 在做 rebalance group,每隔2,4分钟就做一次 频率太高了
用的kafka 2.1.0版本,出现类似的情况
集群内部连不上其中一台broker,流量掉零;
但不会将异常的节点踢出去,命令行describe看topic信息ISR都是在线的,机器的句柄数也在持续增长
需要重启节点才能恢复
你好,现在问题解决了吗
你好,现在问题解决了吗
你的答案