kafka报错打开文件数过多导致kafka关闭？

服务器信息：

kafka版本：kafka_2.11-0.10.1.0
jvm版本：java version "1.7.0_67"
kafka最大文件数配置为: 65535

kafka配置如下

broker.id=23
listeners=PLAINTEXT://192.168.1.100:9092
num.network.threads=33
num.io.threads=64
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/data/disk01/,/data/disk02/,/data/disk03/,/data/disk04/,/data/disk05/,/data/disk06/,/data/disk07/,/data/disk08/,/data/disk09/,/data/disk10/,/data/disk11/,/data/disk12/
num.partitions=20
num.recovery.threads.per.data.dir=10
log.flush.interval.messages=20000
log.flush.interval.ms=10000
log.flush.scheduler.interval.ms=2000
log.retention.hours=24
log.roll.hours=24
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
log.cleanup.interval.mins=10
log.cleaner.enable=false
message.max.bytes=20000000
auto.create.topics.enable=false
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
replica.fetch.max.bytes=104857600
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=65536
replica.lag.time.max.ms=10000
auto.leader.rebalance.enable=true
delete.topic.enable=true
log.message.format.version=0.8.2
zookeeper.connect=192.168.1.101:2181,192.168.1.102:2181,192.168.1.103:2181,192.168.1.104:2181,192.168.1.105:2181
zookeeper.connection.timeout.ms=6000
zookeeper.sync.time.ms=2000

问题描述：

近期生产环境kafka集群经常发生由于打开文件数过多导致kafka关闭的情况。
每次出现问题的kafka数量为1个，每次出现问题的kafka不固定,日志中有大量报错：

打开文件过多报错前有大量如下info信息，kafka isr由多个变成出错设备自己一个

 Shrinking ISR for partition [some_topic,13] from 23,17 to 17
 Shrinking ISR for partition [some_topic2,12] from 22，23,17 to 17
 Shrinking ISR for partition [some_topic3,11] from 20，23,17 to 17
 Shrinking ISR for partition [some_topic4,12] from 19，25,17 to 17

打开文件过多报错前有大量

 [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic2,18] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
 [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic1,5] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
 [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic19,9] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
 [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic1,9] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

大量打开文件数过多报错

 [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
 java.io.IOException: 打开的文件过多
         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
         at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
         at kafka.network.Acceptor.accept(SocketServer.scala:326)
         at kafka.network.Acceptor.run(SocketServer.scala:269)
         at java.lang.Thread.run(Thread.java:745)
 [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
 java.io.IOException: 打开的文件过多
         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
         at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
         at kafka.network.Acceptor.accept(SocketServer.scala:326)
         at kafka.network.Acceptor.run(SocketServer.scala:269)
         at java.lang.Thread.run(Thread.java:745)
 [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
 java.io.IOException: 打开的文件过多

问题初步分析：

smartacl检查了文件系统，文件系统正常
检控kafka打开文件数，发现是大量close-wait状态socket连接，有1000个左右close-wait状态连接的ip，每个ip有500以上的连接，其中有1000以上连接有50个左右

2018-10-09 11:10:00,304] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:33414-440063 (kafka.network.Processor) [2018-10-09 11:35:00,105] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:56600-440875 (kafka.network.Processor) [2018-10-09 12:05:00,098] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:54642-441768 (kafka.network.Processor) [2018-10-09 13:25:00,266] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:37534-444049 (kafka.network.Processor) [2018-10-09 14:18:37,630] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:34310-445956 (kafka.network.Processor) [2018-10-09 14:25:00,190] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:39004-446107 (kafka.network.Processor) [2018-10-09 15:40:02,721] WARN Attempting to send response via channel for which there] WARN Attempting to send response via channel for Attempting] WARN Attempting to send responseRN Attempting] WARN Attempting to send] WARN Attempting to send response] WARN Attempting to send response] WARN Attempting to send response via channel for which] WARNempting to send response via channel for which there is Attempting] WARN Attempting to send response via channel for which there Attempting] WARN Attempting to send response via channel] WARNRN Attempting to send response via channel] WARN Attempting to send response via] WARN Attempting to send response via] WARN Attempting to send response via channel for which there] WARNempting to send] WARN Attempting to send response via channel for which there is no open connection, connection] WARN Attempting to send response via channel for which there is no] WARN Attempting] WARN Attempting to send response via channel forRN Attempting] WARN Attempting to send response via] WARN Attempting to send response via channel for] WARN Attempting to send response via] WARN Attempting to send response via channel for which there is] WARN Attempting to send response via channel for which there Attempting] WARN Attempting to send response via channel for which there] WARN Attempting to send response via channel forRN Attempting to send response via] WARN Attempting to send response via] WARN Attempting to send response via channel for which there is no open connection, connection idempting to] WARN Attempting to send response via channel for whichWARN Attempting to send response via channel for which there is no] WARN Attempting to send response via channel] WARNWARN Attempting to send response via] WARN Attempting to send response via] WARN Attempting to send response via channel] WARN Attempting to send response via channel for which there is] WARNempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:54002-448589 (kafka.network.Processor) [2018-10-09 15:50:00,153] WARN] WARN Attempting to send response via channel for] WARN Attempting to send response via channel for which there] WARN Attempting to send response via channel for] WARN Attempting to send response via channel for which there] WARN Attempting to send response via channel for which there Attempting to send response via channel for which there is no open connection, connection Attempting to send response via channel for which there is no open] WARNRN Attempting] WARN Attempting to send response via channel for] WARN Attempting to send response via channel] WARN Attempting to send response via channel for which there is no open connection, connection id] WARNempting to send response via channel for which there is no Attempting] WARN Attempting to send response via channel for which Attempting] WARN Attempting to send response via channel forRN Attempting to send response via] WARN Attempting to send] WARN Attempting to send response via channel for which there] WARN] WARN Attempting to send response via channel] WARN Attempting to send response via] WARN Attempting] WARN Attempting to send responseRN Attempting] WARN Attempting to send response via] WARN Attempting to send response via] WARN Attempting to send response via] WARN Attempting to send response via channel for which thereempting to] WARN Attempting to send response via channel for whichWARN Attempting to send response via channel for which there is] WARN Attempting to send response via channel for which] WARN Attempting to send response via channel] WARN Attempting to send response via] WARN Attempting to send response via channel] WARN Attempting to send response via channel for which] WARN Attempting] WARN Attempting to send response via channel for which] WARN Attempting to send response via channel for which there is no open Attempting] WARN Attempting to send response via channelRN Attempting] WARN Attempting to send response via] WARN Attempting to send response via] WARN Attempting to send response via channel for which there is no open connection, connection idempting to] WARN Attempting to send response via channel for which there] WARN Attempting to send response via channel for which Attempting] WARN Attempting to send response via channelRN Attempting] WARN Attempting to send response via] WARN Attempting to send response via] WARN Attempting to send response via channel] WARN Attempting to send response via channel for which thereempting to] WARN Attempting to send response] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:34168-448891 (kafka.network.Processor)

lrwx------ 1 kafka kafka 64 Aug 20 07:45 189987 -> socket:[3517032353] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189988 -> socket:[3517032354] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189989 -> socket:[3517032355] lrwx------ 1 kafka kafka 64 Aug 20 07:39 18999 -> socket:[3516466241] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189990 -> socket:[3517032356] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189991 -> socket:[3517032357] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189992 -> socket:[3517032358] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189993 -> socket:[3517032359] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189994 -> socket:[3517032360] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189995 -> socket:[3517032361] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189996 -> socket:[3517032362] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189997 -> socket:[3517032363] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189998 -> socket:[3517032364] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189999 -> socket:[3517032365]

tcp 1481 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.25.42:35271 CLOSE_WAIT 126050/java tcp 770 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.24.19:54740 CLOSE_WAIT 126050/java tcp 232 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.110.10:18666 CLOSE_WAIT 126050/java tcp 509 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.32.73:52718 CLOSE_WAIT 126050/java tcp 204 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.31.79:23604 CLOSE_WAIT 126050/java tcp 377 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.32.115:49461 CLOSE_WAIT 126050/java tcp 958 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.18.44:46041 CLOSE_WAIT - tcp 877 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.26.27:32511 CLOSE_WAIT 126050/java tcp 226 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.26.80:63607 CLOSE_WAIT 126050/java tcp 30207 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:172.21.31.56:53851 CLOSE_WAIT 126050/java tcp 854 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.34.56:12484 CLOSE_WAIT 126050/java tcp 447 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.13.25:32355 CLOSE_WAIT 126050/java

kafka020 [2018-08-24 08:39:04,559] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@1b37b18 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka025 [2018-08-24 08:39:04,582] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@544c05bb (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka024 [2018-08-24 08:32:24,253] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7059a711 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka019 [2018-08-24 08:39:04,566] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@37f0b8a1 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka016 [2018-08-24 08:39:04,571] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7af80ad4 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka017 [2018-08-24 08:39:04,572] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@3caa6934 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka018 [2018-08-24 08:39:04,583] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@659c8509 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka022 [2018-08-24 08:32:24,088] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@74478c1b (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read

问题已经解决.是kafka的bug
https://github.com/apache/kafka/pull/2116
https://issues.apache.org/jira/browse/KAFKA-4362
升级到kafka kafka_2.11-0.10.1.1版本就可以解决

回答于 5年前

micmouse521

回答于 7年前

小无名者

ighack -> 小无名者 6年前

我也出现这些警告。不知道你解决了没有

冰海落花 -> ighack 5年前

请问这个告警解决了吗

博主，问题解决了吗？

回答于 6年前

﹏随遇而安ヾ

星空 -> ﹏随遇而安ヾ 5年前

您好，现在问题解决了吗？最近我这边也遇到了

我是cdh版本的开发，CDK3.0.0，对应kafka的0.11.0版本，最近也经常出现这个问题，需要升级到1.0版本吗

Angel-Beats

micmouse521 -> Angel-Beats 5年前

升级到kafka kafka_2.11-0.10.1.1版本就可以解决

星空

micmouse521 -> 星空 5年前

星空 -> micmouse521 5年前

我的版本是kafka_2.11-0.10.1.0

星空 -> 星空 5年前

大佬方便留下联系吗，帮忙看下

1、调整下系统的openfile
2、一个客户端正常的话是只建立一个长连接，看样子你是不是获取连接那块写到循环里了，检查下

半兽人

micmouse521 -> 半兽人 7年前

谢谢，openfile调整过两次

另外一个报错的节点调整到65350，还是会报打开文件数过多错误
没有重复，是通过lsof -p pid|wc -l统计的。

半兽人 -> micmouse521 7年前

调整后还是不行吗？
如果openfile调整过了，看下是否和启动kafka的用户对应。

每个ip有500个连接？

嗯。每个ip都有1000多个time-wait的链接
从/proc/pid/fd下统计文件数，也是很多。

部分netstat -aupt结果

把java升级到1.8，防火墙关了吧。
你观察某台机器的某个连接，看看过段时间，连接会不会自动断开。

谢谢，防火墙是关闭的。
谢谢你提供的思路。

找到问题了吗？

没有，和以下这个问题类似，应该是kafka的bug。
https://issues.apache.org/jira/browse/KAFKA-4477

场景能重现吗？

现在每隔两天会重现一次，监控文件数，然后重启对应节点。
不知道怎么人工重现

那之前提到监控某台机器的某个连接是否被重连过呢？
另外，建议你升级kafka版本，0.10版本我遇到了2次灾难级的问题了。

只对kafka做了监控，发送端没做监控。
能具体说明下吗。

1、客户端大量连接kafka集群进行消费和发送，你监管一些客户端与kafka的连接，正常的话是长连接，不会变的，所以你定时监控下是否有连接变动。
2、错误详情不在这台电脑上。1个是消费者客户端commit消息的时候大量失败。2、持续的消费者群组进行重新分配。

对客户端监控进行了监控。
最近又出现了一次。
监控结果是：客户端连接数没有明显变化，状态都是ESTABLISHED，但是出问题的kafka服务器上有大量的CLOSE_WAIT状态连接。
我对比了一下出问题时的连接数大约是正常情况连接数的7到8倍，并且从监控来看大约2分钟连接数由20000左右涨到30000

我分析是，客户端连接不上服务器，就断开连接重新连接，这样重复几次，就成功连上服务器了，但是服务器一直不释放之前失败的连接。

小无名者 -> micmouse521 7年前

问题解决了么？

micmouse521 -> 小无名者 7年前

没有，应该是版本问题。
做了连接监控超过3万就自动重启节点。

我们这边的节点，还不能随便重启。有实时任务在跑这也是个问题。听了半兽人的建议，准备把客户端升级下看看

哦。我们主要是日志，重启会导致消息积压。
客户端升级？详细说明一下

感觉问题不在客户端，是服务端一直没有断开连接。

嗯我的版本已经是 2.11_1.0.0 了，目前也没有多大的头绪。reconnect_backoff_ms, max_poll_records, heartbeat_interval_ms 都做过调整问题依旧出现

我的这个集群还有另外一个问题就是有个 consumer group 不断的在做 rebalance group，每隔2，4分钟就做一次频率太高了

呆木头 -> 半兽人 6年前

用的kafka 2.1.0版本，出现类似的情况
集群内部连不上其中一台broker，流量掉零；
但不会将异常的节点踢出去，命令行describe看topic信息ISR都是在线的，机器的句柄数也在持续增长
需要重启节点才能恢复

﹏随遇而安ヾ -> 小无名者 6年前

你好，现在问题解决了吗

﹏随遇而安ヾ -> 呆木头 6年前

你的答案

查看kafka相关的其他问题或提一个您自己的问题。