kafka报错打开文件数过多导致kafka关闭?

micmouse521 发表于: 2018-08-20   最后更新时间: 2021-09-16 10:46:48   20,856 游览

服务器信息:

  1. kafka版本:kafka_2.11-0.10.1.0
  2. jvm版本:java version "1.7.0_67"
  3. kafka最大文件数配置为: 65535

kafka配置如下

broker.id=23
listeners=PLAINTEXT://192.168.1.100:9092
num.network.threads=33
num.io.threads=64
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/data/disk01/,/data/disk02/,/data/disk03/,/data/disk04/,/data/disk05/,/data/disk06/,/data/disk07/,/data/disk08/,/data/disk09/,/data/disk10/,/data/disk11/,/data/disk12/
num.partitions=20
num.recovery.threads.per.data.dir=10
log.flush.interval.messages=20000
log.flush.interval.ms=10000
log.flush.scheduler.interval.ms=2000
log.retention.hours=24
log.roll.hours=24
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
log.cleanup.interval.mins=10
log.cleaner.enable=false
message.max.bytes=20000000
auto.create.topics.enable=false
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
replica.fetch.max.bytes=104857600
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=65536
replica.lag.time.max.ms=10000
auto.leader.rebalance.enable=true
delete.topic.enable=true
log.message.format.version=0.8.2
zookeeper.connect=192.168.1.101:2181,192.168.1.102:2181,192.168.1.103:2181,192.168.1.104:2181,192.168.1.105:2181
zookeeper.connection.timeout.ms=6000
zookeeper.sync.time.ms=2000

问题描述:

近期生产环境kafka集群经常发生由于打开文件数过多导致kafka关闭的情况。
每次出现问题的kafka数量为1个,每次出现问题的kafka不固定,日志中有大量报错:

  1. 打开文件过多报错前有大量如下info信息,kafka isr由多个变成出错设备自己一个

     Shrinking ISR for partition [some_topic,13] from 23,17 to 17
     Shrinking ISR for partition [some_topic2,12] from 2223,17 to 17
     Shrinking ISR for partition [some_topic3,11] from 2023,17 to 17
     Shrinking ISR for partition [some_topic4,12] from 1925,17 to 17
    
  2. 打开文件过多报错前有大量

     [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic2,18] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
     [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic1,5] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
     [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic19,9] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
     [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic1,9] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
    
  3. 大量打开文件数过多报错

     [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
     java.io.IOException: 打开的文件过多
             at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
             at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
             at kafka.network.Acceptor.accept(SocketServer.scala:326)
             at kafka.network.Acceptor.run(SocketServer.scala:269)
             at java.lang.Thread.run(Thread.java:745)
     [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
     java.io.IOException: 打开的文件过多
             at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
             at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
             at kafka.network.Acceptor.accept(SocketServer.scala:326)
             at kafka.network.Acceptor.run(SocketServer.scala:269)
             at java.lang.Thread.run(Thread.java:745)
     [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
     java.io.IOException: 打开的文件过多
    

问题初步分析:

  1. smartacl检查了文件系统,文件系统正常
  2. 检控kafka打开文件数,发现是大量close-wait状态socket连接,有1000个左右close-wait状态连接的ip,每个ip有500以上的连接,其中有1000以上连接有50个左右
发表于 2018-08-20

问题已经解决.是kafka的bug
https://github.com/apache/kafka/pull/2116
https://issues.apache.org/jira/browse/KAFKA-4362
升级到kafka kafka_2.11-0.10.1.1版本就可以解决

2018-10-09 11:10:00,304] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:33414-440063 (kafka.network.Processor)
[2018-10-09 11:35:00,105] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:56600-440875 (kafka.network.Processor)
[2018-10-09 12:05:00,098] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:54642-441768 (kafka.network.Processor)
[2018-10-09 13:25:00,266] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:37534-444049 (kafka.network.Processor)
[2018-10-09 14:18:37,630] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:34310-445956 (kafka.network.Processor)
[2018-10-09 14:25:00,190] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:39004-446107 (kafka.network.Processor)
[2018-10-09 15:40:02,721] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:54002-448589 (kafka.network.Processor)
[2018-10-09 15:50:00,153] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:34168-448891 (kafka.network.Processor)
ighack -> 小无名者 5年前

我也出现这些警告。不知道你解决了没有

冰海落花 -> ighack 4年前

请问这个告警解决了吗

博主,问题解决了吗?

您好,现在问题解决了吗?最近我这边也遇到了

我是cdh版本的开发,CDK3.0.0,对应kafka的0.11.0版本,最近也经常出现这个问题,需要升级到1.0版本吗

升级到kafka kafka_2.11-0.10.1.1版本就可以解决

您好,现在问题解决了吗?最近我这边也遇到了

micmouse521 -> 星空 5年前

问题已经解决.是kafka的bug
https://github.com/apache/kafka/pull/2116
https://issues.apache.org/jira/browse/KAFKA-4362
升级到kafka kafka_2.11-0.10.1.1版本就可以解决

星空 -> micmouse521 5年前

我的版本是kafka_2.11-0.10.1.0

星空 -> 星空 5年前

大佬方便留下联系吗,帮忙看下

1、调整下系统的openfile
2、一个客户端正常的话是只建立一个长连接,看样子你是不是获取连接那块写到循环里了,检查下

micmouse521 -> 半兽人 6年前

谢谢,openfile调整过两次

  1. 另外一个报错的节点调整到65350,还是会报打开文件数过多错误
  2. 没有重复,是通过lsof -p pid|wc -l统计的。
半兽人 -> micmouse521 6年前

调整后还是不行吗?
如果openfile调整过了,看下是否和启动kafka的用户对应。

每个ip有500个连接?

micmouse521 -> 半兽人 6年前
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189987 -> socket:[3517032353]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189988 -> socket:[3517032354]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189989 -> socket:[3517032355]
lrwx------ 1 kafka kafka 64 Aug 20 07:39 18999 -> socket:[3516466241]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189990 -> socket:[3517032356]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189991 -> socket:[3517032357]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189992 -> socket:[3517032358]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189993 -> socket:[3517032359]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189994 -> socket:[3517032360]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189995 -> socket:[3517032361]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189996 -> socket:[3517032362]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189997 -> socket:[3517032363]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189998 -> socket:[3517032364]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189999 -> socket:[3517032365]
micmouse521 -> 半兽人 6年前

嗯。每个ip都有1000多个time-wait的链接
/proc/pid/fd下统计文件数,也是很多。

micmouse521 -> 半兽人 6年前

部分netstat -aupt结果

tcp     1481      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.25.42:35271    CLOSE_WAIT  126050/java
tcp      770      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.24.19:54740    CLOSE_WAIT  126050/java
tcp      232      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.110.10:18666 CLOSE_WAIT  126050/java
tcp      509      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.32.73:52718    CLOSE_WAIT  126050/java
tcp      204      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.31.79:23604    CLOSE_WAIT  126050/java
tcp      377      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.32.115:49461   CLOSE_WAIT  126050/java
tcp      958      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.18.44:46041    CLOSE_WAIT  -
tcp      877      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.26.27:32511    CLOSE_WAIT  126050/java
tcp      226      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.26.80:63607    CLOSE_WAIT  126050/java
tcp    30207      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:172.21.31.56:53851   CLOSE_WAIT  126050/java
tcp      854      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.34.56:12484    CLOSE_WAIT  126050/java
tcp      447      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.13.25:32355    CLOSE_WAIT  126050/java
半兽人 -> micmouse521 6年前

把java升级到1.8,防火墙关了吧。
你观察某台机器的某个连接,看看过段时间,连接会不会自动断开。

micmouse521 -> 半兽人 6年前

谢谢,防火墙是关闭的。
谢谢你提供的思路。

半兽人 -> micmouse521 6年前

找到问题了吗?

micmouse521 -> 半兽人 6年前

没有,和以下这个问题类似,应该是kafka的bug。
https://issues.apache.org/jira/browse/KAFKA-4477

kafka020
[2018-08-24 08:39:04,559] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@1b37b18 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read
kafka025
[2018-08-24 08:39:04,582] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@544c05bb (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read
kafka024
[2018-08-24 08:32:24,253] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7059a711 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read
kafka019
[2018-08-24 08:39:04,566] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@37f0b8a1 
(kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read

kafka016
[2018-08-24 08:39:04,571] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7af80ad4 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read
kafka017
[2018-08-24 08:39:04,572] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@3caa6934 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read
kafka018
[2018-08-24 08:39:04,583] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@659c8509 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read
kafka022
[2018-08-24 08:32:24,088] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@74478c1b (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read
半兽人 -> micmouse521 6年前

场景能重现吗?

micmouse521 -> 半兽人 6年前

现在每隔两天会重现一次,监控文件数,然后重启对应节点。
不知道怎么人工重现

半兽人 -> micmouse521 6年前

那之前提到监控某台机器的某个连接是否被重连过呢?
另外,建议你升级kafka版本,0.10版本我遇到了2次灾难级的问题了。

micmouse521 -> 半兽人 6年前

只对kafka做了监控,发送端没做监控。
能具体说明下吗。

半兽人 -> micmouse521 6年前

1、客户端大量连接kafka集群进行消费和发送,你监管一些客户端与kafka的连接,正常的话是长连接,不会变的,所以你定时监控下是否有连接变动。
2、错误详情不在这台电脑上。1个是消费者客户端commit消息的时候大量失败。2、持续的消费者群组进行重新分配。

micmouse521 -> 半兽人 6年前

对客户端监控进行了监控。
最近又出现了一次。
监控结果是:客户端连接数没有明显变化,状态都是ESTABLISHED,但是出问题的kafka服务器上有大量的CLOSE_WAIT状态连接。
我对比了一下出问题时的连接数大约是正常情况连接数的7到8倍,并且从监控来看大约2分钟连接数由20000左右涨到30000

我分析是,客户端连接不上服务器,就断开连接重新连接,这样重复几次,就成功连上服务器了,但是服务器一直不释放之前失败的连接。

问题解决了么?

没有,应该是版本问题。
做了连接监控超过3万就自动重启节点。

我们这边的节点,还不能随便重启。有实时任务在跑 这也是个问题。听了半兽人的建议,准备把客户端升级下看看

哦。我们主要是日志,重启会导致消息积压。
客户端升级?详细说明一下

感觉问题不在客户端,是服务端一直没有断开连接。

嗯 我的 版本已经是 2.11_1.0.0 了,目前也没有多大的头绪。reconnect_backoff_ms, max_poll_records, heartbeat_interval_ms 都做过调整 问题依旧出现

我的这个集群 还有另外一个 问题 就是有个 consumer  group 不断的 在做 rebalance group,每隔2,4分钟就做一次 频率太高了 

呆木头 -> 半兽人 5年前

用的kafka 2.1.0版本,出现类似的情况
集群内部连不上其中一台broker,流量掉零;
但不会将异常的节点踢出去,命令行describe看topic信息ISR都是在线的,机器的句柄数也在持续增长
需要重启节点才能恢复

你好,现在问题解决了吗

你好,现在问题解决了吗

你的答案

查看kafka相关的其他问题或提一个您自己的问题