kafka报错打开文件数过多导致kafka关闭?

micmouse521 发表于: 2018-08-20   最后更新时间: 2021-09-16 10:46:48   21,798 游览

服务器信息:

  1. kafka版本:kafka_2.11-0.10.1.0
  2. jvm版本:java version "1.7.0_67"
  3. kafka最大文件数配置为: 65535

kafka配置如下

broker.id=23
listeners=PLAINTEXT://192.168.1.100:9092
num.network.threads=33
num.io.threads=64
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/data/disk01/,/data/disk02/,/data/disk03/,/data/disk04/,/data/disk05/,/data/disk06/,/data/disk07/,/data/disk08/,/data/disk09/,/data/disk10/,/data/disk11/,/data/disk12/
num.partitions=20
num.recovery.threads.per.data.dir=10
log.flush.interval.messages=20000
log.flush.interval.ms=10000
log.flush.scheduler.interval.ms=2000
log.retention.hours=24
log.roll.hours=24
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
log.cleanup.interval.mins=10
log.cleaner.enable=false
message.max.bytes=20000000
auto.create.topics.enable=false
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
replica.fetch.max.bytes=104857600
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=65536
replica.lag.time.max.ms=10000
auto.leader.rebalance.enable=true
delete.topic.enable=true
log.message.format.version=0.8.2
zookeeper.connect=192.168.1.101:2181,192.168.1.102:2181,192.168.1.103:2181,192.168.1.104:2181,192.168.1.105:2181
zookeeper.connection.timeout.ms=6000
zookeeper.sync.time.ms=2000

问题描述:

近期生产环境kafka集群经常发生由于打开文件数过多导致kafka关闭的情况。
每次出现问题的kafka数量为1个,每次出现问题的kafka不固定,日志中有大量报错:

  1. 打开文件过多报错前有大量如下info信息,kafka isr由多个变成出错设备自己一个

     Shrinking ISR for partition [some_topic,13] from 23,17 to 17
     Shrinking ISR for partition [some_topic2,12] from 2223,17 to 17
     Shrinking ISR for partition [some_topic3,11] from 2023,17 to 17
     Shrinking ISR for partition [some_topic4,12] from 1925,17 to 17
    
  2. 打开文件过多报错前有大量

     [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic2,18] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
     [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic1,5] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
     [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic19,9] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
     [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic1,9] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
    
  3. 大量打开文件数过多报错

     [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
     java.io.IOException: 打开的文件过多
             at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
             at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
             at kafka.network.Acceptor.accept(SocketServer.scala:326)
             at kafka.network.Acceptor.run(SocketServer.scala:269)
             at java.lang.Thread.run(Thread.java:745)
     [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
     java.io.IOException: 打开的文件过多
             at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
             at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
             at kafka.network.Acceptor.accept(SocketServer.scala:326)
             at kafka.network.Acceptor.run(SocketServer.scala:269)
             at java.lang.Thread.run(Thread.java:745)
     [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
     java.io.IOException: 打开的文件过多
    

问题初步分析:

  1. smartacl检查了文件系统,文件系统正常
  2. 检控kafka打开文件数,发现是大量close-wait状态socket连接,有1000个左右close-wait状态连接的ip,每个ip有500以上的连接,其中有1000以上连接有50个左右
发表于 2018-08-20

问题已经解决.是kafka的bug
https://github.com/apache/kafka/pull/2116
https://issues.apache.org/jira/browse/KAFKA-4362
升级到kafka kafka_2.11-0.10.1.1版本就可以解决

2018-10-09 11:10:00,304] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:33414-440063 (kafka.network.Processor)
[2018-10-09 11:35:00,105] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:56600-440875 (kafka.network.Processor)
[2018-10-09 12:05:00,098] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:54642-441768 (kafka.network.Processor)
[2018-10-09 13:25:00,266] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:37534-444049 (kafka.network.Processor)
[2018-10-09 14:18:37,630] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:34310-445956 (kafka.network.Processor)
[2018-10-09 14:25:00,190] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:39004-446107 (kafka.network.Processor)
[2018-10-09 15:40:02,721] WARN Attempting to send response via channel for which there] WARN Attempting to send response via channel for Attempting] WARN Attempting to send responseRN Attempting] WARN Attempting to send] WARN Attempting to send response] WARN Attempting to send response] WARN Attempting to send response via channel for which] WARNempting to send response via channel for which there is Attempting] WARN Attempting to send response via channel for which there Attempting] WARN Attempting to send response via channel] WARNRN Attempting to send response via channel] WARN Attempting to send response via] WARN Attempting to send response via] WARN Attempting to send response via channel for which there] WARNempting to send] WARN Attempting to send response via channel for which there is no open connection, connection] WARN Attempting to send response via channel for which there is no] WARN Attempting] WARN Attempting to send response via channel forRN Attempting] WARN Attempting to send response via] WARN Attempting to send response via channel for] WARN Attempting to send response via] WARN Attempting to send response via channel for which there is] WARN Attempting to send response via channel for which there Attempting] WARN Attempting to send response via channel for which there] WARN Attempting to send response via channel forRN Attempting to send response via] WARN Attempting to send response via] WARN Attempting to send response via channel for which there is no open connection, connection idempting to] WARN Attempting to send response via channel for whichWARN Attempting to send response via channel for which there is no] WARN Attempting to send response via channel] WARNWARN Attempting to send response via] WARN Attempting to send response via] WARN Attempting to send response via channel] WARN Attempting to send response via channel for which there is] WARNempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:54002-448589 (kafka.network.Processor)
[2018-10-09 15:50:00,153] WARN] WARN Attempting to send response via channel for] WARN Attempting to send response via channel for which there] WARN Attempting to send response via channel for] WARN Attempting to send response via channel for which there] WARN Attempting to send response via channel for which there Attempting to send response via channel for which there is no open connection, connection Attempting to send response via channel for which there is no open] WARNRN Attempting] WARN Attempting to send response via channel for] WARN Attempting to send response via channel] WARN Attempting to send response via channel for which there is no open connection, connection id] WARNempting to send response via channel for which there is no Attempting] WARN Attempting to send response via channel for which Attempting] WARN Attempting to send response via channel forRN Attempting to send response via] WARN Attempting to send] WARN Attempting to send response via channel for which there] WARN] WARN Attempting to send response via channel] WARN Attempting to send response via] WARN Attempting] WARN Attempting to send responseRN Attempting] WARN Attempting to send response via] WARN Attempting to send response via] WARN Attempting to send response via] WARN Attempting to send response via channel for which thereempting to] WARN Attempting to send response via channel for whichWARN Attempting to send response via channel for which there is] WARN Attempting to send response via channel for which] WARN Attempting to send response via channel] WARN Attempting to send response via] WARN Attempting to send response via channel] WARN Attempting to send response via channel for which] WARN Attempting] WARN Attempting to send response via channel for which] WARN Attempting to send response via channel for which there is no open Attempting] WARN Attempting to send response via channelRN Attempting] WARN Attempting to send response via] WARN Attempting to send response via] WARN Attempting to send response via channel for which there is no open connection, connection idempting to] WARN Attempting to send response via channel for which there] WARN Attempting to send response via channel for which Attempting] WARN Attempting to send response via channelRN Attempting] WARN Attempting to send response via] WARN Attempting to send response via] WARN Attempting to send response via channel] WARN Attempting to send response via channel for which thereempting to] WARN Attempting to send response] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:34168-448891 (kafka.network.Processor)
ighack -> 小无名者 6年前

我也出现这些警告。不知道你解决了没有

冰海落花 -> ighack 5年前

请问这个告警解决了吗

博主,问题解决了吗?

您好,现在问题解决了吗?最近我这边也遇到了

我是cdh版本的开发,CDK3.0.0,对应kafka的0.11.0版本,最近也经常出现这个问题,需要升级到1.0版本吗

升级到kafka kafka_2.11-0.10.1.1版本就可以解决

您好,现在问题解决了吗?最近我这边也遇到了

micmouse521 -> 星空 5年前

问题已经解决.是kafka的bug
https://github.com/apache/kafka/pull/2116
https://issues.apache.org/jira/browse/KAFKA-4362
升级到kafka kafka_2.11-0.10.1.1版本就可以解决

星空 -> micmouse521 5年前

我的版本是kafka_2.11-0.10.1.0

星空 -> 星空 5年前

大佬方便留下联系吗,帮忙看下

1、调整下系统的openfile
2、一个客户端正常的话是只建立一个长连接,看样子你是不是获取连接那块写到循环里了,检查下

micmouse521 -> 半兽人 7年前

谢谢,openfile调整过两次

  1. 另外一个报错的节点调整到65350,还是会报打开文件数过多错误
  2. 没有重复,是通过lsof -p pid|wc -l统计的。
半兽人 -> micmouse521 7年前

调整后还是不行吗?
如果openfile调整过了,看下是否和启动kafka的用户对应。

每个ip有500个连接?

micmouse521 -> 半兽人 7年前
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189987 -> socket:[3517032353]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189988 -> socket:[3517032354]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189989 -> socket:[3517032355]
lrwx------ 1 kafka kafka 64 Aug 20 07:39 18999 -> socket:[3516466241]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189990 -> socket:[3517032356]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189991 -> socket:[3517032357]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189992 -> socket:[3517032358]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189993 -> socket:[3517032359]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189994 -> socket:[3517032360]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189995 -> socket:[3517032361]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189996 -> socket:[3517032362]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189997 -> socket:[3517032363]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189998 -> socket:[3517032364]
lrwx------ 1 kafka kafka 64 Aug 20 07:45 189999 -> socket:[3517032365]
micmouse521 -> 半兽人 7年前

嗯。每个ip都有1000多个time-wait的链接
/proc/pid/fd下统计文件数,也是很多。

micmouse521 -> 半兽人 7年前

部分netstat -aupt结果

tcp     1481      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.25.42:35271    CLOSE_WAIT  126050/java
tcp      770      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.24.19:54740    CLOSE_WAIT  126050/java
tcp      232      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.110.10:18666 CLOSE_WAIT  126050/java
tcp      509      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.32.73:52718    CLOSE_WAIT  126050/java
tcp      204      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.31.79:23604    CLOSE_WAIT  126050/java
tcp      377      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.32.115:49461   CLOSE_WAIT  126050/java
tcp      958      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.18.44:46041    CLOSE_WAIT  -
tcp      877      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.26.27:32511    CLOSE_WAIT  126050/java
tcp      226      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.26.80:63607    CLOSE_WAIT  126050/java
tcp    30207      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:172.21.31.56:53851   CLOSE_WAIT  126050/java
tcp      854      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.34.56:12484    CLOSE_WAIT  126050/java
tcp      447      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.13.25:32355    CLOSE_WAIT  126050/java
半兽人 -> micmouse521 7年前

把java升级到1.8,防火墙关了吧。
你观察某台机器的某个连接,看看过段时间,连接会不会自动断开。

micmouse521 -> 半兽人 7年前

谢谢,防火墙是关闭的。
谢谢你提供的思路。

半兽人 -> micmouse521 7年前

找到问题了吗?

micmouse521 -> 半兽人 7年前

没有,和以下这个问题类似,应该是kafka的bug。
https://issues.apache.org/jira/browse/KAFKA-4477

kafka020
[2018-08-24 08:39:04,559] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@1b37b18 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read
kafka025
[2018-08-24 08:39:04,582] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@544c05bb (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read
kafka024
[2018-08-24 08:32:24,253] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7059a711 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read
kafka019
[2018-08-24 08:39:04,566] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@37f0b8a1 
(kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read

kafka016
[2018-08-24 08:39:04,571] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7af80ad4 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read
kafka017
[2018-08-24 08:39:04,572] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@3caa6934 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read
kafka018
[2018-08-24 08:39:04,583] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@659c8509 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read
kafka022
[2018-08-24 08:32:24,088] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@74478c1b (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 23 was disconnected before the response was read
半兽人 -> micmouse521 7年前

场景能重现吗?

micmouse521 -> 半兽人 7年前

现在每隔两天会重现一次,监控文件数,然后重启对应节点。
不知道怎么人工重现

半兽人 -> micmouse521 7年前

那之前提到监控某台机器的某个连接是否被重连过呢?
另外,建议你升级kafka版本,0.10版本我遇到了2次灾难级的问题了。

micmouse521 -> 半兽人 7年前

只对kafka做了监控,发送端没做监控。
能具体说明下吗。

半兽人 -> micmouse521 7年前

1、客户端大量连接kafka集群进行消费和发送,你监管一些客户端与kafka的连接,正常的话是长连接,不会变的,所以你定时监控下是否有连接变动。
2、错误详情不在这台电脑上。1个是消费者客户端commit消息的时候大量失败。2、持续的消费者群组进行重新分配。

micmouse521 -> 半兽人 7年前

对客户端监控进行了监控。
最近又出现了一次。
监控结果是:客户端连接数没有明显变化,状态都是ESTABLISHED,但是出问题的kafka服务器上有大量的CLOSE_WAIT状态连接。
我对比了一下出问题时的连接数大约是正常情况连接数的7到8倍,并且从监控来看大约2分钟连接数由20000左右涨到30000

我分析是,客户端连接不上服务器,就断开连接重新连接,这样重复几次,就成功连上服务器了,但是服务器一直不释放之前失败的连接。

问题解决了么?

没有,应该是版本问题。
做了连接监控超过3万就自动重启节点。

我们这边的节点,还不能随便重启。有实时任务在跑 这也是个问题。听了半兽人的建议,准备把客户端升级下看看

哦。我们主要是日志,重启会导致消息积压。
客户端升级?详细说明一下

感觉问题不在客户端,是服务端一直没有断开连接。

嗯 我的 版本已经是 2.11_1.0.0 了,目前也没有多大的头绪。reconnect_backoff_ms, max_poll_records, heartbeat_interval_ms 都做过调整 问题依旧出现

我的这个集群 还有另外一个 问题 就是有个 consumer  group 不断的 在做 rebalance group,每隔2,4分钟就做一次 频率太高了 

呆木头 -> 半兽人 6年前

用的kafka 2.1.0版本,出现类似的情况
集群内部连不上其中一台broker,流量掉零;
但不会将异常的节点踢出去,命令行describe看topic信息ISR都是在线的,机器的句柄数也在持续增长
需要重启节点才能恢复

你好,现在问题解决了吗

你好,现在问题解决了吗

你的答案

查看kafka相关的其他问题或提一个您自己的问题