kafka报错打开文件数过多导致kafka关闭?

  1. kafka版本:kafka_2.11-0.10.1.0
  2. jvm版本:java version "1.7.0_67"
  3. kafka最大文件数配置为: 65535

kafka配置如下

=================

broker.id=23
listeners=PLAINTEXT://192.168.1.100:9092
num.network.threads=33
num.io.threads=64
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/data/disk01/,/data/disk02/,/data/disk03/,/data/disk04/,/data/disk05/,/data/disk06/,/data/disk07/,/data/disk08/,/data/disk09/,/data/disk10/,/data/disk11/,/data/disk12/
num.partitions=20
num.recovery.threads.per.data.dir=10
log.flush.interval.messages=20000
log.flush.interval.ms=10000
log.flush.scheduler.interval.ms=2000
log.retention.hours=24
log.roll.hours=24
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
log.cleanup.interval.mins=10
log.cleaner.enable=false
message.max.bytes=20000000
auto.create.topics.enable=false
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
replica.fetch.max.bytes=104857600
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=65536
replica.lag.time.max.ms=10000
auto.leader.rebalance.enable=true
delete.topic.enable=true
log.message.format.version=0.8.2
zookeeper.connect=192.168.1.101:2181,192.168.1.102:2181,192.168.1.103:2181,192.168.1.104:2181,192.168.1.105:2181
zookeeper.connection.timeout.ms=6000
zookeeper.sync.time.ms=2000

=================

问题描述:

近期生产环境kafka集群经常发生由于打开文件数过多导致kafka关闭的情况。
每次出现问题的kafka数量为1个,每次出现问题的kafka不固定,日志中有大量报错:

  1. 打开文件过多报错前有大量如下info信息,kafka isr由多个变成出错设备自己一个

     Shrinking ISR for partition [some_topic,13] from 23,17 to 17
     Shrinking ISR for partition [some_topic2,12] from 2223,17 to 17
     Shrinking ISR for partition [some_topic3,11] from 2023,17 to 17
     Shrinking ISR for partition [some_topic4,12] from 1925,17 to 17
    
  2. 打开文件过多报错前有大量

     [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic2,18] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
     [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic1,5] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
     [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic19,9] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
     [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic1,9] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
    
  3. 大量打开文件数过多报错

     [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
     java.io.IOException: 打开的文件过多
             at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
             at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
             at kafka.network.Acceptor.accept(SocketServer.scala:326)
             at kafka.network.Acceptor.run(SocketServer.scala:269)
             at java.lang.Thread.run(Thread.java:745)
     [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
     java.io.IOException: 打开的文件过多
             at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
             at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
             at kafka.network.Acceptor.accept(SocketServer.scala:326)
             at kafka.network.Acceptor.run(SocketServer.scala:269)
             at java.lang.Thread.run(Thread.java:745)
     [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
     java.io.IOException: 打开的文件过多
    

问题初步分析:

  1. smartacl检查了文件系统,文件系统正常
  2. 检控kafka打开文件数,发现是大量close-wait状态socket连接,有1000个左右close-wait状态连接的ip,每个ip有500以上的连接,其中有1000以上连接有50个左右






发表于: 1月前   最后更新时间: 1月前   游览量:312
上一条: 到头了!
下一条: 已经是最后了!

评论…


  • 1、调整下系统的openfile
    2、一个客户端正常的话是只建立一个长连接,看样子你是不是获取连接那块写到循环里了,检查下
    • 谢谢。
      openfile调整过两次
      1.另外一个报错的节点调整到65350,还是会报打开文件数过多错误
      2.没有重复,是通过lsof -p pid|wc -l统计的。
        • lrwx------ 1 kafka kafka 64 Aug 20 07:45 189987 -> socket:[3517032353]
          lrwx------ 1 kafka kafka 64 Aug 20 07:45 189988 -> socket:[3517032354]
          lrwx------ 1 kafka kafka 64 Aug 20 07:45 189989 -> socket:[3517032355]
          lrwx------ 1 kafka kafka 64 Aug 20 07:39 18999 -> socket:[3516466241]
          lrwx------ 1 kafka kafka 64 Aug 20 07:45 189990 -> socket:[3517032356]
          lrwx------ 1 kafka kafka 64 Aug 20 07:45 189991 -> socket:[3517032357]
          lrwx------ 1 kafka kafka 64 Aug 20 07:45 189992 -> socket:[3517032358]
          lrwx------ 1 kafka kafka 64 Aug 20 07:45 189993 -> socket:[3517032359]
          lrwx------ 1 kafka kafka 64 Aug 20 07:45 189994 -> socket:[3517032360]
          lrwx------ 1 kafka kafka 64 Aug 20 07:45 189995 -> socket:[3517032361]
          lrwx------ 1 kafka kafka 64 Aug 20 07:45 189996 -> socket:[3517032362]
          lrwx------ 1 kafka kafka 64 Aug 20 07:45 189997 -> socket:[3517032363]
          lrwx------ 1 kafka kafka 64 Aug 20 07:45 189998 -> socket:[3517032364]
          lrwx------ 1 kafka kafka 64 Aug 20 07:45 189999 -> socket:[3517032365]
            • 部分netstat -aupt结果
              tcp     1481      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.25.42:35271    CLOSE_WAIT  126050/java
              tcp      770      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.24.19:54740    CLOSE_WAIT  126050/java
              tcp      232      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.110.10:18666 CLOSE_WAIT  126050/java
              tcp      509      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.32.73:52718    CLOSE_WAIT  126050/java
              tcp      204      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.31.79:23604    CLOSE_WAIT  126050/java
              tcp      377      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.32.115:49461   CLOSE_WAIT  126050/java
              tcp      958      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.18.44:46041    CLOSE_WAIT  -
              tcp      877      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.26.27:32511    CLOSE_WAIT  126050/java
              tcp      226      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.26.80:63607    CLOSE_WAIT  126050/java
              tcp    30207      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:172.21.31.56:53851   CLOSE_WAIT  126050/java
              tcp      854      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.34.56:12484    CLOSE_WAIT  126050/java
              tcp      447      0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.13.25:32355    CLOSE_WAIT  126050/java
                • 没有,和以下这个问题类似,应该是kafka的bug。
                  https://issues.apache.org/jira/browse/KAFKA-4477

                  kafka020
                  [2018-08-24 08:39:04,559] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@1b37b18 (kafka.server.ReplicaFetcherThread)
                  java.io.IOException: Connection to 23 was disconnected before the response was read
                  kafka025
                  [2018-08-24 08:39:04,582] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@544c05bb (kafka.server.ReplicaFetcherThread)
                  java.io.IOException: Connection to 23 was disconnected before the response was read
                  kafka024
                  [2018-08-24 08:32:24,253] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7059a711 (kafka.server.ReplicaFetcherThread)
                  java.io.IOException: Connection to 23 was disconnected before the response was read
                  kafka019
                  [2018-08-24 08:39:04,566] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@37f0b8a1 
                  (kafka.server.ReplicaFetcherThread)
                  java.io.IOException: Connection to 23 was disconnected before the response was read

                  kafka016
                  [2018-08-24 08:39:04,571] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7af80ad4 (kafka.server.ReplicaFetcherThread)
                  java.io.IOException: Connection to 23 was disconnected before the response was read
                  kafka017
                  [2018-08-24 08:39:04,572] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@3caa6934 (kafka.server.ReplicaFetcherThread)
                  java.io.IOException: Connection to 23 was disconnected before the response was read
                  kafka018
                  [2018-08-24 08:39:04,583] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@659c8509 (kafka.server.ReplicaFetcherThread)
                  java.io.IOException: Connection to 23 was disconnected before the response was read
                  kafka022
                  [2018-08-24 08:32:24,088] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@74478c1b (kafka.server.ReplicaFetcherThread)
                  java.io.IOException: Connection to 23 was disconnected before the response was read
                    • 1、客户端大量连接kafka集群进行消费和发送,你监管一些客户端与kafka的连接,正常的话是长连接,不会变的,所以你定时监控下是否有连接变动。
                      2、错误详情不在这台电脑上。1个是消费者客户端commit消息的时候大量失败。2、持续的消费者群组进行重新分配。
                        • 对客户端监控进行了监控。
                          最近又出现了一次。
                          监控结果是:客户端连接数没有明显变化,状态都是ESTABLISHED,但是出问题的kafka服务器上有大量的CLOSE_WAIT状态连接。
                          我对比了一下出问题时的连接数大约是正常情况连接数的7到8倍,并且从监控来看大约2分钟连接数由20000左右涨到30000

                          我分析是,客户端连接不上服务器,就断开连接重新连接,这样重复几次,就成功连上服务器了,但是服务器一直不释放之前失败的连接。

                          • 评论…
                            • in this conversation