kafka版本:2.11-0.10.1.0
1、生产的kafka集群会定期被绿盟扫描,本地尝试使用nmap工具扫描kafka端口后立即出现OOM故障,问题原因可能是nmap使用内置的接口访问kafka服务端,该接口提供的参数又都是错误的,感觉可能还是kafka的bug。
2、不修改buffer.memory扫描一次就OOM,增大buffer.memory虽然可以撑多几次,但是也不是解决办法,开启防火墙成本太高,很多安全扫描的东西需要连过来。
nmap扫描
nmap -p 9092 -T4 -A -v 172.17.1.6
kafka异常日志
[2021-08-04 22:00:01,092] ERROR Closing socket for 172.17.1.6:6667-172.17.1.1:47865 because of error (kafka.network.Processor)
org.apache.kafka.common.errors.InvalidRequestException: Error parsing request header. Our best guess of the apiKey is: 27265
Caused by: org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'client_id': Error reading string of length 513, only 103 bytes available
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:73)
at org.apache.kafka.common.requests.RequestHeader.parse(RequestHeader.java:80)
at kafka.network.RequestChannel$Request.liftedTree1$1(RequestChannel.scala:82)
at kafka.network.RequestChannel$Request.<init>(RequestChannel.scala:82)
at kafka.network.Processor$$anonfun$processCompletedReceives$1.apply(SocketServer.scala:492)
at kafka.network.Processor$$anonfun$processCompletedReceives$1.apply(SocketServer.scala:487)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at kafka.network.Processor.processCompletedReceives(SocketServer.scala:487)
at kafka.network.Processor.run(SocketServer.scala:417)
at java.lang.Thread.run(Thread.java:748)
[2021-08-04 22:00:01,094] ERROR Closing socket for 172.17.1.6:6667-172.17.1.1:47867 because of error (kafka.network.Processor)
org.apache.kafka.common.errors.InvalidRequestException: Error getting request for apiKey: -173 and apiVersion: 19778
Caused by: java.lang.IllegalArgumentException: Unexpected ApiKeys id `-173`, it should be between `0` and `20` (inclusive)
at org.apache.kafka.common.protocol.ApiKeys.forId(ApiKeys.java:73)
at org.apache.kafka.common.requests.AbstractRequest.getRequest(AbstractRequest.java:39)
at kafka.network.RequestChannel$Request.liftedTree2$1(RequestChannel.scala:96)
at kafka.network.RequestChannel$Request.<init>(RequestChannel.scala:91)
at kafka.network.Processor$$anonfun$processCompletedReceives$1.apply(SocketServer.scala:492)
at kafka.network.Processor$$anonfun$processCompletedReceives$1.apply(SocketServer.scala:487)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at kafka.network.Processor.processCompletedReceives(SocketServer.scala:487)
at kafka.network.Processor.run(SocketServer.scala:417)
at java.lang.Thread.run(Thread.java:748)
[2021-08-04 22:00:39,516] ERROR Processor got uncaught exception. (kafka.network.Processor)
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:694)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:241)
at sun.nio.ch.IOUtil.read(IOUtil.java:195)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:110)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:97)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:154)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:135)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:343)
at org.apache.kafka.common.network.Selector.poll(Selector.java:291)
at kafka.network.Processor.poll(SocketServer.scala:476)
at kafka.network.Processor.run(SocketServer.scala:416)
at java.lang.Thread.run(Thread.java:748)
你调整的是
-XX:MaxDirectMemorySize
吗,并且要保障系统有足够的内存。参考:kafka内存溢出 java.lang.OutOfMemoryError: Direct buffer memory
生产调整过了,调过了大概能撑1-2两个月,然后某天被扫描后就立即假死(整个集群不可消费),必须kill掉假死的节点,整个集群才能恢复。
刚试了下高版本没这个问题,可正常抛出访问异常,不会OOM,哎。
【存活节点日志如下】
[2020-09-09 16:58:59,103] WARN [ReplicaFetcherThread-0-1], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7a8b66ef (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 1 was disconnected before the response was read at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112) at scala.Option.foreach(Option.scala:257) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108) at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137) at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143) at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108) at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
一般是kakfa节点内部的资源利用率很高导致的夯住了,或者某些资源一直拿不到在等待(死锁),可以系统监控cpu和内存的使用情况可以观察下,排除系统资源的问题。
另外,你说的对,大概率是kafka入口bug导致的,(猜测)同一时刻执行某些命令的时候,并行导致资源争抢死锁。
我没有更好的办法(0.10.1.0版本太旧了,也没维护了),全球都没找到相关的问题...
多谢了,也是因为客户用的这个版本,为了保持一致才降到这个版本的。
nmap扫描的时候基本都在半夜(那时候资源占用应该很低),而且是几秒钟内发起了16次异常连接,不行升版本了,哈哈。
你的答案