kafka假死,jps进程在,但是无法生产和消费数据。

小头針╮︶♡ 发表于: 2019-04-11   最后更新时间: 2019-04-13 23:20:15   7,185 游览

kafka假死,jps进程在,但是无法生产和消费数据

我的问题和https://www.orchome.com/516 这个问题的描述是类似的(就是所谓是卡夫卡假死,进程在,但是无法消费数据),但是因为这个没有给出报错信息,所以本质问题是否一致无法判断,所以贴出来请大神们一起看看。

kafka版本:kafka_2.11-0.10.1.0

问题具体描述如下:生产环境有5个kafka节点,突然消费kafka数据的程序没有数据了,然后查看每个节点的kafka的进程都在,但是某一台的分区无法使用,整个集群也是无法消费数据。查看了报错日志,发现kafka在2019-03-06 22:00左右报了如下的错误。(此报错日志为分区无法使用节点的报错信息)

[2019-03-06 22:35:12,306] WARN [ReplicaFetcherThread-0-1], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@4e9b2495 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 1 was disconnected before the response was read
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
    at scala.Option.foreach(Option.scala:257)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
    at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
    at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
    at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
    at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
    at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
    at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
    at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
    at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2019-03-06 22:35:44,314] INFO [Group Metadata Manager on Broker 4]: Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
[2019-03-06 22:35:44,342] WARN [ReplicaFetcherThread-0-1], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@19d368a4 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 1 was disconnected before the response was read
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
    at scala.Option.foreach(Option.scala:257)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
    at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
    at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
    at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
    at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
    at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
    at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
    at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
    at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2019-03-06 22:36:16,351] WARN [ReplicaFetcherThread-0-1], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@44510d44 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 1 was disconnected before the response was read
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
    at scala.Option.foreach(Option.scala:257)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
    at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
    at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
    at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
    at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
    at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
    at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
    at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
    at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)

重新启动后kafka可以正常消费,但是过了不久,另外一个节点又出现同样的问题,出现问题的时间也是不规律的。下面是我记录最近三次出现的时间

2019/03/07
kafka假死:节点分区不可以使用,无法进行生产和消费

2019/04/03
kafka125 假死:125这台上的数据写不进去,重启后可以消费

2019/04/07
kafka126假死

多谢关注!

发表于 2019-04-11
添加评论

楼主 问下你这个问题解决了吗? 也我遇到类似的问题 环境压力并不是很大

你用的是哪个版本的kafka呢?

跟你一样的版本kafka_2.11-0.10.1.0 不知道是不是kafka Bug https://issues.apache.org/jira/browse/KAFKA-4399

你们目前怎么处理这个错误的,我们目前不知道具体原因就直接重启kafka让业务恢复,但是这个问题还是存在的

我们遇到这个错误也都是重启,但是最近假死的很频繁。我们目前在监控,这个版本的kafka是有死锁的,已经监控到三次死锁了,我们打算更换版本。你可以用jdk自带的JvisualVM和Jconsole先对kafka的线程监控一下

我们新上的项目,跑了还没一两天就出现了这个问题。目前还在讨论怎么处理,也打算更改到0.10.2.0这个版本去试试,另外做好监控,一有问题先重启再说,感谢你的回复

  1. 查下GC信息,可能是gc时间长,导致客户端和服务端断开了连接, 如果gc时间太长,可以考虑增大jvm内存配置。
  2. netstat -anp |grep 9092 | wc -l 查下连接数有多少,连接数很多的话说明客户端连接比较多,kafka业务压力较大。
  3. 这种现象在系统刚启动,大量客户端集中访问的时候最容易出现,如果内存够用的话,看看CPU资源是否足够。

多谢回复,根据你的建议,我总结了一下集群中的相关数据,请帮忙看一下可能引起所述问题的会是哪一项呢?
其中每个节点内存和cpu分别是256G和48 processor
节点 1 2 3 4 5
kafka占用内存 32G 32G 32G 32G 32G
连接数 2063 1260 18911 4769 4868
GC总时长 406.178 124.346 581.161 613.546 242.607

你的答案

查看kafka相关的其他问题或提一个您自己的问题