kafka假死,jps进程在,但是无法生产和消费数据。

kafka假死,jps进程在,但是无法生产和消费数据

我的问题和http://www.orchome.com/516 这个问题的描述是类似的(就是所谓是卡夫卡假死,进程在,但是无法消费数据),但是因为这个没有给出报错信息,所以本质问题是否一致无法判断,所以贴出来请大神们一起看看。

kafka版本:kafka_2.11-0.10.1.0

问题具体描述如下:生产环境有5个kafka节点,突然消费kafka数据的程序没有数据了,然后查看每个节点的kafka的进程都在,但是某一台的分区无法使用,整个集群也是无法消费数据。查看了报错日志,发现kafka在2019-03-06 22:00左右报了如下的错误。(此报错日志为分区无法使用节点的报错信息)

[2019-03-06 22:35:12,306] WARN [ReplicaFetcherThread-0-1], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@4e9b2495 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 1 was disconnected before the response was read
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
    at scala.Option.foreach(Option.scala:257)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
    at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
    at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
    at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
    at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
    at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
    at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
    at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
    at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2019-03-06 22:35:44,314] INFO [Group Metadata Manager on Broker 4]: Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
[2019-03-06 22:35:44,342] WARN [ReplicaFetcherThread-0-1], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@19d368a4 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 1 was disconnected before the response was read
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
    at scala.Option.foreach(Option.scala:257)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
    at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
    at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
    at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
    at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
    at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
    at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
    at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
    at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2019-03-06 22:36:16,351] WARN [ReplicaFetcherThread-0-1], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@44510d44 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 1 was disconnected before the response was read
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
    at scala.Option.foreach(Option.scala:257)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
    at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
    at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
    at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
    at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
    at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
    at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
    at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
    at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
    at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)

重新启动后kafka可以正常消费,但是过了不久,另外一个节点又出现同样的问题,出现问题的时间也是不规律的。下面是我记录最近三次出现的时间

2019/03/07
kafka假死:节点分区不可以使用,无法进行生产和消费

2019/04/03
kafka125 假死:125这台上的数据写不进去,重启后可以消费

2019/04/07
kafka126假死

多谢关注!






发表于: 14天前   最后更新时间: 12天前   游览量:147
上一条: 到头了!
下一条: 已经是最后了!

  • 1. 查下GC信息,可能是gc时间长,导致客户端和服务端断开了连接, 如果gc时间太长,可以考虑增大jvm内存配置。
    2. netstat -anp |grep 9092 | wc -l 查下连接数有多少,连接数很多的话说明客户端连接比较多,kafka业务压力较大。
    3. 这种现象在系统刚启动,大量客户端集中访问的时候最容易出现,如果内存够用的话,看看CPU资源是否足够。
    • 多谢回复,根据你的建议,我总结了一下集群中的相关数据,请帮忙看一下可能引起所述问题的会是哪一项呢?
      其中每个节点内存和cpu分别是256G和48 processor
      节点 1 2 3 4 5
      kafka占用内存 32G 32G 32G 32G 32G
      连接数 2063 1260 18911 4769 4868
      GC总时长 406.178 124.346 581.161 613.546 242.607