kafka监控

半兽人 发表于: 2015-03-10   最后更新时间: 2017-02-09  
  •   168 订阅,9534 游览

6.6 Monitoring

Kafka uses Yammer Metrics for metrics reporting in both the server and the client. This can be configured to report stats using pluggable stats reporters to hook up to your monitoring system.

Kafka使用Yammer Metrics(度量,也可称为指标)(在服务器和客户端之间的指标报告)。可以配置使用可插拔的记录统计连接到你的监控系统。


The easiest way to see the available metrics to fire up jconsole and point it at a running kafka client or server; this will all browsing all metrics with JMX.

最简单的方式是通过查看可用的指标来激活jconsole并将其指向正在运行的kafka客户端或服务器(将使用JMX游览所有的指标);


We pay particular we do graphing and alerting on the following metrics:

我们特别支持对以下指标进行图形化和警报:


Description Mbean name Normal value
Message in rate
消息比率
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
Byte in rate
字节比率
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
Request rate
请求比率
kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|FetchConsumer|FetchFollower}
Byte out rate
字节输出比率
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
Log flush rate and time
日志冲洗比率和时间
kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs
# of under replicated partitions (|ISR| < |all replicas|)
关于副本分区 (|ISR| < |all replicas|)
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions 0
Is controller active on broker
在broker上控制活跃
kafka.controller:type=KafkaController,name=ActiveControllerCount only one broker in the cluster should have 1
急群中仅1个应该有1
Leader election rate
leader选举比率
kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs non-zero when there are broker failures
非零,当broker失败
Unclean leader election rate

Unclean leader
选举的比率

kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec 0
Partition counts
分区总数
kafka.server:type=ReplicaManager,name=PartitionCount mostly even across brokers

大部分甚至跨broker

Leader replica counts
leader副本数
kafka.server:type=ReplicaManager,name=LeaderCount mostly even across brokers

大部分甚至跨broker

ISR shrink rate
ISR收缩比率
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.
ISR expansion rate
ISR膨胀比率
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec See above
Max lag in messages btw follower and leader replicas
跟随者和leader副本的最大消息落后
kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica < replica.lag.max.messages
Lag in messages per follower replica
每个跟随者副本的消息落后
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+) < replica.lag.max.messages
Requests waiting in the producer purgatory
生产者purgatory请求告警
kafka.server:type=ProducerRequestPurgatory,name=PurgatorySize non-zero if ack=-1 is used
非零,如果ack=-1
Requests waiting in the fetch purgatory
拉取purgatory的请求告警
kafka.server:type=FetchRequestPurgatory,name=PurgatorySize size depends on fetch.wait.max.ms in the consumer

Request total time
请求总时间
kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce|FetchConsumer|FetchFollower} broken into queue, local, remote and response send time
分成队列,本地,远程和响应发送时间
Time the request waiting in the request queue
在请求队列中等待请求的时间
kafka.network:type=RequestMetrics,name=QueueTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request being processed at the leader
leader处理请求的时间
kafka.network:type=RequestMetrics,name=LocalTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request waits for the follower
跟随者请求等待的时间
kafka.network:type=RequestMetrics,name=RemoteTimeMs,request={Produce|FetchConsumer|FetchFollower} non-zero for produce requests when ack=-1
当ack=-1,生产请求非零

non-zero for produce requests when ack=-1
当ack=-1,生产请求非零

Time to send the response
响应发送的时间
kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request={Produce|FetchConsumer|FetchFollower}
Number of messages the consumer lags behind the producer by
消息数,消费者落后于消生产者
kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+)
The average fraction of time the network processors are idle
网络处理闲置的平均分数
kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent between 0 and 1, ideally > 0.3
0和1之间,理想地 > 0.3
The average fraction of time the request handler threads are idle
请求处理线程闲置的平均分数
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent

between 0 and 1, ideally > 0.3

0和1之间,理想地 > 0.3


Common monitoring metrics for producer/consumer/connect

生产者/消费者/连接的共同监控指标


The following metrics are available on producer/consumer/connector instances. For specific metrics, please see following sections.

以下指标可用于生产者/消费者/连接器实例。有关具体的指标。请查看以下部分。


METRIC/ATTRIBUTE NAME

DESCRIPTION

MBEAN NAME

connection-close-rate

Connections closed per second in the window.
窗口每秒关闭的连接。

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

connection-creation-rate

New connections established per second in the window.
窗口每秒建立的新连接。

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

network-io-rate

The average number of network operations (reads or writes) on all connections per second.
所有连接每秒的平均网络操作数(读取或写入)。

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

outgoing-byte-rate

The average number of outgoing bytes sent per second to all servers.
每秒向所有服务器发送的传出字节的平均数。

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

request-rate

The average number of requests sent per second.
每秒发送请求的平均数。

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

request-size-avg

The average size of all requests in the window.
窗口所有请求的平均大小。

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

request-size-max

The maximum size of any request sent in the window.
窗口发送请求的最大值。

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

incoming-byte-rate

Bytes/second read off all sockets.
字节/秒读取所有socket。

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

response-rate

Responses received sent per second.
每秒响应收到的发送

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

select-rate

Number of times the I/O layer checked for new I/O to perform per second.
I/O层每秒检查新I/O执行的次数。

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

io-wait-time-ns-avg

The average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds.
I/O线程花费在等待以纳秒为单位准备好读取或写入的socket的平均时间长度。

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

io-wait-ratio

The fraction of time the I/O thread spent waiting.
I/O线程花费等待的时间的比例。

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

io-time-ns-avg

The average length of time for I/O per select call in nanoseconds.
每个选择调用的I/O的平均时间长度(以纳秒为单位)。

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

io-ratio

The fraction of time the I/O thread spent doing I/O.
I/O线程用于执行I/O的时间比例。

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

connection-count

The current number of active connections.
当前活跃的连接数

kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)


Common Per-broker metrics for producer/consumer/connect

生产者/消费者/连接的broker指标


The following metrics are available on producer/consumer/connector instances. For specific metrics, please see following sections.

以下可用于生产者/消费者/连接器实例。有关具体指标,请参阅以下部分。

METRIC/ATTRIBUTE NAME

DESCRIPTION

MBEAN NAME

outgoing-byte-rate

The average number of outgoing bytes sent per second for a node.
每个节点每秒传出字节的平均数。

kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)

request-rate

The average number of requests sent per second for a node.
每个节点每秒发送的平均请求数。

kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)

request-size-avg

The average size of all requests in the window for a node.
每个节点窗口所有请求平均大小。

kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)

request-size-max

The maximum size of any request sent in the window for a node.
每个节点窗口发送请求最大值。

kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)

incoming-byte-rate

The average number of responses received per second for a node.
每个节点接收响应的平均时间。

kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)

request-latency-avg

The average request latency in ms for a node.
节点等待平均请求延迟(毫秒)

kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)

request-latency-max

The maximum request latency in ms for a node.
节点的请求最大延迟。

kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)

response-rate

Responses received sent per second for a node.
节点每秒接收发送的响应。

kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)



Producer monitoring

生产者监控


The following metrics are available on producer instances.

以下指数可用于生产实例。


METRIC/ATTRIBUTE NAME

DESCRIPTION

MBEAN NAME

waiting-threads

The number of user threads blocked waiting for buffer memory to enqueue their records.
用户线程数,阻塞等待缓冲内存消息入队。

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

buffer-total-bytes

The maximum amount of buffer memory the client can use (whether or not it is currently used).
客户端可以使用的最大缓冲区内存(无论目前是否使用)

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

buffer-available-bytes

The total amount of buffer memory that is not being used (either unallocated or in the free list).
未使用的缓冲内存总量(未分配或在空闲列表中)。

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

bufferpool-wait-time

The fraction of time an appender waits for space allocation.
appender等待空间分配的时间比率。

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

batch-size-avg

The average number of bytes sent per partition per-request.
每个分区每个请求发送的平均字节数

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

batch-size-max

The max number of bytes sent per partition per-request.
每个分区每个请求发送的最大字节数

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

compression-rate-avg

The average compression rate of record batches.
消息批次的平均压缩比率

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

record-queue-time-avg

The average time in ms record batches spent in the record accumulator.
消息累加器花费消息批次的平均时间(毫秒)。

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

record-queue-time-max

The maximum time in ms record batches spent in the record accumulator.
消息累加器花费消息批次的最大时间(毫秒)。

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

request-latency-avg

The average request latency in ms.
请求平均延迟(毫秒)

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

request-latency-max

The maximum request latency in ms.
最大请求延迟(毫秒)

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

record-send-rate

The average number of records sent per second.
每秒发送的消息平均数。

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

records-per-request-avg

The average number of records per request.
每个请求的平均消息数

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

record-retry-rate

The average per-second number of retried record sends.
每秒重试消息发送的平均数。

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

record-error-rate

The average per-second number of record sends that resulted in errors.
引起错误的消息发送的每秒平均数。

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

record-size-max

The maximum record size.
最大消息大小

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

record-size-avg

The average record size.
平均消息大小

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

requests-in-flight

The current number of in-flight requests awaiting a response.
等待响应的当前请求数。

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

metadata-age

The age in seconds of the current producer metadata being used.
当前生产者元数据已使用的时间(以秒为单位)。

kafka.producer:type=producer-metrics,client-id=([-.\w]+)

record-send-rate

The average number of records sent per second for a topic.
topic每秒发送的平均消息数。

kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)

byte-rate

The average number of bytes sent per second for a topic.
topic每秒发送的平均字节数

kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)

compression-rate

The average compression rate of record batches for a topic.
topic的消息批次的平均压缩比率。

kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)

record-retry-rate

The average per-second number of retried record sends for a topic.
topic发送重试消息的每秒平均数

kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)

record-error-rate

The average per-second number of record sends that resulted in errors for a topic.
topic引起错误的发送每秒平均数。

kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)

produce-throttle-time-max

The maximum time in ms a request was throttled by a broker.
broker限制请求的最打时间(以毫秒为单位)

kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+)

produce-throttle-time-avg

The average time in ms a request was throttled by a broker.
broker限制请求的平均时间(以毫秒为单位)

kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+)

New consumer monitoring
新消费者监控


The following metrics are available on new consumer instances.

以下指标适用于新的消费者实例。


Consumer Group Metrics
消费者组指标

METRIC/ATTRIBUTE NAME

DESCRIPTION

MBEAN NAME

commit-latency-avg

The average time taken for a commit request
提交请求所需的平均时间

kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

commit-latency-max

The max time taken for a commit request
提交请求所需的最大时间

kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

commit-rate

The number of commit calls per second
每秒调用提交

kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

assigned-partitions

The number of partitions currently assigned to this consumer
当前分配给此消费者的分区数

kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

heartbeat-response-time-max

The max time taken to receive a response to a heartbeat request
接收心跳请求响应所需的最大时间

kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

heartbeat-rate

The average number of heartbeats per second
每秒心跳的平均数

kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

join-time-avg

The average time taken for a group rejoin
group重新加入所需要的平均时间

kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

join-time-max

The max time taken for a group rejoin
group重新加入的最大时间

kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

join-rate

The number of group joins per second
每秒加入的group数

kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

sync-time-avg

The average time taken for a group sync
group同步所需的平均时间

kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

sync-time-max

The max time taken for a group sync
group同步所需的最大时间

kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

sync-rate

The number of group syncs per second
每秒group同步数

kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

last-heartbeat-seconds-ago

The number of seconds since the last controller heartbeat
上次控制器心跳之后的秒数

kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)


Consumer Fetch Metrics
消费者拉取指标

METRIC/ATTRIBUTE NAME

DESCRIPTION

MBEAN NAME

fetch-size-avg

The average number of bytes fetched per request
每个请求拉取的平均字节数

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)

fetch-size-max

The maximum number of bytes fetched per request
每次请求拉取的最大字节数

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)

bytes-consumed-rate

The average number of bytes consumed per second
每秒消费的平均字节数

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)

records-per-request-avg

The average number of records in each request
每个请求的平均消息数

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)

records-consumed-rate

The average number of records consumed per second
每秒消费的消息平均数

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)

fetch-latency-avg

The average time taken for a fetch request
拉取请求所需的平均时间

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)

fetch-latency-max

The max time taken for a fetch request
拉取请求所需的最大时间

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)

fetch-rate

The number of fetch requests per second
每秒拉取请求数

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)

records-lag-max

The maximum lag in terms of number of records for any partition in this window
此窗口中任何分区消息数的最大落后

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)

fetch-throttle-time-avg

The average throttle time in ms
平均限制时间(毫秒)

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)

fetch-throttle-time-max

The maximum throttle time in ms
最大限流时间(毫秒)

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)


Topic-level Fetch Metrics

topic拉取指标


METRIC/ATTRIBUTE NAME

DESCRIPTION

MBEAN NAME

fetch-size-avg

The average number of bytes fetched per request for a specific topic.
每个分区针对特定topic拉取的平均字节数

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)

fetch-size-max

The maximum number of bytes fetched per request for a specific topic.
每个分区针对特定topic拉取的最大数

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)

bytes-consumed-rate

The average number of bytes consumed per second for a specific topic.
特定topic每秒消费的平均字节数

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)

records-per-request-avg

The average number of records in each request for a specific topic.
特定topic每个请求的平均消息数 

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)

records-consumed-rate

The average number of records consumed per second for a specific topic.
特定topic每秒消费的平均消息数

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)

Others

其他方面

We recommend monitoring GC time and other stats and various server stats such as CPU utilization, I/O service time, etc. On the client side, we recommend monitoring the message/byte rate (global and per topic), request rate/size/time, and on the consumer side, max lag in messages among all partitions and min fetch request rate. For a consumer to keep up, max lag needs to be less than a threshold and min fetch rate needs to be larger than 0.

我们建议监控GC时间和其他统计信息以及各种服务器状态,例如CPU利用率,I/O服务时间等。客户端方面,我们建议监控消息/字节速率(全局和每个topic),请求速率/大小/ 时间,并且在消费者方面,在所有分区之间的消息中的最大滞后和最小获取请求速率。 对于消费者来说,最大落后需要小于阈值,并且最少拉取速率需要大于0。


Audit

审计

The final alerting we do is on the correctness of the data delivery. We audit that every message that is sent is consumed by all consumers and measure the lag for this to occur. For important topics we alert if a certain completeness is not achieved in a certain time period. The details of this are discussed in KAFKA-260.
我们最后提醒的是数据传输的正确性。 我们审核发送的每条消息都由所有消费者消费,并估算发生这种情况的落后。 对于重要的topic,我们提醒,如果在一定时间内没有达到某种完整性。 详细内容在KAFKA-260中讨论。









发表于: 1年前   最后更新时间: 9月前   游览量:9534
上一条: kafka的Java版本
下一条: kafka新生产监控
评论…

  • 有两个问题请教一下前辈:
    1、如果想监控每个topic的producer响应最长等待时间是否可行?
    2、request-latency-max

        The maximum request latency in ms.
        最大请求延迟(毫秒)

        kafka.producer:type=producer-metrics,client-id=([-.\w]+)
       如果想监控这个指标,实际写程序访问的时候client-id是什么,如何取到呢,多谢

    • client-id的解释:当发出请求时传递给服务器的id字符串。这样做的目的是允许服务器请求记录记录这个【逻辑应用名】,这样能够追踪请求的源,而不仅仅只是ip/prot。 
      • 评论…
        • in this conversation
          提问