kafka消息传递保障

半兽人 发表于: 2015-03-10   最后更新时间: 2016-12-24  
  •   44 订阅,3475 游览

消息传递语义


Now that we understand a little about how producers and consumers work, let's discuss the semantic guarantees Kafka provides between producer and consumer. Clearly there are multiple possible message delivery guarantees that could be provided:
现在我们了解一些关于生产者和消费者是如何工作的,让我们来讨论kafka提供了生产者和消费者之间的语义的保证。显然,有多种可能的消息传递保证可以提供:

  • At most once—Messages may be lost but are never redelivered.
    最多一次 --- 消息可能丢失,但永远不会重发。
  • At least once—Messages are never lost but may be redelivered.
    至少一次 --- 消息绝不会丢失,但有可能重新发送。
  • Exactly once—this is what people actually want, each message is delivered once and only once.
    正好一次 --- 这是人们真正想要的,每个消息传递一次且仅一次。


It's worth noting that this breaks down into two problems: the durability guarantees for publishing a message and the guarantees when consuming a message.

可分解成两个问题:发布消息的耐久性保障和消费消息的保障。


Many systems claim to provide "exactly once" delivery semantics, but it is important to read the fine print, most of these claims are misleading (i.e. they don't translate to the case where consumers or producers can fail, or cases where there are multiple consumer processes, or cases where data written to disk can be lost).

许多系统声称提供“正好一次”发送语义,声称的很好,但是大部分这些说法都是误导(即:他们没解释消费者和生产者可能失败的情况。如多个消费处理,数据写入磁盘丢失的情况)。


Kafka's semantics are straight-forward. When publishing a message we have a notion of the message being "committed" to the log. Once a published message is committed it will not be lost as long as one broker that replicates the partition to which this message was written remains "alive". The definition of alive as well as a description of which types of failures we attempt to handle will be described in more detail in the next section. For now let's assume a perfect, lossless broker and try to understand the guarantees to the producer and consumer. If a producer attempts to publish a message and experiences a network error it cannot be sure if this error happened before or after the message was committed. This is similar to the semantics of inserting into a database table with an autogenerated key.

kafka的语义是很直接的,我们有一个概念,当发布一条消息时,该消息 “committed(承诺)” 到了日志,一旦发布的消息是”承诺“的,只要副本分区写入了此消息的一个broker仍然"活着”,它就不会丢失。“活着”的定义以及描述的类型,我们处理失败的情况将在下一节中详细描述。现在让我们假设一个完美的无损的broker,并去了解如何保障生产者和消费者的,如果一个生产者发布消息并且正好遇到网络错误,就不能确定已提交的消息是否是在这个错误发生之前或之后。这类似于用自动生成key插入到一个数据库表。


These are not the strongest possible semantics for publishers. Although we cannot be sure of what happened in the case of a network error, it is possible to allow the producer to generate a sort of "primary key" that makes retrying the produce request idempotent. This feature is not trivial for a replicated system because of course it must work even (or especially) in the case of a server failure. With this feature it would suffice for the producer to retry until it receives acknowledgement of a successfully committed message at which point we would guarantee the message had been published exactly once. We hope to add this in a future Kafka version.

对于生产者,这些都不是最重要的语义,虽然我们不能确定网络错误在什么情况下发生,但可以让生产者生成“主键”,来达到重试的幂等性,这个功能不是简单的重复,因为它甚至必须在服务器故障的情况下工作。此功能将满足生产者重试,直到收到确认成功消息,将保证消息已被“正好一次”发布。我们希望添加这个功能在以后的kafka版本。


Not all use cases require such strong guarantees. For uses which are latency sensitive we allow the producer to specify the durability level it desires. If the producer specifies that it wants to wait on the message being committed this can take on the order of 10 ms. However the producer can also specify that it wants to perform the send completely asynchronously or that it wants to wait only until the leader (but not necessarily the followers) have the message.

并不是所有的情况都需要强有力的保障,对于延迟,我们允许生产者指定它想要的耐用性水平。如生产者可以指定它获取需等待10毫秒量级上的响应。生产者也可以指定异步发送,或只等待leader(不需要副本的响应)有响应,


Now let's describe the semantics from the point-of-view of the consumer. All replicas have the exact same log with the same offsets. The consumer controls its position in this log. If the consumer never crashed it could just store this position in memory, but if the consumer fails and we want this topic partition to be taken over by another process the new process will need to choose an appropriate position from which to start processing. Let's say the consumer reads some messages -- it has several options for processing the messages and updating its position.

现在让我们从消费者的角度描述语义。所有的副本都有相同的日志相同的偏移量。消费者控制offset在日志中的位置。如果消费者永不宕机它可能只是在内存中存储这个位置,但是如果消费者故障,我们希望这个topic分区被另一个进程接管,新进程需要选择一个合适的位置开始处理。我们假设消费者读取了一些消息,几种选项用于处理消息和更新它的位置。

  1. It can read the messages, then save its position in the log, and finally process the messages. In this case there is a possibility that the consumer process crashes after saving its position but before saving the output of its message processing. In this case the process that took over processing would start at the saved position even though a few messages prior to that position had not been processed. This corresponds to "at-most-once" semantics as in the case of a consumer failure messages may not be processed.
    读取消息,然后在日志中保存它的位置,最后处理消息。在这种情况下,有可能消费者保存了位置之后,但是处理消息输出之前崩溃了。在这种情况下,接管处理的进程会在已保存的位置开始,即使该位置之前有几个消息尚未处理。这对应于“最多一次” ,在消费者处理失败消息的情况下,不进行处理。

  2. It can read the messages, process the messages, and finally save its position. In this case there is a possibility that the consumer process crashes after processing messages but before saving its position. In this case when the new process takes over the first few messages it receives will already have been processed. This corresponds to the "at-least-once" semantics in the case of consumer failure. In many cases messages have a primary key and so the updates are idempotent (receiving the same message twice just overwrites a record with another copy of itself).
    读取消息,处理消息,最后保存消息的位置。在这种情况下,可能消费进程处理消息之后,但保存它的位置之前崩溃了。在这种情况下,当新的进程接管了它,这将接收已经被处理的前几个消息。这就符合了“至少一次”的语义。在多数情况下消息有一个主键,以便更新冥等(其任意多次执行所产生的影响均与一次执行的影响相同)。

  3. So what about exactly once semantics (i.e. the thing you actually want)? The limitation here is not actually a feature of the messaging system but rather the need to co-ordinate the consumer's position with what is actually stored as output. The classic way of achieving this would be to introduce a two-phase commit between the storage for the consumer position and the storage of the consumers output. But this can be handled more simply and generally by simply letting the consumer store its offset in the same place as its output. This is better because many of the output systems a consumer might want to write to will not support a two-phase commit. As an example of this, our Hadoop ETL that populates data in HDFS stores its offsets in HDFS with the data it reads so that it is guaranteed that either data and offsets are both updated or neither is. We follow similar patterns for many other data systems which require these stronger semantics and for which the messages do not have a primary key to allow for deduplication.
    那么什么是“正好一次”语义(也就是你真正想要的东西)? 实际上这里的限制不是消息发送系统的功能,而是需要去协调消费者的位置与实际存储输出的。实现这一目的经典的方式是分两阶段来提交存储消费者位置和存储消费者输出的位置。我们有更简单的办法,只需要让消费者的存储和输出的偏移量用同一个位置。这样最好了,因为很多输出系统消费者不支持两阶段提交。

So effectively Kafka guarantees at-least-once delivery by default and allows the user to implement at most once delivery by disabling retries on the producer and committing its offset prior to processing a batch of messages. Exactly-once delivery requires co-operation with the destination storage system but Kafka provides the offset which makes implementing this straight-forward.

kafka默认是保证“至少一次”传递,并允许用户通过禁止生产者重试和处理一批消息前提交它的偏移量来实现 “最多一次”传递。而“正好一次”传递需要与目标存储系统合作,但kafka提供了偏移量,所以实现这个很简单。







发表于: 1年前   最后更新时间: 2月前   游览量:3475
上一条: kafka消费者
下一条: kafka副本和leader选举
评论…

  • 评论…
    • in this conversation
      提问