rakhesh sasidharan's mostly techie somewhat purpley blog

I had been using Event Hubs + Azure Functions pretty naively for the past few months. Mainly coz I just assumed how some of the things work, and also coz I guess when working with the cloud you have this mindset that things just work and don’t really care about the details.

Anyways.

The first thing is that I have this Function that does some processing, and if it fails I was pushing the item to an event hub thus:

try {
  Push-OutputBinding -Name eventHubMessages -Value $body -ErrorAction Stop
} catch {
  Write-Host "=== Error pushing ==="
  # do something about it...
}

try {

Push-OutputBinding -Name eventHubMessages -Value $body -ErrorAction Stop

} catch {

Write-Host "=== Error pushing ==="

# do something about it...

}

The expectation being that if the push fails I can output it and also do something like email me the item for instance. But this doesn’t work coz you can’t put the Push-OutputBinding in a try/ catch block. I never tested whether this works or not, and always assumed it does, until I was testing something this weekend and realized the exceptions when pushing weren’t being caught. That’s because all output bindings are executed after a Function exits and is done by the Function host/ worker, not the Function itself.

The way I encountered this was because I copy pasted some event hub bindings between two of my Functions without realizing I was copying the wrong code. The Function I was copying from had event hub triggers, while the Function I copied to had it as output and as you can see they have differences:

Output binding

Input binding

"bindings": [
  {
    "name": "eventHubMessagesOut",
    "direction": "out",
    "type": "eventHub",
    "connection": "xxx_Function_EVENTHUB",
    "eventHubName": "yyyy"
  }
]

"bindings": [

{

"name": "eventHubMessagesOut",

"direction": "out",

"type": "eventHub",

"connection": "xxx_Function_EVENTHUB",

"eventHubName": "yyyy"

}

]

"bindings": [
  {
    "type": "eventHubTrigger",
    "name": "eventHubMessages",
    "direction": "in",
    "eventHubName": "yyyy",
    "connection": "xxx_Function_EVENTHUB",
    "cardinality": "many",
    "consumerGroup": "$Default"
  },
]

"bindings": [

{

"type": "eventHubTrigger",

"name": "eventHubMessages",

"direction": "in",

"eventHubName": "yyyy",

"connection": "xxx_Function_EVENTHUB",

"cardinality": "many",

"consumerGroup": "$Default"

]

Because of this mismatch I was getting an error: No binding found for attribute 'Microsoft.Azure.WebJobs.EventHubTriggerAttribute'.

I couldn’t find out why this was so until I realized the mistake I made.

The biggest thing I learnt though was about retries. To begin with, check out this link on how Azure Functions consumes Event Hubs. I am going to copy paste it here.

Azure Functions consumes Event Hub events while cycling through the following steps:

A pointer is created and persisted in Azure Storage for each partition of the event hub.
When new messages are received (in a batch by default), the host attempts to trigger the function with the batch of messages.
If the function completes execution (with or without exception) the pointer advances and a checkpoint is saved to the storage account.
If conditions prevent the function execution from completing, the host fails to progress the pointer. If the pointer isn’t advanced, then later checks end up processing the same messages.
Repeat steps 2–4

This behavior reveals a few important points:

Unhandled exceptions may cause you to lose messages. Executions that result in an exception will continue to progress the pointer. Setting a retry policy will delay progressing the pointer until the entire retry policy has been evaluated.
Functions guarantees at-least-once delivery. Your code and dependent systems may need to account for the fact that the same message could be received twice.

The last two points are super important. Unless a Function crashes, if an Event Hub message is read and the Function doesn’t process it for some reason, it may not see it again. That is to say, if I read 10 messages from the Hub, process 6 and run into some error for the remaining 4 – which I may or may not catch via a try/ catch block – the Event Hub & Function don’t care and I may not see those messages again. So it is up to me, the developer, to ensure I handle failed messages.

I sort of knew this, but I also assumed that if the Function runs into an exception then it magically knows to re-read those messages again from the Event Hub. ☺️ My bad, of course!

Here, once again, we encounter the Function host/ worker. It doesn’t know what the Function is doing, which is why it doesn’t know to re-read the messages. The only signal it has is that of the Function succeeded or crashing, and it’s on that basis that it re-reads messages if needed.

The second point is that a Function may read the same message more than once. Because, if the Function crashes like we said above, subsequent executions might read already processed messages. So I as the developer must expect this and do something to ensure I can handle duplicates.

Every function must have try/ catch blocks to handle messages that didn’t process.

Next, this article on checkpointing. Again, I’ll copy paste:

Checkpoints mark or commit reader positions in a partition event sequence. It’s the responsibility of the Functions host to checkpoint as events are processed and the setting for the batch checkpoint frequency is met. For more information about checkpointing, see Features and terminology in Azure Event Hubs.

The following concepts can help you understand the relationship between checkpointing and the way that your function processes events:

Exceptions still count towards success: If the function process doesn’t crash while processing events, the completion of the function is considered successful, even if exceptions occurred. When the function completes, the Functions host evaluates batchCheckpointFrequency. If it’s time for a checkpoint, it creates one, regardless of whether there were exceptions. The fact that exceptions don’t affect checkpointing shouldn’t affect your proper use of exception checking and handling.
Batch frequency matters: In high-volume event streaming solutions, it can be beneficial to change the batchCheckpointFrequency setting to a value greater than 1. Increasing this value can reduce the rate of checkpoint creation and, as a consequence, the number of storage I/O operations.
Replays can happen: Each time a function is invoked with the Event Hubs trigger binding, it uses the most recent checkpoint to determine where to resume processing. The offset for every consumer is saved at the partition level for each consumer group. Replays happen when a checkpoint doesn’t occur during the last invocation of the function, and the function is invoked again. For more information about duplicates and deduplication techniques, see Idempotency.

Understanding checkpointing becomes critical when you consider best practices for error handling and retries, a topic that’s discussed later in this article.

The first and last points we already know. But batchCheckpointFrequency is something new. What is this setting?

Several configuration settings in the host.json file play a key role in the performance characteristics of the Event Hubs trigger binding for Functions:

maxEventBatchSize: This setting represents the maximum number of events that the function can receive when it’s invoked. If the number of events received is less than this amount, the function is still invoked with as many events as are available. You can’t set a minimum batch size.
prefetchCount: The prefetch count is one of the most important settings when you optimize for performance. The underlying AMQP channel references this value to determine how many messages to fetch and cache for the client. The prefetch count should be greater than or equal to the maxEventBatchSize value and is commonly set to a multiple of that amount. Setting this value to a number less than the maxEventBatchSize setting can hurt performance.
batchCheckpointFrequency: As your function processes batches, this value determines the rate at which checkpoints are created. The default value is 1, which means that there’s a checkpoint whenever a function successfully processes a batch. A checkpoint is created at the partition level for each reader in the consumer group. For information about how this setting influences replays and retries of events, see Event hub triggered Azure function: Replays and Retries (blog post).

Do several performance tests to determine the values to set for the trigger binding. We recommend that you change settings incrementally and measure consistently to fine-tune these options. The default values are a reasonable starting point for most event processing solutions.

The default value of 1 means as each batch is processed a checkpoint is written. And the maxEventBatchSize tells how many messages are pulled at most, each time. (There is no minimum amount, and also notice there is no setting that says how to get a Function to query an Event Hub for new messages – say when you are troubleshooting, or coz you Function crashed and now you want it to check for new messages. The only way to do that is to send something to the Event Hub, causing it to push to the Function).

Here’s some good info on how you can get duplicate messages.

More later, I am still learning stuff. 🧑🏽‍💻