Skip to content

fix: TOCTOU race in and_compute_with under multi-threaded tokio#1

Open
Squadrick wants to merge 1 commit intoanthropics:anthropic-0.12.13from
Squadrick:squadrick/and-compute-with-toctou
Open

fix: TOCTOU race in and_compute_with under multi-threaded tokio#1
Squadrick wants to merge 1 commit intoanthropics:anthropic-0.12.13from
Squadrick:squadrick/and-compute-with-toctou

Conversation

@Squadrick
Copy link
Collaborator

In try_compute and try_compute_if_nobody_else, the waiter was removed from the waiter map (via set_waiter_value(ReadyNone)) BEFORE the actual cache mutation (insert_with_hash/invalidate_with_hash). This created a window where a concurrent and_compute_with caller could:

  1. Successfully insert its own waiter (the first was already removed)
  2. Read the cache and see the OLD value (first caller hasn't written yet)
  3. Both callers execute Op::Put based on the same stale data

This caused silent data loss under concurrent access with a multi-threaded tokio runtime (e.g. 10-15% of increments lost in a concurrent counter test with 8 writers).

The fix defers set_waiter_value(ReadyNone) to after the cache mutation in each match arm, ensuring the waiter remains in the map (blocking concurrent callers) until the cache is updated.

Contrast with try_init_or_read (used by get_with), which already does this correctly: it calls insert_with_hash BEFORE set_waiter_value.

In try_compute and try_compute_if_nobody_else, the waiter was removed
from the waiter map (via set_waiter_value(ReadyNone)) BEFORE the actual
cache mutation (insert_with_hash/invalidate_with_hash). This created a
window where a concurrent and_compute_with caller could:

1. Successfully insert its own waiter (the first was already removed)
2. Read the cache and see the OLD value (first caller hasn't written yet)
3. Both callers execute Op::Put based on the same stale data

This caused silent data loss under concurrent access with a
multi-threaded tokio runtime (e.g. 10-15% of increments lost in a
concurrent counter test with 8 writers).

The fix defers set_waiter_value(ReadyNone) to after the cache mutation
in each match arm, ensuring the waiter remains in the map (blocking
concurrent callers) until the cache is updated.

Contrast with try_init_or_read (used by get_with), which already does
this correctly: it calls insert_with_hash BEFORE set_waiter_value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant