Skip to content

Conversation

@caiomcbr
Copy link
Contributor

@caiomcbr caiomcbr commented Aug 19, 2025

In cases where we have circular channel creation, such as:
creating channel 0 <-> 1
creating channel 1 <-> 2
creating channel 2 <-> 3
creating channel 3 <-> 0
creating channel 0 <-> 3
creating channel 1 <-> 0
creating channel 2 <-> 1
creating channel 3 <-> 2

This setup can result in a deadlock during the first channel creation for each rank. The current code requires sharing the semaphore for the first channel before moving on, which leads to the following sequence:
creating channel 0 <-> 1
creating channel 1 <-> 2
creating channel 2 <-> 3
creating channel 3 <-> 0
<-- HANG ISSUE -->

The process hangs because, for example, rank 0 will only share the semaphore with rank 3 after receiving it from rank 1. However, rank 1 is waiting for a semaphore from rank 2, rank 2 is waiting for one from rank 3, and rank 3 is waiting for one from rank 0.

The solution is to make this creation asynchronous and only retrieve the semaphore after all semaphores have been requested.

@caiomcbr caiomcbr requested a review from Binyang2014 August 19, 2025 22:13
Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the PR name to make it more clear?

@caiomcbr caiomcbr changed the title Adjusting Setup Channels Executor Fix deadlock in Executor channel setup Aug 19, 2025
@caiomcbr caiomcbr merged commit f839184 into main Aug 19, 2025
14 checks passed
@caiomcbr caiomcbr deleted the caiorocha/fix_setup_channels branch August 19, 2025 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

SYSTEM_READY >> ...MS