moidhassan commited on
Commit
794ffcf
1 Parent(s): ad85cab

Resolve - 196 [rank0]: triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 180224, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.

Browse files
Files changed (1) hide show
  1. triton_flash_blocksparse_attn.py +1 -1
triton_flash_blocksparse_attn.py CHANGED
@@ -1020,7 +1020,7 @@ def blocksparse_flash_attn_padded_fwd(
1020
  BLOCK_M_LOADING = 16 if q_len == 1 else block_size, # smaller for decoding
1021
  EVEN_D = block_d == head_size,
1022
  num_warps = 1 if q_len == 1 else 4,
1023
- num_stages = 3
1024
  )
1025
 
1026
  return out
 
1020
  BLOCK_M_LOADING = 16 if q_len == 1 else block_size, # smaller for decoding
1021
  EVEN_D = block_d == head_size,
1022
  num_warps = 1 if q_len == 1 else 4,
1023
+ num_stages = 1 # <---- instead of 3
1024
  )
1025
 
1026
  return out