+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID V100S-32Q On | 00000000:02:01.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Same problem, in the end I find that KVCache is construst with torch.empty filling with uninitialized values, for some reason the uninitialized values are involved in computing.
Here in vllm/vllm/worker
/cache_engine.py
def allocate_gpu_cache(self) -> List[KVCache]: gpu_cache: List[KVCache] = [] key_block_shape = self.get_key_block_shape() value_block_shape = self.get_value_block_shape() for _ in range(self.num_layers): key_blocks = torch.empty( size=(self.num_gpu_blocks, *key_block_shape), dtype=self.dtype, device="cuda", ) value_blocks = torch.empty( size=(self.num_gpu_blocks, *value_block_shape), dtype=self.dtype, device="cuda", ) gpu_cache.append((key_blocks, value_blocks)) return gpu_cache
When I change torch.empty to torch.zeros, the model does not output nan any more, but I believe it is a bug. Because of the uninitialized values, the texts generated by gpt2 with vllm and huggingface are different.
I am running gpt2 in docker image kevinng77/vllm with one GPU T4-8C.
I believe single_query_cached_kv_attention_kernel
does not properly check boundary in the context length dimension when block size is larger than one.
Setting block size to one can suppress this bug.
When using the same prompt and greedy sampling params, the output is not same before and after the two times
🌡 Have you tried increasing the temperature?
Well try increasing the temperature
value. I had very low temperature value along with other parameters such as top_k
and top_p
which made the next token distribution too steep and as the beam search's logic, you will need to have multiple tokens available, and in the low temperature case I couldn't have (because we know how temperature works, right?)
So I increased the temperature and it worked.
Try increasing the temp value and it should just work, if there are no other complexity involved.
“RuntimeError: probability tensor contains either inf
, nan
or element < 0” when use llama2-70B
#1448
I tried to change empty to zeros for both allocate_gpu_cache and allocate_cpu_cache but it doesn't help.
Actually, I am using the LLM "internlm/internlm-chat-20b". Thus, appreciate anyone for other suggestions.
Thank you!
“RuntimeError: probability tensor contains either inf
, nan
or element < 0” when use llama2-70B #1448
Hi WoosukKwon,So for the llama2-70B, should I do same operation to avoid 'nan'?
[Bug] When more than 1 is used for num_beams : probability tensor contains either inf
, nan
or element < 0
coqui-ai/TTS#3232
RuntimeError: probability tensor contains either inf
, nan
or element < 0
Im facing the same issue while Im using Mistrall LLM
"filipealmeida/Mistral-7B-Instruct-v0.1-sharded"
result = map_reduce_chain.invoke(split_docs, return_only_outputs=True)
return result['output_text']
RuntimeError: probability tensor contains either inf, nan or element < 0
Im facing the same issue while Im using Mistrall LLM
"filipealmeida/Mistral-7B-Instruct-v0.1-sharded"
result = map_reduce_chain.invoke(split_docs, return_only_outputs=True)
return result['output_text']
Please help me out with this ASAP.