By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hello everyone, I always got this error for Baichuan and LLaMA models. And I found it's caused by the single_query_cached_kv_attention method in vllm\model_executor\layers\ attention.py . After calling of this method, the hidden output has some rows of "nan" . How can I fix this? Thanks!

Still have such errors even after installing xformers from source.

This is my code:

from vllm import LLM, SamplingParams
#from vllm.transformers_utils.configs.baichuan import BaiChuanConfig
prompts = [
        "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
sampling_params = SamplingParams(temperature=1, top_p=0.95)
llm = LLM(
        model="/.../Baichuan-7b",
        trust_remote_code=True,
        dtype='float16',
        gpu_memory_utilization=0.85,
        tokenizer_mode="slow"
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

and this is my python environment:

accelerate                0.21.0
aiofiles                  23.1.0
aiohttp                   3.8.5
aiosignal                 1.3.1
altair                    5.0.1
annotated-types           0.5.0
anyio                     3.7.1
appdirs                   1.4.4
argon2-cffi               21.3.0
argon2-cffi-bindings      21.2.0
arrow                     1.2.3
asttokens                 2.2.1
async-lru                 2.0.3
async-timeout             4.0.2
attrs                     23.1.0
Babel                     2.12.1
backcall                  0.2.0
beautifulsoup4            4.12.2
bleach                    6.0.0
blinker                   1.6.2
boltons                   23.0.0
brotlipy                  0.7.0
certifi                   2022.12.7
cffi                      1.15.1
charset-normalizer        2.0.4
click                     8.1.6
cmake                     3.27.0
comm                      0.1.3
conda                     23.3.1
conda-content-trust       0.1.3
conda-package-handling    2.0.2
conda_package_streaming   0.7.0
contourpy                 1.1.0
cryptography              39.0.1
cycler                    0.11.0
datasets                  2.14.0
debugpy                   1.6.7
decorator                 5.1.1
defusedxml                0.7.1
dill                      0.3.7
distlib                   0.3.7
docker-pycreds            0.4.0
editables                 0.5
exceptiongroup            1.1.2
executing                 1.2.0
fastapi                   0.100.0
fastjsonschema            2.18.0
ffmpy                     0.3.1
filelock                  3.12.2
Flask                     2.3.2
fonttools                 4.41.1
fqdn                      1.5.1
frozenlist                1.4.0
fsspec                    2023.6.0
gitdb                     4.0.10
GitPython                 3.1.32
gradio                    3.35.2
gradio_client             0.2.10
grpcio                    1.56.2
h11                       0.14.0
hatchling                 1.18.0
httpcore                  0.17.3
httpx                     0.24.1
huggingface-hub           0.16.4
idna                      3.4
ipykernel                 6.24.0
ipython                   8.14.0
ipython-genutils          0.2.0
ipywidgets                8.0.7
isoduration               20.11.0
itsdangerous              2.1.2
jedi                      0.18.2
jieba                     0.42.1
Jinja2                    3.1.2
joblib                    1.3.1
json5                     0.9.14
jsonpatch                 1.32
jsonpointer               2.1
jsonschema                4.18.4
jsonschema-specifications 2023.7.1
jupyter                   1.0.0
jupyter_client            8.3.0
jupyter-console           6.6.3
jupyter_core              5.3.1
jupyter-events            0.6.3
jupyter-lsp               2.2.0
jupyter_server            2.7.0
jupyter_server_terminals  0.4.4
jupyterlab                4.0.3
jupyterlab-pygments       0.2.2
jupyterlab_server         2.24.0
jupyterlab-widgets        3.0.8
kiwisolver                1.4.4
linkify-it-py             2.0.2
lit                       16.0.6
markdown-it-py            2.2.0
markdown2                 2.4.10
MarkupSafe                2.1.3
matplotlib                3.7.2
matplotlib-inline         0.1.6
mdit-py-plugins           0.3.3
mdurl                     0.1.2
mistune                   3.0.1
mpmath                    1.3.0
msgpack                   1.0.5
multidict                 6.0.4
multiprocess              0.70.15
mypy-extensions           1.0.0
nbclient                  0.8.0
nbconvert                 7.7.2
nbformat                  5.9.1
nest-asyncio              1.5.6
networkx                  3.1
nh3                       0.2.14
ninja                     1.11.1
nltk                      3.8.1
notebook                  7.0.0
notebook_shim             0.2.3
numpy                     1.25.1
nvidia-cublas-cu11        11.10.3.66
nvidia-cuda-cupti-cu11    11.7.101
nvidia-cuda-nvrtc-cu11    11.7.99
nvidia-cuda-runtime-cu11  11.7.99
nvidia-cudnn-cu11         8.5.0.96
nvidia-cufft-cu11         10.9.0.58
nvidia-curand-cu11        10.2.10.91
nvidia-cusolver-cu11      11.4.0.1
nvidia-cusparse-cu11      11.7.4.91
nvidia-nccl-cu11          2.14.3
nvidia-nvtx-cu11          11.7.91
orjson                    3.9.2
overrides                 7.3.1
packaging                 23.0
pandas                    2.0.3
pandocfilters             1.5.0
parso                     0.8.3
pathspec                  0.11.1
pathtools                 0.1.2
peft                      0.4.0
pexpect                   4.8.0
pickleshare               0.7.5
Pillow                    10.0.0
pip                       23.0.1
platformdirs              3.9.1
pluggy                    1.0.0
prometheus-client         0.17.1
prompt-toolkit            3.0.39
protobuf                  4.23.4
psutil                    5.9.5
ptyprocess                0.7.0
pure-eval                 0.2.2
pyarrow                   12.0.1
pycosat                   0.6.4
pycparser                 2.21
pydantic                  1.10.12
pydantic_core             2.3.0
pydub                     0.25.1
Pygments                  2.15.1
pyOpenSSL                 23.0.0
pyparsing                 3.0.9
pyre-extensions           0.0.29
PySocks                   1.7.1
python-dateutil           2.8.2
python-json-logger        2.0.7
python-multipart          0.0.6
pytz                      2023.3
PyYAML                    6.0.1
pyzmq                     25.1.0
qtconsole                 5.4.3
QtPy                      2.3.1
ray                       2.6.1
referencing               0.30.0
regex                     2023.6.3
requests                  2.28.1
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rich                      13.4.2
rouge-chinese             1.0.3
rpds-py                   0.9.2
ruamel.yaml               0.17.21
ruamel.yaml.clib          0.2.6
safetensors               0.3.1
semantic-version          2.10.0
Send2Trash                1.8.2
sentencepiece             0.1.99
sentry-sdk                1.28.1
setproctitle              1.3.2
setuptools                65.6.3
shortuuid                 1.0.11
six                       1.16.0
smmap                     5.0.0
sniffio                   1.3.0
soupsieve                 2.4.1
stack-data                0.6.2
starlette                 0.27.0
svgwrite                  1.4.3
sympy                     1.12
terminado                 0.17.1
tinycss2                  1.2.1
tokenizers                0.13.3
tomli                     2.0.1
toolz                     0.12.0
torch                     2.0.1
tornado                   6.3.2
tqdm                      4.65.0
traitlets                 5.9.0
transformers              4.31.0
triton                    2.0.0
trl                       0.4.7
trove-classifiers         2023.7.6
typing_extensions         4.7.1
typing-inspect            0.9.0
tzdata                    2023.3
uc-micro-py               1.0.2
uri-template              1.3.0
urllib3                   1.26.15
uvicorn                   0.23.1
virtualenv                20.24.2
vllm                      0.1.2       /.../feng/OpenSource/vllm
wandb                     0.15.7
wavedrom                  2.0.3.post3
wcwidth                   0.2.6
webcolors                 1.13
webencodings              0.5.1
websocket-client          1.6.1
websockets                11.0.3
Werkzeug                  2.3.6
wheel                     0.38.4
widgetsnbextension        4.0.8
xformers                  0.0.20
xxhash                    3.2.0
yarl                      1.9.2
zstandard                 0.19.0

and my GPU info:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID V100S-32Q      On   | 00000000:02:01.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
          

Same problem, in the end I find that KVCache is construst with torch.empty filling with uninitialized values, for some reason the uninitialized values are involved in computing.
Here in vllm/vllm/worker
/cache_engine.py

def allocate_gpu_cache(self) -> List[KVCache]: gpu_cache: List[KVCache] = [] key_block_shape = self.get_key_block_shape() value_block_shape = self.get_value_block_shape() for _ in range(self.num_layers): key_blocks = torch.empty( size=(self.num_gpu_blocks, *key_block_shape), dtype=self.dtype, device="cuda", ) value_blocks = torch.empty( size=(self.num_gpu_blocks, *value_block_shape), dtype=self.dtype, device="cuda", ) gpu_cache.append((key_blocks, value_blocks)) return gpu_cache

When I change torch.empty to torch.zeros, the model does not output nan any more, but I believe it is a bug. Because of the uninitialized values, the texts generated by gpt2 with vllm and huggingface are different.

I am running gpt2 in docker image kevinng77/vllm with one GPU T4-8C.

I believe single_query_cached_kv_attention_kernel does not properly check boundary in the context length dimension when block size is larger than one.

Setting block size to one can suppress this bug.

When using the same prompt and greedy sampling params, the output is not same before and after the two times

🌡 Have you tried increasing the temperature?

Well try increasing the temperature value. I had very low temperature value along with other parameters such as top_k and top_p which made the next token distribution too steep and as the beam search's logic, you will need to have multiple tokens available, and in the low temperature case I couldn't have (because we know how temperature works, right?)

So I increased the temperature and it worked.

Try increasing the temp value and it should just work, if there are no other complexity involved.

“RuntimeError: probability tensor contains either inf, nan or element < 0” when use llama2-70B #1448 I tried to change empty to zeros for both allocate_gpu_cache and allocate_cpu_cache but it doesn't help.

Actually, I am using the LLM "internlm/internlm-chat-20b". Thus, appreciate anyone for other suggestions.

Thank you!

“RuntimeError: probability tensor contains either inf, nan or element < 0” when use llama2-70B #1448

Hi WoosukKwon,So for the llama2-70B, should I do same operation to avoid 'nan'?

[Bug] When more than 1 is used for num_beams : probability tensor contains either inf, nan or element < 0 coqui-ai/TTS#3232

RuntimeError: probability tensor contains either inf, nan or element < 0

Im facing the same issue while Im using Mistrall LLM
"filipealmeida/Mistral-7B-Instruct-v0.1-sharded"

result = map_reduce_chain.invoke(split_docs, return_only_outputs=True)
return result['output_text']

RuntimeError: probability tensor contains either inf, nan or element < 0

Im facing the same issue while Im using Mistrall LLM
"filipealmeida/Mistral-7B-Instruct-v0.1-sharded"

result = map_reduce_chain.invoke(split_docs, return_only_outputs=True)
return result['output_text']

Please help me out with this ASAP.