JavaScript is disabled. For a better experience, please enable JavaScript in your browser before proceeding.
You are using an out of date browser. It may not display this or other websites correctly.
You should upgrade or use an
alternative browser
.
You can't run ChatGPT on a single GPU, but you can run some far less complex text generation large language models on your own PC. We tested oogabooga's text generation webui on several cards to see how fast it is and what sort of results you can expect.
How to Run a ChatGPT Alternative on Your Local PC : Read more
@jarred
, thanks for the ongoing forays into generative-AI uses and the HW requirements. Some questions:
. What's the qualitative difference between 4-bit and 8-bit answers?
. How does the tokens/sec perf number translate to speed of response (output). I asked ChatGPT about this and it only gives me speed of processing input (eg input length / tokens/sec).
I'm building a box specifically to play with these AI-in-a-box as you're doing, so it's helpful to have trailblazers in front. I'll likely go with a baseline GPU, ie 3060 w/ 12GB VRAM, as I'm not after performance, just learning.
Looking forward to seeing an open-source ChatGPT alternative. IIRC, StabilityAI CEO has intimated that such is in the works.
@jarred
, thanks for the ongoing forays into generative-AI uses and the HW requirements. Some questions:
. What's the qualitative difference between 4-bit and 8-bit answers?
. How does the tokens/sec perf number translate to speed of response (output). I asked ChatGPT about this and it only gives me speed of processing input (eg input length / tokens/sec).
I'm building a box specifically to play with these AI-in-a-box as you're doing, so it's helpful to have trailblazers in front. I'll likely go with a baseline GPU, ie 3060 w/ 12GB VRAM, as I'm not after performance, just learning.
Looking forward to seeing an open-source ChatGPT alternative. IIRC, StabilityAI CEO has intimated that such is in the works.
The 8-bit and 4-bit are supposed to be virtually the same quality, according to what I've read. I have tried both and didn't see a massive change. Basically, the weights either trend toward a larger number or zero, so 4-bit is enough — or something like that. A "token" is just a word, more or less (things like parts of a URL I think also qualify as a "token" which is why it's not strictly a one to one equivalence).
For the GPUs, a 3060 is a good baseline, since it has 12GB and can thus run up to a 13b model. I suspect long-term, a lot of stuff will want at least 24GB to get better results.
I dream of a future when i could host an AI in a computer at home, and connect it to the smart home systems. I would call her "EVA" as an tribute to the AI from Command And Conquer
"hey EVA did anything out of ordinary happened since I left home ?"
or when im out of home : " hey EVA did my son already returned from school ? if so open vocal chat with him"
and of course the basic stuff like lights and temperature control
it's just the coolest name for an AI
UPDATE: I've managed to test Turing GPUs now, and I retested everything else just to be sure the new build didn't screw with the numbers. Try as I might, at least under Windows I can't get performance to scale beyond about 25 tokens/s on the responses with llama-13b-4bit. What's really weird is that the Titan RTX and RTX 2080 Ti come very close to that number, but all of the Ampere GPUs are about 20% slower.
Is the code somehow better optimized for Turing? Maybe, or maybe it's something else. I created a new conda environment and went through all the steps again, running an RTX 3090 Ti, and that's what was used for the Ampere GPUs. Using the same environment as the Turing or Ada GPUs (yes, I have three separate environments now) didn't change the results more than margin of error (~3%).
So, obviously there's room for optimizations and improvements to extract more throughput. At least, that's my assumption based on the RTX 2080 Ti humming along at a respectable 24.6 tokens/s. Meanwhile, the RTX 3090 Ti couldn't get above 22 tokens/s. Go figure.
Again, these are all preliminary results, and the article text should make that very clear. Linux might run faster, or perhaps there's just some specific code optimizations that would boost performance on the faster GPUs. Given a 9900K was noticeably slower than the 12900K, it seems to be pretty CPU limited, with a high dependence on single-threaded performance.
I suspect long-term, a lot of stuff will want at least 24GB to get better results.
Given Nvidia's current strangle-hold on the GPU market as well as AI accelerators, I have no illusion that 24GB cards will be affordable to the avg user any time soon. I'm wondering if offloading to system RAM is a possibility, not for this particular software, but future models.
it seems to be pretty CPU limited, with a high dependence on single-threaded performance.
So CPU would need to be a benchmark? Does CPU make a difference for Stable Diffusion? Would X3D's larger L3 cache matter?
Because of the Microsoft/Google competition, we'll have access to free high-quality general-purpose chatbots. (BTW, no more waitlist for Bing Chat.) I'm hoping to see more niche bots limited to specific knowledge fields (eg programming, health questions, etc) that can have lighter HW requirements, and thus be more viable running on consumer-grade PCs. That, and the control/customizable aspect having your own AI.
Looking around, I see there are several open-source projects in the offing. I'll be following their progress,
OpenChatKit
I'm building a box specifically to play with these AI-in-a-box as you're doing, so it's helpful to have trailblazers in front. I'll likely go with a baseline GPU, ie 3060 w/ 12GB VRAM, as I'm not after performance, just learning.
If you're intending to work specifically with large models, you'll be extremely limited on a single-GPU consumer desktop. You might instead set aside the $ for renting time on Nvidia's A100 or H100 cloud instances. Or possibly Amazon's or Google's - not sure how well they scale to such large models. I haven't actually run the numbers on this - just something to consider.
If you're really serious about the DIY route, Tim Dettmers has been one of the leading authorities on the subject, for many years:
What's really weird is that the Titan RTX and RTX 2080 Ti come very close to that number, but all of the Ampere GPUs are about 20% slower.
Is the code somehow better optimized for Turing? Maybe, or maybe it's something else.
I'd start reading up on tips to optimize PyTorch performance in Windows. It seems like others should've already spent a lot of time on this subject.
Also, when I've compiled deep learning frameworks in the past, you had to tell it which CUDA capabilities to use. Maybe specifying a common baseline will fail to utilize capabilities present only on the newer hardware. That said, I don't know to what extent this applies to PyTorch. I'm pretty sure there's some precompiled code, but then a hallmark of Torch is that it compiles your model for the specific hardware at runtime.
I'm wondering if offloading to system RAM is a possibility, not for this particular software, but future models.
Not really. Inferencing is massively bandwidth-intensive. If we make a simplistic assumption that the entire network needs to be applied for each token, and your model is too big to fit in GPU memory (e.g. trying to run a 24 GB model on a 12 GB GPU), then you might be left in a situation of trying to pull in the remaining 12 GB per iteration. Considering PCIe 4.0 x16 has a theoretical limit of 32 GB/s, you'd only be able to read in the other half of the model about 2.5 times per second. So, your thoughput would drop by at least an order of magnitude.
Those are indeed simplistic assumptions, but I think they're not too far off the mark. A better way to scale would be multi-GPU, where each card contains a part of the model. As data passes from the early layers of the model to the latter portion, it's handed off to the second GPU. This is known as a dataflow architecture, and it's becoming a very popular way to scale AI processing.
I'm hoping to see more niche bots limited to specific knowledge fields (eg programming, health questions, etc) that can have lighter HW requirements, and thus be more viable running on consumer-grade PCs.
Don't count on it. Though the tech is advancing so fast that maybe someone will figure out a way to squeeze these models down enough that you can do it.
The 8-bit and 4-bit are supposed to be virtually the same quality, according to what I've read.
If today's models still work on the same general principles as what I've seen in an AI class I took a long time ago, signals usually pass through sigmoid functions to help them converge toward 0/1 or whatever numerical range limits the model layer operates on, so more resolution would only affect cases where rounding at higher precision would cause enough nodes to snap the other way and affect the output layer's outcome. When you have hundreds of inputs, most of the rounding noise should cancel itself out and not make much of a difference.
A bit weird by traditional math standards but it works.
You can't run ChatGPT on a single GPU, but you can run some far less complex text generation large language models on your own PC. We tested oogabooga's text generation webui on several cards to see how fast it is and what sort of results you can expect.
How to Run a ChatGPT Alternative on Your Local PC : Read more
This process worked until I got to setup-py, although I had to run vcvars64.bat from C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\ rather than the provided directory.
The error I get when I attempt to setup cuda is as follows:
(llama4bit) E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa>python setup_cuda.py install
running install
C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running bdist_egg
running egg_info
writing quant_cuda.egg-info\PKG-INFO
writing dependency_links to quant_cuda.egg-info\dependency_links.txt
writing top-level names to quant_cuda.egg-info\top_level.txt
reading manifest file 'quant_cuda.egg-info\SOURCES.txt'
writing manifest file 'quant_cuda.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_ext
C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\utils\cpp_extension.py:388: UserWarning: The detected CUDA version (11.4) has a minor version mismatch with the version that was used to compile PyTorch (11.7). Most likely this shouldn't be a problem.
warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
building 'quant_cuda' extension
Emitting ninja build file E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\bin\nvcc --generate-dependencies-with-compile --dependency-output E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj.d --use-local-env -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /EHsc -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\torch\csrc\api\include -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\TH -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include" -IC:\ProgramData\miniconda3\envs\llama4bit\include -IC:\ProgramData\miniconda3\envs\llama4bit\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" -c E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\quant_cuda_kernel.cu -o E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
FAILED: E:/llmRunner/text-generation-webui/repositories/GPTQ-for-LLaMa/build/temp.win-amd64-cpython-310/Release/quant_cuda_kernel.obj
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\bin\nvcc --generate-dependencies-with-compile --dependency-output E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj.d --use-local-env -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /EHsc -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\torch\csrc\api\include -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\TH -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include" -IC:\ProgramData\miniconda3\envs\llama4bit\include -IC:\ProgramData\miniconda3\envs\llama4bit\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" -c E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\quant_cuda_kernel.cu -o E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
C:/ProgramData/miniconda3/envs/llama4bit/lib/site-packages/torch/include\c10/macros/Macros.h(138): warning C4067: unexpected tokens following preprocessor directive - expected a newline
C:/ProgramData/miniconda3/envs/llama4bit/lib/site-packages/torch/include\c10/macros/Macros.h(138): warning C4067: unexpected tokens following preprocessor directive - expected a newline
C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\pybind11\cast.h(624): error: too few arguments for template template parameter "Tuple"
detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std:

air, Ts=<T1, T2>]"
(721): here
C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\pybind11\cast.h(717): error: too few arguments for template template parameter "Tuple"
detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std:

air, Ts=<T1, T2>]"
(721): here
C:/ProgramData/miniconda3/envs/llama4bit/lib/site-packages/torch/include\c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>:

perator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>:

perator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
C:/ProgramData/miniconda3/envs/llama4bit/lib/site-packages/torch/include\c10/core/TensorImpl.h(77): here
C:/ProgramData/miniconda3/envs/llama4bit/lib/site-packages/torch/include\c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>:

perator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=true, <unnamed>=0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>:

perator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=true, <unnamed>=0]"
C:/ProgramData/miniconda3/envs/llama4bit/lib/site-packages/torch/include\ATen/core/qualified_name.h(73): here
2 errors detected in the compilation of "e:/llmrunner/text-generation-webui/repositories/gptq-for-llama/quant_cuda_kernel.cu".
quant_cuda_kernel.cu
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\utils\cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\setup_cuda.py", line 4, in <module>
setup(
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\
init
.py", line 108, in setup
return distutils.core.setup(**attrs)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\core.py", line 185, in setup
return run_commands(dist)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\core.py", line 201, in run_commands
dist.run_commands()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\dist.py", line 969, in run_commands
self.run_command(cmd)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\dist.py", line 1221, in run_command
super().run_command(command)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\install.py", line 74, in run
self.do_egg_install()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\install.py", line 123, in do_egg_install
self.run_command('bdist_egg')
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\dist.py", line 1221, in run_command
super().run_command(command)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\bdist_egg.py", line 164, in run
cmd = self.call_command('install_lib', warn_dir=0)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\bdist_egg.py", line 150, in call_command
self.run_command(cmdname)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\dist.py", line 1221, in run_command
super().run_command(command)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\install_lib.py", line 11, in run
self.build()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\command\install_lib.py", line 111, in build
self.run_command('build_ext')
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\dist.py", line 1221, in run_command
super().run_command(command)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\build_ext.py", line 84, in run
_build_ext.run(self)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\command\build_ext.py", line 345, in run
self.build_extensions()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\utils\cpp_extension.py", line 843, in build_extensions
build_ext.build_extensions(self)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\command\build_ext.py", line 467, in build_extensions
self._build_extensions_serial()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\command\build_ext.py", line 493, in _build_extensions_serial
self.build_extension(ext)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\build_ext.py", line 246, in build_extension
_build_ext.build_extension(self, ext)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\command\build_ext.py", line 548, in build_extension
objects = self.compiler.compile(
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\utils\cpp_extension.py", line 815, in win_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\utils\cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\utils\cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
(llama4bit) E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa>
This process worked until I got to setup-py, although I had to run vcvars64.bat from C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\ rather than the provided directory.
The error I get when I attempt to setup cuda is as follows:
(llama4bit) E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa>python setup_cuda.py install
running install
C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running bdist_egg
running egg_info
writing quant_cuda.egg-info\PKG-INFO
writing dependency_links to quant_cuda.egg-info\dependency_links.txt
writing top-level names to quant_cuda.egg-info\top_level.txt
reading manifest file 'quant_cuda.egg-info\SOURCES.txt'
writing manifest file 'quant_cuda.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_ext
C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\utils\cpp_extension.py:388: UserWarning: The detected CUDA version (11.4) has a minor version mismatch with the version that was used to compile PyTorch (11.7). Most likely this shouldn't be a problem.
warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
building 'quant_cuda' extension
Emitting ninja build file E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\bin\nvcc --generate-dependencies-with-compile --dependency-output E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj.d --use-local-env -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /EHsc -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\torch\csrc\api\include -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\TH -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include" -IC:\ProgramData\miniconda3\envs\llama4bit\include -IC:\ProgramData\miniconda3\envs\llama4bit\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" -c E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\quant_cuda_kernel.cu -o E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
FAILED: E:/llmRunner/text-generation-webui/repositories/GPTQ-for-LLaMa/build/temp.win-amd64-cpython-310/Release/quant_cuda_kernel.obj
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\bin\nvcc --generate-dependencies-with-compile --dependency-output E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj.d --use-local-env -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /EHsc -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\torch\csrc\api\include -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\TH -IC:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include" -IC:\ProgramData\miniconda3\envs\llama4bit\include -IC:\ProgramData\miniconda3\envs\llama4bit\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" -c E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\quant_cuda_kernel.cu -o E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
C:/ProgramData/miniconda3/envs/llama4bit/lib/site-packages/torch/include\c10/macros/Macros.h(138): warning C4067: unexpected tokens following preprocessor directive - expected a newline
C:/ProgramData/miniconda3/envs/llama4bit/lib/site-packages/torch/include\c10/macros/Macros.h(138): warning C4067: unexpected tokens following preprocessor directive - expected a newline
C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\pybind11\cast.h(624): error: too few arguments for template template parameter "Tuple"
detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std:

air, Ts=<T1, T2>]"
(721): here
C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\pybind11\cast.h(717): error: too few arguments for template template parameter "Tuple"
detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std:

air, Ts=<T1, T2>]"
(721): here
C:/ProgramData/miniconda3/envs/llama4bit/lib/site-packages/torch/include\c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>:

perator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>:

perator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
C:/ProgramData/miniconda3/envs/llama4bit/lib/site-packages/torch/include\c10/core/TensorImpl.h(77): here
C:/ProgramData/miniconda3/envs/llama4bit/lib/site-packages/torch/include\c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>:

perator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=true, <unnamed>=0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>:

perator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=true, <unnamed>=0]"
C:/ProgramData/miniconda3/envs/llama4bit/lib/site-packages/torch/include\ATen/core/qualified_name.h(73): here
2 errors detected in the compilation of "e:/llmrunner/text-generation-webui/repositories/gptq-for-llama/quant_cuda_kernel.cu".
quant_cuda_kernel.cu
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\utils\cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa\setup_cuda.py", line 4, in <module>
setup(
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\
init
.py", line 108, in setup
return distutils.core.setup(**attrs)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\core.py", line 185, in setup
return run_commands(dist)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\core.py", line 201, in run_commands
dist.run_commands()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\dist.py", line 969, in run_commands
self.run_command(cmd)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\dist.py", line 1221, in run_command
super().run_command(command)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\install.py", line 74, in run
self.do_egg_install()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\install.py", line 123, in do_egg_install
self.run_command('bdist_egg')
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\dist.py", line 1221, in run_command
super().run_command(command)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\bdist_egg.py", line 164, in run
cmd = self.call_command('install_lib', warn_dir=0)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\bdist_egg.py", line 150, in call_command
self.run_command(cmdname)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\dist.py", line 1221, in run_command
super().run_command(command)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\install_lib.py", line 11, in run
self.build()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\command\install_lib.py", line 111, in build
self.run_command('build_ext')
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\dist.py", line 1221, in run_command
super().run_command(command)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\build_ext.py", line 84, in run
_build_ext.run(self)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\command\build_ext.py", line 345, in run
self.build_extensions()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\utils\cpp_extension.py", line 843, in build_extensions
build_ext.build_extensions(self)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\command\build_ext.py", line 467, in build_extensions
self._build_extensions_serial()
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\command\build_ext.py", line 493, in _build_extensions_serial
self.build_extension(ext)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\command\build_ext.py", line 246, in build_extension
_build_ext.build_extension(self, ext)
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\setuptools\_distutils\command\build_ext.py", line 548, in build_extension
objects = self.compiler.compile(
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\utils\cpp_extension.py", line 815, in win_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\utils\cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\utils\cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
(llama4bit) E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa>
Interesting read
[1/1] C:\Program Files\NVIDIA GPU Computing Toolkit\
CUDA\v11.4
\bin\nvcc
The article says it uses CUDA 11.7.
And one of the linked instructions also say to use CUDA 11.7:
https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/
This:
C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\pybind11\cast.h(624): error: too few arguments for template template parameter "Tuple" detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]" (721): here
...is a C++ error that implies there's probably a version mismatch between two packages. That supports the idea that your CUDA version is too low, which we can also see by the fact that it happens while compiling a CUDA kernel:
2 errors detected in the compilation of "e:/llmrunner/text-generation-webui/repositories/gptq-for-llama/quant_cuda_kernel.cu".
If you're intending to work specifically with large models, you'll be extremely limited on a single-GPU consumer desktop. You might instead set aside the $ for renting time on Nvidia's A100 or H100 cloud instances.
Thanks for the advice, but I'm afraid the real bottleneck in this case is the human operator, aka yours truly. I'm still the wading pool phase--far, far from the point of needing timeshares on cloud services. It'll be a while.
Until then, hopefully the concept of edge computing will come to apply to AI, and AI-in-a-box will become more mainstream. As well, I'm confident that Nvidia won't be able to hog the whole AI accelerator market to itself for too long.
Given the competition between the tech giants for market share, there'll be several high-quality general-purpose chatbots with free access. No need to reinvent the wheel there, so my hope is for DIY bots to be aimed at niche (area-specific) uses.
I'm confident that Nvidia won't be able to hog the whole AI accelerator market to itself for too long.
Intel GPUs have pretty decent AI performance, at least for what they cost. Much more competitive on that front than gaming.
I also expect AMD to redouble their efforts to try and capture some meaningful AI marketshare. It's such a big and rapidly-growing market that they're barely tapping that they can't ignore it. And it's one of the few markets where they haven't made meaningful traction in the past few years.
This process worked until I got to setup-py, although I had to run vcvars64.bat from C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\ rather than the provided directory.
The error I get when I attempt to setup cuda is as follows:
(llama4bit) E:\llmRunner\text-generation-webui\repositories\GPTQ-for-LLaMa>python setup_cuda.py install
If Visual Studio is in Program Files (x86), that probably means you installed the 32-bit version. You might consider uninstalling that and getting the 64-bit version, as I never tested whether this stuff would work with a 32-bit environment. CUDA meanwhile indicates a version mismatch, which was one of the problems I encountered early on as well. If you run "conda list" it should show all the installed stuff for your current environment (i.e. after running "conda activate [whatever your called the environment]") Here's what I show on a working install, using the Reddit instructions (which specify CUDA 11.3):
(textgen) C:\Users\jwalt>conda list
# packages in environment at C:\Users\jwalt\miniconda3\envs\textgen:
# Name Version Build Channel
accelerate 0.17.1 pypi_0 pypi
aiofiles 23.1.0 pypi_0 pypi
aiohttp 3.8.4 pypi_0 pypi
aiosignal 1.3.1 pypi_0 pypi
altair 4.2.2 pypi_0 pypi
anyio 3.6.2 pypi_0 pypi
async-timeout 4.0.2 pypi_0 pypi
attrs 22.2.0 pypi_0 pypi
bitsandbytes 0.37.1 pypi_0 pypi
bzip2 1.0.8 he774522_0
ca-certificates 2023.01.10 haa95532_0
certifi 2022.12.7 py310haa95532_0
charset-normalizer 3.1.0 pypi_0 pypi
click 8.1.3 pypi_0 pypi
colorama 0.4.6 pypi_0 pypi
contourpy 1.0.7 pypi_0 pypi
cuda 11.3.0 hd997d6f_0 nvidia/label/cuda-11.3.0
cuda-command-line-tools 11.3.0 hd997d6f_0 nvidia/label/cuda-11.3.0
cuda-compiler 11.3.0 hd997d6f_0 nvidia/label/cuda-11.3.0
cuda-cudart 11.3.58 h24ea3a4_0 nvidia/label/cuda-11.3.0
cuda-cuobjdump 11.3.58 h9c7f84a_0 nvidia/label/cuda-11.3.0
cuda-cupti 11.3.58 h0481b1b_0 nvidia/label/cuda-11.3.0
cuda-cuxxfilt 11.3.58 hb382750_0 nvidia/label/cuda-11.3.0
cuda-libraries 11.3.0 hd997d6f_0 nvidia/label/cuda-11.3.0
cuda-libraries-dev 11.3.0 hd997d6f_0 nvidia/label/cuda-11.3.0
cuda-memcheck 11.3.58 h0838ec0_0 nvidia/label/cuda-11.3.0
cuda-nvcc 11.3.58 hb8d16a4_0 nvidia/label/cuda-11.3.0
cuda-nvdisasm 11.3.58 h028471b_0 nvidia/label/cuda-11.3.0
cuda-nvml-dev 11.3.58 hbc9c638_0 nvidia/label/cuda-11.3.1
cuda-nvprof 11.3.58 h45e7c35_0 nvidia/label/cuda-11.3.0
cuda-nvprune 11.3.58 h42e8f5f_0 nvidia/label/cuda-11.3.0
cuda-nvrtc 11.3.58 h5d15f37_0 nvidia/label/cuda-11.3.0
cuda-nvtx 11.3.58 h607cf41_0 nvidia/label/cuda-11.3.0
cuda-runtime 11.3.0 hd997d6f_0 nvidia/label/cuda-11.3.0
cuda-samples 11.3.58 h9a5194a_0 nvidia/label/cuda-11.3.1
cuda-sanitizer-api 11.3.58 h5192ad9_0 nvidia/label/cuda-11.3.0
cuda-thrust 11.3.58 hc445dc0_0 nvidia/label/cuda-11.3.0
cuda-toolkit 11.3.0 hd997d6f_0 nvidia/label/cuda-11.3.0
cuda-tools 11.3.0 hd997d6f_0 nvidia/label/cuda-11.3.0
cycler 0.11.0 pypi_0 pypi
entrypoints 0.4 pypi_0 pypi
fastapi 0.94.1 pypi_0 pypi
ffmpy 0.3.0 pypi_0 pypi
filelock 3.10.0 pypi_0 pypi
flexgen 0.1.7 pypi_0 pypi
fonttools 4.39.2 pypi_0 pypi
frozenlist 1.3.3 pypi_0 pypi
fsspec 2023.3.0 pypi_0 pypi
git 2.34.1 haa95532_0
gradio 3.18.0 pypi_0 pypi
h11 0.14.0 pypi_0 pypi
httpcore 0.16.3 pypi_0 pypi
httpx 0.23.3 pypi_0 pypi
huggingface-hub 0.13.2 pypi_0 pypi
idna 3.4 pypi_0 pypi
jinja2 3.1.2 pypi_0 pypi
jsonschema 4.17.3 pypi_0 pypi
kiwisolver 1.4.4 pypi_0 pypi
libcublas 11.4.2.10064 hdce621a_0 nvidia/label/cuda-11.3.0
libcufft 10.4.2.58 ha8d0324_0 nvidia/label/cuda-11.3.0
libcurand 10.2.4.58 h205e5ba_0 nvidia/label/cuda-11.3.0
libcusolver 11.1.1.58 h8fce944_0 nvidia/label/cuda-11.3.0
libcusparse 11.5.0.58 h26ccba6_0 nvidia/label/cuda-11.3.0
libffi 3.4.2 hd77b12b_6
libnpp 11.3.3.44 h8a18219_0 nvidia/label/cuda-11.3.0
libnvjpeg 11.4.1.58 h1234a80_0 nvidia/label/cuda-11.3.0
linkify-it-py 2.0.0 pypi_0 pypi
markdown 3.4.1 pypi_0 pypi
markdown-it-py 2.2.0 pypi_0 pypi
markupsafe 2.1.2 pypi_0 pypi
matplotlib 3.7.1 pypi_0 pypi
mdit-py-plugins 0.3.5 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
multidict 6.0.4 pypi_0 pypi
networkx 3.0 pypi_0 pypi
ninja 1.11.1 pypi_0 pypi
numpy 1.24.2 pypi_0 pypi
openssl 1.1.1t h2bbff1b_0
orjson 3.8.7 pypi_0 pypi
packaging 23.0 pypi_0 pypi
pandas 1.5.3 pypi_0 pypi
peft 0.2.0 pypi_0 pypi
pillow 9.4.0 pypi_0 pypi
pip 23.0.1 py310haa95532_0
psutil 5.9.4 pypi_0 pypi
pulp 2.7.0 pypi_0 pypi
pycryptodome 3.17 pypi_0 pypi
pydantic 1.10.6 pypi_0 pypi
pydub 0.25.1 pypi_0 pypi
pyparsing 3.0.9 pypi_0 pypi
pyrsistent 0.19.3 pypi_0 pypi
python 3.10.9 h966fe2a_2
python-dateutil 2.8.2 pypi_0 pypi
python-multipart 0.0.6 pypi_0 pypi
pytz 2022.7.1 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
quant-cuda 0.0.0 pypi_0 pypi
regex 2022.10.31 pypi_0 pypi
requests 2.28.2 pypi_0 pypi
rfc3986 1.5.0 pypi_0 pypi
safetensors 0.3.0 pypi_0 pypi
sentencepiece 0.1.97 pypi_0 pypi
setuptools 65.6.3 py310haa95532_0
six 1.16.0 pypi_0 pypi
sniffio 1.3.0 pypi_0 pypi
sqlite 3.41.1 h2bbff1b_0
starlette 0.26.1 pypi_0 pypi
sympy 1.11.1 pypi_0 pypi
tk 8.6.12 h2bbff1b_0
tokenizers 0.13.2 pypi_0 pypi
toolz 0.12.0 pypi_0 pypi
torch 1.12.0+cu113 pypi_0 pypi
tqdm 4.65.0 pypi_0 pypi
transformers 4.28.0.dev0 pypi_0 pypi
typing-extensions 4.5.0 pypi_0 pypi
tzdata 2022g h04d1e81_0
uc-micro-py 1.0.1 pypi_0 pypi
urllib3 1.26.15 pypi_0 pypi
uvicorn 0.21.1 pypi_0 pypi
vc 14.2 h21ff451_1
vs2015_runtime 14.27.29016 h5e58377_2
websockets 10.4 pypi_0 pypi
wheel 0.38.4 py310haa95532_0
wincertstore 0.2 py310haa95532_2
xz 5.2.10 h8cc25b3_1
yarl 1.8.2 pypi_0 pypi
zlib 1.2.13 h8cc25b3_0
Here's the same list, only for an environment using my version of the instructions:
(llama4bit) C:\Users\jwalt>conda list
# packages in environment at C:\Users\jwalt\miniconda3\envs\llama4bit:
# Name Version Build Channel
accelerate 0.17.1 pypi_0 pypi
aiofiles 23.1.0 pypi_0 pypi
aiohttp 3.8.4 pypi_0 pypi
aiosignal 1.3.1 pypi_0 pypi
altair 4.2.2 pypi_0 pypi
anyio 3.6.2 pypi_0 pypi
async-timeout 4.0.2 pypi_0 pypi
attrs 22.2.0 pypi_0 pypi
bitsandbytes 0.37.1 pypi_0 pypi
blas 1.0 mkl
brotlipy 0.7.0 py310h2bbff1b_1002
bzip2 1.0.8 he774522_0
ca-certificates 2023.01.10 haa95532_0
cchardet 2.1.7 pypi_0 pypi
certifi 2022.12.7 py310haa95532_0
cffi 1.15.1 py310h2bbff1b_3
chardet 4.0.0 py310haa95532_1003
charset-normalizer 3.1.0 pypi_0 pypi
click 8.1.3 pypi_0 pypi
colorama 0.4.6 pypi_0 pypi
contourpy 1.0.7 pypi_0 pypi
cryptography 39.0.1 py310h21b164f_0
cuda 11.7.0 0 nvidia/label/cuda-11.7.0
cuda-cccl 12.1.55 0 nvidia
cuda-command-line-tools 11.7.0 0 nvidia/label/cuda-11.7.0
cuda-compiler 11.7.0 0 nvidia/label/cuda-11.7.0
cuda-cudart 11.7.99 0 nvidia
cuda-cudart-dev 11.7.99 0 nvidia
cuda-cuobjdump 11.7.50 0 nvidia/label/cuda-11.7.0
cuda-cupti 11.7.101 0 nvidia
cuda-cuxxfilt 11.7.50 0 nvidia/label/cuda-11.7.0
cuda-demo-suite 11.7.50 0 nvidia/label/cuda-11.7.0
cuda-documentation 11.7.50 0 nvidia/label/cuda-11.7.0
cuda-libraries 11.7.1 0 nvidia
cuda-libraries-dev 11.7.1 0 nvidia
cuda-memcheck 11.7.50 0 nvidia/label/cuda-11.7.0
cuda-nsight-compute 11.7.0 0 nvidia/label/cuda-11.7.0
cuda-nvcc 11.7.64 0 nvidia/label/cuda-11.7.0
cuda-nvdisasm 11.7.50 0 nvidia/label/cuda-11.7.0
cuda-nvml-dev 11.7.50 0 nvidia/label/cuda-11.7.0
cuda-nvprof 11.7.50 0 nvidia/label/cuda-11.7.0
cuda-nvprune 11.7.50 0 nvidia/label/cuda-11.7.0
cuda-nvrtc 11.7.99 0 nvidia
cuda-nvrtc-dev 11.7.99 0 nvidia
cuda-nvtx 11.7.91 0 nvidia
cuda-nvvp 11.7.50 0 nvidia/label/cuda-11.7.0
cuda-runtime 11.7.1 0 nvidia
cuda-sanitizer-api 11.7.50 0 nvidia/label/cuda-11.7.0
cuda-toolkit 11.7.0 0 nvidia/label/cuda-11.7.0
cuda-tools 11.7.0 0 nvidia/label/cuda-11.7.0
cuda-visual-tools 11.7.0 0 nvidia/label/cuda-11.7.0
cycler 0.11.0 pypi_0 pypi
entrypoints 0.4 pypi_0 pypi
fastapi 0.94.1 pypi_0 pypi
ffmpy 0.3.0 pypi_0 pypi
filelock 3.9.1 pypi_0 pypi
flexgen 0.1.7 pypi_0 pypi
flit-core 3.6.0 pyhd3eb1b0_0
fonttools 4.39.0 pypi_0 pypi
freetype 2.12.1 ha860e81_0
frozenlist 1.3.3 pypi_0 pypi
fsspec 2023.3.0 pypi_0 pypi
giflib 5.2.1 h8cc25b3_3
git 2.34.1 haa95532_0
gradio 3.18.0 pypi_0 pypi
h11 0.14.0 pypi_0 pypi
httpcore 0.16.3 pypi_0 pypi
httpx 0.23.3 pypi_0 pypi
huggingface-hub 0.13.2 pypi_0 pypi
idna 3.4 py310haa95532_0
intel-openmp 2021.4.0 haa95532_3556
jinja2 3.1.2 py310haa95532_0
jpeg 9e h2bbff1b_1
jsonschema 4.17.3 pypi_0 pypi
kiwisolver 1.4.4 pypi_0 pypi
lerc 3.0 hd77b12b_0
libcublas 11.10.3.66 0 nvidia
libcublas-dev 11.10.3.66 0 nvidia
libcufft 10.7.2.124 0 nvidia
libcufft-dev 10.7.2.124 0 nvidia
libcurand 10.3.2.56 0 nvidia
libcurand-dev 10.3.2.56 0 nvidia
libcusolver 11.4.0.1 0 nvidia
libcusolver-dev 11.4.0.1 0 nvidia
libcusparse 11.7.4.91 0 nvidia
libcusparse-dev 11.7.4.91 0 nvidia
libdeflate 1.17 h2bbff1b_0
libffi 3.4.2 hd77b12b_6
libnpp 11.7.4.75 0 nvidia
libnpp-dev 11.7.4.75 0 nvidia
libnvjpeg 11.8.0.2 0 nvidia
libnvjpeg-dev 11.8.0.2 0 nvidia
libpng 1.6.39 h8cc25b3_0
libtiff 4.5.0 h6c2663c_2
libuv 1.44.2 h2bbff1b_0
libwebp 1.2.4 hbc33d0d_1
libwebp-base 1.2.4 h2bbff1b_1
linkify-it-py 2.0.0 pypi_0 pypi
lz4-c 1.9.4 h2bbff1b_0
markdown 3.4.1 pypi_0 pypi
markdown-it-py 2.2.0 pypi_0 pypi
markupsafe 2.1.2 pypi_0 pypi
matplotlib 3.7.1 pypi_0 pypi
mdit-py-plugins 0.3.5 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
mkl 2021.4.0 haa95532_640
mkl-service 2.4.0 py310h2bbff1b_0
mkl_fft 1.3.1 py310ha0764ea_0
mkl_random 1.2.2 py310h4ed8f06_0
mpmath 1.2.1 py310haa95532_0
multidict 6.0.4 pypi_0 pypi
networkx 2.8.4 py310haa95532_0
ninja 1.10.2 haa95532_5
ninja-base 1.10.2 h6d14046_5
nsight-compute 2022.2.0.13 0 nvidia/label/cuda-11.7.0
numpy 1.24.2 pypi_0 pypi
numpy-base 1.23.5 py310h04254f7_0
openssl 1.1.1t h2bbff1b_0
orjson 3.8.7 pypi_0 pypi
packaging 23.0 pypi_0 pypi
pandas 1.5.3 pypi_0 pypi
pillow 9.4.0 pypi_0 pypi
pip 23.0.1 py310haa95532_0
psutil 5.9.4 pypi_0 pypi
pulp 2.7.0 pypi_0 pypi
pycparser 2.21 pyhd3eb1b0_0
pycryptodome 3.17 pypi_0 pypi
pydantic 1.10.6 pypi_0 pypi
pydub 0.25.1 pypi_0 pypi
pyopenssl 23.0.0 py310haa95532_0
pyparsing 3.0.9 pypi_0 pypi
pyrsistent 0.19.3 pypi_0 pypi
pysocks 1.7.1 py310haa95532_0
python 3.10.9 h966fe2a_2
python-dateutil 2.8.2 pypi_0 pypi
python-multipart 0.0.6 pypi_0 pypi
pytorch 2.0.0 py3.10_cuda11.7_cudnn8_0 pytorch
pytorch-cuda 11.7 h16d0643_3 pytorch
pytorch-mutex 1.0 cuda pytorch
pytz 2022.7.1 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
quant-cuda 0.0.0 pypi_0 pypi
regex 2022.10.31 pypi_0 pypi
requests 2.28.2 pypi_0 pypi
rfc3986 1.5.0 pypi_0 pypi
rwkv 0.4.2 pypi_0 pypi
safetensors 0.3.0 pypi_0 pypi
sentencepiece 0.1.97 pypi_0 pypi
setuptools 65.6.3 py310haa95532_0
six 1.16.0 pyhd3eb1b0_1
sniffio 1.3.0 pypi_0 pypi
sqlite 3.41.1 h2bbff1b_0
starlette 0.26.1 pypi_0 pypi
sympy 1.11.1 py310haa95532_0
tk 8.6.12 h2bbff1b_0
tokenizers 0.13.2 pypi_0 pypi
toolz 0.12.0 pypi_0 pypi
torch 1.13.1 pypi_0 pypi
torchaudio 2.0.0 pypi_0 pypi
torchvision 0.15.0 pypi_0 pypi
tqdm 4.65.0 pypi_0 pypi
transformers 4.27.0.dev0 pypi_0 pypi
typing-extensions 4.5.0 pypi_0 pypi
typing_extensions 4.4.0 py310haa95532_0
tzdata 2022g h04d1e81_0
uc-micro-py 1.0.1 pypi_0 pypi
urllib3 1.26.15 pypi_0 pypi
uvicorn 0.21.0 pypi_0 pypi
vc 14.2 h21ff451_1
vs2015_runtime 14.27.29016 h5e58377_2
websockets 10.4 pypi_0 pypi
wheel 0.38.4 py310haa95532_0
win_inet_pton 1.1.0 py310haa95532_0
wincertstore 0.2 py310haa95532_2
xz 5.2.10 h8cc25b3_1
yarl 1.8.2 pypi_0 pypi
zlib 1.2.13 h8cc25b3_0
zstd 1.5.2 h19a0ad4_0
If you have wrong CUDA versions, you could try doing "conda remove [library]" and then go back to the "conda install cuda [etc]" statement and see if that helps.
Hello,
Anyone have idea what caused this error? I tried reinstalling everything but I always get to the end and get this error.
(llama4bit) PS C:\AIStuff\text-generation-webui> python server.py --gptq-bits 4 --model llama-7b
Loading llama-7b...
Traceback (most recent call last):
File "C:\AIStuff\text-generation-webui\server.py", line 241, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "C:\AIStuff\text-generation-webui\modules\models.py", line 101, in load_model
model = load_quantized(model_name)
File "C:\AIStuff\text-generation-webui\modules\GPTQ_loader.py", line 56, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)
TypeError: load_quant() missing 1 required positional argument: 'groupsize'
(llama4bit) PS C:\AIStuff\text-generation-webui>
If you have idea how can I fix this please let me know.
Hello,
Anyone have idea what caused this error? I tried reinstalling everything but I always get to the end and get this error.
(llama4bit) PS C:\AIStuff\text-generation-webui> python server.py --gptq-bits 4 --model llama-7b
Loading llama-7b...
Traceback (most recent call last):
File "C:\AIStuff\text-generation-webui\server.py", line 241, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "C:\AIStuff\text-generation-webui\modules\models.py", line 101, in load_model
model = load_quantized(model_name)
File "C:\AIStuff\text-generation-webui\modules\GPTQ_loader.py", line 56, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)
TypeError: load_quant() missing 1 required positional argument: 'groupsize'
(llama4bit) PS C:\AIStuff\text-generation-webui>
If you have idea how can I fix this please let me know.
Looks like potentially either the download was corrupted, or maybe the configuration file has an error. Have you tried the Reddit steps? If not, create a new Conda environment and see if those work. (You can copy over the llama-7b files, or if you want just redownload to see if the error persists.)
Looks like potentially either the download was corrupted, or maybe the configuration file has an error. Have you tried the Reddit steps? If not, create a new Conda environment and see if those work. (You can copy over the llama-7b files, or if you want just redownload to see if the error persists.)
Hello,
Today I downloaded it for at least 5 times so probably not a download corruption. I tried different folders I reinstalled conda I tried 7b and 13b files. But every time I try to run 4bit version it just give me this error. In other location on my disk I have non 4bit version and it works.. I can't find anything on the Internet so I tried to write here because by following that tutorial everything worked with 0 errors except when trying to start it.
No, it's more like a version mismatch. GPTQ_loader.py clearly added a
groupsize
parameter, at some point, and models.py simply isn't setting it.
You might do better by going back a couple revisions. Unfortunately, I don't know enough about conda to provide any further guidance.
No, it's more like a version mismatch. GPTQ_loader.py clearly added a
groupsize
parameter, at some point, and models.py simply isn't setting it.
You might do better by going back a couple revisions. Unfortunately, I don't know enough about conda to provide any further guidance.
Hello sir,
So basically you are suggesting to get older GPTQ version? I will try that and update my post.
Hello sir,
So basically you are suggesting to get older GPTQ version? I will try that and update my post.
That could work. What GPU are you trying to run this on? Anyway, you might try this command (from the folder where you did "
https://github.com/oobabooga/text-generation-webui.git
"):
git checkout 'master@{2023-03-18 18:30:00}'
In theory, that will get you the versions of the files from last Saturday.
I tried it several times, with the exact instruction, and always get:
"running build_exterror: [WinError 2]..."
When I run 'python setup_cuda.py install' a shame, I was hoping for finally being able to get 4-bit running.
I have the 2019 Build tool version installed, and miniconda, etc. CUDA says True when it is queried from Python, so everything seems ok, but still the error. T_T
That could work. What GPU are you trying to run this on? Anyway, you might try this command (from the folder where you did "
https://github.com/oobabooga/text-generation-webui.git
"):
git checkout 'master@{2023-03-18 18:30:00}'
In theory, that will get you the versions of the files from last Saturday.
Hello, thank you for your reply.
I tried your command from all main folders but it says "fatal: not a git repository (or any of the parent directories): .git" could you please provide more information?
By the way I'm trying to run it on RTX 2070.
Tom's Hardware is part of Future plc, an international media group and leading digital publisher.
Visit our corporate site
.
© Future Publishing Limited Quay House, The Ambury, Bath BA1 1UA.
All rights reserved. England and Wales company registration number 2008885.
Tom's Hardware is part of Future plc, an international media group and leading digital publisher.
Visit our corporate site
.
© Future Publishing Limited Quay House, The Ambury, Bath BA1 1UA. All rights reserved. England and Wales company registration number 2008885.