Hindsight model errors #44

New issue

Open

opened 2026-06-03 15:59:08 +01:00 by apb · 0 comments

apb commented

2026-06-03 15:59:08 +01:00

Owner

get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
0.00.748.647 I device_info:
0.00.748.827 I   - SYCL0   : Intel(R) UHD Graphics 750 (28851 MiB, 28851 MiB free)
0.00.748.846 I   - CPU     : 11th Gen Intel(R) Core(TM) i9-11900 @ 2.50GHz (31171 MiB, 31171 MiB free)
0.00.748.959 I system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.00.748.971 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.749.152 I srv          init: running without SSL
0.00.749.219 I srv          init: using 15 threads for HTTP server
0.00.749.534 I srv         start: binding port with default address family
0.00.750.756 I srv  llama_server: loading model
0.00.750.767 I srv    load_model: loading model '/root/.cache/huggingface/hub/models--bartowski--google_gemma-4-E2B-it-GGUF/snapshots/b5e99bd964eaacc27ba484bb2eb3e9f6160b9143/google_gemma-4-E2B-it-Q4_K_M.gguf'
0.01.200.774 I srv    load_model: [mtmd] estimated worst-case memory usage of mmproj is 1200.06 MiB
0.01.200.793 I common_init_result: fitting params to device memory ...
0.01.200.793 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
0.01.979.826 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.980.182 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.998.024 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
0.02.589.936 W llama_context: n_ctx_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
0.02.610.187 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.03.861.327 W init_audio: audio input is in experimental stage and may have reduced quality:
    https://github.com/ggml-org/llama.cpp/discussions/13759
0.03.861.382 I srv    load_model: loaded multimodal model, '/root/.cache/huggingface/hub/models--bartowski--google_gemma-4-E2B-it-GGUF/snapshots/b5e99bd964eaacc27ba484bb2eb3e9f6160b9143/mmproj-google_gemma-4-E2B-it-bf16.gguf'
0.03.861.392 I srv    load_model: initializing slots, n_slots = 4
0.04.216.377 W common_speculative_init: no implementations specified for speculative decoding
0.04.216.387 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 16384
0.04.216.392 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 16384
0.04.216.393 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 16384
0.04.216.393 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 16384
0.04.216.463 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
0.04.216.464 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.04.216.464 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.04.216.465 I srv    load_model: context checkpoints enabled, max = 32, min spacing = 256
0.04.216.481 I srv          init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.04.222.252 I init: chat template, example_format: '<|turn>system
      
>
You are a helpful assistant<turn|>
<|turn>user
      
Hello<turn|>
      
<|turn>model
Hi there<turn|>
<|turn>user
How are you?<turn|>
      
<|turn>model
'
0.04.222.833 I srv          init: init: chat template, thinking = 1
0.04.222.852 I srv  llama_server: model loaded
0.04.222.855 I srv  llama_server: server is listening on http://0.0.0.0:8080
0.04.222.858 I srv  update_slots: all slots are idle

``` get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory 0.00.748.647 I device_info: 0.00.748.827 I - SYCL0 : Intel(R) UHD Graphics 750 (28851 MiB, 28851 MiB free) 0.00.748.846 I - CPU : 11th Gen Intel(R) Core(TM) i9-11900 @ 2.50GHz (31171 MiB, 31171 MiB free) 0.00.748.959 I system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.00.748.971 I srv llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true 0.00.749.152 I srv init: running without SSL 0.00.749.219 I srv init: using 15 threads for HTTP server 0.00.749.534 I srv start: binding port with default address family 0.00.750.756 I srv llama_server: loading model 0.00.750.767 I srv load_model: loading model '/root/.cache/huggingface/hub/models--bartowski--google_gemma-4-E2B-it-GGUF/snapshots/b5e99bd964eaacc27ba484bb2eb3e9f6160b9143/google_gemma-4-E2B-it-Q4_K_M.gguf' 0.01.200.774 I srv load_model: [mtmd] estimated worst-case memory usage of mmproj is 1200.06 MiB 0.01.200.793 I common_init_result: fitting params to device memory ... 0.01.200.793 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on) get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory 0.01.979.826 W load: control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden 0.01.980.182 W load: control-looking token: 50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden 0.01.998.024 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory 0.02.589.936 W llama_context: n_ctx_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized 0.02.610.187 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) 0.03.861.327 W init_audio: audio input is in experimental stage and may have reduced quality: https://github.com/ggml-org/llama.cpp/discussions/13759 0.03.861.382 I srv load_model: loaded multimodal model, '/root/.cache/huggingface/hub/models--bartowski--google_gemma-4-E2B-it-GGUF/snapshots/b5e99bd964eaacc27ba484bb2eb3e9f6160b9143/mmproj-google_gemma-4-E2B-it-bf16.gguf' 0.03.861.392 I srv load_model: initializing slots, n_slots = 4 0.04.216.377 W common_speculative_init: no implementations specified for speculative decoding 0.04.216.387 I slot load_model: id 0 | task -1 | new slot, n_ctx = 16384 0.04.216.392 I slot load_model: id 1 | task -1 | new slot, n_ctx = 16384 0.04.216.393 I slot load_model: id 2 | task -1 | new slot, n_ctx = 16384 0.04.216.393 I slot load_model: id 3 | task -1 | new slot, n_ctx = 16384 0.04.216.463 I srv load_model: prompt cache is enabled, size limit: 8192 MiB 0.04.216.464 I srv load_model: use `--cache-ram 0` to disable the prompt cache 0.04.216.464 I srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391 0.04.216.465 I srv load_model: context checkpoints enabled, max = 32, min spacing = 256 0.04.216.481 I srv init: idle slots will be saved to prompt cache and cleared upon starting a new task 0.04.222.252 I init: chat template, example_format: '<|turn>system > You are a helpful assistant<turn|> <|turn>user Hello<turn|> <|turn>model Hi there<turn|> <|turn>user How are you?<turn|> <|turn>model ' 0.04.222.833 I srv init: init: chat template, thinking = 1 0.04.222.852 I srv llama_server: model loaded 0.04.222.855 I srv llama_server: server is listening on http://0.0.0.0:8080 0.04.222.858 I srv update_slots: all slots are idle ```