Meta에서는 텍스트와 음성을 자유롭게 혼합할 수 있는 다중모달 언어 모델인 Spirit LM을 공개하였습니다. NVIDIA는 인간의 선호도에 더 잘 맞는 LLM 정렬을 위해 새로운 보상 모델을 발표하였으며, IBM은 고정밀도와 효율성을 갖춘 새로운 Granite 3.0 생성 AI 모델을 출시하였습니다. Mistral AI는 엣지 컴퓨팅을 위한 고성능 모델인 Ministral 3B와 8B를 소개하였고, Microsoft는 1비트 LLM의 공식 추론 프레임워크인 bitnet.cpp를 공개하였습니다. Neural Magic은 양자화된 LLM에 대한 광범위한 평가를 실시하여 성능 저하가 미미하다는 것을 발견하였습니다. NVIDIA는 기존의 대형 언어 모델을 Mixture of Experts(MoE)로 업사이클링하는 연구를 발표하였으며, Google DeepMind는 LLM의 추론 능력을 향상시키기 위한 새로운 프로세스 보상 모델을 제안하였습니다. 또한, simular-ai는 컴퓨터와 인간처럼 상호 작용할 수 있는 오픈 에이전트 프레임워크인 Agent S를 소개하였고, Apple은 LLM의 수학적 추론 한계를 이해하기 위한 GSM-Symbolic을 발표하였습니다.

Meta, SPIRIT-LM: Interleaved Spoken and Written Language Model

링크, 2024년 10월 22일

  • 텍스트와 음성을 자유롭게 혼합할 수 있는 다중모달 언어 모델 Spirit LM을 소개
  • 7B 사전학습된 텍스트 언어 모델을 기반으로 음성 모달리티를 추가
  • 텍스트와 음성 시퀀스를 단일 토큰 스트림으로 연결하여 학습
  • 음성-텍스트 병렬 코퍼스를 활용한 단어 수준의 인터리빙 방법 사용
  • 베이스 버전과 익스프레시브 버전 두 가지 모델 제공
    • 베이스 버전은 음성 음운 단위(HuBERT) 사용
    • 익스프레시브 버전은 음운 단위에 피치와 스타일 단위를 추가하여 표현력 모델링
  • 다양한 작업을 소수의 예제로 학습 가능 (ASR, TTS, 음성 분류 등)
  • 연구 커뮤니티가 텍스트와 음성 통합을 위한 새로운 접근법을 발전시키길 희망

NVIDIA, New Reward Model Helps Improve LLM Alignment with Human Preferences

링크, 2024년 10월 3일

  • 인간의 선호도에 맞게 AI 시스템을 정렬하기 위한 강화 학습(RLHF)의 중요성 강조
  • NVIDIA에서 Llama 3.1-Nemotron-70B-Reward라는 최첨단 보상 모델 출시
    • Hugging Face RewardBench 리더보드에서 전체 1위 달성 (94.1% 정확도)
    • 다양한 카테고리에서 높은 성능 (안전성 95.1%, 추론 98.1%)
  • 보상 모델을 활용하여 Llama 3.1-Nemotron-70B-Instruct 모델을 훈련
    • Arena Hard 리더보드에서 상위권 달성
  • 보상 모델은 CC-BY-4.0 라이선스의 HelpSteer2 데이터를 사용하여 훈련
  • NVIDIA NIM inference microservice로 쉽게 배포 가능
  • 기업용 사례에 적용할 수 있도록 모델과 데이터 공개

Mistral AI, Un Ministral, des Ministraux

링크, 2024년 10월 16일

  • 엣지 컴퓨팅을 위한 고성능 모델 Ministral 3B와 Ministral 8B 출시
  • 작은 모델 크기로도 높은 지식, 상식, 추론 능력 제공
  • 128k의 컨텍스트 길이 지원 (현재는 vLLM에서 32k 지원)
  • Ministral 8B는 빠르고 메모리 효율적인 추론을 위한 인터리브드 슬라이딩 윈도우 어텐션 패턴 사용
  • 다양한 사용 사례에 적용 가능 (에이전틱 워크플로우, 전문 작업자 등)
  • Mistral 7B 대비 향상된 성능과 효율성 제공
  • API 및 라이선스 정보 제공, 모델 가중치 연구용으로 공개

IBM, IBM’s New Granite 3.0 Generative AI Models Are Small, Yet Highly Accurate and Efficient

링크, 2024년 10월 21일

  • IBM에서 Granite 3.0 시리즈 모델 출시
    • Granite 3.0 8B, Granite 3.0 2B 등
    • Mixture of Experts(MoE) LLMs: Granite 3.0 3B-A800M, Granite 3.0 1B-A400M
  • 고정밀도와 효율성을 겸비한 모델로, 기업용 워크플로우의 기본 구성 요소로 설계
  • 함수 호출을 지원하여 도구 기반의 다양한 사용 사례에 적용 가능
  • 그룹 쿼리 어텐션(GQA) 및 RoPE 등 최적화된 아키텍처 사용
  • 추론 속도를 높이기 위한 speculative decoding 기법 적용
    • 추론 속도를 높이고 자원 사용을 최적화
  • NVIDIA NIM inference microservice로 패키징되어 배포 용이

Microsoft, bitnet.cpp

링크, 2024년 10월 18일

  • 1비트 LLM의 공식 추론 프레임워크인 bitnet.cpp 공개
  • CPU에서 빠르고 손실 없는 추론을 지원하는 최적화된 커널 제공
  • NPU 및 GPU 지원 예정
  • ARM 및 x86 CPU에서 최대 6배 이상의 속도 향상 및 에너지 소비 감소 달성
  • 100B BitNet b1.58 모델을 단일 CPU에서 실행 가능
  • MIT 라이선스 버전으로 공개

Neural Magic, We Ran Over Half a Million Evaluations on Quantized LLMs: Here’s What We Found

링크, 2024년 10월 17일

  • 양자화된 LLM이 성능 저하 없이 효율성을 높일 수 있는지에 대한 연구 수행
  • Llama 3.1 모델에 대해 다양한 양자화 스킴으로 평가 진행
  • 양자화로 인해 성능 저하는 미미하며, 최대 2.4배의 추론 속도 향상 및 3.5배의 모델 크기 감소 달성
  • Academic 및 Real-World 벤치마크에서 평균 99% 이상의 정확도 유지
  • W8A8-FP8 동적 양자화가 가장 우수한 결과를 보임
  • 모델은 Hugging Face에서 이용 가능

NVIDIA, Upcycling Large Language Models into Mixture of Experts

링크, 2024년 10월 11일

  • 기존의 대형 언어 모델을 Mixture of Experts(MoE)로 업사이클링하는 연구 발표
  • 새로운 “가상 그룹” 초기화 스킴 및 가중치 스케일링 방법 제안
  • 업사이클링을 통해 동일한 데이터로 학습한 모델보다 성능 향상
  • Nemotron-4 15B 모델을 1조 토큰으로 업사이클링하여 MMLU 성능 향상 (65.3% → 67.6%)
  • MoE 언어 모델 구축을 위한 최적의 방법과 인사이트 제공

Google DeepMind, Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

링크, 2024년 10월 11일

  • LLM의 추론 능력을 향상시키기 위한 새로운 프로세스 보상 모델(PRMs) 제안
  • 인간의 레이블 없이 자동화된 데이터로 PRMs를 훈련하는 방법 제시
  • 단계별로 진척도를 측정하여 보상함으로써 강화 학습의 효율성 향상
  • 베이스 모델과 별도의 “Prover” LLM을 사용하여 성능 향상
  • 기존의 결과 기반 보상 모델보다 정확도가 8% 이상 향상되고, 데이터 효율성은 최대 6배 개선

simular-ai, Agent S: An Open Agentic Framework that Uses Computers Like a Human

링크, 2024년 10월 10일

  • 인간처럼 컴퓨터와 상호 작용할 수 있는 오픈 에이전트 프레임워크 Agent S 공개
  • 그래픽 사용자 인터페이스(GUI)를 통해 자동화된 복잡한 다단계 작업 수행 가능
  • 도메인별 지식 획득, 긴 작업 계획, 동적 인터페이스 처리 등의 문제 해결
  • 경험 기반의 계층적 플래닝 도입으로 효율적인 작업 계획 및 하위 작업 실행
  • 다중모달 대형 언어 모델(MLLMs)을 기반으로 한 에이전트-컴퓨터 인터페이스(ACI) 사용
  • OSWorld 벤치마크에서 기존 대비 83.6%의 상대적 개선 달성
  • WindowsAgentArena 벤치마크에서의 일반화 능력 입증

Apple, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

링크, 2024년 10월

  • LLM의 수학적 추론 능력의 한계를 이해하기 위한 연구 발표
  • GSM8K 벤치마크의 한계를 극복하기 위해 GSM-Symbolic 벤치마크 도입
  • 심볼릭 템플릿을 사용하여 다양한 수학 문제 생성 및 평가
  • LLM이 동일한 문제의 수치만 변경해도 성능이 저하되는 현상 발견
  • 논리적 추론보다 학습 데이터의 추론 단계를 복제하는 경향 분석
  • 추가적인 조건이 포함되면 성능이 크게 저하되는 취약성 확인
  • LLM의 수학적 추론 능력에 대한 보다 깊은 이해 제공
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
###
https://github.com/facebookresearch/spiritlm
META
SPIRIT-LM: Interleaved Spoken and Written Language Model

We introduce Spirit LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a 7B pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single stream of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. Spirit LM comes in two versions: a Base version that uses speech phonetic units (HuBERT) and an Expressive version that models expressivity using pitch and style units in addition to the phonetic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that Spirit LM can learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification).
.

Meta Spirit LM is our first open source multimodal language model that freely mixes text and speech.
Details, code and model weights ➡️
https://go.fb.me/2jooyy

Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches compromise the expressive aspects of speech in inputs and outputs. Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations to better understand and generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification.
We hope that sharing this work will enable the research community to further new approaches for text and speech integration.

###
https://developer.nvidia.com/blog/new-reward-model-helps-improve-llm-alignment-with-human-preferences/
NVIDIA
New Reward Model Helps Improve LLM Alignment with Human Preferences
Oct 03, 2024
By Zhilin Wang and Chintan Patel

+8
Like
Discuss (0)
Nemotron icon in front of multiple tiles with icons and three sliders each, in colors of green, purple, and grey.
Reinforcement learning from human feedback (RLHF) is essential for developing AI systems that are aligned with human values and preferences. RLHF enables the most capable LLMs, including ChatGPT, Claude, and Nemotron families, to generate exceptional responses.

By integrating human feedback into the training process, RLHF enables models to learn more nuanced behaviors and make decisions that better reflect user expectations. This approach enhances the quality of AI-generated responses and fosters trust and reliability in AI applications.

To help the AI community easily adopt RLHF to build and customize models, NVIDIA has released Llama 3.1-Nemotron-70B-Reward, a state-of-the-art reward model that scores the responses generated by LLMs. Such scores can be used to improve LLM response quality, making a more positive and impactful interaction between humans and AI.

NVIDIA researchers leveraged the reward model to train Llama 3.1-Nemotron-70B-Instruct model, which is among the top models on Arena Hard leaderboard.

#1 reward model
The Llama 3.1-Nemotron-70B-Reward model currently is in first place on the Hugging Face RewardBench leaderboard for evaluating the capabilities, safety, and pitfalls of reward models.

The model scored 94.1% on Overall RewardBench, meaning that it can identify responses that align with human preferences 94% of the time.

Screenshot of the leaderboard shows the ranking of various reward models and their accuracy across different categories. The model on the top of the RewardBench leaderboard is NVIDIA’s Llama-3.1-Nemotron-70B Reward model.
Figure 1. Llama-3.1-Nemtron-70B-Reward tops RewardBench leaderboard across various categories
The model scores well across all four categories: Chat, Chat-Hard, Safety, and Reasoning. It has an impressive performance for Safety and Reasoning, achieving 95.1% and 98.1% accuracy, respectively. This means that the model can safely reject potential unsafe responses and support RLHF in domains like math and code.

With just a fifth the size of Nemotron-4 340B Reward, this model delivers high compute efficiency coupled with superior accuracy. This model is also trained only on CC-BY-4.0-licensed HelpSteer2 data, which makes it feasible for enterprise use cases.

Implementation
To train this model, we combined two popular approaches to make the best of both worlds:

Regression-style reward models
Bradley-Terry reward model
We trained with both approaches using data that we released in HelpSteer2. An important contributor to the model performance is high data quality, which we meticulously curated and then released to advance AI for all.

Leading large language model
Using the trained Reward Model and HelpSteer2-Preference Prompts for RLHF training (specifically with the REINFORCE algorithm) produces a model that scores 85 on Arena Hard, a popular automatic evaluation tool for instruction-tuned LLMs. This makes this the best leading models on the Arena Hard Leaderboard, among models that do not require additional test-time compute.

The Llama-3.1-Nemotron-70B-Instruct model comes with Llama-3.1 License, making it easy for research and enterprises to customize and integrate this model in their applications.

Easy deployment with NVIDIA NIM
The Nemotron Reward model is packaged as an NVIDIA NIM inference microservice to streamline and accelerate the deployment of generative AI models across NVIDIA-accelerated infrastructure anywhere, including cloud, data center, and workstations.

NIM uses inference optimization engines, industry-standard APIs, and prebuilt containers to provide high-throughput AI inference that scales with demand.

Getting started
Experience the Llama 3.1-Nemotron-70B-Reward model from a browser today or test it at scale and build a proof of concept (PoC) with the NVIDIA-hosted API endpoint running on a fully accelerated stack.

Get started at ai.nvidia.com with free NVIDIA cloud credits or download the model from Hugging Face.

For more information about how the model was trained and can be used for RLHF, see HelpSteer2-Preference: Complementing Ratings with Preferences.

e're soo back!: Nvidia Nemotron 70B - beats Llama 3.1 405B, GPT4o & Claude 3.5 Sonnet! 🔥
Evals (Nemotron 70B vs Claude 3.5 vs GPT4o)
> Arena Hard - 85.0 vs 79.2 vs 79.3
> AlpacaEval 2 LC - 57.6 vs 52.4 vs 57.5
> MT Bench - 8.98 vs 8.81 vs 8.74
Secret Sauce?
RLHF (REINFORCE) with Llama-3.1-Nemotron-70B-Reward and HelpSteer2-Preference prompts
They release the Instruct model, reward model and the dataset all on Hugging Face! 🤗

###
https://mistral.ai/news/ministraux/
Mistral AI
Un Ministral, des Ministraux
Introducing the world’s best edge models.

October 16, 2024 Mistral AI team
Introducing the world’s best edge models
On the first anniversary of the release of Mistral 7B, the model that revolutionized independent frontier AI innovation for millions, we are proud to introduce two new state-of-the-art models for on-device computing and at-the-edge use cases. We call them les Ministraux: Ministral 3B and Ministral 8B.

These models set a new frontier in knowledge, commonsense, reasoning, function-calling, and efficiency in the sub-10B category, and can be used or tuned to a variety of uses, from orchestrating agentic workflows to creating specialist task workers. Both models support up to 128k context length (currently 32k on vLLM) and Ministral 8B has a special interleaved sliding-window attention pattern for faster and memory-efficient inference.

Use cases
Our most innovative customers and partners have increasingly been asking for local, privacy-first inference for critical applications such as on-device translation, internet-less smart assistants, local analytics, and autonomous robotics. Les Ministraux were built to provide a compute-efficient and low-latency solution for these scenarios. From independent hobbyists to global manufacturing teams, les Ministraux deliver for a wide variety of use cases.

Used in conjunction with larger language models such as Mistral Large, les Ministraux are also efficient intermediaries for function-calling in multi-step agentic workflows. They can be tuned to handle input parsing, task routing, and calling APIs based on user intent across multiple contexts at extremely low latency and cost.

Benchmarks
We demonstrate the performance of les Ministraux across multiple tasks where they consistently outperform their peers. We re-evaluated all models with our internal framework for fair comparison.

Pretrained Models
Pretrained model comparison table
Table 1: Ministral 3B and 8B models compared to Gemma 2 2B, Llama 3.2 3B, Llama 3.1 8B and Mistral 7B on multiple categories

Pretrained model comparison graph
Figure 1: Ministral 3B and 8B base models compared to Gemma 2 2B, Llama 3.2 3B, Llama 3.1 8B and Mistral 7B

Instruct Models
Instruct model comparison table
Table 2: Ministral 3B and 8B Instruct models compared to Gemma 2 2B, Llama 3.2 3B, Llama 3.1 8B, Gemma 2 9B and Mistral 7B on different evaluation categories.

3B Instruct model comparison graph
Figure 2: A comparison of the 3B family of Instruct models - Gemma 2 2B, Llama 3.2 3B and Ministral 3B. The figure showcases the improvements of Ministral 3B over the much larger Mistral 7B.

8B Instruct model comparison graph
Figure 3: A comparison of the 8B family of Instruct models - Gemma 2 9B, Llama 3.1 8B, Mistral 7B and Ministral 8B.

Availability and pricing
Both models are available starting today.

Model API Pricing on la Plateforme License
Ministral 8B ministral-8b-latest $0.1 / M tokens (input and output) Mistral Commercial License
Mistral Research License
Ministral 3B ministral-3b-latest $0.04 / M tokens (input and output) Mistral Commercial License
For self-deployed use, please reach out to us for commercial licenses. We will also assist you in lossless quantization of the models for your specific use-cases to derive maximum performance.

The model weights for Ministral 8B Instruct are available for research use. Both models will be available from our cloud partners shortly.

###
https://developer.nvidia.com/blog/ibms-new-granite-3-0-generative-ai-models-are-small-yet-highly-accurate-and-efficient/?ncid=so-link-654107
IBM
IBM’s New Granite 3.0 Generative AI Models Are Small, Yet Highly Accurate and Efficient
Oct 21, 2024
By Maryam Ashoori and Chintan Patel

0
Like
Discuss (0)

Today, IBM released the third generation of IBM Granite, a collection of open language models and complementary tools. Prior generations of Granite focused on domain-specific use cases; the latest IBM Granite models meet or exceed the performance of leading similarly sized open models across both academic and enterprise benchmarks.

The developer-friendly Granite 3.0 generative AI models are designed for function calling, supporting tool-based use cases. They were developed as workhorse enterprise models capable of serving as the primary building block of sophisticated workflows across use cases including text generation, agentic AI, classification, tool calling, summarization, entity extraction, customer service chatbots, and more.

Introducing IBM’s Granite Generation 3 family

IBM developed the Granite series, available as an NVIDIA NIM microservice, for enterprise use, prioritizing industry-leading trust, safety and cost efficiency without compromising performance.

In its entirety, the Granite 3.0 release comprises of

Dense, text-only LLMs: Granite 3.0 8B, Granite 3.0 2B
Mixture of Experts (MoE) LLMs: Granite 3.0 3B-A800M, Granite 3.0 1B-A400M
LLM-based input-output guardrail models: Granite Guardian 8B, Granite Guardian 2B
Core components of Granite’s architecture are: Group-query attention (GQA) and Rotary Position Encodings (RoPE) for positional information, multilayer perceptron (MLP) with SwiGLU activation, RMSNorm, and shared input/output embeddings.

Optimized performance with speculative decoding

Trained on over 12 trillion tokens of carefully curated enterprise data, the new 8B and 2B models demonstrate significant improvements over their predecessors in both performance and speed.

Speculative decoding is an optimization technique for accelerating model inference speed, helping LLMs generate text faster while using the same (or less) compute resources, and allowing more users to utilize a model at the same time. For example, in a recent IBM Research breakthrough, speculative decoding was used to cut the latency of Granite Code 20B in half while quadrupling its throughput.

In standard inferencing, LLMs process each previous token they’ve generated thus far, then generate one token at a time. In speculative decoding, LLMs also evaluate several prospective tokens that might come after the token they’re about to generate—if these “speculated” tokens are verified as sufficiently accurate, one pass can produce two or more tokens for the computational “price” of one.

###
https://github.com/microsoft/BitNet
Microsoft
10/18/2024

bitnet.cpp
License: MIT version

bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU (with NPU and GPU support coming next).

The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. More details will be provided soon.

###
https://neuralmagic.com/blog/we-ran-over-half-a-million-evaluations-on-quantized-llms-heres-what-we-found/
Neural Magic
We Ran Over Half a Million Evaluations on Quantized LLMs: Here's What We Found

Oct 17, 2024

Quantizing models to lower precision formats, such as 8-bit or 4-bit, significantly reduces computational costs and accelerates inference. However, there has been a persistent question of whether these quantized models retain the same level of accuracy and quality. Recently, the machine learning (ML) community has raised significant concerns about whether quantized large language models (LLMs) can truly compete with their uncompressed counterparts in accuracy and the general quality of generated responses.


An example tweet from the community to highlight the concerns around quantized models.
In this blog, we address these concerns directly to answer a key question: How much accuracy do we sacrifice when quantizing LLMs? To find out, we conducted over half a million evaluations across various benchmarks, including academic datasets, real-world tasks, and manual inspections, rigorously testing our latest quantized models. Our results revealed several likely sources for the community’s concerns, such as overly sensitive evaluations, models susceptible to the formatting of the chat template, and insufficient hyperparameter tuning in widely used quantization algorithms. By addressing these issues, we have produced highly accurate quantized models that, on average, show no discernible differences from their full-precision counterparts.


Figure 1: Accuracy recovery of the Llama 3.1 models at FP8 precision compared to their full-precision baselines across various academic and real-world evaluation benchmarks.
As seen in Figure 1, we achieved full accuracy recovery across various academic and real-world tasks, including the ArenaHard benchmark the community found issues with. Let’s take a closer look!

Exploring Our Approach and Rationale
Our evaluation focused on extensive testing of the Llama 3.1 series of models, which have gained significant traction in research and deployment contexts. With its streamlined and efficient base architecture, Llama 3.1 is an ideal candidate for assessing various quantization schemes.

For each Llama 3.1 size (8B, 70B, and 405B), we tested three distinct quantization schemes alongside the baseline 16-bit model. These schemes were selected to accommodate different hardware and deployment requirements, and all performance claims were validated using vLLM (0.6.2):

W8A8-INT: This quantization scheme reduces weights and activations to 8-bit integer values, making it ideal for server or throughput-based deployments on Nvidia Ampere (A100 GPUs) and older hardware. It provides approximately 2x model size compression and delivers an average 1.8x performance speedup across various server (multi-request) scenarios.
W8A8-FP: This quantization scheme uses an 8-bit floating point format for weights and activations rather than integer values. This simplifies the compression process but is supported only on the latest Nvidia Hopper (H100 GPUs) and Ada Lovelace hardware. It provides approximately 2x model size compression and delivers an average 1.8x performance speedup across various server (multi-request) scenarios.
W4A16-INT: In this scheme, weights are quantized to 4-bit integers while the activations remain at 16-bit precision. This approach is optimal for latency-critical applications and edge use cases where model size and single request response time are key factors. This means that the model inference is dominated by memory access for loading the weights instead of compute-intensive operations. In this regime, W4A16 provides approximately 3.5x model size compression and delivers an average speedup of 2.4x for single-stream scenarios.
Each quantized model was created by optimizing hyperparameter and algorithmic choices on the OpenLLM Leaderboard v1 benchmarks and then evaluated across many other benchmarks to ensure it generalizes across diverse scenarios. The best choices varied by model and scheme but comprised some combination of SmoothQuant, GPTQ, and/or standard round-to-nearest quantization algorithms. Detailed documentation for each model, including the specific approaches used, can be found in the model cards available in our HuggingFace collection.

We designed our evaluation suite to cover a broad spectrum of inference scenarios and use cases, providing a comprehensive analysis across multiple model sizes and quantization schemes:

Academic Benchmarks: These benchmarks, such as OpenLLM Leaderboard v1 and v2, are key for evaluating research developments and model improvements. They focus on structured tasks like question-answering and reasoning, providing consistent and easily validated accuracy scores. However, they often fail to reflect real-world scenarios where semantics, variability, and context are critical.
Real-World Benchmarks: Unlike academic benchmarks, real-world benchmarks test models in scenarios that mimic human usage, such as instruction following, chat, and code generation. These benchmarks include ArenaHard and HumanEval, which offer a broader range of tasks with higher variation but better reflect real-world model performance. These benchmarks provide a more comprehensive view of models' performance in live environments.
Text Similarity: Text similarity measures how closely quantized models’ outputs match their unquantized counterparts. Metrics such as ROUGE, BERTScore, and Semantic Textual Similarity (STS) evaluate the semantic and structural consistency, ensuring that the generated text's intended meaning and quality are preserved.
With this extensive evaluation framework, we ensured that deployment scenarios ranging from structured, research-driven tasks to open-ended, real-world applications were covered, providing a holistic view of the performance and capabilities of quantized LLMs.

Academic Benchmark Performance
Academic benchmarks are an excellent starting point for evaluating language models’ accuracy and reasoning capabilities. They provide structured tasks, making them essential for comparing models on well-defined criteria. Our evaluations focused on OpenLLM Leaderboard v1 and v2, ensuring consistent results across both older and newer, more challenging benchmarks. Additionally, testing on both allowed us to prevent overfitting to v1, where we optimized our quantization hyperparameters.

We evaluated OpenLLM Leaderboard v1 by utilizing Meta’s prompts for the Llama-3.1 models. We base our comparisons and recovery percentages on the average score and report a full per-task breakdown of results in our HuggingFace model collection. The Leaderboard v1 benchmark consists of a diverse range of topics, including:

Grade school math: GSM8k
World knowledge and reasoning: MMLU, ARC-Challenge
Language understanding: Winogrande, HellaSwag
Truthfulness: TruthfulQA.
As illustrated in Figure 2 (left) below, all quantization schemes—regardless of model size—recover over 99% of the average score achieved by the unquantized baseline.


Figure 2: OpenLLM Leaderboard v1 (left) and v2 (right) average scores for baseline (BF16) and various quantized versions of Llama 3.1 (405B, 70B, and 8B).
The community has evolved, and with it, so have the benchmarks. As scores began to plateau on v1, OpenLLM Leaderboard v2 was introduced to push models further, offering more challenging tasks that test deeper reasoning and knowledge capabilities. Like v1, we measured the recovery percentages based on the average scores across the v2 benchmarks (full results in our HuggingFace model collection). The benchmarks in v2 include more complex topics, such as:

Expert knowledge and reasoning: MMLU-Pro, GPQA, Big Bench Hard
Multistep reasoning: MuSR
Advanced math problems: MATH Level 5
Instruction following: IFEval.
As illustrated in Figure 2 (right) above, the quantized models recover close to 99% of the baseline’s average score on average, with all models maintaining at least 96% recovery. However, the increased difficulty of these tasks, especially for smaller models, resulted in higher variance for benchmarks like GPQA and MuSR, where scores approached the random guessing threshold even for the full-precision baseline. This led to more variability in the quantized versions' scores and a lack of a clear signal for accuracy recovery.

Real-World Benchmark Performance
While academic benchmarks provide structured evaluations, real-world open-ended benchmarks better represent how models perform in dynamic environments like human-chat interactions or coding tasks. These benchmarks test models on varied prompts with longer generations and multiple potential solutions, focusing on responses' correctness and semantic quality. Our evaluations targeted three key real-world benchmarks: Arena-Hard, HumanEval, and HumanEval+, which measure performance in chat, instruction-following, and code generation.

The LMSYS Chatbot Arena has established itself as a leading benchmark for LLMs, assessing how models align with human preferences. Arena-Hard Auto is an automated extension of this benchmark, where an LLM judges responses to 500 complex prompts on various topics. It has demonstrated a strong correlation with human evaluations, achieving a state-of-the-art 89% agreement with human preference rankings.


Figure 4: Arena-Hard-Auto average scores for baseline (BF16) and various quantized versions of Llama 3.1 (405B, 70B, and 8B).
Figure 4 shows how well-quantized models compare to their full-precision counterparts on the Arena-Hard-Auto benchmark, averaging results from two evaluation runs per model. The results illustrate that the response quality of quantized models remains highly competitive with their unquantized counterparts. As shown in the detailed results on our HuggingFace Hub, the 95% confidence intervals overlap for all model sizes and quantization schemes, highlighting the minimal impact on accuracy.

In addition to chat-based interactions, LLMs are widely deployed as coding assistants. To evaluate the performance of quantized models in code generation, we tested them on HumanEval and its more challenging variant, HumanEval+. These benchmarks measure a model’s ability to generate correct and functional code based on programming problems, with HumanEval+ introducing more complex, multi-step tasks requiring deeper reasoning and problem-solving. Figure 5 below presents the pass@1 scores obtained using the EvalPlus library.


Figure 5: HumanEval and HumanEval+ pass@1 score for baseline (BF16) and various quantized versions of Llama 3.1 (405B, 70B, and 8B).
As illustrated in Figure 5, quantized models demonstrate exceptional performance on both HumanEval and HumanEval+, with 8-bit models achieving 99.9% accuracy recovery and 4-bit models recovering 98.9%. These results highlight that quantized models not only maintain high performance in simpler coding tasks but also excel in more complex scenarios, proving their reliability for real-world coding applications with minimal loss in accuracy.

Text Similarity and Manual Inspection
After evaluating quantized models across various academic and real-world benchmarks, we put them through the final test: How similar is the text generated by quantized models compared to their unquantized counterparts?

We used four key metrics to answer this:

ROUGE-1 measures word-level overlap between outputs of quantized and unquantized models.
ROUGE-L captures structural similarity by focusing on the longest common subsequence.
BERTScore evaluates the contextual similarity at the token-level.
STS assesses overall semantic similarity at the sentence level.
These metrics were computed across responses generated from the ArenaHard prompts, allowing us to analyze how well-quantized models preserve the meaning and structure of outputs compared to full-precision models. The results are summarized in Figure 6 below.


Figure 6: Various text similarity metrics comparing the outputs of quantized Llama 3.1 models (405B, 70B, and 8B) to their full-precision baselines.
The results show that larger quantized models (70B and 405B) maintain a high degree of text similarity to their full-precision counterparts, with ROUGE-1 and ROUGE-L scores indicating strong preservation of word choice and structure. BERTScore and STS further confirm that the overall meaning remains consistent, even with slight token variations introduced by quantization. While 8B models exhibit more variability in word selection, they still preserve the core semantic meaning as shown in the BERTScore and STS results. This demonstrates that quantized models maintain high-quality output across all model sizes and quantization schemes.

So far, we’ve evaluated the performance of quantized models using a variety of benchmarks and comparison metrics distilled into raw numbers. Now, it’s time to see the results for yourself. Our interactive demo app (built on top of the fantastic HuggingFace Spaces) lets you select different models and quantization schemes to compare generated responses side-by-side with their full-precision counterparts. This tool offers an intuitive way to visually assess how quantization affects model outputs and the quality of the generated text.


If the interactive demo isn't rendering nicely in your browser, visit our HuggingFace Space for a smoother UI experience: https://huggingface.co/spaces/neuralmagic/quant-llms-text-generation-comparison.

Why Quantization is Here to Stay
In conclusion, our comprehensive evaluation demonstrates that quantized models maintain impressive accuracy and quality compared to their full-precision counterparts, making them an essential tool for optimizing LLMs in real-world deployments.

Consistent Performance: 8-bit and 4-bit quantized LLMs show very competitive accuracy recovery across diverse benchmarks, including Arena-Hard, OpenLLM Leaderboards v1 and v2, and coding benchmarks like HumanEval and HumanEval+.
Minimal Trade-offs: Larger models (70B, 405B) show negligible performance degradation. In comparison, smaller models (8B) may experience slight variability but still preserve their outputs' core semantic meaning and structural coherence.
Efficiency and Scalability: Quantization provides significant computational savings and faster inference speeds while maintaining the semantic quality and reliability of responses.
These findings confirm that quantization offers large benefits in terms of cost, energy, and performance without sacrificing the integrity of the models. As LLMs grow in size and complexity, quantization will play a pivotal role in enabling organizations to deploy state-of-the-art models efficiently.

Ready to explore how quantized LLMs can enhance your business's efficiency and performance? Connect with our experts to discuss enterprise solutions tailored to your needs.

How does quantization impact the performance of LLMs? Only minimal! 🤯 A new study ran 500,000 different evaluations on Meta Llama using different quantization strategies. The impact is <1%, but the benefits are up to 2.4 faster inference and 3.5 model size reduction! 🔥
TL;DR;
💯 Quantized models achieve 99% accuracy recovery compared to full-precision
🚀 Up to 2.4x speedup and 3.5x model size reduction with quantization.
📊 Tested Llama 3.1 8B, 70B, and 405B models on OpenLLM Leaderboard, ArenaHard, HumanEval, and text similarity metrics.
🥇W8A8-FP8 dynamic yields the best results
🤗 Quantized models available on Hugging Face.



###
https://arxiv.org/abs/2410.07524
NVIDIA
Upcycling Large Language Models into Mixture of Experts
Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, Bryan Catanzaro
Upcycling pre-trained dense language models into sparse mixture-of-experts (MoE) models is an efficient approach to increase the model capacity of already trained models. However, optimal techniques for upcycling at scale remain unclear. In this work, we conduct an extensive study of upcycling methods and hyperparameters for billion-parameter scale language models. We propose a novel "virtual group" initialization scheme and weight scaling approach to enable upcycling into fine-grained MoE architectures. Through ablations, we find that upcycling outperforms continued dense model training. In addition, we show that softmax-then-topK expert routing improves over topK-then-softmax approach and higher granularity MoEs can help improve accuracy. Finally, we upcycled Nemotron-4 15B on 1T tokens and compared it to a continuously trained version of the same model on the same 1T tokens: the continuous trained model achieved 65.3% MMLU, whereas the upcycled model achieved 67.6%. Our results offer insights and best practices to effectively leverage upcycling for building MoE language models.

Our new #NVIDIAResearch paper presents training recipes and mechanisms to consistently upcycle billion-parameter scale #LLMs, resulting in models that outperform the original dense models. 👀 See "Upcycling Large Language Models into Mixture of Experts" with Megatron-LM

###
https://huggingface.co/papers/2410.08146
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
Published on Oct 11
Authors:
Amrith Setlur
,
Chirag Nagpal
,
Adam Fisch
,
Xinyang Geng
,
Jacob Eisenstein
,
Rishabh Agarwal
,
Alekh Agarwal
,
Jonathan Berant
,
Aviral Kumar
Abstract
A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?". Our key insight is that, to be effective, the process reward for a step should measure progress: a change in the likelihood of producing a correct response in the future, before and after taking the step, corresponding to the notion of step-level advantages in RL. Crucially, this progress should be measured under a prover policy distinct from the base policy. We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL. In fact, our characterization shows that weak prover policies can substantially improve a stronger base policy, which we also observe empirically. We validate our claims by training process advantage verifiers (PAVs) to predict progress under such provers, and show that compared to ORMs, test-time search against PAVs is >8% more accurate, and 1.5-5times more compute-efficient. Online RL with dense rewards from PAVs enables one of the first results with 5-6times gain in sample efficiency, and >6% gain in accuracy, over ORMs.

Process Reward Models (PRM) can provide feedback on each step of LLM reasoning but normally require a huge amount of human label data. Google DeepMind is trying to solve this by using progress (likelihood) improvements after each reasoning step and a “prover” LLM to correctly predict the answer, leading to 8% higher accuracy and up to 6x better data efficiency compared to standard outcome-based Reward Models.
Implementation
1️⃣ Select a base LLM and a distinct prover LLM (can be weaker).
2️⃣ Generate reasoning traces using the base LLM on a reasoning dataset with correct answers.
3️⃣ Use the prover to solve the problem multiple times before and after the step.
4️⃣ Calculate Advantage/progress for each step by subtracting the "before" success rate from the "after" success rate.
5️⃣ Create Training Data for each step with the problem, steps taken so far (the prefix), the current step, and calculated advantage (target label for RM).
6️⃣ Train a Reward Model to predict these advantage values and use it during RLHF
Insights
🎯 Achieve >8% higher accuracy and 1.5-5x better compute efficiency than traditional outcome reward models
🔄 Weaker prover can help improve stronger base models through better exploration
⚡ More efficient exploration by rewarding intermediate progress steps


###
https://github.com/simular-ai/Agent-S
[Submitted on 10 Oct 2024]
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, Xin Eric Wang
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. To this end, Agent S introduces experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% on success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. Comprehensive analysis highlights the effectiveness of individual components and provides insights for future improvements. Furthermore, Agent S demonstrates broad generalizability to different operating systems on a newly-released WindowsAgentArena benchmark. Code available at this https URL.

Logo Agent S:
An Open Agentic Framework that Uses Computers Like a Human
🌐[Website] 📄[Paper] 🎥[Video] 🗨️[Discord]

💡 Introduction


Welcome to Agent S, an open-source framework designed to enable autonomous interaction with computers through Agent-Computer Interface. Our mission is to build intelligent GUI agents that can learn from past experiences and perform complex tasks autonomously on your computer.

Whether you're interested in AI, automation, or contributing to cutting-edge agent-based systems, we're excited to have you here!

###
https://machinelearning.apple.com/research/gsm-symbolic
Apple
paper | publishedOctober 2024
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
AuthorsIman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar

View publication

Copy Bibtex



Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models. Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

기술적으로 최대한 자세하게 적어. 10개의 기사가 있고 하나도 빼먹지 말고 적어.