Summary

Anthropic에서는 안드로이드용 Claude 앱을 출시하여 강력한 AI 모델 Claude 3.5 Sonnet의 기능을 안드로이드 사용자들에게 제공합니다. Mistral AI는 코드 생성에 특화된 Codestral Mamba 모델과 수학적 추론을 위한 Mathstral 모델을 발표했습니다. Microsoft는 스프레드시트 데이터 처리를 위한 효율적인 인코딩 방법을 도입한 SpreadsheetLLM을 소개했습니다. H2o는 소형 언어 모델인 SmolLM을 공개했으며, Alibaba는 Qwen2 시리즈의 기술 보고서를 발표했습니다. Gartner는 2024년 AI Hype Cycle을 발표하면서 Sovereign AI의 중요성을 강조했습니다. Neural Magic은 vLLM에 FP8 양자화 지원을 추가하여 효율적인 LLM 추론을 가능하게 했습니다. AI 환각 문제를 다루는 다양한 연구도 진행되고 있습니다.

Claude Android 앱 출시,

Anthropic, Claude Android 앱 출시

링크, 2024년 7월 17일,

  • 새로운 Claude Android 앱은 안드로이드 사용자들에게 Claude 3.5 Sonnet 모델의 강력한 기능을 제공합니다.
  • 이 앱은 모든 플랜에서 무료로 사용할 수 있으며, iOS 및 웹과 동일한 기능을 지원합니다.
  • 다중 플랫폼 지원: 웹, iOS, 안드로이드 앱에서 대화를 이어갈 수 있습니다.
  • 비전 기능: 실시간 이미지 분석을 위한 사진 촬영 및 파일 업로드 기능 제공.
  • 다국어 처리: 실시간 언어 번역 기능으로 의사소통 및 번역 지원.
  • 고급 추론: 계약서 분석, 시장 조사 등 복잡한 문제 해결 가능.
  • 다양한 사용 사례: 비즈니스 제안서 작성, 여행 중 메뉴 번역, 쇼핑 중 선물 아이디어 브레인스토밍, 비행 대기 중 연설 작성 등.

Codestral Mamba 및 Mathstral 모델 출시,

Mistral AI, Codestral Mamba 및 Mathstral 모델 출시

링크, 2024년 7월 16일,

  • Codestral Mamba 모델은 코드 생성에 특화된 Mamba2 아키텍처 기반의 모델입니다.
  • Apache 2.0 라이선스 하에 무료로 제공되며, HumanEval에서 75% 성능을 달성.
  • 긴 시퀀스를 모델링할 수 있는 선형 시간 추론 능력 보유.
  • 256k 토큰까지의 맥락 검색 기능 테스트 완료.
  • Mathstral 모델은 수학적 추론에 특화된 모델로, MATH에서 56.6%, MMLU에서 63.47% 성능을 기록.
  • 학계 프로젝트 지원을 위한 노력의 일환으로 Mathstral 모델 출시.

SpreadsheetLLM: 스프레드시트 인코딩 방법 소개,

Microsoft, SpreadsheetLLM

링크, 2024년 7월 12일,

  • 스프레드시트 데이터를 효율적으로 인코딩하는 방법을 소개.
  • SheetCompressor라는 혁신적인 인코딩 프레임워크 개발.
  • GPT4의 맥락 학습 설정에서 25.6% 성능 향상.
  • 평균 압축 비율 25배, 78.9% F1 점수로 기존 모델 대비 12.3% 향상.
  • 스프레드시트 이해와 관련된 다양한 작업에서 뛰어난 성능 발휘.

SmolLM 모델 발표,

H2o, SmolLM 발표

링크, 2024년 7월 16일,

  • 작은 크기의 언어 모델인 SmolLM 시리즈 공개: 135M, 360M, 1.7B 매개변수 모델.
  • 고품질 데이터셋을 사용하여 효율적인 훈련 및 성능 향상.
  • 교육 및 일반 상식을 테스트하는 다양한 벤치마크에서 우수한 성능을 입증.
  • SmolLM 모델은 모바일 장치에서도 높은 성능을 발휘하도록 설계됨.

Qwen2 Technical Report 발표,

Alibaba, Qwen2 Technical Report 발표

링크, 2024년 7월 12일,

  • Qwen2 시리즈는 0.5B에서 72B 매개변수 범위를 포함하는 다양한 모델을 제공.
  • 다중 언어 능력, 코딩, 수학 및 추론에서 뛰어난 성능을 발휘.
  • Qwen2-72B 모델은 MMLU에서 84.2, GPQA에서 37.9, HumanEval에서 64.6, GSM8K에서 89.5, BBH에서 82.4 점수를 기록.
  • Qwen2 시리즈는 공개 가중치로 제공되며, Hugging Face 및 ModelScope에서 접근 가능.

AI Hype Cycle 2024 발표,

Gartner, AI Hype Cycle 2024 발표

링크, 2024년 7월 12일,

  • Sovereign AI가 새로운 키워드로 등장.
  • Sovereign AI는 국가의 언어, 문화, 사회적 맥락을 반영한 AI 서비스.
  • Naver는 자체 개발한 HyperCLOVA X를 이용한 Sovereign AI 챗봇 출시.
  • 각국 정부와 기업들이 Sovereign AI에 대한 투자를 강화하고 있음.

FP8 양자화 지원 추가,

Neural Magic, vLLM에 FP8 양자화 지원

링크, 2024년 7월 15일,

  • FP8 양자화는 LLM 추론의 효율성을 극대화.
  • NVIDIA H100 GPU에서 최대 2배의 지연 시간 감소.
  • 99% 이상의 정확도 보존.
  • 다양한 모델에서 메모리 사용량 절감 및 성능 향상.

AI 환각 문제 연구,

Intel, AI 환각 문제 연구

링크, 2024년 7월 3일,

  • Intel Neural Chat 7B 모델이 AI 환각률 리더보드에 올랐음.
  • AI 환각 문제는 여전히 해결되지 않았으며, 여러 연구가 진행 중.
  • Oxford 연구진은 새로운 환각 감지 방법을 개발, AI 응답의 신뢰성을 높임.

Context Embeddings for Efficient Answer Generation in RAG

연구팀, 효율적인 RAG를 위한 문맥 임베딩

링크, 2024년 7월 12일,

  • RAG에서 긴 문맥을 효율적으로 압축하여 답변 생성 속도를 크게 향상시키는 방법을 제안.
  • COCOM이라는 문맥 압축 방법을 통해 긴 입력을 소수의 문맥 임베딩으로 축소.
  • 기존 방법들에 비해 5.69배 속도 향상 및 더 높은 성능 달성.

AI 논문 추천,

연구팀, 주목할 만한 AI 논문 추천

링크, 링크, 링크, 링크, 링크,

  • RankRAG: 문맥 순위와 답변 생성을 효과적으로 수행하는 새로운 지침 미세조정 프레임워크.
  • Mixture of A Million Experts: 백만 개의 작은 전문가를 활용한 효율적인 전문가 검색 메커니즘.
  • Contextual Hallucinations Mitigation in LLMs: LLM에서 문맥 환각을 감지하고 줄이는 새로운 방법 제안.
  • RouteLLM: 비용과 성능의 균형을 맞추기 위해 더 강력한 LLM과 약한 LLM을 동적으로 선택하는 효율적인 라우터 모델.
  • Internet of Agents: 다양한 제3자 에이전트를 통합하고 동적 작업 요구사항에 적응할 수 있는 새로운 프레임워크.
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

Title,

company name, 제목

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

Title,

company name, 제목

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
###
https://www.anthropic.com/news/android-app
Claude Android app
2024년 7월 17일

1 min read
An illustration of a person engaging with a phone app
The new Claude Android app brings the power of Claude—including our most powerful model, Claude 3.5 Sonnet—to Android users. The app is free and accessible with all plans, including Pro and Team.

The Claude Android app works just like Claude on iOS and the web, meaning you get access to:

Multi-platform support: Pick up and continue conversations with Claude across web, iOS, and Android apps
Vision capabilities: Take new pictures or upload files for real-time image analysis
Multilingual processing: Real-time language translation to help communicate or translate aspects of the world around you
Advanced reasoning: Claude can help you tackle complex problems, like analyzing contracts while traveling or conducting market research to prepare for a meeting
Four examples of use cases on Android devices
Talk to Claude from anywhere
Use Claude for work or for fun. Whether you're drafting a business proposal between meetings, translating menus while traveling, brainstorming gift ideas while shopping, or composing a speech while waiting for a flight, Claude is ready to assist you.

Get started
To get started with the Claude Android app, download it on Google Play.

###
https://mistral.ai/news/codestral-mamba/
Mistral releases their first Mamba Model! 🐍 Codestral Mamba 7B is a Code LLM based on the Mamba2 architecture. Released under Apache 2.0 and achieves 75% on HumanEval for Python Coding. 👀
They also released a Math fine-tuning base on Mistral 7B that achieves 56.6% on MATH and 63.47% on MMLU.
Codestral Mamba
As a tribute to Cleopatra, whose glorious destiny ended in tragic snake circumstances, we are proud to release Codestral Mamba, a Mamba2 language model specialised in code generation, available under an Apache 2.0 license.

July 16, 2024 Mistral AI team
Following the publishing of the Mixtral family, Codestral Mamba is another step in our effort to study and provide new architectures. It is available for free use, modification, and distribution, and we hope it will open new perspectives in architecture research. Codestral Mamba was designed with help from Albert Gu and Tri Dao.

Unlike Transformer models, Mamba models offer the advantage of linear time inference and the theoretical ability to model sequences of infinite length. It allows users to engage with the model extensively with quick responses, irrespective of the input length. This efficiency is especially relevant for code productivity use cases—this is why we trained this model with advanced code and reasoning capabilities, enabling it to perform on par with SOTA transformer-based models.

Detailed Codestral Mamba benchmarks
We have tested Codestral Mamba on in-context retrieval capabilities up to 256k tokens. We expect it to be a great local code assistant!

You can deploy Codestral Mamba using the mistral-inference SDK, which relies on the reference implementations from Mamba’s GitHub repository. The model can also be deployed through TensorRT-LLM. For local inference, keep an eye out for support in llama.cpp. You may download the raw weights from HuggingFace.

For easy testing, we made Codestral Mamba available on la Plateforme (codestral-mamba-2407), alongside its big sister, Codestral 22B. While Codestral Mamba is available under the Apache 2.0 license, Codestral 22B is available under a commercial license for self-deployment or a community license for testing purposes.

Important: This is an instructed model, with 7,285,403,648 parameters.

MathΣtral
As a tribute to Archimedes, whose 2311th anniversary we’re celebrating this year, we are proud to release our first Mathstral model, a specific 7B model designed for math reasoning and scientific discovery. The model has a 32k context window published under the Apache 2.0 license.

July 16, 2024 Mistral AI team
We’re contributing Mathstral to the science community to bolster efforts in advanced mathematical problems requiring complex, multi-step logical reasoning. The Mathstral release is part of our broader effort to support academic projects—it was produced in the context of our collaboration with Project Numina.

Akin to Isaac Newton in his time, Mathstral stands on the shoulders of Mistral 7B and specializes in STEM subjects. It achieves state-of-the-art reasoning capacities in its size category across various industry-standard benchmarks. In particular, it achieves 56.6% on MATH and 63.47% on MMLU, with the following MMLU performance difference by subject between Mathstral 7B and Mistral 7B.

Mathstral 7B breakdown by subject
Mathstral is another example of the excellent performance/speed tradeoffs achieved when building models for specific purposes – a development philosophy we actively promote in la Plateforme, particularly with its new fine-tuning capabilities.

Mathstral 7B detailed benchmarks
Mathstral can achieve significantly better results with more inference-time computation: Mathstral 7B scores 68.37% on MATH with majority voting and 74.59% with a strong reward model among 64 candidates.

Mathstral is an instructed model – use it or fine-tune it as such, referring to our documentation. Weights are hosted on HuggingFace. You can try Mathstral now with mistral-inference and adapt it with mistral-finetune.

We thank Professor Paul Bourdon for curating the GRE Math Subject Test problems used in our evaluation.

###
https://huggingface.co/papers/2407.09025
SpreadsheetLLM: Encoding Spreadsheets for Large Language Models
Microsoft presents SpreadsheetLLM
Encoding Spreadsheets for Large Language Models

Spreadsheets, with their extensive two-dimensional grids, various layouts, and diverse formatting options, present notable challenges for large language models (LLMs). In response, we introduce SpreadsheetLLM, pioneering an efficient encoding method designed to unleash and optimize LLMs' powerful understanding and reasoning capability on spreadsheets. Initially, we propose a vanilla serialization approach that incorporates cell addresses, values, and formats. However, this approach was limited by LLMs' token constraints, making it impractical for most applications. To tackle this challenge, we develop SheetCompressor, an innovative encoding framework that compresses spreadsheets effectively for LLMs. It comprises three modules: structural-anchor-based compression, inverse index translation, and data-format-aware aggregation. It significantly improves performance in spreadsheet table detection task, outperforming the vanilla approach by 25.6% in GPT4's in-context learning setting. Moreover, fine-tuned LLM with SheetCompressor has an average compression ratio of 25 times, but achieves a state-of-the-art 78.9% F1 score, surpassing the best existing models by 12.3%. Finally, we propose Chain of Spreadsheet for downstream tasks of spreadsheet understanding and validate in a new and demanding spreadsheet QA task. We methodically leverage the inherent layout and structure of spreadsheets, demonstrating that SpreadsheetLLM is highly effective across a variety of spreadsheet tasks.

Published on Jul 12
·
Submitted by
akhaliq
on Jul 15
#1 Paper of the day
Authors:
Yuzhang Tian
,
Jianbo Zhao
,
Haoyu Dong
,
Junyu Xiong
,
Shiyu Xia
,
Mengyu Zhou
,
Yun Lin
,
José Cambronero
,
Yeye He
,
Shi Han
,
Dongmei Zhang
Abstract
Spreadsheets, with their extensive two-dimensional grids, various layouts, and diverse formatting options, present notable challenges for large language models (LLMs). In response, we introduce SpreadsheetLLM, pioneering an efficient encoding method designed to unleash and optimize LLMs' powerful understanding and reasoning capability on spreadsheets. Initially, we propose a vanilla serialization approach that incorporates cell addresses, values, and formats. However, this approach was limited by LLMs' token constraints, making it impractical for most applications. To tackle this challenge, we develop SheetCompressor, an innovative encoding framework that compresses spreadsheets effectively for LLMs. It comprises three modules: structural-anchor-based compression, inverse index translation, and data-format-aware aggregation. It significantly improves performance in spreadsheet table detection task, outperforming the vanilla approach by 25.6% in GPT4's in-context learning setting. Moreover, fine-tuned LLM with SheetCompressor has an average compression ratio of 25 times, but achieves a state-of-the-art 78.9% F1 score, surpassing the best existing models by 12.3%. Finally, we propose Chain of Spreadsheet for downstream tasks of spreadsheet understanding and validate in a new and demanding spreadsheet QA task. We methodically leverage the inherent layout and structure of spreadsheets, demonstrating that SpreadsheetLLM is highly effective across a variety of spreadsheet tasks.

###
https://huggingface.co/blog/smollm
Smol Model 🚨: Danube 3 0.5B & 4B LLMs by H2o! 🔥
> Apache 2.0 licensed, Beats Qwen 2 0.5B and is competitive with Phi3 4B
> Uses Llama architecture w/ Mistral tokenizer (32K vocabulary)
> 8192 context length along with Grouped Query Attention
> 4B trained on 6T tokens and 0.5B on 4T tokens with multiple stages
> Performs quite strongly on chat benchmarks for the smol model
> Quite ripe for fine-tuning, in most of cases, beats fine-tuned Phi3 4B, too
> Bonus: Works Out of the Box in Transformers!
Quite excited to see the on-device space heat up recently with Meta's MobileLLM, Qwen 2, and so on.
Smol LLMs for the win! 🤗
SmolLM - blazingly fast and remarkably powerful
Published July 16, 2024
Loubna Ben Allal's avatar
loubnabnl
Loubna Ben Allal
Anton Lozhkov's avatar
anton-l
Anton Lozhkov
Elie Bakouch's avatar
eliebak
Elie Bakouch
TL;DR
This blog post introduces SmolLM, a family of state-of-the-art small models with 135M, 360M, and 1.7B parameters, trained on a new high-quality dataset. It covers data curation, model evaluation, and usage.

Introduction
There is increasing interest in small language models that can operate on local devices. This trend involves techniques such as distillation or quantization to compress large models, as well as training small models from scratch on large datasets. These approaches enable novel applications while dramatically reducing inference costs and improving user privacy.

Microsoft's Phi series, Alibaba's Qwen2 (less than 2B), and Meta's MobileLLM demonstrate that small models can achieve impressive results when designed and trained thoughtfully. However, most of the details about the data curation and training of these models are not publicly available.

In this blog post, we're excited to introduce SmolLM, a series of state-of-the-art small language models available in three sizes: 135M, 360M, and 1.7B parameters. These models are built on a meticulously curated high-quality training corpus, which we are releasing as SmolLM-Corpus. Smollm Corpus includes:

Cosmopedia v2: A collection of synthetic textbooks and stories generated by Mixtral (28B tokens)
Python-Edu: educational Python samples from The Stack (4B tokens)
FineWeb-Edu (deduplicated): educational web samples from FineWeb (220B tokens)
Our evaluations demonstrate that SmolLM models outperform other models in their size categories across a diverse set of benchmarks, testing common sense reasoning and world knowledge. In this blog post, we will go over the curation of each subset in the training corpus and then discuss the training and evaluation of SmolLM models.


Evaluation of SmolLM models on different reasoning and common knowledge benchmarks.

Data curation
From Cosmopedia v1 to v2
Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 30 million textbooks, blog posts, and stories generated by Mixtral-8x7B-Instruct-v0.1. Most of the samples are generated by prompting the model to generate content on specific topics using a web page referred to as a "seed sample", as shown in Figure 1. We use web samples to increase diversity and expand the range of prompts. You can find more details in this blog post.


Figure 1. Example of a Cosmopedia prompt.

To improve the dataset in v2, we tried two strategies:

Using more capable models with the same prompts
Optimizing the prompts themselves
For the first strategy, we experimented with llama3-70B-Instruct, Mixtral-8x22B-Instruct-v0.1, and Qwen1.5-72B-Chat but found no significant improvements when training models on textbooks generated by these alternatives. Therefore, in the remainder of this section, we will focus on the second strategy: how we improved the prompts.

The search for better topics and seed samples
Each prompt has three main components: the topic, the seed sample, and the generation style, which specifies the intended audience and the type of content we want the model to generate.

To ensure consistent generations, we need seed samples that are closely related to the given topic. In Cosmopedia v1, we ran clustering on FineWeb samples to identify both the topics and the corresponding web samples, as shown in Figure 2. This approach has two main limitations:

The topic list reflects the web/FineWeb clusters, which, while comprehensive, may limit our control over the topics.
The web samples within each cluster are not further filtered, potentially including some low-quality samples.

Figure 2. FineWeb clusters.

Instead of this unsupervised clustering approach, in v2 we started with a predefined list of 34,000 topics using the BISAC book classification, a standard used to categorize books by subject that is both comprehensive and educationally focused. We started with 5,000 topics belonging to 51 categories and asked Mixtral to generate subtopics for certain topics. Below is the final distribution of subtopics in each category:


Figure 3. Distribution of topics per top categories used for the prompts.

After defining the topics, we still needed to find web pages related to them. Just like using a search engine to find content on a specific topic, we implemented a search tool to retrieve the most relevant pages for each topic. We ran this tool using our BISAC categories and their subtopics as queries on the FineWeb CC-MAIN-2024-10 and CC-MAIN-2023-50 dumps, which together consist of over 520 million samples. For each query, we retrieved 1,000 pages, ensuring we retrieved only the most relevant content. The code for deploying and running the search tool is available here.

As a result, we compiled 34 million web pages across 34,000 topics. The next step was to determine which generation style worked best.


Figure 4. Topics and their retrieved samples in the category “Medical”.

Generation Style
To determine the most effective generation style, we conducted ablation studies by training 1.8B models on 8B tokens from different subsets of Cosmopedia v1. For newly generated data, we only generated 2B tokens and trained for 4 epochs to save time (it takes approximately 1000 GPU hours to generate 2B tokens with Mixtral). We used the same training and evaluation setup as FineWeb ablation models. We ran each experiment twice with two different seeds and averaged the scores between the two runs.

We compared the performance of the following subsets of Cosmopedia v1:

The web textbooks subset
The stories subset
The Stanford & OpenStax subset
We found that textbooks based on topics and seed samples from curated sources such as Stanford and OpenStax provided the best overall performance, leading to MMLU and ARC benchmarks compared to web-based textbooks. Stories seemed to help with common sense benchmarks. After implementing the new topics and seed sample retrieval methods in v2, we were able to match the performance of curated sources using web seeds, confirming the quality of the new prompts.

Next, we explored which audience style worked best. We generated textbooks using the same web textbook prompts but targeted two different audiences: middle school students and college students. We found that models trained on textbooks aimed primarily at middle school students gave the best score on all benchmarks except MMLU. This can be explained by the fact that most of these test basic common sense and elementary to intermediate science knowledge, while MMLU contains some questions that require advanced knowledge and expertise.


Evaluation of textbooks for different audiences.


Evaluation of textbooks for different audiences.

For v2, we decided to generate 40% of the content for middle school students, 30% for college students and 30% as a mix of other audiences and styles including in subsets we borrow from Cosmopedia v1 such as stories and Stanford courses based textbooks. Additionally, we generated 1B code textbooks based on Python seed samples from AutoMathText dataset.

Ultimately, we produced 39 million synthetic documents consisting of 28B tokens of textbooks, stories, articles, and code, with a diverse range of audiences and over 34,000 topics.

FineWeb-Edu
FineWeb-Edu is a dataset we released a few months ago with FineWeb’s technical report. It consists of 1.3T tokens of educational web pages filtered from 🍷 FineWeb dataset.

We developed an educational quality classifier using annotations generated by Llama3-70B-Instruct. We then used this classifier to retain only the most educational web pages from FineWeb. FineWeb-Edu outperforms FineWeb on popular benchmarks and shows the power of classifiers trained on synthetic data.


Comparison of FineWeb-Edu to other open web datasets.

In Smollm-Corpus we include 220B deduplicated tokens from FineWeb.

Stack-Edu-Python
We applied the same idea of FineWeb-Edu to Code. We used Llama3 to annotate 500,000 python samples from The Stack dataset and used them to train an educational classifier using the same recipe as the FineWeb-Edu classifier. We then applied this classifier on Python subset of StarCoder models training corpus. From the 40B Python tokens available, we retained only the samples with a score of 4 or higher, resulting in a refined dataset of 4B tokens.

The plot below compares Python-Edu to the unfiltered Python code and to using a less strict threshold of 3. We can see that the model trained on Python-Edu converges more than 3 times faster than the model trained on unfiltered Python code, achieving 16% pass@1 after only 12B tokens.


Comparison of Python-Edu to unfiltered Python code.

Training
SmolLM models are available in three sizes and were trained on the data mixture below:

135M and 360M models, each trained on 600B tokens from Smollm-Corpus
1.7B model, trained on 1T tokens from Smollm-Corpus

Training mixture of SmolLM models.

Hyperparameters choice
We used a trapezoidal learning rate scheduler with a cooldown phase equal to 20% of the total training time. It's important to note that the original experiments with this schedule were conducted at a smaller scale, and we've adapted it for our larger models.

For the architecture of our 135M and 360M parameter models, we adopted a design similar to MobileLLM, incorporating Grouped-Query Attention (GQA) and prioritizing depth over width. The 1.7B parameter model uses a more traditional architecture. For all three models we use embedding tying and a context length of 2048 tokens. This context length can be further extended with some long context fine-tuning.

The detailed architecture specifications for each model size are as follows:


Architecture details of SmolLM models.

We used a tokenizer trained on the Smollm Corpus with a vocab size of 49152.

Experiments
One advantage of using the trapezoidal scheduler is that it can reduce the time needed to perform scaling law experiments, as shown in Hägele et al.. We illustrate this with a small scaling law study on our smallest model, SmolLM-125M. We observed that performance continues to improve with longer training, even beyond the Chinchilla optimal point. Therefore, we decided to the 1.7B model on 1 trillion tokens and the 135M and 360M models on 600B tokens, as the performance gains after 400B tokens begin to slow on some benchmarks for these smaller models.


Evaluation of 125M SmolLM models trained on different numbers of tokens.

We experimented with adding instruct datasets and upsampling the curated Cosmopedia subsets during the cooldown phase, but found no significant improvements. This may be because the primary data mixture is already of high quality, limiting the impact of these changes.

To track our training progress, we evaluate our two smallest models every 2B token. The following plot shows their performance on several benchmarks:


Intermediate evaluation of SmolLM-135M and SmolLM-360M on different benchmarks.

Evaluation
In this section, we evaluate the performance of SmolLM models across different parameter sizes and compare them with the best models in their respective categories. We evaluate on a diverse set of benchmarks testing common sense reasoning and world knowledge. We use the same evaluation setup for all models using this setup with lighteval library. For HumanEval, we use [bigcode-evaluation-harness](We use temperature 0.2, top-p 0.95 with 20 samples.) with We use temperature 0.2, top-p 0.95 with 20 samples. For MobileLLM, which isn’t publicly available, we use the numbers reported in the paper whenever possible.

We find that:

SmolLM-135M outperforms the current best model with less than 200M parameters, MobileLM-125M, despite being trained on only 600B tokens compared to MobileLM's 1T tokens.
SmolLM**-**360M outperforms all models with less than 500M parameters, despite having fewer parameters and being trained on less than a trillion tokens (600B) as opposed to MobileLM-350M and Qwen2-500M.
SmolLM-1.7B outperforms all other models with less than 2B parameters, including Phi1.5 from Microsoft, MobileLM-1.5B, and Qwen2-1.5B.
SmolLM-1.7B shows strong Python coding performance with 24 pass@1. We note that the evaluation scorefor Qwen2-1.5B is different from the 31.1 pass@1 reported by Qwen team. We use temperature 0.2, top-p 0.95 with 20 samples.

Comparison of SmolLM models to other SLMs. We evaluate all models on the same setup, except for MobieLLM, which isn't publicly available.


Evaluation of SmolLM models on HumanEval.

We also instruction tuned the models using publicly available permissive instruction datasets. We trained all three models for one epoch on the permissive subset of the WebInstructSub dataset, combined with StarCoder2-Self-OSS-Instruct. Following this, we performed DPO (Direct Preference Optimization) for one epoch: using HelpSteer for the 135M and 1.7B models, and argilla/dpo-mix-7k for the 360M model. We followed the training parameters from the Zephyr-Gemma recipe in the alignment handbook, but adjusted the SFT (Supervised Fine-Tuning) learning rate to 3e-4.

The table below shows the performance of SmolLM-Instruct and other models on the IFEval benchmark (Prompt Strict Accuracy). Qwen2-1.5B-Instruct model scores the highest with 29.94, SmolLM-Instruct models provide a good balance between model size and performance, using only publicly available permissive datasets.


Evaluation of SmolLM-Instruct models on IFEval.

How to run locally ?
Our models are designed to be small and can run locally on various hardware configurations. For reference, an iPhone 15 has 6GB of DRAM, while an iPhone 15 Pro has 16GB. These memory requirements make our models suitable for deployment on a wide range of devices, from smartphones to laptops. We benchmarked the memory footprint of our three model sizes:


Memory footprint of SmolLM models.

Along with the transformers checkpoints, we released ONNX checkpoints and plan to add a GGUF version compatible with llama.cpp. You can find WebGPU demos SmolLM-135M and Smol-LM360M at https://huggingface.co/spaces/HuggingFaceTB/SmolLM-135M-Instruct-WebGPU and https://huggingface.co/spaces/HuggingFaceTB/SmolLM-360M-Instruct-WebGPU.

Conclusion
In this blog post we introduced SmolLM models, a new state-of-the-art family of small LLMs. They demonstrate that small language models can achieve high performance with efficient training on high-quality datasets, providing a strong balance between size and performance.


###
https://huggingface.co/papers/2407.10671
Alibaba presents Qwen2 Technical Report

This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face1 and ModelScope2, and the supplementary materials including example code on GitHub3. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors.

###
https://www-newstheai-com.cdn.ampproject.org/c/s/www.newstheai.com/news/articleViewAmp.html?idxno=5913
logo
Gartner announces 'Hype Cycle 2024'... "Is the Sovereign AI Boom Coming"
With the increased use of generative AI, the importance of 'Sovereign AI' is rising Generative AI is entering the stage of real-world validation and use cases
2024-07-12 Yelim Seo
Gartner has announced the AI Hype Cycle 2024. Generative AI is moving past the hype and entering a stage where real-world validated use cases are emerging, and Sovereign AI has been added. / Gartner.
Gartner has announced the AI Hype Cycle 2024. Generative AI is moving past the hype and entering a stage where real-world validated use cases are emerging, and Sovereign AI has been added. / Gartner.
Sovereign AI Emerges as a New Keyword in Gartner's Hype Cycle 2024.

Sovereign AI has emerged as a new keyword in the Hype Cycle announced by Gartner this year. According to the 'Hype Cycle for AI 2024' released by the American IT research company Gartner on the 2nd, Sovereign AI, which did not exist before, has newly appeared.

Sovereign AI is a compound word of 'sovereign,' meaning sovereignty, and 'AI,' meaning artificial intelligence. It refers to AI services that reflect a country's language, culture, social context, and values based on its own data and infrastructure. Governments and companies concerned about the dependence on values imposed by US-centric big tech companies are particularly strengthening their investments in Sovereign AI.

Gartner's Hype Cycle is a graphical representation that visually reflects the market's expectations and realities regarding technological trends and innovations. It is used to visually explain the maturity stages of technological innovations.

To build Sovereign AI, it is necessary to have data centers equipped with high-performance GPUs, a supporting power grid, data acquisition, and the process of applying it to actual services. Currently, Naver is actively pursuing business in Sovereign AI domestically. Regarding Sovereign AI, Naver has released 'HyperCLOVA X,' a generative AI chatbot utilizing its self-developed large language model (LLM).

On the 4th, Lee Hae-jin, Naver's founder and Global Investment Officer (GIO), met with Jensen Huang, CEO of Nvidia, to discuss Sovereign AI. Although Naver and Nvidia's core businesses differ, they have both emphasized the importance of Sovereign AI for a long time. Naver emphasizes the importance of Sovereign AI for the global expansion of the core technology of its hyper-scale AI 'HyperCLOVA X,' while Nvidia emphasizes it to secure new markets where it can supply AI semiconductors and other infrastructure.

Governments around the world are also enthusiastic about strengthening Sovereign AI. In April last year, Mistral AI, founded by former Google DeepMind and Meta researchers, developed its own AI model 'Le Chat.' Samsung Electronics, Nvidia, and Naver, among others, have invested in it, evaluating it as a rival to 'ChatGPT.' The investment from major global companies alone is reported to be around 1 trillion won.

Chinese AI startup Moonshot AI has also introduced 'Kimi,' a chatbot specialized in processing Chinese sentences, with Alibaba holding about 36% of the shares. Indian AI startup Krutrim has developed 'Krutrim,' an LLM that has learned local Indian languages, supporting more than 10 local languages, including Hindi, Tamil, and Telugu. Finnish AI startup Silo has also developed 'Poro' and 'Viking,' LLMs based on Nordic languages.

Japan is also recently supporting companies with about 72.5 billion yen (about 620 billion won) and cooperating with Nvidia to develop LLMs specialized in Japanese to reduce its dependence on American technology. The developing LLM analyzes responses to natural disasters or climate change specialized in regional construction and geography.

However, as the Sovereign AI market currently focuses more on understanding and processing national languages, there are opinions that it will take time for AI models to fully grasp cultural and historical contexts. An IT industry official stated, "To strengthen Sovereign AI, learning is essential," adding, "The government needs to make efforts to provide a lot of quality public data." Additionally, he emphasized that "companies should significantly increase their investments to acquire not only open data but also quality copyrighted information."

Meanwhile, in the Hype Cycle announced by Gartner this time, generative AI has just entered the 'Trough of Disillusionment' phase. Gartner defines the Trough of Disillusionment as a stage where the hype fades and the trend diminishes, receiving less media attention.

Last year, generative AI was at the 'Peak of Inflated Expectations.' The Peak of Inflated Expectations is a phase where some technology leaders succeed in promotion due to excessive enthusiasm and unrealistic predictions, but in reality, there are significant failures. Gartner criticized this stage by saying, "The only companies making money at this stage are conference organizers and content publishers." Gartner's definition of generative AI as being in the Trough of Disillusionment is analyzed as interpreting it as a stage where the hype is over and real, validated use cases are emerging.

©THE AI

###
https://neuralmagic.com/blog/vllm-brings-fp8-inference-to-the-open-source-community/
FP8 quantization support to vLLM, making LLM inference even more efficient. FP8 reduces latency on NVIDIA GPUs by 2x with >99% accuracy preservation. Thank you to NVIDIA AI for validating our benchmarks!
🔍 What is FP8? FP8 is a modern quantization format that balances precision and efficiency with hardware acceleration on newer GPUs. It reduces memory usage significantly, enabling more cost-effective LLM deployments and higher throughput.
📈 Performance gains: FP8 delivers up to 2x Inter Token Latency (ITL) improvement for Llama 3 70B, 1.6x ITL improvement for Mixtral 8x7B, and up to 3x throughput improvement on 2 NVIDIA H100 GPUs. Memory savings allow for larger batch sizes, boosting performance across various models. Our blog contains specific accuracy details.
✅ Model accuracy: We validated the accuracy preservation of FP8 in vLLM through lm-evaluation-harness comparisons on Open LLM Leaderboard v1 tasks. Most models experience over 99% accuracy preservation compared to the unquantized baseline.
🛠️ Get Started: You can now try out FP8 support in vLLM using a quantized FP8 checkpoint. Access Neural Magic's growing list of accuracy-verified quantized FP8 checkpoints of popular LLMs on our Hugging Face Model Hub. Ready to use with vLLM:


🗓️ Learn more: See our blog for more detailed FP8 insights and join our bi-weekly vLLM Office Hours to regularly hear from and give feedback to the vLLM committer community.


🙏 Thank you for reading and please spread the word about FP8 in vLLM by sharing this post.

vLLM Brings FP8 Inference to the Open-Source Community

Jul 15, 2024

vLLM Now Supports FP8 on NVIDIA GPUs
vLLM, a leading open-source LLM serving engine, has taken a significant leap forward in its recent 0.5 release by incorporating FP8 quantization support. This cutting-edge format promises to revolutionize LLM deployment by dramatically improving efficiency without sacrificing model quality.

The implementation of FP8 support is the result of development efforts from Neural Magic and Anyscale. This integration allows vLLM to utilize specialized hardware units, such as the fourth-generation Tensor Cores on NVIDIA H100 and L40s GPUs, which are designed to accelerate matrix multiplication in FP8 precision.

With FP8, vLLM deployments may receive up to a 2x reduction in latency with minimal accuracy degradation.

This blog post explores the integration of FP8 in vLLM, its benefits, and what it means for the future of LLM inference.

What is FP8?
Traditionally, FP32 (32-bit floating point) and FP16 (16-bit floating point) have been the go-to formats for machine learning models. However, as LLMs grow larger and more complex, there's an increasing need for more efficient formats that can maintain accuracy while reducing computational and memory requirements.

FP8, or 8-bit floating point, is a modern quantization format that strikes a balance between precision and efficiency. It provides a non-uniform range representation and per-tensor scaling factors with hardware acceleration on modern GPUs, allowing for significant performance gains and 2x reduced memory usage without sacrificing model quality.

FP8 Performance in vLLM
Before diving into the performance gains, let’s briefly explain three crucial metrics for LLM serving:

Inter-Token Latency (ITL): The average time between generating each token in the output per user. Lower ITL means smoother, more responsive text generation.
Throughput: The number of output tokens per second an inference server can generate across all users and requests. Higher throughput allows for serving more requests simultaneously.
Time-to-First-Token (TTFT): The time it takes for the model to generate the first token of the response after receiving the input prompt. Lower TTFT reduces the initial wait time for users.
These metrics are vital for assessing and optimizing the real-world performance of LLM serving systems, directly impacting user experience and system efficiency.

The integration of FP8 in vLLM has yielded impressive performance gains across various models and use cases:

Up to 2x ITL improvement for serving dense models (Llama 3 70B)
Up to 1.6x ITL improvement for serving Mixture of Experts (MoE) models (Mixtral 8x7B)
Up to 3x throughput improvement in scenarios where the significant memory savings lead to increasing batch sizes.
FP8 in vLLM benchmarks for Llama 3 70B and Mixtral 8x7B on 2xH100.
Inter-Token Latency (ITL) benchmarks for Llama 3 70B and Mixtral 8x7B on 2xH100. Note that FP8 MoE support currently requires Triton version 2.3.1 or higher.
FP8 in vLLM benchmark for Llama 3 70B on 2xH100
Intensive serving benchmark for Llama 3 70B on 2xH100. Notice that with large requests and more requests per second, the FP16 server does not have enough memory to process requests in parallel, choking the utilization of the GPU due to small batch sizes and leading to degraded TTFT.
Minimal Quality Degradation
Accuracy preservation of FP8 in vLLM has been validated through lm-evaluation-harness comparisons on Open LLM Leaderboard v1 tasks. Most models experience over 99% accuracy preservation compared to the unquantized baseline.

Open LLM Leaderboard v1 Evaluations for BF16 and FP8 checkpoints of common models. All FP8 models were quantized with a calibration set of 2048 samples from UltraChat 200k. Accuracy metrics are reported for instruction-fine tuned checkpoints.
Open LLM Leaderboard v1 Evaluations for BF16 and FP8 checkpoints of common models. All FP8 models were quantized with a calibration set of 2048 samples from UltraChat 200k. Accuracy metrics are reported for instruction-fine tuned checkpoints.
FP8 Inference Quickstart
Try out FP8 support in vLLM immediately using a quantized FP8 checkpoint:

# pip install vllm==0.5.1
from vllm import LLM
model = LLM("neuralmagic/Meta-Llama-3-8B-Instruct-FP8")
result = model.generate("Hello, my name is")
There is also support for dynamic FP8 quantization for existing FP16/BF16 models within vLLM by specifying the quantization=”fp8” argument. Note that this will not provide the same performance uplift due to the dynamic scale calculations required.

from vllm import LLM
model = LLM("meta-llama/Meta-Llama-3-8B-Instruct", quantization="fp8")
result = model.generate("Hello, my name is")
For easy performant FP8 inference, Neural Magic has produced a growing list of accuracy-verified quantized FP8 checkpoints of popular LLMs ready to use with vLLM. You can reproduce these results or calibrate with your dataset using our open-source tool llm-compressor.


Overview of FP8 Architecture in vLLM
This section goes into detail over several key features of the FP8 architecture in vLLM, along with easy steps for you to get started adopting the features.

Performant FP8 Kernels
vLLM’s implementation of FP8 draws inspiration from PyTorch, initially adopting torch.float8_e4m3fn and torch._scaled_mm to enable runtime quantization of existing FP16/BF16 checkpoints. This straightforward approach allows users to enable FP8 quantization by simply specifying quantization="fp8". Building on this foundation, we extended FP8 support to (MoE) models, starting with a Mixtral implementation in Triton. Since then, we have significantly enhanced the FP8 implementation for performant inference:

Utilization of static activation scales to reduce quantization overhead
Development of custom CUTLASS kernels for FP8 matrix multiplication, surpassing PyTorch's FP8 performance
Optimization of Triton and CUTLASS parameters for improved performance
These advancements collectively contribute to vLLM's state-of-the-art FP8 inference support.

Memory Reduction
FP8 quantization offers substantial memory benefits. Both weights and activations are stored more efficiently, occupying only half the space required by their original precision. This reduction in memory footprint allows for longer context lengths and accommodates more concurrent requests. Additionally, vLLM extended FP8 quantization to the KV Cache. By specifying kv_cache_dtype="fp8", users can further reduce the memory footprint of in-flight requests, potentially doubling the number of requests that can be processed simultaneously or allowing larger models to fit into GPU memory.

FP8 Checkpoint Compatibility
vLLM now supports direct ingestion of FP8 model checkpoints, streamlining the use of pre-quantized models. When creating FP8 checkpoints for your models, vLLM offers two approaches:

Static per-tensor scales for weights with dynamic per-tensor scales for activations
Pros: Easy to use
Cons: Sub-optimal performance due to cost of scale calculation
Static per-tensor scales for both weights and activations
Pros: Optimal performance
Cons: Requires a calibration step
The following table illustrates the structure of an FP8 checkpoint, using the neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 model as an example:


The FP8 checkpoint contains static per-tensor scales for both weights and activations.
For optimal inference performance, we recommend using llm-compressor or AutoFP8 with relevant calibration data to generate appropriate per-tensor static scales for both weights and activations. Here's a step-by-step guide to quantize your model using AutoFP8:

# pip install git+https://github.com/neuralmagic/AutoFP8.git
from datasets import load_dataset
from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
# Load and tokenize 2048 dataset samples for calibration of activation scales
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
ds = load_dataset("neuralmagic/ultrachat_2k", split="train_sft").select(range(2048))
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
# Define quantization config with static activation scales
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")
# Load the model, quantize, and save checkpoint
model = AutoFP8ForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", quantize_config)
model.quantize(examples)
model.save_quantized("Meta-Llama-3-8B-Instruct-FP8/")
After executing this script, your quantized model checkpoint will be available at Meta-Llama-3-8B-Instruct-FP8/. You can then load this checkpoint directly in vLLM:

from vllm import LLM
model = LLM(model="Meta-Llama-3-8B-Instruct-FP8/")
result = model.generate("Hello, my name is")
For a more comprehensive understanding of FP8 in vLLM, please read our documentation on FP8 here.

The Future of FP8 in vLLM
The integration of FP8 in vLLM is a great step forward, and is just the beginning. The development team is actively working on several exciting enhancements:

More Advanced Quantization: Through the recent integration of llm-compressor, we will be applying more advanced quantization techniques like SmoothQuant and GPTQ from integer quantization methods to reduce outliers and preserve accuracy. Development is ongoing to support scaling factors of a finer granularity (e.g., per-channel, per-token), which will further improve quantization accuracy. We will also be pushing for INT8 W8A8 quantization to provide similar performance benefits on hardware without support for FP8, such as A100 GPUs.
FP8 Attention: We will extend FP8 computation to the attention mechanism as well by leveraging kernels from FlashInfer, greatly improving performance at large context lengths.
Expanded MoE FP8 Support: While FP8 support for Mixture of Experts (MoE) models like Mixtral is already available, work is in progress to extend this support to a broader range of MoE architectures like Qwen2 and DeepSeek-V2.
Operation Fusion: We are exploring ways to fuse linear layers with surrounding operations to reduce the impact of quantization and dequantization. This is primarily focused on utilizing torch.compile with custom passes for layer fusion.
As these features progress, we can expect vLLM to continue pushing the boundaries of LLM inference efficiency, making advanced AI models more accessible and deployable in a wide range of applications.

If you are interested in helping these developments, please join the bi-weekly vLLM Open Office Hours where you can ask questions, meet the community, and learn how to contribute!

###
https://www.digit.in/features/general/ai-hallucination-in-llm-and-beyond-can-it-be-fixed-completely.html?linkId=100000273930317
Kudos to Haihao Shen, Kaokao Lv, and Huma Abidi for #Intel Neural Chat 7B making the leaderboard for AI hallucination rates! This dashboard evaluates factual consistency and hallucination rates, crucial for trustworthy AI outputs. Our commitment to innovation is setting new benchmarks in the AI landscape

AI hallucination in LLM and beyond: Will it ever be fixed?
By Jayesh Shinde | Updated on 03-Jul-2024
AI hallucination in LLM and beyond: Will it ever be fixed?
Jayesh Shinde
JAYESH SHINDE
03-JUL-2024
Despite going mainstream two years ago, Generative AI products and services are arguably still in their infancy, and you just can’t stop marvelling at their potent, transformative power. Even this early in its adoption curve, GenAI continues to impress. With broad consensus on GenAI as the next best thing since sliced bread, capable of responding to our whims and fancies better than our own wildest imagination, the honeymoon period is well and truly on. It seems these AI chatbots or text-to-image generators can do no wrong. Unless, of course, they do – at which point the honeymoon ends rather abruptly.

Just like us mere mortals, GenAI isn’t without its flaws. Sometimes subtle, sometimes glaringly obvious. In its myriad attempts to conjure up text and images out of thin air, AI can have a tendency to make factual mistakes. In other words, hallucinate. These are instances where GenAI models produce incorrect, illogical or purely nonsensical output amounting to beautifully wrapped gibberish.

Also read: When AI misbehaves: Google Gemini and Meta AI image controversies

From Google Gemini’s historically inaccurate images to Meta AI’s gender biased pictures, whether it’s ChatGPT’s imaginary academic citations for generative text or Microsoft Edge’s Bing Copilot giving erroneous information, these mistakes are noteworthy. Call it inference failure or Woke AI, they’re all shades of AI hallucinations on display. Needless to say these AI hallucinations have been shocking, embarrassing and deeply concerning, giving even the most ardent of GenAI evangelists and gung-ho AI fans some serious pause. In fact, take any LLM (one of the pillars of GenAI currently) out there, it’s guaranteed to make mistakes in something as simple as document summarisation. No jokes!


LLM Hallucination Leaderboard (as of June 28, 2024)
Researchers have created a public leaderboard on GitHub to track the hallucination rates in popular LLMs. They built an AI model to detect hallucinations in LLM outputs, feeding 1000 short documents to various AI models and measuring the rate of factual consistency and hallucination in their output. The models were also measured by their answer rate and average summary length. According to their leaderboard, some of the LLMs with the lowest hallucination rates are GPT-4 Turbo, Snowflake Arctic, and Intel Neural Chat 7B. They’re also in the process of building a leaderboard on citation accuracy of LLMs – a crucial hurdle to overcome in terms of improving factual consistency.

Why does AI hallucinate?
AI hallucinations in popular LLMs like Llama 2 (70 billion parameters), GPT-3.5 (175 billion parameters), Claude Sonnet (70 billion parameters), etc, are all ultimately linked to their training data. Despite its gigantic size, if the training data of these LLMs had built-in bias of some kind, the generative AI output of these LLMs can have hallucinated facts that try to reinforce and transfer that bias in some form or another – similar to the Google Gemini blunders, for example. On the other end of the spectrum, absence of enough variety of data on any given subject can also lead to AI hallucinations every time the LLM is prompted on a topic it isn’t well-versed to answer with authority.

Also read: Hallucin[AI]tion

If an LLM is trained on a mix of code and natural language-based data, it’s very likely to hallucinate nonsensical code if it encounters a programming concept outside its training dataset. If the initial training data of image generation models like Midjourney or Stable Diffusion, which were trained on hundreds of billions of parameters, had a majority of images of Western architecture, for instance, their output will struggle to generate realistic or believable images of traditional Indian architecture, leading their models to invent or hallucinate a mish-mash of architectural variations that don’t pass muster.


Generative AI video models like MovieGAN and OpenAI Sora, which aim to generate realistic videos from text input, suffer from similar issues right now. If their training data doesn’t capture the full range of human motion, it will generate human forms capable of performing physically impossible movements – as these AI generated videos of human gymnasts very well emphasise. Last year, a TikTok user self-released a song called “Heart On My Sleeve,” where the vocals sounded eerily similar to Drake and The Weeknd.

The song clocked over 15 million views on TikTok, not to mention hundreds of thousands more on Spotify and YouTube. If the viral hit was generated using an AI-based sound generation tool, chances are it might have been heavily trained on Western hip-hop music as part of its dataset, and that it won’t be great at generating vocals that sound like Lata Mangeshkar. Probably why we haven’t heard an AI generated song of a famous Indian singer yet, because of the lack of quality training data.

Could AI hallucinations also be linked to a lack of effort? Because the AI genie is well and truly out of the bottle, and there’s no going back to non-GenAI times, companies and startups are locked in a furious race to release half-baked AI products to gain first mover’s advantage and cover market share. These are some of the key findings of a recent report from Aporia, which surveyed about 1000 AI and ML professionals from North America and UK – individuals working in companies ranging from 500 to 7,000 employees, across various important sectors such as finance, insurance, healthcare and travel, among others.

Aporia’s findings reveal a noteworthy trend among engineers working with LLMs and Generative AI. A shocking 93-percent of machine learning engineers reported encountering issues with AI-based production models either on a daily or weekly basis, while 89-percent of these professionals also acknowledge encountering hallucinations within these AI systems. According to the survey findings, these AI distortions often materialise as factual inaccuracies, biases, or potentially harmful content, underscoring the critical importance of implementing robust monitoring and control mechanisms to mitigate such AI hallucination issues effectively.

Can AI hallucination be detected and stopped?
University of Oxford researchers seem to have made significant progress in ensuring the reliability of information generated by AI, one that addresses the issue of AI hallucination fair and square. Their study, published in Nature, introduces a novel method for detecting instances when LLMs hallucinate by inventing plausible-sounding but imaginary facts. The new method proposed by Oxford researchers analyses the statistics behind any given AI model’s answer, specifically looking at the uncertainty in the meaning of a phrase in a generated sentence rather than just its grammatical structure, allowing it to determine if the model is genuinely unsure about the answer it generates for any given prompt. According to the researchers, their new method outperformed existing ones in detecting incorrect answers in GenAI based LLMs, leading to more secure deployment of GenAI in contexts where errors can have serious consequences, such as legal or medical question-answering

Also read: AI Turf “War”: An Old Man’s Defence of Good Old-Fashioned AI

Microsoft also claims to tackle AI hallucinations through new tools as part of its Azure AI Studio suite for enterprise customers, according to a report by The Verge. Microsoft is able to detect AI hallucinations in GenAI-based deployments of its enterprise customers’ apps by blocking malicious prompts that trick their customers’ AI into deviating from its training data. It also analyses the AI model’s response to check if it contains fabricated information and further assess potential vulnerabilities in the AI model itself. These features readily integrate with popular GenAI models like GPT-4 and Llama, according to Microsoft, giving its Azure cloud users more control in preventing unintended and potentially damaging AI outputs.

Other big tech players aren’t sitting idle in the face of AI hallucinations. Beyond recognising the importance of high-quality training data for LLMs, Google Cloud Platform employs techniques like regularisation, which penalises GenAI models for making extreme predictions, preventing overfitting to the training data and generating potentially hallucinating outputs. Amazon uses a similar approach in its online cloud empire, with AWS (Amazon Web Services) also exploring approaches like Retrieval-Augmented Generation (RAG), which combines the LLM’s text generation capabilities with a retrieval system that searches for relevant information and helps the LLM stay grounded in factual information while generating text and reduce the chances of AI hallucination.


AI hallucination: A glass half full or empty?
Long story short, it appears there’s no single solution for stopping AI hallucinations. With GenAI deployments across various industries still accelerating, the problem of AI hallucination remains an ongoing area of research for all major tech players and academia. In fact, one research paper from the National University of Singapore asserts that AI hallucination is inevitable due to an innate limitation of LLMs. Their study provides a mathematical proof asserting that hallucination is an inherent challenge for these models – that no matter how advanced an LLM may be, it cannot learn everything. They will inevitably generate inaccurate outputs or hallucinate when faced with certain real-world scenarios.

If it’s an unintended feature and not a bug, there’s an argument to be made that AI hallucination is actually good for some use cases, according to IBM. AI can create dreamlike visuals and inspire new artistic styles, reveal hidden connections and offer fresh perspectives on complex information which can be great for data analysis. Mind-bending virtual worlds hallucinated by AI can enrich gaming and VR experiences as well.

Depending on how you look at it, the phenomenon of AI hallucination seems to be both a curse and a blessing in disguise (but it’s mostly a curse). It mirrors the complexities of the human brain and cognitive thought, in a process shrouded in mystery that both medical researchers and computer scientists don’t fully understand. Just as our brains can sometimes misinterpret or fill gaps in information, creating illusions or mistaken perceptions, AI systems too encounter limitations in interpreting data. While efforts are underway to enhance their accuracy and reliability, these occasional AI hallucinations also present opportunities for creativity and innovation, for thinking out of the box – similar to how our minds can unexpectedly spark new ideas.

This realisation should make you appreciate your LLM’s output even more, that GenAI isn’t too dissimilar from us when it comes to brainfarts. Until the experts lobotomise the problem, keep double-triple checking your favourite LLM’s response.

###
https://arxiv.org/abs/2407.09252
[Submitted on 12 Jul 2024]
Context Embeddings for Efficient Answer Generation in RAG
David Rau, Shuai Wang, Hervé Déjean, Stéphane Clinchant
Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer which slows down decoding time directly translating to the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin. Our method allows for different compression rates trading off decoding time for answer quality. Compared to earlier methods, COCOM allows for handling multiple contexts more effectively, significantly reducing decoding time for long inputs. Our method demonstrates a speed-up of up to 5.69 × while achieving higher performance compared to existing efficient context compression methods.

###
After going through 100s of AI papers in the past couple of weeks, I am noticing the deeper integration of ideas (e.g., Mixture of Million Experts and Internet of Agents) and the utility of simple yet very effective methods (e.g., RouteLLM and RankRAG).
If you are looking for some weekend reads, here are a few notable AI papers I read this week:
- RankRAG: introduces a new instruction fine-tuning framework to perform effective context ranking and answering generation to enhance an LLM’s RAG capabilities. It leverages a small ranking dataset to outperform existing expert ranking models. Shows that a Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks.
https://arxiv.org/abs/2407.02485v1
- Mixture of A Million Experts: introduces a parameter-efficient expert retrieval mechanism that leverages the product key technique for sparse retrieval from a million tiny experts. It attempts to decouple computational cost from parameter count by efficiently routing to a very large number of tiny experts through a learned index structure used for routing.
https://arxiv.org/abs/2407.04153

- Contextual Hallucinations Mitigation in LLMs: proposes a new method that detects and significantly reduces contextual hallucinations in LLMs (e.g., reduces by 10% in the XSum summarization task). Builds a hallucination detection model based on input features given by the ratio of attention weights on the context vs. newly generated tokens (for each attention head). The hypothesis is that contextual hallucinations are related to the extent to which an LLM attends to the provided contextual information.
https://arxiv.org/abs/2407.07071

- RouteLLM: proposes efficient router models to dynamically select between stronger and weak LLMs during inference to achieve a balance between cost and performance. The training framework leverages human preference data and data augmentation techniques to boost performance. Shows to significantly reduce costs by over 2x in certain cases while maintaining the quality of responses.
https://arxiv.org/abs/2406.18665v2

- Internet of Agents: a new framework to address several limitations in multi-agent frameworks such as integrating diverse third-party agents and adaptability to dynamic task requirements. Introduces an agent integration protocol, instant messaging architecture design, and dynamic mechanisms for effective collaboration among heterogeneous agents.
https://arxiv.org/abs/2407.07061v2