Summary

Kyutai에서는 Moshi라는 실시간 네이티브 멀티모달 모델을 발표하였습니다. Moshi는 감정을 표현하고 이해하며, 음성을 생성하고 들을 수 있는 기능을 갖추고 있습니다. InternLM에서는 IXC-2.5라는 새로운 비전 언어 모델을 공개하였습니다. 이 모델은 고해상도 이미지 이해와 멀티턴 대화를 지원하며, 다양한 벤치마크에서 우수한 성능을 보였습니다. NVIDIA는 DoRA라는 새로운 파인튜닝 방법을 소개하였고, Meta에서는 다중 토큰 예측을 사용한 새로운 접근법을 발표하였습니다. 마지막으로 Hugging Face에서는 RT-DETR이라는 실시간 객체 탐지 모델을 지원하게 되었습니다.

Kyutai, Moshi 발표

Kyutai, 2024년 7월 3일,
Kyutai

  • Kyutai는 실시간 네이티브 멀티모달 모델 Moshi를 발표
  • Moshi는 감정을 표현하고 이해하는 능력 보유
  • 감정을 담아 “프랑스 억양”으로 말하기 가능
  • 음성 생성 및 청취 기능 제공
  • 텍스트와 오디오 혼합 데이터를 공동 훈련
  • Kyutai가 제작한 Helium 7B LLM의 합성 텍스트 데이터 사용
  • 100k “oral-style” 합성 대화 데이터로 미세 조정
  • 별도의 TTS 모델로 생성된 합성 데이터를 사용하여 음성 학습
  • 200ms의 종단간 지연 시간 달성
  • MacBook이나 일반 GPU에서도 실행 가능한 소형 버전 존재
  • AI 생성 오디오를 감지하는 워터마킹 기능 포함
  • 오픈 소스로 공개 예정
  • Moshi는 오픈 연구와 AI 생태계 발전에 기여할 것

InternLM, InternLM-XComposer-2.5 공개

arXiv, 2024년 7월 3일,
InternLM

  • IXC-2.5는 다양한 텍스트-이미지 이해와 작문 응용에 탁월
  • 7B 파라미터를 사용하는 비전 언어 모델
  • 24K의 교차 이미지-텍스트 문맥으로 훈련
  • RoPE 보간 기술로 96K 긴 문맥 지원
  • 고해상도 이미지와 동영상 이해 능력 제공
  • 멀티턴 멀티이미지 대화 지원
  • 텍스트-이미지 작문 및 웹페이지 제작에 사용
  • IXC-2.5는 웹페이지 제작과 고품질 텍스트-이미지 기사 작성에 활용 가능
  • 28개 벤치마크에서 기존 오픈 소스 모델을 능가하는 성능 입증
  • GPT-4V 및 Gemini Pro와 유사한 성능 발휘
  • 웹페이지 작성과 텍스트-이미지 기사 작성에서 특별히 설계된 Chain-of-Thought (CoT)와 Direct Preference Optimization (DPO) 기법 사용

NVIDIA, DoRA: 고성능 파인튜닝 대안

NVIDIA, 2024년 6월 28일,
NVIDIA

  • DoRA는 LoRA의 대안으로 제안된 파인튜닝 방법
  • LoRA보다 학습 용량과 안정성 향상
  • 추가 추론 비용 없이 성능 개선
  • 다양한 언어 및 비전 모델 작업에서 LoRA를 능가
  • LLM 및 VLM 작업에서 공통적인 성능 향상
  • 각 파라미터의 방향과 크기를 분해하여 학습
  • ICML 2024에서 구술 논문으로 발표
  • DoRA는 다양한 모델 아키텍처에 적용 가능
  • LoRA보다 FT 학습 패턴과 유사한 학습 행동을 보임
  • QLoRA와 함께 사용하여 메모리 수요 감소 가능
  • Hugging Face의 DreamBooth로 텍스트-이미지 개인화에서 우수한 성능 발휘

Meta, 다중 토큰 예측 접근법 발표

Meta, 2024년 7월 4일,
Meta

  • 다중 토큰 예측을 사용한 새로운 LLM 훈련 접근법 발표
  • 모델 성능 및 훈련 효율성 향상
  • 코드 완성을 위한 사전 훈련된 모델 공개
  • Hugging Face에서 모델 이용 가능
  • 200B 토큰과 1T 토큰의 코드 데이터로 훈련된 모델 포함
  • 표준 Llama 2 SentencePiece 토크나이저 사용

Hugging Face, RT-DETR 실시간 객체 탐지 모델 지원

Hugging Face, 2024년 7월 5일,
Hugging Face

  • RT-DETR 모델은 실시간 객체 탐지 기능 제공
  • YOLO 모델보다 속도와 정확성에서 우수한 성능 발휘
  • Apache 2.0 라이선스로 상업적 용도로 자유롭게 사용 가능
  • Meta의 Transformer 기반 탐지 모델인 DETR의 후속작
  • 하이브리드 인코더 설계를 통해 다중 스케일 특징을 신속히 처리
  • 고품질 초기 쿼리를 제공하여 정확도 향상
  • 다양한 시나리오에 적응할 수 있는 유연한 속도 조절 지원
  • T4 GPU에서 108/74 FPS 성능 발휘
  • Objects365로 사전 훈련 후 55.3%/56.2% AP 달성
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

Title,

한글제목

링크, date,
company name

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

Title,

한글제목

링크, date,
company name

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
###
https://kyutai.org/
kyutai
July 3, 2024
Did Open Science just beat OpenAI? 🤯 Kyutai just announced Moshi, a real-time native multimodal foundation model that can listen and speak, similar to what OpenAI demoed GPT-4o in May. 👀
Moshi:
> Expresses and understands emotions, e.g. speak with “french access”
> Listens and generates Audio/Speech
> thinks as it speaks (textual thoughts)
> Supports 2 streams of audio to listen and speak at the same time
> Used Joint pre-training on mix of text and audio
> Used synthetic data text data from Helium a 7B LLM (Kyutai created)
> Is fine-tuned on 100k “oral-style” synthetic (conversations) converted with TTS
> Learned its voice from synthetic data generated by a separate TTS model
> Achieves a end-to-end latency of 200ms
> Has a smaller variant that runs on a MacBook or consumer-size GPU.
> Uses watermarking to detect AI-generated audio (WIP)
> Will be released open source!
1. It’s small, 7B model (14GB VRAM in bf16/ fp16, 7GB in fp8/ int8) - Can be quantised further to run in even constrained environments. Massive win for accessibility.
2. 160-200ms latency speech-in, speech-out - you can iterate quickly and prototype.
3. Upcoming technical report + code + model weights release - just by code and report alone we can learn so much about scaling such models for further use-cases.
4. Remember that while the model itself is important there’s a lot more artefacts behind it - The LLM (Helium 7B), Audio Codec (Mimi), Inference stack (based on Rust, possibly candle), Watermarking (possibly audioseal or a variant) and lots more.
5. It’s a v1 - this is the worst this tech will ever be! This team is less than 6 months old and they’ve managed to ship a world class open demo.
It’s easy to dunk on those which build in public and be open about the shortcomings/ tidbits about the model. Tell me when your fav ClosedAI company does something similar.
Congratulations again to the Kyutai team! 🤗
1
PRESS RELEASE
Paris, July 3, 2024
Kyutai unveils today the very first voice-enabled AI openly accessible to all
In just 6 months, with a team of 8, the Kyutai research lab developed from scratch an
artificial intelligence (AI) model with unprecedented vocal capabilities called Moshi.
The team publicly unveiled its experimental prototype today in Paris. At the end of the
presentation, the participants – researchers, developers, entrepreneurs, investors and journalists
– were themselves able to interact with Moshi. The interactive demo of the AI will be accessible
from the Kyutai website at the end of the day. It can therefore be freely tested online as from
today, which constitutes a world first for a generative voice AI.
This new type of technology makes it possible for the first time to communicate in a smooth,
natural and expressive way with an AI. During the presentation, the Kyutai team interacted with
Moshi to illustrate its potential as a coach or companion for example, and its creativity through the
incarnation of characters in roleplays.
More broadly, Moshi has the potential to revolutionize the use of speech in the digital world.
For instance, its text-to-speech capabilities are exceptional in terms of emotion and interaction
between multiple voices.
2
Compact, Moshi can also be installed locally and therefore run safely on an unconnected
device.
With Moshi, Kyutai intends to contribute to open research in AI and to the development of the
entire ecosystem. The code and weights of the models will soon be freely shared, which is also
unprecedented for such technology. They will be useful both to researchers in the field and to
developers working on voice-based products and services. This technology can therefore be
studied in depth, modified, extended or specialized according to needs. The community will in
particular be able to extend Moshi's knowledge base and factuality, which are currently deliberately
limited in such a lightweight model, while exploiting its unparalleled voice interaction capabilities.
-----------------------------
About Kyutai
Kyutai is a non-profit laboratory dedicated to open research in AI, founded in November 2023 by the iliad Group,
CMA CGM and Schmidt Sciences. Launched with an initial team of six leading scientists, who have all worked with
Big Tech labs in the USA, Kyutai continues to recruit at the highest level, and also offers internships to research
Master’s degree students. Now comprising a dozen members, the team will launch its first PhD theses at the end
of the year. The research undertaken explores new general-purpose models with high capabilities. The lab is
currently working in particular on multimodality, i.e., the possibility for a model to exploit different types of content
(text, sound, images, etc.) both for learning and for inference. All the models developed are intended to be freely
shared, as are the software and know-how that enabled their creation. To carry out its work and train its models,
Kyutai relies in particular for its compute on the Nabu 23 superpod made available by Scaleway, a subsidiary of the
iliad Group.
Follow us on:
www.kyutai.org
X: @kyutai_labs
Contacts
For any requests for interviews and/or photos of the Kyutai team, please send an email to presse@kyutai.org

###
https://arxiv.org/abs/2407.03320
[Submitted on 3 Jul 2024]
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at this https URL.


InternLM2.5 has open-sourced a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:

Outstanding reasoning capability: State-of-the-art performance on Math reasoning, surpassing models like Llama3 and Gemma2-9B.

1M Context window: Nearly perfect at finding needles in the haystack with 1M-long context, with leading performance on long-context tasks like LongBench. Try it with LMDeploy for 1M-context inference.

Stronger tool use: InternLM2.5 supports gathering information from more than 100 web pages, corresponding implementation will be released in Lagent soon. InternLM2.5 has better tool utilization-related capabilities in instruction following, tool selection and reflection. See examples.


InternLM-XComposer-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. IXC-2.5 is trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to perform exceptionally well in tasks requiring extensive input and output contexts.

Ultra-High Resolution Understanding: IXC-2.5 enhances the dynamic resolution solution proposed in IXC2-4KHD with a native 560 × 560 ViT vision encoder, supporting high-resolution images with any aspect ratio.

Fine-Grained Video Understanding: IXC-2.5 treats videos as a ultra-high-resolution composite picture consisting of tens to hundreds of frames, allowing it to capture fine details through dense sampling and higher resolution for each frame.

Multi-Turn Multi-Image Dialogue: IXC-2.5 supports free-form multi-turn multi-image dialogue, allowing it to naturally interact with humans in multi-round conversations.

Webpage Crafting: IXC-2.5 can be readily applied to create webpages by composing source code (HTML, CSS, and JavaScript) following text-image instructions.

Composing High-Quality Text-Image Articles: IXC-2.5 leverages specially designed Chain-of-Thought (CoT) and Direct Preference Optimization (DPO) techniques to significantly enhance the quality of its written content.

Awesome performance: IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks.

###
https://developer.nvidia.com/blog/introducing-dora-a-high-performing-alternative-to-lora-for-fine-tuning/
Generative AI

English
Introducing DoRA, a High-Performing Alternative to LoRA for Fine-Tuning
Jun 28, 2024
By Min-Hung Chen

+10
Like
Discuss (0)

LTFRE
Full fine-tuning (FT) is commonly employed to tailor general pretrained models for specific downstream tasks. To reduce the training cost, parameter-efficient fine-tuning (PEFT) methods have been introduced to fine-tune pretrained models with a minimal number of parameters. Among these, Low-Rank Adaptation (LoRA) and its variants have gained considerable popularity because they avoid additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning.

NVIDIA Research Taiwan and the NVIDIA Learning and Perception Research Group developed Weight-Decomposed Low-Rank Adaptation (DoRA), which could be the default replacement for LoRA. DoRA improves both the learning capacity and stability of LoRA, without introducing any additional inference overhead.

DoRA consistently outperforms LoRA across a wide variety of large language model (LLM) and vision language model (VLM) tasks, such as common-sense reasoning (+3.7/+1.0 on Llama 7B/13B, +2.9 on Llama 2 7B, and +4.4 on Llama 3 8B), Multi-Turn (MT) Benchmark (+0.4/+0.3 for Llama/Llama 2 7B), image/video-text understanding (+0.9/+1.9 on VL-BART), and visual instruction tuning (+0.6 on LLaVA 7B). DoRA has also been demonstrated in other tasks, including compression-aware LLM and text-to-image generation. This work has been accepted to ICML 2024 as an oral paper (1.5% acceptance rate).

Diagram showing that DoRA consistently outperforms LoRA on various tasks (LLM, VLM, LVLM) and backbones (Llama 2 and 3).
Figure 1. Comparison of DoRA and LoRA on various tasks and backbones
How does DoRA work?
DoRA begins by decomposing the pretrained weight into its magnitude and directional components and then fine-tunes both. Given the substantial size of the directional component in terms of parameters, DoRA exploits LoRA for ‌directional adaptation to enable efficient fine-tuning, as illustrated in Figure 2. Finally, DoRA can be merged with the pretrained weight before inference, thereby avoiding the introduction of additional latency.

Diagram of proposed DoRA, which decomposes the pretrained weight into magnitude and direction components for fine-tuning, especially with LoRA to efficiently update the direction component.
Figure 2. An overview of DoRA
How does DoRA affect model training?
To investigate how DoRA affects model training, the magnitude and directional differences (∆D, ∆M) between the DoRA weight W’ and the pretrained weight W0 are visualized in Figure 3 (so as FT and LoRA). From the regression line for (∆D, ∆M) of both DoRA and FT, a distinct negative slope characterizes DoRA and FT, instead of a clear positive correlation shown by LoRA. Different markers represent matrices of different training steps and different colors represent the matrices of each layer.

Figure shows magnitude and direction updates of FT, LoRA, and DoRA of the query matrices across different layers and intermediate steps. DoRA and FT show a distinct negative slope while LoRA shows a clear positive correlation, indicating that DoRA has a learning capacity closely resembling FT.
Figure 3. Magnitude and direction updates of FT, LoRA, and DoRA
DoRA demonstrates the ability to make only substantial directional adjustments with relatively minimal changes in magnitude or the reverse, while showing learning patterns closer to FT. This signifies its superior learning capacity over LoRA. For more qualitative and mathematical analyses, see DoRA: Weight-Decomposed Low-Rank Adaptation.

Performance
DoRA outperforms LoRA across a wide variety of models, including LLM, VLM, compressed LLM, and diffusion models.

Large language models
DoRA significantly outperforms LoRA in terms of the overall commonsense reasoning ability, as shown in Table 1. Moreover, DoRA can provide better conversation and instruction-following capabilities than LoRA, as demonstrated by the MT Benchmark in Table 2.

Model # Params (%) BoolQ PIQA SIQA HellaSwag WinoGrande ARC-e ARC-c OBQA Avg.
ChatGPT-3.5 – 73.1 85.4 68.5 78.5 66.1 89.8 79.9 74.8 77.0
Llama-LoRA 0.83 68.9 80.7 77.4 78.1 78.8 77.8 61.3 74.8 74.7
Llama-DoRA (Ours) 0.84 69.7 83.4 78.6 87.2 81.0 81.9 66.2 79.2 78.4
Llama 2-LoRA 0.83 69.8 79.9 79.5 83.6 82.6 79.8 64.7 81.0 77.6
Llama 2-DoRA (Ours) 0.84 72.0 83.1 79.9 89.1 83.0 84.5 71.0 81.2 80.5
Llama 3-LoRA 0.83 70.8 85.2 79.9 91.7 84.3 84.2 71.2 79.0 80.8
Llama 3-DoRA (Ours) 0.84 74.6 89.3 79.9 95.5 85.6 90.5 80.4 85.8 85.2
Table 1. Comparison of LoRA and DoRA on the commonsense reasoning benchmark
Model # Params (%) Score
Llama-LoRA 2.31 5.1
Llama-DoRA (Ours) 2.33 5.5
Llama-VeRA 0.02 4.3
Llama-DVoRA (Ours) 0.04 5.0
Llama 2-LoRA 2.31 5.7
Llama 2-DoRA (Ours) 2.33 6.0
Llama 2-VeRA 0.02 5.5
Llama 2-DVoRA (Ours) 0.04 6.0
Table 2. Comparison of LoRA and DoRA on MT-Bench (scored by GPT-4). DVoRA is obtained by integrating DoRA on VeRA
Vision language models
In addition to pure natural language processing (NLP), DoRA also outperforms LoRA in terms of image-text understanding (Table 3), video-text understanding (Table 4), and visual instruction tuning (Table 5) abilities.

Model # Params (%) VQAv2 GQA NVLR2 COCO Cap. Avg.
VLBART-LoRA 5.93 65.2 53.6 71.9 115.3 76.5
VLBART-DoRA (Ours) 5.96 65.8 54.7 73.1 115.9 77.4
Table 3. Comparison of LoRA and DoRA on image-text understanding tasks
Model # Params (%) TVQA How2QA TVC YC2C Avg.
VLBART-LoRA 5.17 75.5 72.9 44.6 140.9 83.5
VLBART-DoRA (Ours) 5.19 76.3 74.1 45.8 145.4 85.4
Table 4. Comparison of LoRA and DoRA on video-text understanding tasks
Model # Params (%) VQAv2 GQA Vis-Wiz
SQA VQAT POPE MMBench Avg.
LLaVA-LoRA 4.61 79.1 62.9 47.8 68.4 58.2 86.4 66.1 66.9
LLaVA-DoRA (Ours) 4.63 78.6 62.9 52.2 69.9 57.0 87.2 66.1 67.6
Table 5. Comparison of LoRA and DoRA on visual instruction tuning tasks
Compression-aware LLMs
To further decrease the memory demands of PEFT fine-tuning, QLoRA suggests quantizing the pretrained model to 4-bit and fine-tuning LoRA on top of the frozen low-bit backbone. With DoRA, which narrows the gap between LoRA and FT, it is natural to also explore whether DoRA can enhance the accuracy of LoRA within the QLoRA framework.

Recently, our team collaborated with several researchers in Answer.AI on their QDoRA project, which substitutes the LoRA component in QLoRA with DoRA. The results show that QDoRA outperforms FT, QLoRA on both Llama 2 and Llama 3, respectively (Figure 4).

Graph showing that QDoRA significantly outperforms QLoRA on the Math Problem Benchmark, Orca-Math, with either Llama2 or Llama3 backbone. QDoRA+Llama2 has comparable results with QLoRA+Llama3. Moreover, QDoRA outperforms FT, which requires much larger memory.
Figure 4. Accuracy comparison of QDoRA and other methods on the Orca-Math dataset including 100K training samples
Text-to-image generation
DoRA can also be applied on DreamBooth for text-to-image personalization with the advanced training scripts developed by Hugging Face. Testing results on the challenging 3d_icon and lego_set datasets show that DoRA can obtain significantly better personalization results than LoRA under the same training configurations (Figure 5).

Two sets of images showing that on the challenging 3d_icon and lego_set datasets, DoRA can obtain significantly better personalization results than LoRA under the same DreamBooth training configurations.
Figure 5. Personalization results using DreamBooth plus DoRA on the challenging 3D Icon (top) and Lego (bottom) datasets
Summary
DoRA is a generally efficient and effective training technique and will be supported soon by various NVIDIA services, platforms, and frameworks. DoRA is a fine-tuning method that is compatible with LoRA and its variants and exhibits a closer resemblance to FT learning behavior. DoRA consistently outperforms LoRA across various fine-tuning tasks and model architectures. Moreover, DoRA can be considered a costless replacement for LoRA, as its decomposed magnitude and direction components can be merged back into the pretrained weight after the training, ensuring that there is no extra inference overhead. We hope DoRA can help NVIDIA effectively adapt various foundation models to diverse applications in NVIDIA Metropolis, NVIDIA NeMo, NVIDIA NIM, NVIDIA TensorRT, audiovisual, robotics, generative AI, and more.

###
https://huggingface.co/facebook/multi-token-prediction
META
July 4, 2024

In April, we published a research paper on a new approach for building better and faster LLMs by using multi-token prediction. Using this approach, we can train language models to predict multiple future words at once, improving model capabilities and training efficiency while allowing for faster inference.
In the spirit of responsible open science, we’ve released pre-trained models for code completion using this approach to enable further exploration in the research community.
Get the model on Hugging Face ➡️
https://go.fb.me/dm1giu
More on this approach ➡️
https://go.fb.me/x1zhdq
Multi-token prediction models and baselines
Models accompanying the research paper "Better & Faster Large Language Models via Multi-token Prediction" (https://arxiv.org/abs/2404.19737).

Included are the following four 7B parameter models trained on code:

baseline model (n=1) trained on 200B tokens of code: 7B_200B_1/
multi-token prediction model (n=4) trained on 200B tokens of code: 7B_200B_4/
baseline model (n=1) trained on 1T tokens of code: 7B_1T_1/
multi-token prediction model (n=4) trained on 1T tokens of code: 7B_1T_4/
Tokenizer: standard Llama 2 SentencePiece tokenizer in tokenizer.model.

###
https://huggingface.co/papers/2407.01370
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
Published on Jul 2
·
Submitted by
philschmid
on Jul 3
#1 Paper of the day
Authors:

Philippe Laban
,

Alexander R. Fabbri
,

Caiming Xiong
,

Chien-Sheng Wu
Abstract
LLMs and RAG systems are now capable of handling millions of input tokens or more. However, evaluating the output quality of such systems on long-context tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity. In this work, we argue that summarization can play a central role in such evaluation. We design a procedure to synthesize Haystacks of documents, ensuring that specific insights repeat across documents. The "Summary of a Haystack" (SummHay) task then requires a system to process the Haystack and generate, given a query, a summary that identifies the relevant insights and precisely cites the source documents. Since we have precise knowledge of what insights should appear in a haystack summary and what documents should be cited, we implement a highly reproducible automatic evaluation that can score summaries on two aspects - Coverage and Citation. We generate Haystacks in two domains (conversation, news), and perform a large-scale evaluation of 10 LLMs and corresponding 50 RAG systems. Our findings indicate that SummHay is an open challenge for current systems, as even systems provided with an Oracle signal of document relevance lag our estimate of human performance (56\%) by 10+ points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to study enterprise RAG systems and position bias in long-context models. We hope future systems can equal and surpass human performance on SummHay.
How good are LLMs in a long context, and do we need RAG? 🤔 Summary of a Haystack (SummHay) tries to solve the limitations of “Needle in a Haystack” by focusing on challenging information extraction. Google DeepMind Gemini 1.5 pro performs the best with and without RAG (37-44%), while OpenAI GPT-4o and Anthropic Claude 3 Opus are below 20%. 👀
SummHay includes 92 subtopics for evaluating long-context LLMs and RAG. It was curated by synthesizing "Haystacks" with specific insights repeated across documents. LLMs need to generate summaries that identify relevant insights and accurately cite source documents. Performance is measured using Coverage (how well the summary captures the important insights) and Citation (how accurately the summary cites the source documents).
Insights
💡 RAG always improves the performance of LLMs if correct information is retrieved
📊 Evaluated 10 LLMs and 50 RAG systems, including GPT-4o, Claude 3 Opus, and Gemini-1.5-pro
🏆 Claude 3 Opus achieved the highest Coverage; Gemini-1.5-pro highest citation
🎯 Gemini-1.5-pro is the best LLM without RAG with 37.8; Claude 3 Sonnet 18.3; GPT-4o 11.4;
⚙️ Gemini-1.5-pro + Oracle RAG achieves 44.6, whereas humans achieved 56.1.
🔢 Full input is around 100,000 tokens, while Oracle RAG is reduced to 15,000 tokens
📈 Smaller Models like Claude 3 Haiku or Gemini 1.5 Flash outperform bigger LLMs (GPT-4o, Claude 3 Opus) with RAG

###
https://arxiv.org/abs/2407.01219
[Submitted on 1 Jul 2024]
Searching for Best Practices in Retrieval-Augmented Generation
Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang
Retrieval-augmented generation (RAG) techniques have proven to be effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality, particularly in specialized domains. While many RAG approaches have been proposed to enhance large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolonged response times. Typically, a RAG workflow involves multiple processing steps, each of which can be executed in various ways. Here, we investigate existing RAG approaches and their potential combinations to identify optimal RAG practices. Through extensive experiments, we suggest several strategies for deploying RAG that balance both performance and efficiency. Moreover, we demonstrate that multimodal retrieval techniques can significantly enhance question-answering capabilities about visual inputs and accelerate the generation of multimodal content using a "retrieval as generation" strategy.
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:2407.01219 [cs.CL]
(or arXiv:2407.01219v1 [cs.CL] for this version)


###
https://huggingface.co/spaces/merve/RT-DETR-tracking-coco
RT-DETR is now supported in Hugging Face Transformers! 🙌
RT-DETR, short for “Real-Time DEtection TRansformer”, is a computer vision model developed at Peking University and Baidu, Inc. capable of real-time object detection. The authors claim better performance than YOLO models in both speed and accuracy. The model comes with an Apache 2.0 license, meaning people can freely use it for commercial applications. 🔥
RT-DETR is a follow-up work of DETR, a model developed by AI at Meta that successfully used Transformers for the first time for object detection. The latter has been in the Transformers library since 2020. After this, lots of improvements have been made to enable faster convergence and inference speed. RT-DETR is an important example of that as it unlocks real-time inference at high accuracy!
https://huggingface.co/papers/2304.08069
DETRs Beat YOLOs on Real-time Object Detection
Published on Apr 17, 2023
Authors:
Yian Zhao
,
Wenyu Lv
,
Shangliang Xu
,
Jinman Wei
,
Guanzhong Wang
,
Qingqing Dang
,
Yi Liu
,
Jie Chen
Abstract
The YOLO series has become the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy. However, we observe that the speed and accuracy of YOLOs are negatively affected by the NMS. Recently, end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS. Nevertheless, the high computational cost limits their practicality and hinders them from fully exploiting the advantage of excluding NMS. In this paper, we propose the Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector to our best knowledge that addresses the above dilemma. We build RT-DETR in two steps, drawing on the advanced DETR: first we focus on maintaining accuracy while improving speed, followed by maintaining speed while improving accuracy. Specifically, we design an efficient hybrid encoder to expeditiously process multi-scale features by decoupling intra-scale interaction and cross-scale fusion to improve speed. Then, we propose the uncertainty-minimal query selection to provide high-quality initial queries to the decoder, thereby improving accuracy. In addition, RT-DETR supports flexible speed tuning by adjusting the number of decoder layers to adapt to various scenarios without retraining. Our RT-DETR-R50 / R101 achieves 53.1% / 54.3% AP on COCO and 108 / 74 FPS on T4 GPU, outperforming previously advanced YOLOs in both speed and accuracy. We also develop scaled RT-DETRs that outperform the lighter YOLO detectors (S and M models). Furthermore, RT-DETR-R50 outperforms DINO-R50 by 2.2% AP in accuracy and about 21 times in FPS. After pre-training with Objects365, RT-DETR-R50 / R101 achieves 55.3% / 56.2% AP. The project page: https://zhao-yian.github.io/RTDETR.

###
https://github.com/andrewyng/translation-agent
Translation Agent: Agentic translation using reflection workflow
This is a Python demonstration of a reflection agentic workflow for machine translation. The main steps are:

Prompt an LLM to translate a text from source_language to target_language;
Have the LLM reflect on the translation to come up with constructive suggestions for improving it;
Use the suggestions to improve the translation.
Customizability
By using an LLM as the heart of the translation engine, this system is highly steerable. For example, by changing the prompts, it is easier using this workflow than a traditional machine translation (MT) system to:

Modify the output's style, such as formal/informal.
Specify how to handle idioms and special terms like names, technical terms, and acronyms. For example, including a glossary in the prompt lets you make sure particular terms (such as open source, H100 or GPU) are translated consistently.
Specify specific regional use of the language, or specific dialects, to serve a target audience. For example, Spanish spoken in Latin America is different from Spanish spoken in Spain; French spoken in Canada is different from how it is spoken in France.
This is not mature software, and is the result of Andrew playing around with translations on weekends the past few months, plus collaborators (Joaquin Dominguez, Nedelina Teneva, John Santerre) helping refactor the code.

According to our evaluations using BLEU score on traditional translation datasets, this workflow is sometimes competitive with, but also sometimes worse than, leading commercial offerings. However, we’ve also occasionally gotten fantastic results (superior to commercial offerings) with this approach. We think this is just a starting point for agentic translations, and that this is a promising direction for translation, with significant headroom for further improvement, which is why we’re releasing this demonstration to encourage more discussion, experimentation, research and open-source contributions.

If agentic translations can generate better results than traditional architectures (such as an end-to-end transformer that inputs a text and directly outputs a translation) -- which are often faster/cheaper to run than our approach here -- this also provides a mechanism to automatically generate training data (parallel text corpora) that can be used to further train and improve traditional algorithms. (See also this article in The Batch on using LLMs to generate training data.)

Comments and suggestions for how to improve this are very welcome!

Getting Started
To get started with translation-agent, follow these steps:

Installation:
The Poetry package manager is required for installation. Poetry Installation Depending on your environment, this might work:
pip install poetry
A .env file with a OPENAI_API_KEY is required to run the workflow. See the .env.sample file as an example.
git clone https://github.com/andrewyng/translation-agent.git
cd translation-agent
poetry install
poetry shell # activates virtual environment
Usage:
import translation_agent as ta
source_lang, target_lang, country = "English", "Spanish", "Mexico"
translation = ta.translate(source_lang, target_lang, source_text, country)
See examples/example_script.py for an example script to try out.

License
Translation Agent is released under the MIT License. You are free to use, modify, and distribute the code for both commercial and non-commercial purposes.

Ideas for extensions
Here are ideas we haven’t had time to experiment with but that we hope the open-source community will:

Try other LLMs. We prototyped this primarily using gpt-4-turbo. We would love for others to experiment with other LLMs as well as other hyperparameter choices and see if some do better than others for particular language pairs.
Glossary Creation. What’s the best way to efficiently build a glossary -- perhaps using an LLM -- of the most important terms that we want translated consistently? For example, many businesses use specialized terms that are not widely used on the internet and that LLMs thus don’t know about, and there are also many terms that can be translated in multiple ways. For example, ”open source” in Spanish can be “Código abierto” or “Fuente abierta”; both are fine, but it’d better to pick one and stick with it for a single document.
Glossary Usage and Implementation. Given a glossary, what’s the best way to include it in the prompt?
Evaluations on different languages. How does its performance vary in different languages? Are there changes that make it work better for particular source or target languages? (Note that for very high levels of performance, which MT systems are approaching, we’re not sure if BLEU is a great metric.) Also, its performance on lower resource languages needs further study.
Error analysis. We’ve found that specifying a language and a country/region (e.g., “Spanish as colloquially spoken in Mexico”) does a pretty good job for our applications. Where does the current approach fall short? We’re also particularly interested in understanding its performance on specialized topics (like law, medicine) or special types of text (like movie subtitles) to understand its limitations.
Better evals. Finally, we think better evaluations (evals) is a huge and important research topic. As with other LLM applications that generate free text, current evaluation metrics appear to fall short. For example, we found that even on documents where our agentic workflow captures context and terminology better, resulting in translations that our human raters prefer over current commercial offerings, evaluation at the sentence level (using the FLORES dataset) resulted in the agentic system scoring lower on BLEU. Can we design better metrics (perhaps using an LLM to evaluate translations?) that capture translation quality at a document level that correlates better with human preferences?
Related work
A few academic research groups are also starting to look at LLM-based and agentic translation. We think it’s early days for this field!

ChatGPT MT: Competitive for High- (but not Low-) Resource Languages, Robinson et al. (2023), https://arxiv.org/pdf/2309.07423
How to Design Translation Prompts for ChatGPT: An Empirical Study, Gao et al. (2023), https://arxiv.org/pdf/2304.02182v2
Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts, Wu et al. (2024), https://arxiv.org/pdf/2405.11804