Meta에서는 Llama 3.1을 출시하였으며, 8B, 70B 및 405B 크기의 모델로 제공됩니다. 이 모델은 다국어 지원과 상업적 사용이 가능하며, 효율적인 추론을 위해 양자화된 버전을 제공합니다. DeepSeek에서는 DeepSeek-V2-Chat-0628 모델을 개선하여 LMSYS Chatbot Arena에서 높은 순위를 기록했습니다. OpenAI는 비용 효율적인 GPT-4o mini 모델을 발표하였으며, Apple은 7B 오픈소스 LLM을 출시하였습니다. Mistral은 12B 모델을 출시하였고, Salesforce는 xLAM 모델을 공개했습니다. NVIDIA는 Minitron 모델을 발표하여 교육 비용을 줄이면서 성능을 향상시켰습니다. Google은 새로운 RLHF 방법을 발표하였고, Apple은 LazyLLM 방법을 소개하였습니다. 최근 AI 연구 논문들도 다양하게 발표되었습니다.

Meta, Llama 3.1 출시

링크, 2024년 7월 24일

  • 8B, 70B 및 405B 크기의 모델 제공
  • 8개 언어 지원
  • 15T 이상의 토큰으로 훈련, 25M 이상의 인간 및 합성 샘플로 미세 조정
  • 상업적 사용이 가능한 라이선스 제공
  • 효율적인 추론을 위한 FP8, AWQ 및 GPTQ 버전 제공
  • Hugging Face Inference API 및 HuggingChat에서 사용 가능
  • 128K 토큰의 컨텍스트 윈도우 지원
  • 다양한 벤치마크에서 GPT-4o 수준의 성능

DeepSeek, DeepSeek-V2-Chat-0628 출시

링크, 2024년 7월 19일

  • LMSYS Chatbot Arena에서 #11 순위 기록
  • 코딩 능력에서 #3 순위, 어려운 문제 해결에서 #3 순위
  • 이전 버전 대비 HumanEval에서 3.7% 향상
  • MATH 벤치마크에서 17.1% 향상
  • IFEval에서 13.8% 향상
  • Arena-Hard에서 26.7% 향상
  • JSON 출력 성능 7% 향상
  • 시스템 영역에서 명령어 따르기 능력 최적화, 사용자 경험 향상

OpenAI, GPT-4o mini 발표

링크, 2024년 7월 18일

  • 비용 효율적인 소형 모델, MMLU에서 82% 기록
  • GPT-3.5 Turbo 대비 60% 저렴한 가격
  • 128K 토큰의 컨텍스트 윈도우, 16K 출력 토큰 지원
  • 텍스트와 비전에서 우수한 성능, 다중모드 추론 지원
  • 안전 조치 내장, 포괄적인 안전성 평가 실시
  • Assistants API, Chat Completions API, Batch API에서 사용 가능
  • 개발자들이 더 효율적이고 저렴하게 AI 애플리케이션을 구축하고 확장할 수 있도록 지원

Apple, 7B 오픈소스 LLM 출시

링크, 2024년 7월 16일

  • 2.5T 토큰으로 훈련된 7B 기본 모델
  • 주로 영어 데이터를 사용, 2048 컨텍스트 윈도우 지원
  • MMLU에서 0.6372 점수 기록, Mistral보다 우수
  • PyTorch 및 OpenLM 프레임워크 사용
  • Hugging Face 및 Transformers에서 사용 가능

Mistral, Nemo 12B 모델 출시

링크, 2024년 7월 24일

  • 128K 컨텍스트 윈도우 지원, 새로운 토크나이저 Tekken 사용
  • 9개 언어 지원, Apache 2.0 라이선스 제공
  • Instruct 버전은 함수 호출 지원
  • NVIDIA와 협력하여 개발, 3,072 H100 80GB로 훈련
  • Hugging Face에서 사용 가능

Salesforce, xLAM 모델 발표

링크, 2024년 7월

  • 1.35B 및 7B 모델 제공, 최대 16K 컨텍스트 길이 지원
  • 자율적으로 작업을 계획하고 실행하는 기능
  • GPT-4 및 Claude 3.5 수준의 성능
  • DeepSeek Coder로 생성된 60K 함수 호출 데이터셋 공개
  • Transformers와 호환, GGUF 지원

NVIDIA, Minitron 4B 및 8B 모델 출시

링크, 2024년 7월 24일

  • 큰 LLM에서 2-4배 작은 모델로 가지치기 및 증류
  • 40배 적은 교육 토큰 사용, MMLU에서 16% 향상
  • 94B 교육 토큰, 256K 어휘
  • Iterative pruning + distillation 방법 사용
  • Hugging Face와 통합

Google, J-BOND RLHF 방법 발표

링크, 2024년 7월 20일

  • Best-of-N Distillation 알고리즘 도입
  • Monte Carlo 샘플링을 사용하여 보상 백분위수를 추정
  • 제프리 다이버전스를 사용하여 모드 커버링과 모드 시킹 행동 균형
  • 여러 벤치마크에서 효과 입증

Apple, LazyLLM 방법 발표

링크, 2024년 7월 19일

  • 긴 컨텍스트 LLM 추론을 위한 동적 토큰 가지치기
  • LLama 2 7B 모델에서 2.34배 속도 향상
  • 정확도를 유지하면서 생성 시간 단축

AI 연구 논문

텍스트-TO-SQL 작업에 LLM을 사용하는 방법에 대한 조사

링크, 2024년 7월 21일

  • 데이터베이스 액세스를 용이하게 하기 위한 텍스트-TO-SQL 변환의 중요성 강조
  • LLM을 활용한 새로운 방법들 소개
  • 프롬프트 엔지니어링 및 파인튜닝 방법 논의

LLM의 프롬프트 엔지니어링 방법에 대한 조사

링크, 2024년 7월 17일

  • LLM을 위한 프롬프트 엔지니어링 기술의 발전 논의
  • 다양한 NLP 작업에서의 프롬프트 방법들 정리
  • 44개의 연구 논문 요약, 39개의 프롬프트 방법과 29개의 NLP 작업 소개

오픈 인공지능 지식 데이터셋 발표

링크, 2024년 7월 19일

  • 고품질, 다양하고 윤리적으로 소싱된 데이터셋의 필요성 강조
  • Wikipedia의 주요 카테고리를 기반으로 한 5억 개 이상의 토큰 데이터셋 제공
  • 다양한 LLM을 사용하여 높은 지식 범위와 일관성, 정확성을 유지

이번 AI 소식에서는 주요 AI 모델 출시 및 기술적인 세부 사항과 더불어 최신 연구 논문까지 다양하게 소개되었습니다. AI 기술의 빠른 발전과 함께 이러한 정보들이 더욱 널리 활용될 수 있기를 기대합니다.

Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

Title,

company name, 제목

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

Title,

company name, 제목

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
###
https://llama.meta.com/
7/24/24
META
Llama 405B is here, and it comes with more than expected! 🚨 Meta Llama 3.1 comes in 3 sizes, 8B, 70B, and 405B, and speaks 8 languages! 🌍 Llama 3.1 405B matches or beats the Openai GPT-4o across many text benchmarks.
New and improvements of 3.1✨:
🧮 8B, 70B & 405B versions as Instruct and Base with 128k context
🌐 Multilingual, supports 8 languages, including English, German, French, and more.
🔠 Trained on >15T Tokens & fine-tuned on 25M human and synthetic samples
📃 Commercial friendly license with allowance to use model outputs to improve other LLMs
⚖️ Quantized versions in FP8, AWQ, and GPTQ for efficient inference.
🚀 Llama 3 405B matches and beast GPT-4o on many benchmarks
🧑🏻‍💻 8B & 70B improved Coding and instruction, following up to 12%
⚒️ Supports Tool use and Function Calling
🤖 Llama 3.1 405B available on Hugging Face Inference API and in HuggingChat
🤗 Available on @huggingface
🔜 1-click deployments on Hugging Face, Amazon SageMaker, Google Cloud


Big Kudos to Meta for releasing Llama 3.1, including 405B. This will help everyone accelerate and adopt AI more easily and faster. ❤️
Llama 3.1 is here!
8B, 70B, and 405B versions are available.
Results on common benchmarks suggest that Llama 3.1 405B is a GPT-4o level model. Closes the gap on both GPT-4o and Claude 3.5 Sonnet.
Here is my full video with an overview of Llama 3.1, takeaways, first impressions, and test cases:

More results:
128K tokens context window supported. "Pre-trains a model with 405B parameters on 15.6T tokens using a context window of 8K tokens. This standard pre-training stage is followed by a continued pre-training stage that increases the supported context window to 128K tokens."
Llama 3.1 405B shows strong performance on a variety of proficiency exams. "We observe that the performance of our Llama 3 405B model is very similar to Claude 3.5 Sonnet and GPT-4 4o. Our 70B model has an even more impressive performance. It is significantly better than GPT-3.5 Turbo and beats Nemotron 4 340B on many tests."
The 405B results are comparable to Claude 3.5 Sonnet and GPT-4o on common code generation benchmarks.
Uses a five-stage compositional training approach to add multimodal capabilities. That's right, this model has strong vision and video recognition capabilities too.
The 405B model was quantized from 16-bit (BF16) to 8-bit (FP8) which helps to reduce the compute requirements.
Llama 3.1 405B is trained on up to 16K H100 GPUs!

Model Information
The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks.

Model developer: Meta

Model Architecture: Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Training Data Params Input modalities Output modalities Context length GQA Token count Knowledge cutoff
Llama 3.1 (text only) A new mix of publicly available online data. 8B Multilingual Text Multilingual Text and code 128k Yes 15T+ December 2023
70B Multilingual Text Multilingual Text and code 128k Yes
405B Multilingual Text Multilingual Text and code 128k Yes
Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Llama 3.1 family of models. Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.

Model Release Date: July 23, 2024.

Status: This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback.

License: A custom commercial license, the Llama 3.1 Community License, is available at: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE

Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go here.

Intended Use
Intended Use Cases Llama 3.1 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. The Llama 3.1 model collection also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.1 Community License allows for these use cases.

Out-of-scope Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.1 Community License. Use in languages beyond those explicitly referenced as supported in this model card.

Note: Llama 3.1 has been trained on a broader collection of languages than the 8 supported languages. Developers may fine-tune Llama 3.1 models for languages beyond the 8 supported languages provided they comply with the Llama 3.1 Community License and the Acceptable Use Policy and in such cases are responsible for ensuring that any uses of Llama 3.1 in additional languages is done in a safe and responsible manner.

Hardware and Software
Training Factors We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on production infrastructure.

Training Energy Use Training utilized a cumulative of 39.3M GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency.

Training Greenhouse Gas Emissions Estimated total location-based greenhouse gas emissions were 11,390 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with renewable energy, therefore the total market-based greenhouse gas emissions for training were 0 tons CO2eq.

Training Time (GPU hours) Training Power Consumption (W) Training Location-Based Greenhouse Gas Emissions
(tons CO2eq)

Training Market-Based Greenhouse Gas Emissions
(tons CO2eq)

Llama 3.1 8B 1.46M 700 420 0
Llama 3.1 70B 7.0M 700 2,040 0
Llama 3.1 405B 30.84M 700 8,930 0
Total 39.3M
11,390 0
The methodology used to determine training energy use and greenhouse gas emissions can be found here. Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others.

Training Data
Overview: Llama 3.1 was pretrained on ~15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 25M synthetically generated examples.

Data Freshness: The pretraining data has a cutoff of December 2023.

###
https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat-0628
Deepseek changed the game with v2 chat 0628 - The best open LLM on LYMSYS arena right now - 236B parameter model with 21B active parameters. It also excels at coding (rank #3) and arena hard problems (rank #3)
7/19/24
DeepSeek-V2-Chat-0628
1. Introduction
DeepSeek-V2-Chat-0628 is an improved version of DeepSeek-V2-Chat. For model details, please visit DeepSeek-V2 page for more information.

DeepSeek-V2-Chat-0628 has achieved remarkable performance on the LMSYS Chatbot Arena Leaderboard:

Overall Ranking: #11, outperforming all other open-source models.


Coding Arena Ranking: #3, showcasing exceptional capabilities in coding tasks.


Hard Prompts Arena Ranking: #3, demonstrating strong performance on challenging prompts.


2. Improvement
Compared to the previous version DeepSeek-V2-Chat, the new version has made the following improvements:

Benchmark DeepSeek-V2-Chat DeepSeek-V2-Chat-0628 Improvement
HumanEval 81.1 84.8 +3.7
MATH 53.9 71.0 +17.1
BBH 79.7 83.4 +3.7
IFEval 63.8 77.6 +13.8
Arena-Hard 41.6 68.3 +26.7
JSON Output (Internal) 78 85 +7
Furthermore, the instruction following capability in the "system" area has been optimized, significantly enhancing the user experience for immersive translation, RAG, and other tasks.

###
https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
July 18, 2024

GPT-4o mini: advancing cost-efficient intelligence
Introducing our most cost-efficient small model

Introducing GPT-4o mini > Hero > Media Item
OpenAI is committed to making intelligence as broadly accessible as possible. Today, we're announcing GPT-4o mini, our most cost-efficient small model. We expect GPT-4o mini will significantly expand the range of applications built with AI by making intelligence much more affordable. GPT-4o mini scores 82% on MMLU and currently outperforms GPT-41 on chat preferences in LMSYS leaderboard(opens in a new window). It is priced at 15 cents per million input tokens and 60 cents per million output tokens, an order of magnitude more affordable than previous frontier models and more than 60% cheaper than GPT-3.5 Turbo.

GPT-4o mini enables a broad range of tasks with its low cost and latency, such as applications that chain or parallelize multiple model calls (e.g., calling multiple APIs), pass a large volume of context to the model (e.g., full code base or conversation history), or interact with customers through fast, real-time text responses (e.g., customer support chatbots).

Today, GPT-4o mini supports text and vision in the API, with support for text, image, video and audio inputs and outputs coming in the future. The model has a context window of 128K tokens, supports up to 16K output tokens per request, and has knowledge up to October 2023. Thanks to the improved tokenizer shared with GPT-4o, handling non-English text is now even more cost effective.

A small model with superior textual intelligence and multimodal reasoning
GPT-4o mini surpasses GPT-3.5 Turbo and other small models on academic benchmarks across both textual intelligence and multimodal reasoning, and supports the same range of languages as GPT-4o. It also demonstrates strong performance in function calling, which can enable developers to build applications that fetch data or take actions with external systems, and improved long-context performance compared to GPT-3.5 Turbo.

GPT-4o mini has been evaluated across several key benchmarks2.

Reasoning tasks: GPT-4o mini is better than other small models at reasoning tasks involving both text and vision, scoring 82.0% on MMLU, a textual intelligence and reasoning benchmark, as compared to 77.9% for Gemini Flash and 73.8% for Claude Haiku.

Math and coding proficiency: GPT-4o mini excels in mathematical reasoning and coding tasks, outperforming previous small models on the market. On MGSM, measuring math reasoning, GPT-4o mini scored 87.0%, compared to 75.5% for Gemini Flash and 71.7% for Claude Haiku. GPT-4o mini scored 87.2% on HumanEval, which measures coding performance, compared to 71.5% for Gemini Flash and 75.9% for Claude Haiku.

Multimodal reasoning: GPT-4o mini also shows strong performance on MMMU, a multimodal reasoning eval, scoring 59.4% compared to 56.1% for Gemini Flash and 50.2% for Claude Haiku.

Model Evaluation Scores
GPT-4o mini
Gemini Flash
Claude Haiku
GPT-3.5 Turbo
GPT-4o
Accuracy (%)
100
75
50
25
0
82.0
77.9
73.8
69.8
88.7
40.2
38.6
35.7
30.8
53.6
79.7
78.4
78.4
70.2
83.4
87.0
75.5
71.7
56.3
90.5
70.2
40.9
40.9
43.1
76.6
87.2
71.5
75.9
68.0
90.2
59.4
56.1
50.2
0.0
69.1
56.7
58.4
46.4
0.0
63.8
MMLU
GPQA
DROP
MGSM
MATH
HumanEval
MMMU
MathVista
Eval Benchmark
As part of our model development process, we worked with a handful of trusted partners to better understand the use cases and limitations of GPT-4o mini. We partnered with companies like Ramp(opens in a new window) and Superhuman(opens in a new window) who found GPT-4o mini to perform significantly better than GPT-3.5 Turbo for tasks such as extracting structured data from receipt files or generating high quality email responses when provided with thread history.

Built-in safety measures
Safety is built into our models from the beginning, and reinforced at every step of our development process. In pre-training, we filter out(opens in a new window) information that we do not want our models to learn from or output, such as hate speech, adult content, sites that primarily aggregate personal information, and spam. In post-training, we align the model’s behavior to our policies using techniques such as reinforcement learning with human feedback (RLHF) to improve the accuracy and reliability of the models’ responses.

GPT-4o mini has the same safety mitigations built-in as GPT-4o, which we carefully assessed using both automated and human evaluations according to our Preparedness Framework and in line with our voluntary commitments. More than 70 external experts in fields like social psychology and misinformation tested GPT-4o to identify potential risks, which we have addressed and plan to share the details of in the forthcoming GPT-4o system card and Preparedness scorecard. Insights from these expert evaluations have helped improve the safety of both GPT-4o and GPT-4o mini.

Building on these learnings, our teams also worked to improve the safety of GPT-4o mini using new techniques informed by our research. GPT-4o mini in the API is the first model to apply our instruction hierarchy(opens in a new window) method, which helps to improve the model’s ability to resist jailbreaks, prompt injections, and system prompt extractions. This makes the model’s responses more reliable and helps make it safer to use in applications at scale.

We’ll continue to monitor how GPT-4o mini is being used and improve the model’s safety as we identify new risks.

Availability and pricing
GPT-4o mini is now available as a text and vision model in the Assistants API, Chat Completions API, and Batch API. Developers pay 15 cents per 1M input tokens and 60 cents per 1M output tokens (roughly the equivalent of 2500 pages in a standard book). We plan to roll out fine-tuning for GPT-4o mini in the coming days.

In ChatGPT, Free, Plus and Team users will be able to access GPT-4o mini starting today, in place of GPT-3.5. Enterprise users will also have access starting next week, in line with our mission to make the benefits of AI accessible to all.

What’s Next
Over the past few years, we’ve witnessed remarkable advancements in AI intelligence paired with substantial reductions in cost. For example, the cost per token of GPT-4o mini has dropped by 99% since text-davinci-003, a less capable model introduced in 2022. We’re committed to continuing this trajectory of driving down costs while enhancing model capabilities.

We envision a future where models become seamlessly integrated in every app and on every website. GPT-4o mini is paving the way for developers to build and scale powerful AI applications more efficiently and affordably. The future of AI is becoming more accessible, reliable, and embedded in our daily digital experiences, and we’re excited to continue to lead the way.



###
https://huggingface.co/apple/DCLM-7B
Apple
7/16/24
Apple has entered the game! Apple just released a 7B open-source LLM, weights, training code, and dataset! 👀
TL;DR:
🧠 7B base model, trained on 2.5T tokens on an open datasets
🌐 Primarily English data and a 2048 context window
📈 Combined DCLM-BASELINE, StarCoder, and ProofPile2 data
🏆 MMLU 0.6372 > Mistral & < Llama3
🔓 Open License with Apple Sample Code License
📊 Matches closed-dataset models like Mistral
🔬 Trained using PyTorch with OpenLM framework
🤗 Available on Hugging Face and in Transformers
Paper: DataComp-LM: In search of the next generation of training sets for language models
Model Card for DCLM-Baseline-7B
DCLM-Baseline-7B is a 7 billion parameter language model trained on the DCLM-Baseline dataset, which was curated as part of the DataComp for Language Models (DCLM) benchmark. This model is designed to showcase the effectiveness of systematic data curation techniques for improving language model performance.

###
https://huggingface.co/mistralai?search_models=nemo
Mistral releases 12B open LLM! 🤯 Mistral Nemo comes as a base and instruct version with a 128k Context window and is multilingual in 9 Language with a new tokenizer. 👀
TL;DR:
🧠 12B Base and Instruct drop-in-replacement for Mistral 7B
🪟 Supports 128k context window and new tokenizer, Tekken, based on Tiktoken
🌍 Base Model multilingual in English, French, German, Spanish, Italian and more
🔓 Released under Apache 2.0
🏆 Base MMLU 68.0%; Instruct 53.4% MixEval Hard;
⚒️ Instruct versions support function calling
🤯 quantized aware-training for FP8 inference without any performance loss
🤝 Created as a Collaboration between NVIDIA and Mistral AI
🚀 Trained on 3,072 H100 80GB on DGX Cloud
🤗 Available on Hugging Face


###
https://github.com/SalesforceAIResearch/xLAM
Salseforce
07.2024
Missed it, Salesforce released xLAM - 1.35B & 7B Large Action Models, (upto 16K context length) ⚡️
LAMs autonomously plan and execute tasks to achieve specific goals!
> Competitive with GPT4 & Claude 3.5 on BFCL (function calling leaderboard)
Beats pretty much all open access models (command r plus, Mixtral 8x22B etc)
> 7B scores 88.24% whilst the 2B scores 78.94% on BFCL
> They release a function calling dataset with 60K entries created with DeepSeek Coder
> Each datapoint is verified through three hierarchical stages: format checking, actual function executions, and semantic verification
> Works out of the box with Transformers 🤗
> They also ship GGUFs compatible with llama.cpp 🦙
Kudos to Salesforce, the LAMs look quite powerful and more so thanks for releasing the dataset too! 🤗
[07.2024]: We are excited to announce the release of our two function-calling models: xLAM-1b-fc-r and xLAM-7b-fc-r. These models have achieved impressive rankings, placing #3 and #25 on the Berkeley Function-Calling Leaderboard, outperforming many significantly larger models. Stay tuned for more powerful models coming soon.
Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories.

This repo introduces xLAM that aggregates agent trajectories from distinct environments, spanning a wide array of scenarios. It standardizes and unifies these trajectories into a consistent format, streamlining the creation of a generic data loader optimized for agent training. Leveraging the data unification, our training pipeline maintains equilibrium across different data sources and preserves independent randomness across devices during dataset partitioning and model training.

###
https://huggingface.co/collections/nvidia/minitron-669ac727dc9c86e6ab7f0f3e
NVIDIA
7/24/24
Nvidia releases Minitron 4B & 8B - iteratively pruning and distilling 2-4x smaller models from large LLMs, requiring 40x fewer training tokens and with 16% improvement on MMLU! 🔥
Distilled model (w/ pruning + retraining) beats teacher!
> Competitive with L3 8B/ Mistral 7B with fractional compute + training tokens.
> 94B training tokens only.
> 256K vocab.
> Integrated with transformers.
Best practices:
1. Train a big LLM, iteratively prune + distil + retrain.
2. Use KL Divergence as the loss function for distillation.
3. Logit loss is sufficient for retraining/ distilling, so there is no need for CLM loss.
4. Iterative (instead of one-shot) pruning results in the student model outperforming the teacher.
5. Depth + Width pruning results in the best performance.
6. Lightweight Neural Architecture Search for distilled checkpoints.
And many more in the paper..
They released the base checkpoints for 8B and 4B; it would be cool to see the instruct checkpoints, too!
Kudos Nvidia! 🤗
Now.. how's going to do this L3.1 405B? ;)
Minitron is a family of small language models (SLMs) obtained by pruning NVIDIA's Nemotron-4 15B model. We prune model embedding size, attention heads, and MLP intermediate dimension, following which, we perform continued training with distillation to arrive at the final models.

Deriving the Minitron 8B and 4B models from the base 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. Please refer to our arXiv paper for more details.

Minitron models are for research and development only.

###
https://huggingface.co/papers/2407.14622
Google
RLHF method behind Google Gemma 1.1 released. 👀J-BOND a new RLHF method from Google DeepMind was used to fine-tune models Gemma 1.1 2B and 7B. J-BOND introduces a Best-of-N Distillation (BOND) algorithm that emulates Best-of-N sampling using Monte Carlo sampling to estimate reward quantiles.
Implementation:
1️⃣ Collect datasets of prompts and a Reward Model
2️⃣ Generate 1 sample from current policy and 2 samples from anchor (ref model) for each prompt
3️⃣ Compute forward KL gradient using the best of the 2 anchor samples
4️⃣ Compute backward KL gradient using the policy sample and a reward function rJ-BOND
5️⃣ Update policy weights using a combined gradient (Jeffreys divergence + KL regularization)
6️⃣ Update anchor model using Exponential Moving Average (EMA)
Insights:
📌 The Anchor Model in J-BOND is the “reference” model initialized from the SFT model
🏆 J-BOND outperforms REINFORCE baselines in terms of reward/KL trade-off
🤖 Was used to RLHF Gemma 1.1 2B and 7B, no mention of Gemma2
🐢 Slower updates of the anchor model improve the stability of the training
🎯 The anchor model serves as a moving target for the policy to improve upon
📉 No mention of common benchmarks like MT Bench, Alpaca Eval, or Arena Hard
BOND: Aligning LLMs with Best-of-N Distillation
Published on Jul 20
·
Submitted by
piergs
on Jul 23
Authors:

Abstract
Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models. Aligning Gemma policies with BOND outperforms other RLHF algorithms by improving results on several benchmarks.

###
https://huggingface.co/papers/2407.14057
Published on Jul 19
Apple
Apple presents LazyLLM
Dynamic Token Pruning for Efficient Long Context LLM Inference

The inference of transformer-based large language models consists of two sequential stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token, and 2) a decoding stage to generate subsequent tokens. For long prompts, the KV cache must be computed for all tokens during the prefilling stage, which can significantly increase the time needed to generate the first token. Consequently, the prefilling stage may become a bottleneck in the generation process. An open question remains whether all prompt tokens are essential for generating the first token. To answer this, we introduce a novel method, LazyLLM, that selectively computes the KV for tokens important for the next token prediction in both the prefilling and decoding stages. Contrary to static pruning approaches that prune the prompt at once, LazyLLM allows language models to dynamically select different subsets of tokens from the context in different generation steps, even though they might be pruned in previous steps. Extensive experiments on standard datasets across various tasks demonstrate that LazyLLM is a generic method that can be seamlessly integrated with existing language models to significantly accelerate the generation without fine-tuning. For instance, in the multi-document question-answering task, LazyLLM accelerates the prefilling stage of the LLama 2 7B model by 2.34x while maintaining accuracy.

###
https://arxiv.org/abs/2407.15186
[Submitted on 21 Jul 2024]
A Survey on Employing Large Language Models for Text-to-SQL Tasks
Liang Shi, Zhengju Tang, Zhi Yang
The increasing volume of data stored in relational databases has led to the need for efficient querying and utilization of this data in various sectors. However, writing SQL queries requires specialized knowledge, which poses a challenge for non-professional users trying to access and query databases. Text-to-SQL parsing solves this issue by converting natural language queries into SQL queries, thus making database access more accessible for non-expert users. To take advantage of the recent developments in Large Language Models (LLMs), a range of new methods have emerged, with a primary focus on prompt engineering and fine-tuning. This survey provides a comprehensive overview of LLMs in text-to-SQL tasks, discussing benchmark datasets, prompt engineering, fine-tuning methods, and future research directions. We hope this review will enable readers to gain a broader understanding of the recent advances in this field and offer some insights into its future trajectory.

###
https://arxiv.org/abs/2407.14371
[Submitted on 19 Jul 2024]
Open Artificial Knowledge
Vadim Borisov, Richard H. Schreiber
The tremendous success of chat-based AI systems like ChatGPT, Claude, and Gemini stems from Large Language Models (LLMs) trained on vast amount of datasets. However, acquiring high-quality, diverse, and ethically sourced training data remains a significant challenge. We introduce the Open Artificial Knowledge (OAK) dataset, a large-scale resource of over 500 million tokens (at the moment of writing) designed to address this issue. OAK leverages an ensemble of state-of-the-art LLMs, including GPT4o, LLaMa3-70B, LLaMa3-8B, Mixtral-8x7B, Gemma-7B, and Gemma-2-9B , to generate high-quality text across diverse domains, guided by Wikipedia's main categories. Our methodology ensures broad knowledge coverage while maintaining coherence and factual accuracy. The OAK dataset aims to foster the development of more capable and aligned language models while addressing critical issues of data scarcity and privacy in LLM training, and it is freely available on this http URL.

###
https://arxiv.org/abs/2407.12994
[Submitted on 17 Jul 2024]
A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks
Shubham Vatsal, Harsh Dubey
Large language models (LLMs) have shown remarkable performance on many different Natural Language Processing (NLP) tasks. Prompt engineering plays a key role in adding more to the already existing abilities of LLMs to achieve significant performance gains on various NLP tasks. Prompt engineering requires composing natural language instructions called prompts to elicit knowledge from LLMs in a structured way. Unlike previous state-of-the-art (SoTA) models, prompt engineering does not require extensive parameter re-training or fine-tuning based on the given NLP task and thus solely operates on the embedded knowledge of LLMs. Additionally, LLM enthusiasts can intelligently extract LLMs' knowledge through a basic natural language conversational exchange or prompt engineering, allowing more and more people even without deep mathematical machine learning background to experiment with LLMs. With prompt engineering gaining popularity in the last two years, researchers have come up with numerous engineering techniques around designing prompts to improve accuracy of information extraction from the LLMs. In this paper, we summarize different prompting techniques and club them together based on different NLP tasks that they have been used for. We further granularly highlight the performance of these prompting strategies on various datasets belonging to that NLP task, talk about the corresponding LLMs used, present a taxonomy diagram and discuss the possible SoTA for specific datasets. In total, we read and present a survey of 44 research papers which talk about 39 different prompting methods on 29 different NLP tasks of which most of them have been published in the last two years.