Summary

오늘의 AI 소식에서는 PyTorch, Microsoft, OpenAI, Amazon, Apple, 그리고 다양한 연구 기관에서 발표된 최신 기술과 연구 결과들을 다룹니다. PyTorch에서는 FlashAttention-3를 발표하며 주목할 만한 성능 향상을 이루었고, Microsoft는 AgentInstruct를 통해 LLM의 새로운 학습 방법을 소개했습니다. OpenAI와 Los Alamos National Laboratory는 AI의 안전한 실험실 사용을 위한 협력을 발표했으며, Amazon은 Anthropic의 Claude 3 Haiku 모델의 미세 조정 방법을 공유했습니다. Apple은 코드 생성 능력을 향상시키는 RLAIF 프레임워크를 소개했고, 다양한 연구팀들은 VLM의 시각적 한계와 수학 문제 해결을 위한 새로운 접근법을 발표했습니다.

주요 뉴스

PyTorch, FlashAttention-3 발표

링크, 2024년 7월 11일,

  • FlashAttention-3는 비동기성과 저정밀도를 활용하여 주목을 빠르고 정확하게 처리하는 새로운 기술을 도입.
  • Tensor Cores와 TMA를 이용해 데이터 이동과 계산을 겹치도록 설계.
  • FP16에서 FlashAttention-2보다 1.5-2.0배 빠르며 H100 GPU의 이론적 최대 FLOPS의 75%를 달성.
  • FP8에서는 최대 1.2 PFLOPS에 도달하며 기본 FP8 주목보다 2.6배 작은 오류를 나타냄.
  • FlashAttention-3은 GitHub에서 이용 가능.

Microsoft, AgentInstruct 발표

링크, 2024년 7월 4일,

  • AgentInstruct는 LLM 에이전트가 생성한 합성 데이터를 통해 새로운 기술이나 행동을 가르치는 새로운 방법.
  • Orca-3 모델의 성능을 모든 벤치마크에서 약 20% 향상시키고 GPT-4와 유사한 성능을 보임.
  • 여러 에이전트 워크플로우를 통해 원시 데이터를 고품질의 학습 데이터로 변환.
  • AgentInstruct는 원시 비구조화 텍스트를 입력으로 사용하여 다양한 학습 데이터를 생성.

OpenAI, Los Alamos National Laboratory와 협력

링크, 2024년 7월 10일,

  • OpenAI와 Los Alamos National Laboratory는 실험실 환경에서 AI 모델을 안전하게 사용할 수 있도록 평가 연구를 공동으로 진행.
  • GPT-4o 모델이 생물학적 연구를 지원할 수 있는지 평가하는 최초의 실험을 수행할 예정.
  • AI의 다중 모드 기능을 활용하여 전문가와 초보자 모두를 지원할 수 있는 가능성을 탐구.
  • 이번 협력은 AI 생물보안 평가 연구의 최전선에 기여할 것으로 기대됨.

Amazon, Anthropic Claude 3 Haiku 미세 조정 방법 공유

링크, 2024년 7월 10일,

  • Amazon Bedrock에서 Anthropic Claude 3 Haiku 모델을 미세 조정하여 특정 도메인이나 작업에서 최적의 성능을 제공.
  • 미세 조정은 분류, 구조화된 출력, 산업 지식, 도구 및 API 사용 등 다양한 용도에 활용 가능.
  • 초기 테스트에서 분류 정확도가 81.5%에서 99.6%로 향상되고 쿼리당 토큰 수가 89% 감소.
  • Amazon Bedrock은 Claude 3 Haiku 모델을 미세 조정할 수 있는 유일한 관리형 서비스 제공.

Apple, RLAIF 프레임워크 발표

링크, 2024년 7월,

  • Apple은 경량 LLM의 코드 생성 능력을 향상시키기 위해 RLAIF 프레임워크를 도입.
  • 큰 LLM의 피드백을 활용하여 보상 모델을 훈련하고 작은 LLM의 성능을 개선.
  • 코드 실행 가능성에서 4.5% 향상, 780M 파라미터 모델이 7B 파라미터 모델보다 우수한 성능을 보임.
  • Gorilla 데이터셋을 사용한 실험에서 코드 품질을 다양한 메트릭을 통해 평가.

연구 뉴스

시각 언어 모델의 한계

링크, 2024년 7월 11일,

  • Vision language models (VLMs)는 간단한 시각적 작업에서 어려움을 겪고 있으며, 정확도가 낮다는 연구 결과 발표.
  • BlindTest라는 간단한 시각적 작업 세트를 통해 VLM의 한계를 평가.
  • 테스트 결과, 최고 성능 모델도 평균 56.20%의 정확도를 기록.
  • 이 연구는 VLM이 정확한 공간 정보와 수 계산 작업에서 어려움을 겪는다는 것을 강조.

NuminaMath 7B TIR 발표

링크, 2024년 7월 11일,

  • NuminaMath 7B TIR는 복잡한 수학 문제를 해결하는 새로운 접근법을 제시하며 AI Math Olympiad에서 우수한 성적을 기록.
  • Chain-of-Thought 추론과 Python REPL을 활용하여 문제를 해결.
  • 두 단계의 감독 학습을 통해 모델을 미세 조정하여 수학 문제 해결 능력을 향상.
  • AMC 12 수준의 문제를 해결할 수 있는 능력 보유.

새로운 디코딩 기술 DoLa 발표

링크, 2024년 7월 11일,

  • DoLa 디코딩은 트랜스포머 모델의 환각 현상을 줄이는 데 크게 기여.
  • 트랜스포머 모델의 저층과 고층 사이의 로짓 변화를 이용하여 다음 토큰을 선택.
  • 다양한 벤치마크에서 5% - 20%의 성능 향상.
  • 실행 시간 증가가 미미하여 실용성이 높음.
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

Title,

company name, 제목

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

Title,

company name, 제목

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
###
https://pytorch.org/blog/flashattention-3/
July 11, 2024
Pytorch

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

by Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao

Attention, as a core layer of the ubiquitous Transformer architecture, is a bottleneck for large language models and long-context applications. FlashAttention (and FlashAttention-2) pioneered an approach to speed up attention on GPUs by minimizing memory reads/writes, and is now used by most libraries to accelerate Transformer training and inference. This has contributed to a massive increase in LLM context length in the last two years, from 2-4K (GPT-3, OPT) to 128K (GPT-4), or even 1M (Llama 3). However, despite its success, FlashAttention has yet to take advantage of new capabilities in modern hardware, with FlashAttention-2 achieving only 35% utilization of theoretical max FLOPs on the H100 GPU. In this blogpost, we describe three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) incoherent processing that leverages hardware support for FP8 low-precision.

We’re excited to release FlashAttention-3 that incorporates these techniques. It’s 1.5-2.0x faster than FlashAttention-2 with FP16, up to 740 TFLOPS, i.e., 75% utilization of H100 theoretical max FLOPS. With FP8, FlashAttention-3 reaches close to 1.2 PFLOPS, with 2.6x smaller error than baseline FP8 attention.

FlashAttention-3 is available at: https://github.com/Dao-AILab/flash-attention
Paper

FLASHATTENTION RECAP
FlashAttention is an algorithm that reorders the attention computation and leverages tiling and recomputation to significantly speed it up and reduce memory usage from quadratic to linear in sequence length. We use tiling to load blocks of inputs from HBM (GPU memory) to SRAM (fast cache), perform attention with respect to that block, and update the output in HBM. By not writing the large intermediate attention matrices to HBM, we reduce the amount of memory reads/writes, which brings 2-4x wallclock time speedup.

Here we show a diagram of FlashAttention forward pass: with tiling and softmax rescaling, we operate by blocks and avoid having to read/write from HBM, while obtaining the correct output with no approximation.

math equations

NEW HARDWARE FEATURES ON HOPPER GPUS - WGMMA, TMA, FP8
While FlashAttention-2 can achieve up to 70% theoretical max FLOPS on Ampere (A100) GPUs, it does not yet take advantage of new features on Hopper GPUs to maximize performance. We describe some of the new Hopper-specific features here, and why they are important.

1. WGMMA (Warpgroup Matrix Multiply-Accumulate). This new feature makes use of the new Tensor Cores on Hopper, with much higher throughput1 than the older mma.sync instruction in Ampere (image from the H100 white paper).

image from the H100 white paper

2. TMA (Tensor Memory Accelerator). This is a special hardware unit that accelerates the transfer of data between global memory and shared memory, taking care of all index calculation and out-of-bound predication. This frees up registers, which is a valuable resource to increase tile size and efficiency.

block diagram

3. Low-precision with FP8. This doubles the Tensor Core throughput (e.g. 989 TFLOPS with FP16 and 1978 TFLOPS with FP8), but trades off accuracy by using fewer bits to represent floating point numbers.

6x throughput

FlashAttention-3 makes use of all of these new features of Hopper, using powerful abstractions from NVIDIA’s CUTLASS library.

By rewriting FlashAttention to use these new features, we can already significantly speed it up (e.g., from 350 TFLOPS in FlashAttention-2 FP16 forward pass to around 540-570 TFLOPS). However, the asynchronous nature of the new instructions on Hopper (WGMMA and TMA) opens up additional algorithmic opportunities to overlap operations and thereby extract even greater performance. For this blogpost, we’ll explain two such techniques specific to attention. The generic technique of warp specialization, with separate producer and consumer warps doing TMA and WGMMA, is well-covered elsewhere in the context of GEMM and works the same here.

ASYNCHRONY: OVERLAPPING GEMM AND SOFTMAX
Why overlap?

Attention has GEMMs (those matmuls between Q and K and between attention probability P and V) and softmax as its two main operations. Why do we need to overlap them? Isn’t most of the FLOPS in the GEMMs anyway? As long as the GEMMs are fast (e.g., computed using WGMMA instructions), shouldn’t the GPU be going brrrr?

The problem is that non-matmul operations are much slower than matmul operations on modern accelerators. Special functions such as exponential (for the softmax) have even lower throughput than floating point multiply-add; they are evaluated by the multi-function unit, a unit separate from floating point multiply-add or matrix multiply-add. As an example, the H100 GPU SXM5 has 989 TFLOPS of FP16 matrix multiply, but only 3.9 TFLOPS (256x less throughput) for special functions2! For head dimension 128, there are 512x more matmul FLOPS than exponential, which means that exponential can take 50% of the time compared to matmul. The situation is even worse for FP8, where the matmul FLOPS are twice as fast yet exponential FLOPS stay the same speed. Ideally we want matmul and softmax to operate in parallel. While the Tensor Cores are busy with matmul, the multi-function units should be calculating exponential!

Inter-warpgroup overlapping with pingpong scheduling
The first and easiest way to overlap GEMM and softmax is to do nothing at all! The warp schedulers already try to schedule warps so that if some warps are blocked (e.g., waiting for GEMM results), other warps can run. That is, the warp schedulers do some of this overlapping for us, for free.

However, we can improve on this by doing some of the scheduling manually. As an example, if we have 2 warpgroups (labeled 1 and 2 – each warpgroup is a group of 4 warps), we can use synchronization barriers (bar.sync) so that warpgroup 1 first does its GEMMs (e.g., GEMM1 of one iteration and GEMM0 of the next iteration), and then warpgroup 2 does its GEMMs while warpgroup 1 does its softmax, and so on. This “pingpong” schedule is illustrated in the figure below, where the same color denotes the same iteration.

block chart

This would allow us to perform the softmax in the shadow of the GEMMs of the other warpgroup. Of course, this figure is just a caricature; in practice the scheduling is not really this clean. Nevertheless, pingpong scheduling can improve FP16 attention forward pass from around 570 TFLOPS to 620 TFLOPS (head dim 128, seqlen 8K).

Intra-warpgroup overlapping of GEMM and Softmax
Even within one warpgroup, we can have some part of softmax running while the GEMMs of that warpgroup is running. This is illustrated in this figure, where the same color denotes the same iteration.

block chart

This pipelining increases throughput from around 620 TFLOPS to around 640-660 TFLOPS for FP16 attention forward, at the cost of higher register pressure. We need more registers to hold both accumulators of the GEMMs, and the input/output of softmax. Overall, we find this technique to offer a favorable tradeoff.

LOW-PRECISION: REDUCE QUANTIZATION ERROR WITH INCOHERENT PROCESSING
LLM activation can have outliers with much larger magnitude than the rest of the features. These outliers make it difficult to quantize, producing much larger quantization errors. We leverage incoherent processing, a technique used in the quantization literature (e.g. from QuIP) that multiplies the query and key with a random orthogonal matrix to “spread out” the outliers and reduce quantization error. In particular, we use the Hadamard transform (with random signs), which can be done per attention head in O(d log d) instead of O(d^2) time, where d is the head dimension. Since the Hadamard transform is memory-bandwidth bound, it can be fused with previous operations such as rotary embedding (also memory-bandwidth bound) “for free”.

In our experiment where Q, K, V are generated from a standard normal distribution but 0.1% of the entries have large magnitudes (to simulate outliers), we found that incoherent processing can reduce the quantization error by 2.6x. We show numerical error comparison in the table below. Please see the paper for details.

text diagram

ATTENTION BENCHMARK
We show some results with FlashAttention-3, and compare it to FlashAttention-2, as well as the implementation in Triton and cuDNN (both of which already use new hardware features of Hopper GPUs).

For FP16, we see about 1.6x-1.8x speedup over FlashAttention-2

speed charts

speed charts

For FP8, we can reach close to 1.2 PFLOPS!

speed charts

DISCUSSION
This blogpost highlights some of the optimizations for FlashAttention available on Hopper GPUs. Other optimizations (e.g., variable length sequences, persistent kernel, and in-kernel transpose for FP8) are covered in the paper.

We have seen that designing algorithms that take advantage of the hardware they run on can bring significant efficiency gains and unlock new model capabilities such as long context. We look forward to future work on optimization for LLM inference, as well as generalizing our techniques to other hardware architectures.

We also look forward to FlashAttention-3 being integrated in a future release of PyTorch.

###
https://huggingface.co/papers/2407.03502
A recipe for Synthetic Data 2.0? Microsoft introduced “AgentInstruct” a new way to teach an LLM a new skill or behavior from synthetic data generated by LLM Agents. AgentInstruct improved a 7B (Orca-3) model by ~20% across all benchmarks and matched GPT-4 on RAG. 🚀
AgentInstruct employs a multi-agent workflow by LLMs and tools, to transform raw data into high-quality instructional data:
1️⃣ Data Collection: Gather raw unstructured text documents and source code files from various sources.
2️⃣ Content Transformation Flow: Transform and improve formatting and quality of raw data for generating instructional content using specialized agents, e.g. convert raw text into a meeting text or technical document.
3️⃣ Seed Instruction Generation Flow: Generate diverse instructional tasks from the transformed text, leveraging a comprehensive taxonomy with 100+ subcategories, e.g. coding, reading comprehension.
4️⃣ Instruction Refinement Flow: Evolve the quality and complexity of generated instructions through iterative refinement by suggester-editor pairs.
Insights:
🐋 Orca-3 is a trained mistral 7B on 22M data pairs for 17 different capabilities
📈 Orca-3 +40% on AGIEval, +19% on MMLU; +54% on GSM8K; +38% on BBH; +45% AlpacaEval
📉 Orca-3 achieves 31.34% reduction in hallucinations for summarization tasks
📝 AgentInstruct uses raw unstructured text as inputs
🧮 AgentInstruct can be used to teach Math, Reasoning, RAG
🚀 Agents can generate data that surpasses the capabilities of the underlying LLMs
AgentInstruct: Toward Generative Teaching with Agentic Flows
Published on Jul 4
·
Submitted by
ari9dam
on Jul 10
Authors:

Arindam Mitra
,

Luciano Del Corro
,

Guoqing Zheng
,

Shweti Mahajan
,
Dany Rouhana
,

Andres Codas
,
Yadong Lu
,
Wei-ge Chen
,
Olga Vrousgos
,

Corby Rosset
,
Fillipe Silva
,

Hamed Khanpour
,
Yash Lara
,

Ahmed Awadallah
Abstract
Synthetic data is becoming increasingly important for accelerating the development of language models, both large and small. Despite several successful use cases, researchers also raised concerns around model collapse and drawbacks of imitating other models. This discrepancy can be attributed to the fact that synthetic data varies in quality and diversity. Effective use of synthetic data usually requires significant human effort in curating the data. We focus on using synthetic data for post-training, specifically creating data by powerful models to teach a new skill or behavior to another model, we refer to this setting as Generative Teaching. We introduce AgentInstruct, an extensible agentic framework for automatically creating large amounts of diverse and high-quality synthetic data. AgentInstruct can create both the prompts and responses, using only raw data sources like text documents and code files as seeds. We demonstrate the utility of AgentInstruct by creating a post training dataset of 25M pairs to teach language models different skills, such as text editing, creative writing, tool usage, coding, reading comprehension, etc. The dataset can be used for instruction tuning of any base model. We post-train Mistral-7b with the data. When comparing the resulting model Orca-3 to Mistral-7b-Instruct (which uses the same base model), we observe significant improvements across many benchmarks. For example, 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval. Additionally, it consistently outperforms other models such as LLAMA-8B-instruct and GPT-3.5-turbo.

###
https://openai.com/index/openai-and-los-alamos-national-laboratory-work-together/
July 10, 2024

OpenAI and Los Alamos National Laboratory announce bioscience research partnership
OpenAI and Los Alamos National Laboratory are developing evaluations to understand how multimodal AI models can be used safely by scientists in laboratory settings.

LosAlamos OpenAI
OpenAI and Los Alamos National Laboratory (LANL) – one of the United States’ leading national laboratories – are working together to study how artificial intelligence can be used safely by scientists in laboratory settings to advance bioscientific research. This partnership follows a long tradition of the U.S. public sector, and in particular the national labs, working with the U.S. private sector to ensure advances in innovation translate to advancements in essential areas like health care and bioscience.

The recent White House Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence(opens in a new window) tasks the U.S. Department of Energy’s national labs to help evaluate the capabilities of frontier AI models, including biological capabilities. This is important to OpenAI because we believe AI has the potential to multiply the speed and impact of science for good. Already, Moderna is leveraging OpenAI’s technology to augment clinical trial development by building a data-analysis assistant designed to help analyze large data sets. Color Health built a new copilot using GPT-4o to assist healthcare providers to make evidence-based decisions about cancer screening and treatment.

“As a private company dedicated to serving the public interest, we’re thrilled to announce a first-of-its-kind partnership with Los Alamos National Laboratory to study bioscience capabilities,” said Mira Murati, OpenAI’s Chief Technology Officer. “This partnership marks a natural progression in our mission, advancing scientific research, while also understanding and mitigating risks.”

“AI is a powerful tool that has the potential for great benefits in the field of science, but, as with any new technology, comes with risks,” said Nick Generous, deputy group leader for Information Systems and Modeling. "At Los Alamos this work will be led by the laboratory's new AI Risks Technical Assessment Group, which will help assess and better understand those risks.”

OpenAI and Los Alamos National Laboratory’s Bioscience Division are working on an evaluation study to assess how frontier models like GPT-4o can assist humans with performing tasks in a physical laboratory setting through multimodal capabilities like vision and voice. This includes biological safety evaluations for GPT-4o and its currently unreleased real-time voice systems to understand how they could be used to support research in bioscience. We believe our upcoming evaluation will be the first of its kind and contribute to state-of-the-art research on AI biosecurity evaluations. It will build upon our existing work on biothreat risks and follow our Preparedness Framework, which outlines our approach to tracking, evaluating, forecasting, and protecting against model risks, and is consistent with our commitments to Frontier AI Safety agreed at the 2024 AI Seoul Summit.

Our upcoming evaluation with Los Alamos will be the first experiment to test multimodal frontier models in a lab setting by assessing the abilities of both experts and novices to perform and troubleshoot a safe protocol consisting of standard laboratory experimental tasks. These tasks are intended to serve as a proxy for more complex tasks that pose a dual use concern. Tasks may include transformation (e.g., introducing foreign genetic material into a host organism; cell culture (e.g., maintaining and propagating cells in vitro), and cell separation (e.g., through centrifugation). By examining the uplift in task completion and accuracy enabled by GPT-4o, we aim to quantify and assess how frontier models can upskill both existing professionals / PhDs as well as novices in real-world biological tasks.

These new evaluations extend our previous work in several new dimensions:

Incorporating wet lab techniques. Written tasks and responses for synthesizing and disseminating compounds were indicative, but do not fully capture the skills required to actually conduct biological benchwork. For example, it may be easy to know one must conduct mass spectrometry or even detail the steps in writing; it is much harder to perform correctly, with real samples.

Incorporating multiple modalities. Our previous work focused on GPT-4, which involved written outputs. GPT-4o’s ability to reason across modalities and take voice and visual inputs can potentially expedite learning. For example, a user less familiar with all the components of a wet lab setup can simply show their setup to GPT-4o and prompt it with questions, and troubleshoot scenarios visually through the camera instead of needing to convey the situation as a written question.

Los Alamos National Laboratory has been a pioneer in safety research and we look forward to working together on novel and robust safety evaluations for frontier AI models as capabilities continue to rapidly improve. This cooperative effort not only underscores the potential of multimodal AI models like GPT-4o to support scientific research, but also emphasizes the critical importance of private and public sector collaboration in both leveraging innovation and ensuring safety. As we look forward to the results of these evaluations, we hope that this partnership will help set new standards for AI safety and efficacy in the sciences, paving the way for future innovations that benefit humanity.

Voice
Speed

Ember

Cove

Juniper

Breeze

###
https://aws.amazon.com/blogs/machine-learning/fine-tune-anthropics-claude-3-haiku-in-amazon-bedrock-to-boost-model-accuracy-and-quality/
Fine-tune Anthropic’s Claude 3 Haiku in Amazon Bedrock to boost model accuracy and quality
by Yanyan Zhang, Fang Liu, Sovik Nath, and Carrie Wu | on 10 JUL 2024 | in Amazon Bedrock, Artificial Intelligence, Generative AI, Intermediate (200) | Permalink | Comments | Share
Frontier large language models (LLMs) like Anthropic Claude on Amazon Bedrock are trained on vast amounts of data, allowing Anthropic Claude to understand and generate human-like text. Fine-tuning Anthropic Claude 3 Haiku on proprietary datasets can provide optimal performance on specific domains or tasks. The fine-tuning as a deep level of customization represents a key differentiating factor by using your own unique data.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) along with a broad set of capabilities to build generative artificial intelligence (AI) applications, simplifying development with security, privacy, and responsible AI. With Amazon Bedrock custom models, you can customize FMs securely with your data. According to Anthropic, Claude 3 Haiku is the fastest and most cost-effective model on the market for its intelligence category. You can now fine-tune Anthropic Claude 3 Haiku in Amazon Bedrock in a preview capacity in the US West (Oregon) AWS Region. Amazon Bedrock is the only fully managed service that provides you with the ability to fine-tune Anthropic Claude models.

This post introduces the workflow of fine-tuning Anthropic Claude 3 Haiku in Amazon Bedrock. We first introduce the general concept of fine-tuning and then focus on the important steps in fining-tuning the model, including setting up permissions, preparing for data, commencing the fine-tuning jobs, and conducting evaluation and deployment of the fine-tuned models.

Solution overview
Fine-tuning is a technique in natural language processing (NLP) where a pre-trained language model is customized for a specific task. During fine-tuning, the weights of the pre-trained Anthropic Claude 3 Haiku model will get updated to enhance its performance on a specific target task. Fine-tuning allows the model to adapt its knowledge to the task-specific data distribution and vocabulary. Hyperparameters like learning rate and batch size need to be tuned for optimal fine-tuning.

Fine-tuning Anthropic Claude 3 Haiku in Amazon Bedrock offers significant advantages for enterprises. This process enhances task-specific model performance, allowing the model to handle custom use cases with task-specific performance metrics that meet or surpass more powerful models like Anthropic Claude 3 Sonnet or Anthropic Claude 3 Opus. As a result, businesses can achieve improved performance with reduced costs and latency. Essentially, fine-tuning Anthropic Claude 3 Haiku provides you with a versatile tool to customize Anthropic Claude, enabling you to meet specific performance and latency goals efficiently.

You can benefit from fine-tuning Anthropic Claude 3 Haiku in different use cases, using your own data. The following use cases are well-suited for fine-tuning the Anthropic Claude 3 Haiku model:

Classification – For example, when you have 10,000 labeled examples and want Anthropic Claude to do really well at this task
Structured outputs – For example, when you need Anthropic Claude’s response to always conform to a given structure
Industry knowledge – For example, when you need to teach Anthropic Claude how to answer questions about your company or industry
Tools and APIs – For example, when you need to teach Anthropic Claude how to use your APIs really well
In the following sections, we go through the steps of fine-tuning and deploying Anthropic Claude 3 Haiku in Amazon Bedrock using the Amazon Bedrock console and the Amazon Bedrock API.


In testing, we fine-tuned Haiku to moderate comments on internet forums. Fine-tuning improved classification accuracy from 81.5% to 99.6% and reduced tokens per query by 89%.
Early customers, like SK Telecom, have used fine-tuning to create custom Claude 3 models. These models deliver more effective responses across a range

###
https://machinelearning.apple.com/research/applying-rlaif
Apple
Machine Learning Research
OverviewResearchEventsWork with us
research areaMethods and Algorithms, research areaSpeech and Natural Language Processing | conference ACL
content type Paper | published July 2024
Applying RLAIF for Code Generation with API-usage in Lightweight LLMs
AuthorsSujan Dutta, Sayantan Mahinder, Raviteja Anantha, Bortik Bandyopadhyay

View publication

Copy Bibtex



This paper was accepted at the Natural Language Reasoning and Structured Explanations workshop at ACL 2024.

Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains, including mitigating harm in LLM outputs, enhancing text summarization, and mathematical reasoning. This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (<1B parameters) LLMs. We specifically focus on code generation tasks that require writing appropriate API calls, which is challenging due to the well-known issue of hallucination in LLMs. Our framework extracts AI feedback from a larger LLM (e.g., GPT-3.5) through a specialized prompting strategy and uses this data to train a reward model towards better alignment from smaller LLMs. We run our experiments on the Gorilla dataset and meticulously assess the quality of the model-generated code across various metrics, including AST, ROUGE, and Code-BLEU, and develop a pipeline to compute its executability rate accurately. Our approach significantly enhances the fine-tuned LLM baseline's performance, achieving a 4.5% improvement in executability rate. Notably, a smaller LLM model (780M parameters) trained with RLAIF surpasses a much larger fine-tuned baseline with 7B parameters, achieving a 1.0% higher code executability rate.

###
https://arxiv.org/pdf/2407.06581
Vision language models are blind
Pooyan Rahmanzadehgervi1⋆ Logan Bolton1⋆
pooyan.rmz@gmail.com logan.bolton@auburn.edu
Mohammad Reza Taesiri2⋆ Anh Totti Nguyen1
mtaesiri@gmail.com anh.ng8@gmail.com
1 Auburn University, AL, USA
2 University of Alberta, Canada
Abstract. Large language models with vision capabilities (VLMs), e.g.,
GPT-4o and Gemini-1.5 Pro are powering countless image-text applications and scoring high on many vision-understanding benchmarks. We
propose BlindTest, a suite of 7 visual tasks absurdly easy to humans
such as identifying (a) whether two circles overlap; (b) whether two lines
intersect; (c) which letter is being circled in a word; and (d) counting the
number of circles in a Olympic-like logo. Surprisingly, four state-of-theart VLMs are, on average, only 56.20% accurate on our benchmark, with
Sonnet-3.5 being the best (73.77% accuracy). On BlindTest, VLMs
struggle with tasks that requires precise spatial information and counting (from 0 to 10), sometimes providing an impression of a person with
myopia seeing fine details as blurry and making educated guesses. Code
is available at: https://vlmsareblind.github.io/

###
https://huggingface.co/AI-MO/NuminaMath-7B-TIR
7/11/24
NuminaMath 7B TIR released! A 7B task-specific LLM that can solve complex math problems better than most high school students! It uses tool-integrated reasoning to solve problems by applying Chain-of-Thought reasoning and Python REPLs in an agentic flow with self-healing.🤯
NuminaMath 7B TIR solves math problems by:
1️⃣ Generating a Chain of Thought reasoning on how to approach the problem.
2️⃣ Translating the CoT into Python Code.
3️⃣ Executes the Python Code in a REPL.
4️⃣ If it fails, it tries to self-heal, repeating steps 1️⃣-3️⃣ using the wrong output.
If it succeeds, it generates a nice response with the result.
Model TL;DR:
🔬 Fine-tuned from deepseek-math-7b-base
🏆 Won the first progress prize in the AI Math Olympiad (AIMO)
🧬 Built a large synthetic dataset following ToRA paper
🧠 Trained in two-stage using Supervised Fine-Tuning on the Hugging Face cluster
🐍 Utilizes tool-integrated reasoning with Python REPL
🤗 Available on Hugging Face under Apache 2.0 license
📊 Capable of solving problems at AMC 12 level
Numina Logo
Model Card for NuminaMath 7B TIR
NuminaMath is a series of language models that are trained to solve math problems using tool-integrated reasoning (TIR). NuminaMath 7B TIR won the first progress prize of the AI Math Olympiad (AIMO), with a score of 29/50 on the public and private tests sets.

image/png

This model is a fine-tuned version of deepseek-ai/deepseek-math-7b-base with two stages of supervised fine-tuning:

Stage 1: fine-tune the base model on a large, diverse dataset of natural language math problems and solutions, where each solution is templated with Chain of Thought (CoT) to facilitate reasoning.
Stage 2: fine-tune the model from Stage 1 on a synthetic dataset of tool-integrated reasoning, where each math problem is decomposed into a sequence of rationales, Python programs, and their outputs. Here we followed Microsoft’s ToRA paper and prompted GPT-4 to produce solutions in the ToRA format with code execution feedback. Fine-tuning on this data produces a reasoning agent that can solve mathematical problems via a mix of natural language reasoning and use of the Python REPL to compute intermediate results.
Model description
Model type: A 7B parameter math LLM fine-tuned in two stages of supervised fine-tuning, first on a dataset with math problem-solution pairs and then on a synthetic dataset with examples of multi-step generations using tool-integrated reasoning.
Language(s) (NLP): Primarily English
License: Apache 2.0
Finetuned from model: deepseek-ai/deepseek-math-7b-base
Model Sources
Repository: Coming soon!
Demo: https://huggingface.co/spaces/AI-MO/math-olympiad-solver


###
https://huggingface.co/papers/2309.03883
𝐍𝐞𝐰 𝐝𝐞𝐜𝐨𝐝𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞 𝐢𝐧 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬 𝐬𝐢𝐠𝐧𝐢𝐟𝐢𝐜𝐚𝐧𝐭𝐥𝐲 𝐫𝐞𝐝𝐮𝐜𝐞𝐬 𝐡𝐚𝐥𝐥𝐮𝐜𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬 👏
DoLa decoding, which made a conference paper at ICLR '24, has just been merged in Transformers by João Gante and Yung-Sung Chuang.
This new decoding method is simple yet extremely impressive!
Reminder: Decoder LLMs (the GPT kind of LLM, the most common one) generate their outputs one token at a time: at each step, given a current text, they compute, for each token in their vocabulary, a "logit" that should represent the probability that this token is coming next.
Then the decoder either picks the highest logit token (greedy decoding) or samples one with a probability defined by the logits (sampling). The token gets appended to the current text, and the decoder compute logits again, and the cycle continues.
The authors of DoLa wanted to improve that simple method.
They knew this established fact that transformer LMs encode low-level info (like base syntax) in early layers and more high-level info like knowledge in the later layers.
💡 This gave them their key idea: During decoding, rather than picking the token with the highest logit, 𝘄𝗵𝘆 𝗻𝗼𝘁 𝗽𝗶𝗰𝗸 𝘁𝗵𝗲 𝘁𝗼𝗸𝗲𝗻 𝘄𝗶𝘁𝗵 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗶𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗶𝗻 𝗹𝗼𝗴𝗶𝘁 𝗮𝗰𝗿𝗼𝘀𝘀 𝗹𝗮𝘆𝗲𝗿𝘀?
Implementation is actually quite simple: at each step, you get the layer for which the logits diverge most from your final layer, and this chosen layer becomes the premature layer. Then you subtract the logits from the premature layer to your final layer, in order to reward tokens for which the logits progressed most. And this lets you pick your next token.
Their test settings:
➤ Test 4 sizes of Llama-1 models (7b to 65B)
➤ Benchmarks on multiple choice QA (TruthfulQA, FACTOR) and open-ended QA (TruthfulQA-open-ended, GSM8K)
✨ 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 𝗮𝗿𝗲 𝗲𝘅𝘁𝗿𝗲𝗺𝗲𝗹𝘆 𝗶𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝘃𝗲!
🚀 𝟱% - 𝟮𝟬% 𝗯𝗮𝘀𝗲 𝗽𝗼𝗶𝗻𝘁𝘀 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗮𝗰𝗿𝗼𝘀𝘀 𝘁𝗵𝗲 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀
🚀 For instance on TruthfulQA / Open-ended, across all model sizes the increase in truthfulness is 14 base points, which is 𝗮𝗿𝗼𝘂𝗻𝗱 𝟰𝟬% 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗰𝗼𝗺𝗽𝗮𝗿𝗲𝗱 𝘁𝗼 𝘀𝘁𝗮𝗻𝗱𝗮𝗿𝗱 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴!
🤔 Wouldn't decoding take longer because of this added contrasting step? 👉 𝗧𝗵𝗲 𝗿𝘂𝗻𝘁𝗶𝗺𝗲 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗶𝘀 𝗻𝗲𝗴𝗹𝗶𝗴𝗶𝗯𝗹𝗲, 𝟭 𝘁𝗼 𝟴% 𝗼𝗻𝗹𝘆.
The paper has additional insights such as how token confidence evolves across layers for different types of tokens: I recommend to read it!