오늘 AI 기술 분야에서는 2024년 노벨 물리학상 및 화학상 수상 소식과 함께, OpenAI, Meta, Rhymes AI, Microsoft 등의 주요 기업들이 최첨단 AI 모델과 기술을 발표하며 중요한 발전을 이루었습니다. OpenAI는 머신러닝 엔지니어링 평가를 위한 새로운 벤치마크와 다중 에이전트 시스템 프레임워크를 공개했으며, Meta는 텍스트에서 고해상도 비디오를 생성하는 Movie Gen 모델을 발표했습니다. Rhymes AI는 새로운 멀티모달 Mixture-of-Experts(MoE) 모델을 공개하여 경쟁력을 강화했으며, Microsoft는 새로운 차별화된 어텐션 메커니즘을 기반으로 하는 Transformer 모델을 선보여 주목받고 있습니다. 이 외에도 다양한 AI 모델들이 텍스트-비디오 생성, 단백질 구조 예측, 텍스트-음성 변환 등에서 혁신적인 진전을 이루며, AI 연구와 실무 적용에서 커다란 영향을 미치고 있습니다.

The Royal Swedish Academy of Sciences, 2024년 노벨 물리학상 발표

링크, 2024년 10월 8일

  • 2024년 물리학상은 인공 신경망을 기반으로 한 기계 학습 기술의 발전에 크게 기여한 John J. Hopfield와 Geoffrey E. Hinton에게 수여
  • Hopfield는 연상 기억(Associative Memory) 모델인 Hopfield Network를 제안, 물리학에서 스핀 시스템을 설명하는 에너지를 기반으로 이미지와 패턴을 저장하고 복원하는 방식을 제시
  • Hinton은 Hopfield 네트워크를 기반으로 한 Boltzmann Machine을 통해 패턴 인식 기술을 발전시켰으며, 이는 통계 물리학의 도구를 활용하여 자율적인 데이터 학습과 패턴 인식에 중요한 기여를 함
  • 두 연구자는 1980년대 이후 신경망을 이용한 인공 지능 기술의 토대를 마련하였으며, 이를 통해 현재의 심층 학습(Deep Learning)과 머신러닝 모델의 발전을 가능하게 함

The Royal Swedish Academy of Sciences, 2024년 노벨 화학상 발표

링크, 2024년 10월 9일

  • David Baker는 컴퓨터 기반 단백질 설계(Computational Protein Design)로 수상, 이 설계 기술을 통해 새로운 유형의 단백질을 창조하고 그 활용 가능성을 확장
  • Demis HassabisJohn Jumper는 AlphaFold2 모델을 통해 단백질의 3D 구조를 예측하는 문제를 해결, 이는 지난 50년간 단백질 연구에서 가장 큰 난제로 여겨졌던 문제
  • AlphaFold2는 AI 모델을 통해 20개의 아미노산 배열에서 단백질의 3차원 구조를 예측하며, 2020년 이후 200만 명 이상의 연구자가 이 기술을 활용해 항생제 내성 이해, 플라스틱 분해 효소 설계 등 다방면에서 연구를 진행
  • AlphaFold2는 전 세계적으로 단백질 기능 연구, 신약 개발, 바이오 엔지니어링 등에서 혁신적인 기여를 하고 있음

OpenAI, MLE-bench 발표

링크, 2024년 10월 10일

  • OpenAI는 머신러닝 엔지니어링 역량을 측정하기 위한 벤치마크 도구 MLE-bench를 발표
  • MLE-bench는 Kaggle의 75개 대회에서 실제 ML 엔지니어링 기술을 평가하며, 데이터셋 준비, 모델 학습, 실험 실행 등 실무에서 중요한 기술들을 테스트
  • OpenAI의 o1-preview 모델은 AIDE scaffolding을 사용하여 벤치마크에서 AI 에이전트가 경쟁자들과 비교하여 얼마나 효율적으로 작업을 수행할 수 있는지 평가, 16.9%의 대회에서 Kaggle 동메달 수준의 성과를 달성
  • 추가적으로, 자원 스케일링(Resource Scaling) 및 사전 학습(Pre-training)에서의 오염(Contamination)이 모델 성능에 미치는 영향을 분석하여 AI 에이전트의 성능 최적화에 관한 인사이트를 제공

OpenAI, Swarm 라이브러리 출시

링크, 2024년 10월 10일

  • Swarm은 다중 에이전트 시스템을 구축할 수 있는 경량 라이브러리로, 무상태(stateless) 추상화를 통해 여러 에이전트 간의 상호작용 및 제어 흐름을 관리할 수 있음
  • 각 에이전트는 고유의 **역할(Role)**과 **함수 세트(Available Functions)**를 정의하고, 대화 흐름이나 특정 기준에 따라 다른 에이전트로 제어권을 동적으로 넘길 수 있음
  • Context Variables를 사용하여 대화 상태를 유지하고 에이전트 간 정보 공유를 가능하게 함
  • Swarm은 실시간 상호작용을 위한 스트리밍 응답을 지원하며, 다양한 에이전트의 협업 및 제어에 유연성을 제공함
  • 이 라이브러리는 다양한 실험적 기능을 제공하여 다중 에이전트 시스템을 쉽게 구축하고 테스트할 수 있도록 설계됨

Meta, Movie Gen 모델 발표

링크, 2024년 10월 4일

  • Meta는 고품질 1080p HD 비디오를 텍스트에서 생성할 수 있는 Movie Gen 모델을 발표
  • 이 모델은 30B 파라미터를 사용하며, 최대 16초 길이의 비디오를 생성할 수 있는 73K 비디오 토큰을 활용하여 높은 해상도와 긴 문맥을 처리 가능
  • 비디오 생성 외에도 사용자 이미지를 기반으로 한 개인화된 비디오 생성텍스트 기반 비디오 편집 기능을 제공, 배경 변경 또는 스타일 변경과 같은 전역 편집뿐 아니라 개별 요소 추가, 제거 등 정밀한 편집이 가능
  • 이 모델은 텍스트-비디오 생성, 비디오-오디오 생성, 비디오 편집 등에서 최첨단 성능을 보여주며, 다양한 연구와 창의적 작업에 활용 가능

Pyramid Flow, SD3 비디오 생성 모델 발표

링크, 2024년 10월 11일

  • Pyramid Flow SD3는 2B 파라미터의 Diffusion Transformer(DiT) 모델로, 10초 길이의 768p 해상도, 24fps 비디오를 생성할 수 있는 텍스트-비디오 생성 모델을 발표
  • 이 모델은 Flow Matching 기반의 효율적인 학습을 통해 기존 비디오 생성 모델 대비 빠르고 효율적인 비디오 생성을 지원
  • 두 가지 변형 모델을 제공하며, MIT 라이선스 하에 공개되어 오픈 소스 커뮤니티와의 협업이 가능
  • 오픈 데이터셋을 사용하여 훈련되었으며, 20.7K GPU 시간 동안 학습되어 효율성을 극대화함

Rhymes AI, Aria 모델 발표

링크, 2024년 10월 10일

  • Rhymes AI는 Aria라는 첫 오픈 멀티모달 Mixture-of-Experts(MoE) 모델을 발표, 3.9B 활성 파라미터로 텍스트, 이미지, 비디오, 코드 등의 다양한 입력을 처리 가능
  • 64K 토큰의 긴 문맥을 처리할 수 있으며, 256프레임 비디오를 10초 만에 캡션할 수 있는 강력한 성능을 제공
  • Aria는 경쟁사 모델들(GPT-4o 및 Gemini Flash)을 뛰어넘는 성능을 자랑하며, 멀티모달 데이터를 효율적으로 처리
  • 이 모델은 Apache 2.0 라이선스 하에 공개되었으며, 오픈 소스 커뮤니티에서 쉽게 확장 가능

Microsoft, Diff Transformer 발표

링크, 2024년 10월 8일

  • Diff Transformer차별화된 어텐션 메커니즘을 도입한 새로운 Transformer 아키텍처로, 기존 self-attention 메커니즘의 한계를 극복
  • **차별적 어텐션(Differential Attention)**는

두 개의 softmax 어텐션 맵 간의 차이를 계산하여 잡음을 제거하고 중요한 정보에 집중하는 **희소 어텐션 패턴(Sparse Attention Patterns)**을 생성

  • 이 모델은 장문 데이터 처리(long-context modeling)핵심 정보 검색에서 기존 Transformer 모델보다 우수한 성능을 발휘
  • 35-40% 적은 파라미터와 학습 토큰으로 기존 Transformer 대비 유사한 성능을 발휘하며, 환각 현상 감소문맥 학습 강화에 도움을 줌
  • 특히 플래시 어텐션(FlashAttention) 커널을 활용하여 기존 하드웨어에서 쉽게 구현 가능

F5-TTS, 텍스트-음성 변환 모델 발표

링크, 2024년 10월 9일

  • F5-TTSFlow Matching 기반의 비자발적(Non-Autoregressive) 텍스트-음성 변환 시스템으로, 빠르고 효율적인 음성 합성 기능을 제공
  • ConvNeXt를 사용하여 텍스트 표현을 개선하고 음성과의 정렬을 쉽게 함
  • 이 모델은 학습과 추론 속도에서 기존 TTS 모델보다 빠른 성능을 제공하며, 감정 기반 음성 합성, 코드 전환, 속도 제어 기능을 지원
  • 100K 시간의 데이터를 바탕으로 훈련되어 자연스럽고 표현력 있는 음성 합성 성능을 자랑하며, 상업적 이용이 가능한 CC-BY 라이선스로 제공됨
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
###
https://www.nobelprize.org/prizes/physics/2024/press-release/
8 October 2024

The Royal Swedish Academy of Sciences has decided to award the Nobel Prize in Physics 2024 to

John J. Hopfield
Princeton University, NJ, USA

Geoffrey E. Hinton
University of Toronto, Canada

“for foundational discoveries and inventions that enable machine learning with artificial neural networks”

They trained artificial neural networks using physics
This year’s two Nobel Laureates in Physics have used tools from physics to develop methods that are the foundation of today’s powerful machine learning. John Hopfield created an associative memory that can store and reconstruct images and other types of patterns in data. Geoffrey Hinton invented a method that can autonomously find properties in data, and so perform tasks such as identifying specific elements in pictures.

When we talk about artificial intelligence, we often mean machine learning using artificial neural networks. This technology was originally inspired by the structure of the brain. In an artificial neural network, the brain’s neurons are represented by nodes that have different values. These nodes influence each other through con­nections that can be likened to synapses and which can be made stronger or weaker. The network is trained, for example by developing stronger connections between nodes with simultaneously high values. This year’s laureates have conducted important work with artificial neural networks from the 1980s onward.

John Hopfield invented a network that uses a method for saving and recreating patterns. We can imagine the nodes as pixels. The Hopfield network utilises physics that describes a material’s characteristics due to its atomic spin – a property that makes each atom a tiny magnet. The network as a whole is described in a manner equivalent to the energy in the spin system found in physics, and is trained by finding values for the connections between the nodes so that the saved images have low energy. When the Hopfield network is fed a distorted or incomplete image, it methodically works through the nodes and updates their values so the network’s energy falls. The network thus works stepwise to find the saved image that is most like the imperfect one it was fed with.

Geoffrey Hinton used the Hopfield network as the foundation for a new network that uses a different method: the Boltzmann machine. This can learn to recognise characteristic elements in a given type of data. Hinton used tools from statistical physics, the science of systems built from many similar components. The machine is trained by feeding it examples that are very likely to arise when the machine is run. The Boltzmann machine can be used to classify images or create new examples of the type of pattern on which it was trained. Hinton has built upon this work, helping initiate the current explosive development of machine learning.

“The laureates’ work has already been of the greatest benefit. In physics we use artificial neural networks in a vast range of areas, such as developing new materials with specific properties,” says Ellen Moons, Chair of the Nobel Committee for Physics.


###
https://www.nobelprize.org/prizes/chemistry/2024/press-release/
Press release
English
English (pdf)
Swedish
Swedish (pdf)
Logo
9 October 2024

The Royal Swedish Academy of Sciences has decided to award the Nobel Prize in Chemistry 2024

with one half to

David Baker
University of Washington, Seattle, WA, USA
Howard Hughes Medical Institute, USA

“for computational protein design”

and the other half jointly to

Demis Hassabis
Google DeepMind, London, UK

John M. Jumper
Google DeepMind, London, UK

“for protein structure prediction”

They cracked the code for proteins’ amazing structures
The Nobel Prize in Chemistry 2024 is about pro­teins, life’s ingenious chemical tools. David Baker has succeeded with the almost impossible feat of building entirely new kinds of proteins. Demis Hassabis and John Jumper have developed an AI model to solve a 50-year-old problem: predicting proteins’ complex structures. These discoveries hold enormous potential.

The diversity of life testifies to proteins’ amazing capacity as chemical tools. They control and drive all the chemi­cal reactions that together are the basis of life. Proteins also function as hormones, signal substances, antibodies and the building blocks of different tissues.

“One of the discoveries being recognised this year concerns the construction of spectacular proteins. The other is about fulfilling a 50-year-old dream: predicting protein structures from their amino acid sequences. Both of these discoveries open up vast possibilities,” says Heiner Linke, Chair of the Nobel Committee for Chemistry.

Proteins generally consist of 20 different amino acids, which can be described as life’s building blocks. In 2003, David Baker succeeded in using these blocks to design a new protein that was unlike any other protein. Since then, his research group has produced one imaginative protein creation after another, including proteins that can be used as pharmaceuticals, vaccines, nanomaterials and tiny sensors.

The second discovery concerns the prediction of protein structures. In proteins, amino acids are linked together in long strings that fold up to make a three-dimensional structure, which is decisive for the protein’s function. Since the 1970s, researchers had tried to predict protein structures from amino acid sequences, but this was notoriously difficult. However, four years ago, there was a stunning breakthrough.

In 2020, Demis Hassabis and John Jumper presented an AI model called AlphaFold2. With its help, they have been able to predict the structure of virtually all the 200 million proteins that researchers have identified. Since their breakthrough, AlphaFold2 has been used by more than two million people from 190 countries. Among a myriad of scientific applications, researchers can now better understand antibiotic resistance and create images of enzymes that can decompose plastic.

Life could not exist without proteins. That we can now predict protein structures and design our own proteins confers the greatest benefit to humankind.

###
https://openai.com/index/mle-bench/
OpenAI
October 10, 2024

MLE-bench
Evaluating Machine Learning Agents on Machine Learning Engineering

Read paper(opens in a new window)
DALL·E generated impressionistic oil painting of concentric pastel rectangular bars representing layers of assessment
We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup — OpenAI's o1-preview with AIDE scaffolding — achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource-scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code(opens in a new window) to facilitate future research in understanding the ML engineering capabilities of AI agents.

###
https://github.com/openai/swarm/tree/main
OpenAI
OpenAI released Swarm, a lightweight library for building multi-agent systems. Swarm provides a stateless abstraction to manage interactions and handoffs between multiple agents and does not use the Assistants API. 👀
How it works:
1️⃣ Define Agents, each with its own instructions, role (e.g., "Sales Agent"), and available functions (will be converted to JSON structures).
2️⃣ Define logic for transferring control to another agent based on conversation flow or specific criteria within agent functions. This handoff is achieved by simply returning the next agent to call within the function.
3️⃣ Context Variables provide initial context and update them throughout the conversation to maintain state and share information between agents.
4️⃣ Client.run() initiate and manage the multi-agent conversation. Needs initial agent, user messages, and context and returns a response containing updated messages, context variables, and the last active agent.
Insights
🔄 Swarm manages a loop of agent interactions, function calls, and potential handoffs.
🧩 Agents encapsulate instructions, available functions (tools), and handoff logic.
🔌 The framework is stateless between calls, offering transparency and fine-grained control.
🛠️ Swarm supports direct Python function calling within agents.
📊 Context variables enable state management across agent interactions.
🔄 Agent handoffs allow for dynamic switching between specialized agents.
📡 Streaming responses are supported for real-time interaction.
🧪 The framework is experimental. Maybe to collect feedback?
🔧 Flexible and works with any OpenAI client, e.g. Hugging Face TGI or vLLM hosted models.


Overview
Swarm focuses on making agent coordination and execution lightweight, highly controllable, and easily testable.

It accomplishes this through two primitive abstractions: Agents and handoffs. An Agent encompasses instructions and tools, and can at any point choose to hand off a conversation to another Agent.

These primitives are powerful enough to express rich dynamics between tools and networks of agents, allowing you to build scalable, real-world solutions while avoiding a steep learning curve.

Note

Swarm Agents are not related to Assistants in the Assistants API. They are named similarly for convenience, but are otherwise completely unrelated. Swarm is entirely powered by the Chat Completions API and is hence stateless between calls.

Why Swarm
Swarm explores patterns that are lightweight, scalable, and highly customizable by design. Approaches similar to Swarm are best suited for situations dealing with a large number of independent capabilities and instructions that are difficult to encode into a single prompt.

The Assistants API is a great option for developers looking for fully-hosted threads and built in memory management and retrieval. However, Swarm is an educational resource for developers curious to learn about multi-agent orchestration. Swarm runs (almost) entirely on the client and, much like the Chat Completions API, does not store state between calls.


###
https://ai.meta.com/static-resource/movie-gen-research-paper
META
Date: October 4, 2024

Movie Gen: A Cast of Media Foundation Models
The Movie Gen team @ Meta1
1A detailed contributor list can be found in the appendix of this paper.
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos
with different aspect ratios and synchronized audio. We also show additional capabilities such as
precise instruction-based video editing and generation of personalized videos based on a user’s image.
Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization,
video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation
model is a 30B parameter transformer trained with a maximum context length of 73K video tokens,
corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical
innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data
curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to
reap the benefits of scaling pre-training data, model size, and training compute for training large scale
media generation models. We hope this paper helps the research community to accelerate progress
and innovation in media generation models.
All videos from this paper are available at https://go.fb.me/MovieGenResearchVideos.

Meta Movie Gen: the most advanced media foundation models to-date.
Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in entirely new possibilities for casual creators and creative professionals alike.
More details and examples of what Movie Gen can do ➡️
https://go.fb.me/00mlgt
Movie Gen Research Paper ➡️
https://go.fb.me/zfa8wf

🛠️ Movie Gen models and capabilities
• Movie Gen Video: A 30B parameter transformer model that can generate high-quality and high-definition images and videos from a single text prompt.
• Movie Gen Audio: A 13B parameter transformer model can take a video input along with optional text prompts for controllability to generate high-fidelity audio synced to the video. It can generate ambient sound, instrumental background music and foley sound — delivering state-of-the-art results in audio quality, video-to-audio alignment and text-to-audio alignment.
• Precise video editing: Using a generated or existing video and accompanying text instructions as an input it can perform localized edits such as adding, removing or replacing elements — or global changes like background or style changes.
• Personalized videos: Using an image of a person and a text prompt, the model can generate a video with state-of-the-art results on character preservation and natural movement in video.
We’re continuing to work closely with creative professionals from across the field to integrate their feedback as we work towards a potential release. We look forward to sharing more on this work and the creative possibilities it will enable in the future.

###
https://pyramid-flow.github.io/
2024.10.11
First real good open-source text-to-video model with MIT license! Pyramid Flow SD3 is a 2B Diffusion Transformer (DiT) that can generate 10-second videos at 768p with 24fps! 🤯 🎥✨
TL;DR;
🎬 Can Generate 10-second videos at 768p/24FPS
🍹 2B parameter single unified Diffusion Transformer (DiT)
🖼️ Supports both text-to-video AND image-to-video
🧠 Uses Flow Matching for efficient training
💻 Two model variants: 384p (5s) and 768p (10s)
📼 example videos on project page
🛠️ Simple two-step implementation process
📚 MIT License and available on
Hugging Face
✅ Trained only on open-source datasets
🔜 Training code coming soon!

Pyramidal Flow Matching for Efficient Video Generative Modeling
Yang Jin1, Zhicheng Sun1, Ningyuan Li3, Kun Xu2, Kun Xu2, Hao Jiang1, Nan Zhuang2, Quzhe Huang2, Yang Song, Yadong Mu1†, Zhouchen Lin1
1Peking University, 2Kuaishou Technology, 3Beijing University of Posts and Telecommunications



The following videos are from a training-efficient Autoregressive Video Generation model based on Flow Matching. It is trained only on open-source datasets within 20.7k A100 GPU hours.

###
https://rhymes.ai/blog-details/aria-first-open-multimodal-native-moe-model
🚨 Rhymes.AI released Aria - Multimodal MoE (3.9B active), 64K tokens, caption 256 frames in 10 sec, Apache 2.0 licensed! Beats GPT4o & Gemini Flash ⚡
> 3.9B Active, 25.3B Total parameters
> Significantly better than Pixtral 12B, Llama Vision 11B & Qwen VL
> Trained on 7.5T tokens
> Four stage training:
- 6.4T language pre-training
- 1.4T multimodal pre-training
- 35B long context training
- 20B high quality post-training
Architecture:
> Aria consists of a vision encoder and a mixture-of-experts (MoE) decoder
> Vision encoder:
- Produces visual tokens for images/videos in native aspect ratio
- Operates in three resolution modes: medium, high, and ultra-high
- Medium-resolution: 128 visual tokens
- High-resolution: 256 visual tokens
- Ultra-high resolution: Dynamically decomposed into multiple high-resolution sub-images
> MoE decoder:
- Multimodal native, conditioned on both language and visual input tokens
- 66 experts per MoE layer
- 2 experts shared among all inputs to capture common knowledge
- 6 additional experts activated per token by a router module
> Models on the Hub & Integrated with Transformers!
October 10, 2024
10 min read
Rhymes AI Team
Share:



Aria: First Open Multimodal Native MoE Model


Aria Multimodal Native MoE - An Open Model for ALL Modalities


Rhymes AI is proud to introduce Aria, the world’s first open-source, multimodal native Mixture-of-Experts (MoE) model.

In short, Aria features:

Multimodal native understanding:

State-of-the-art performance on a wide range of multimodal and language tasks
Pre-trained from scratch on a mixture of multimodal and language data
Lightweight and fast:

Fine-grained mixture-of-expert model with 3.9B activated parameters per token
Efficient and informative visual encoding of variable image sizes and aspect ratios
Long context window:

Long multimodal context window of 64K tokens, captioning a 256-frame video in 10 seconds
Open:

Open model weights 🤗, code repository 💻, technical report 📝 for collaborative development.
License: Apache 2.0


Multimodal Native Performance

Aria Multimodal Native MoE - An Open Model for ALL ModalitiesFigure 1. Aria is a multimodal native model that excels at understanding text, vision, code.

Aria processes text, images, video, and code all at once, without needing separate setups for each type, demonstrating the advantages of a multimodal native model.

We provide a quantifiable definition for the term multimodal native:


A multimodal native model refers to a single model with strong understanding capabilities across multiple input modalities (e.g., text, code, image, video) that matches or exceeds the modality-specialized models of similar capacities.


We compared Aria against the best open and closed multimodal native models across established benchmarks, highlighting the following key observations:

Best-in-Class Performance: Aria is the leading multimodal native model, demonstrating clear advantages over Pixtral-12B and Llama3.2-11B across a range of multimodal, language, and coding tasks.
Competitive Against Proprietary Models: Aria performs competitively against proprietary models like GPT-4o and Gemini-1.5 on multimodal tasks, including document understanding, chart reading, scene text recognition, and video understanding.
Parameter Efficiency: Aria is the most parameter-efficient open model. Thanks to the MoE framework, Aria activates only 3.9 billion parameters, compared to the full activation in models like Pixtral-12B and Llama3.2-11B.

Aria Multimodal Native MoE - An Open Model for ALL ModalitiesFigure 2. Aria shows best-in-class benchmark performance on multimodal, language, coding tasks.


Long Multimodal Input Understanding

Multimodal data is often complex, involving long sequences that combine visuals and text, like videos with subtitles or long documents. For a model to be effective in real-world applications, it must be capable of understanding and processing such data efficiently.

Aria excels in this area, demonstrating superior long multimodal input understanding. It outperforms larger open models, proving its efficiency and effectiveness despite its size. When compared to proprietary models, Aria surpasses GPT-4o mini in long video understanding and outperforms Gemini-1.5-Flash in long document understanding. This makes Aria a preferred choice for processing extensive multimodal data in a compute-and-time-efficient manner, delivering faster and more accurate results in real-world scenarios.


Aria Multimodal Native MoE - An Open Model for ALL Modalities
Figure 3. Aria excels in long multimodal input understanding, such as long video understanding.


Instruction Following

Aria is highly effective at understanding and following instructions on both multimodal and language inputs, performing better than top open-source models on both MIA-Bench and MT-Bench.

Aria Multimodal Native MoE - An Open Model for ALL ModalitiesFigure 4. Aria is highly effective at instruction following on multimodal and language inputs.

###
https://huggingface.co/papers/2410.05258
Microsoft

Differential Transformer
Published on Oct 8
·
Submitted by
unilm
on Oct 8
#1 Paper of the day
Authors:

Tianzhu Ye
,

Li Dong
,

Yuqing Xia
,

Yutao Sun
,
Yi Zhu
,
Gao Huang
,

Furu Wei
Abstract
Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.

⚡️ Most important breakthrough this month: Differential Transformer vastly improves attention ⇒ better retrieval and fewer hallucinations!
Thought that self-attention could not be improved anymore?
Researchers from Microsoft Research and Tsinghua University have dropped a novel "differential attention" mechanism that amplifies focus on relevant context while canceling out noise. It sounds like a free lunch, but it does really seem to vastly improve LLM performance!
Key insights:
🧠 Differential attention computes the difference between two separate softmax attention maps, canceling out noise and promoting sparse attention patterns
🔥 DIFF Transformer outperforms standard Transformers while using 35-40% fewer parameters or training tokens
📏 Scales well to long contexts up to 64K tokens, leveraging increasing context length more effectively
🔎 Dramatically improves key information retrieval, enhancing in-context learning, and possibly reducing risk of hallucinations 🤯
🔢 Reduces activation outliers, potentially enabling lower-bit quantization without performance drop!
⚙️ Can be directly implemented using existing FlashAttention kernels
This new architecture could lead much more capable LLMs, with vastly improved strengths in long-context understanding and factual accuracy.
But they didn’t release weights on the Hub: let’s wait for the community to train the first open-weights DiffTransformer! 🚀

###
https://github.com/SWivid/F5-TTS
[Submitted on 9 Oct 2024]
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, Xie Chen
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. Demo samples can be found at this https URL. We release all code and checkpoints to promote community development.

Let's goo! F5-TTS 🔊
> Trained on 100K hours of data
> Zero-shot voice cloning
> Speed control (based on total duration)
> Emotion based synthesis
> Long-form synthesis
> Supports code-switching
> Best part: CC-BY license (commercially permissive)🔥
Diffusion based architecture:
> Non-Autoregressive + Flow Matching with DiT
> Uses ConvNeXt to refine text representation, alignment
Synthesised: I was, like, talking to my friend, and she’s all, um, excited about her, uh, trip to Europe, and I’m just, like, so jealous, right? (Happy emotion)
The TTS scene is on fire! 🐐

기술적으로 최대한 자세하게 적어. 9개의 기사가 있고 하나도 빼먹지 말고 적어.