Anthropic에서는 AI 비서 Claude가 사람처럼 컴퓨터를 직접 사용할 수 있는 새로운 기능을 개발하여, 사용자 명령에 따라 화면에서 커서를 이동하고 클릭하며, 가상 키보드를 통해 정보를 입력할 수 있게 되었다고 발표하였습니다. 또한 Claude.ai에 새로운 분석 도구를 도입하여 Claude가 JavaScript 코드를 작성하고 실행할 수 있게 되어, 데이터 처리와 실시간 인사이트 제공이 가능해졌습니다. META에서는 Llama 3.2 모델을 출시하여 엣지 및 모바일 장치에 적합한 경량 텍스트 전용 모델(1B 및 3B)과 시각적 LLM(11B 및 90B)을 포함한다고 발표하였습니다. Cohere에서는 23개 언어를 지원하는 Aya Expanse 8B 및 32B 모델을 출시하여 멀티링구얼 모델의 성능을 크게 향상시켰습니다. Hugging Face에서는 Diffusers 라이브러리에 Stable Diffusion 3.5 Large 모델을 추가하여 개선된 이미지 생성 기능을 제공하며, AutoTrain Advanced를 소개하여 다양한 작업에 대해 모델을 훈련하거나 미세 조정할 수 있는 오픈 소스, 코드 없는 도구를 제공한다고 발표하였습니다. 또한 Inference Endpoints를 통해 Speech-to-Speech 기능을 배포할 수 있게 되어, 고성능 애플리케이션을 위한 솔루션을 제공합니다. Genmo에서는 고품질 비디오 생성 모델인 Mochi 1을 공개하여 일관된 인간의 동작과 표현을 생성할 수 있게 되었습니다. Microsoft에서는 UI 스크린샷을 구조화된 요소로 해석하는 OmniParser를 발표하여 GPT-4V와 같은 강력한 LLM이 인터페이스의 관심 영역을 정확하게 파악할 수 있게 되었습니다. OpenAI에서는 연속시간 일관성 모델(sCM)에 대한 새로운 접근 방식을 공유하여 두 단계의 샘플링만으로도 선도적인 확산 모델과 견줄 만한 샘플 품질을 달성할 수 있다고 발표하였습니다. 연구자들은 LLM을 위한 데이터 합성과 증강에 대한 종합적인 설문조사를 발표하여 데이터 효율성을 높이고 새로운 데이터 소스를 탐색하는 방법을 논의하였습니다.

Anthropic, “Developing a computer use model”

링크, 2024년 10월 23일

  • Claude 3.5 Sonnet은 이제 컴퓨터 화면을 보고 마우스와 키보드를 사용하여 사람처럼 컴퓨터를 조작할 수 있음
  • 이 기능은 현재 공개 베타 버전으로 제공되며, AI의 중요한 진전으로 간주됨
  • 컴퓨터 사용 모델 개발을 위한 연구 과정에서 이미지 이해와 도구 사용 능력을 결합함
  • 간단한 소프트웨어에 대한 훈련으로부터 Claude가 빠르게 일반화함을 발견함
  • OSWorld 평가에서 14.9%의 성능을 달성하여 현재 최고 수준을 달성함
  • 안전성 측면에서 프롬프트 인젝션 공격과 같은 새로운 위험을 식별하고 완화 조치를 마련함
  • 미래에는 더 빠르고 신뢰할 수 있으며 유용한 컴퓨터 사용 기능을 개발할 계획임

Anthropic, “Introducing the analysis tool in Claude.ai”

링크, 2024년 10월 25일

  • Claude.ai에 분석 도구를 도입하여 Claude가 JavaScript 코드를 작성하고 실행할 수 있게 함
  • 이제 Claude는 데이터 처리, 분석 수행 및 실시간 인사이트 제공이 가능함
  • 분석 도구는 모든 Claude.ai 사용자에게 기능 미리보기로 제공됨
  • 내장된 코드 샌드박스로서 복잡한 수학 계산, 데이터 분석 및 아이디어 반복이 가능함
  • CSV 파일에서 데이터 분석 및 시각화 지원
  • 마케팅, 영업, 제품 관리, 엔지니어링, 금융 팀이 활용할 수 있는 다양한 예시 제공
  • 시작하려면 Claude.ai에 로그인하여 기능 미리보기를 활성화하면 됨

META, “Llama 3.2: Revolutionizing edge AI and vision with open, customizable models”

링크, 2024년 9월 25일

  • Llama 3.2를 출시하여 엣지 및 모바일 장치에 적합한 경량 텍스트 모델(1B 및 3B)과 비전 LLM(11B 및 90B)을 포함함
  • 1B 및 3B 모델은 128K 토큰의 컨텍스트 길이를 지원하며 온디바이스 요약, 지시 따르기, 재작성 작업에 적합함
  • 11B 및 90B 비전 모델은 텍스트 모델과 호환되며 이미지 이해 작업에서 우수한 성능을 보임
  • Llama Stack 배포판을 처음으로 공개하여 다양한 환경에서 Llama 모델 작업을 단순화함
  • AWS, Databricks, Dell Technologies 등과 협력하여 Llama Stack 배포판을 구축함
  • 모델 평가에서 Llama 3.2 비전 모델이 Claude 3 Haiku와 같은 폐쇄형 모델과 경쟁함을 보여줌
  • 안전성을 위해 Llama Guard 3 11B Vision 및 Llama Guard 3 1B를 도입하여 안전 조치를 강화함

Cohere, “Cohere releases Aya 8B & 32B: SOTA multilingual models for 23 languages!”

링크, 2024년 10월 25일

  • Cohere는 Aya Expanse 8B 및 32B 모델을 출시하여 23개 언어에 대한 최첨단 멀티링구얼 모델을 제공함
  • 합성 데이터를 사용하여 모델 붕괴를 방지하기 위해 “데이터 중재” 기법을 도입함
  • 여러 교사 모델로부터 전략적으로 샘플링하여 성능 향상
  • 언어별로 모델을 훈련한 후 병합하는 모델 병합 기술 사용
  • Aya Expanse 8B는 Gemma 2 9B, Llama 3.1 8B, Mistral 8B 등을 능가함
  • Aya Expanse 32B는 Gemma 2 27B, Mistral 8x22B, Llama 3.1 70B보다 우수한 성능을 보임
  • 모델은 오픈 가중치로 제공되지만 CC-by-NC 비상업용 라이선스를 따름

Stability AI, “🧨 Diffusers welcomes Stable Diffusion 3.5 Large”

링크, 2024년 10월 22일

  • Stable Diffusion 3.5는 Stable Diffusion 3의 개선된 버전으로 Hugging Face Hub에서 사용할 수 있음
  • 두 개의 체크포인트를 제공함: 대형(8B) 모델과 시간 단계 증류된 대형(8B) 모델로, 몇 단계의 추론을 가능하게 함
  • SD3.5의 트랜스포머 아키텍처는 SD3(중형)와 유사하지만 QK 정규화 및 이중 어텐션 레이어 등 몇 가지 변경 사항이 있음
  • 나머지 세부 사항은 SD3 Medium과 동일함

Genmo, “Introducing Mochi 1: The best open source video generation model”

링크, 2024년 10월 22일

  • Mochi 1은 세계 최고의 오픈 비디오 생성 모델의 연구 프리뷰로 공개됨
  • 고품질의 비디오와 강력한 프롬프트 준수를 제공함
  • 일관된 인간의 동작과 표현을 생성하여 불쾌한 골짜기를 넘어섬
  • Apache 2.0 라이선스로 제공되며 Hugging Face Hub에서 모델 사용 가능

Hugging Face, “AutoTrain: No-code training for state-of-the-art models”

링크, 2024년 10월 21일

  • AutoTrain Advanced를 소개하여 최첨단 모델을 위한 코드 없는 훈련을 가능하게 함
  • LLM 미세 조정, 텍스트 분류/회귀, 토큰 분류, 시퀀스-투-시퀀스 작업, 문장 변환기 미세 조정, VLM 미세 조정, 이미지 분류/회귀, 표 형식 데이터의 분류 및 회귀 작업 지원
  • AutoTrain Advanced는 https://github.com/huggingface/autotrain-advanced 에서 오픈 소스로 제공됨
  • 완전히 로컬 모드 또는 클라우드 머신에서 사용할 수 있으며 Hugging Face Hub에 공유된 모델들과 함께 작동함

Hugging Face, “Deploying Speech-to-Speech on Hugging Face”

링크, 2024년 10월 22일

  • Hugging Face의 Inference Endpoints를 사용하여 Speech-to-Speech를 배포할 수 있게 됨
  • L4 GPU에서 시간당 $0.80으로 초저지연 성능을 제공함
  • 낮은 지연 시간을 위한 맞춤형 Docker 이미지를 생성하고 솔루션을 오픈 소스로 공개함
  • 고성능 애플리케이션을 위한 솔루션을 제공하여 복잡한 서버 설정이나 클라우드 제공자 문제 없이 빠른 지연 시간을 달성함

Microsoft, “OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent”

링크, 2024년 10월 25일

  • OmniParser는 사용자 인터페이스 스크린샷을 구조화된 요소로 파싱하는 도구로 순수 비전 기반의 GUI 에이전트를 위해 개발됨
  • OpenAI의 GPT-4V가 컴퓨터를 조작할 수 있도록 지원함
  • 강력한 LLM이 인터페이스의 관심 영역을 정확하게 파악할 수 있게 함
  • YOLOv8와 BLIP-2 모델을 미세 조정한 버전을 포함함
  • 다양한 스크린샷을 구조화된 형식으로 변환하여 AI 에이전트의 효율성을 향상시킴

연구자 그룹, “A Survey on Data Synthesis and Augmentation for Large Language Models”

링크, 2024년 10월 16일

  • LLM을 위한 데이터 합성과 증강에 대한 종합적인 설문 조사를 발표함
  • 훈련 데이터셋의 확장이 고품질 데이터의 성장 속도를 앞지르면서 데이터 고갈 위기가 다가오고 있음을 강조함
  • LLM의 수명 주기 전반에 걸친 데이터 생성 기술을 검토하고 요약함
  • 이러한 방법이 직면한 현재의 제약과 향후 개발 및 연구를 위한 잠재적 경로를 조사함
  • 연구자들이 LLM 구축에서 적절한 데이터 생성 전략을 신속하게 식별할 수 있도록 돕고자 함

OpenAI, “Simplifying, stabilizing, and scaling continuous-time consistency models”

링크, 2024년 10월 23일

  • OpenAI는 연속시간 일관성 모델(sCM)에 대한 새로운 접근 방식을 공유함
  • 이 접근 방식은 이론적 공식화를 단순화하고 훈련 과정을 안정화하여 대규모 데이터셋으로의 확장을 가능하게 함
  • 두 단계의 샘플링만으로 선도적인 확산 모델과 견줄 만한 샘플 품질을 달성함
  • sCM은 사전 훈련된 확산 모델로부터 지식을 증류함
  • 이미지, 오디오, 비디오 등 다양한 도메인에서 실시간 생성의 가능성을 열어줌
  • 더 나은 추론 속도와 샘플 품질을 가진 생성 모델 개발을 지속할 계획임
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
###
https://www.anthropic.com/research/developing-computer-use
Anthrophic
Developing a computer use model
2024년 10월 23일

7 min read
An abstract representation of AI computer use, with a computer cursor clicking on a stylized representation of a neural network
Claude can now use computers. The latest version of Claude 3.5 Sonnet can, when run through the appropriate software setup, follow a user’s commands to move a cursor around their computer’s screen, click on relevant locations, and input information via a virtual keyboard, emulating the way people interact with their own computer.

We think this skill—which is currently in public beta—represents a significant breakthrough in AI progress. Below, we share some insights from the research that went into developing computer use models—and into making them safer.

Why computer use?
Why is this new capability important? A vast amount of modern work happens via computers. Enabling AIs to interact directly with computer software in the same way people do will unlock a huge range of applications that simply aren’t possible for the current generation of AI assistants.

Over the last few years, many important milestones have been reached in the development of powerful AI—for example, the ability to perform complex logical reasoning and the ability to see and understand images. The next frontier is computer use: AI models that don’t have to interact via bespoke tools, but that instead are empowered to use essentially any piece of software as instructed.

The research process
Our previous work on tool use and multimodality provided the groundwork for these new computer use skills. Operating computers involves the ability to see and interpret images—in this case, images of a computer screen. It also requires reasoning about how and when to carry out specific operations in response to what’s on the screen. Combining these abilities, we trained Claude to interpret what’s happening on a screen and then use the software tools available to carry out tasks.

When a developer tasks Claude with using a piece of computer software and gives it the necessary access, Claude looks at screenshots of what’s visible to the user, then counts how many pixels vertically or horizontally it needs to move a cursor in order to click in the correct place. Training Claude to count pixels accurately was critical. Without this skill, the model finds it difficult to give mouse commands—similar to how models often struggle with simple-seeming questions like “how many A’s in the word ‘banana’?”.

We were surprised by how rapidly Claude generalized from the computer-use training we gave it on just a few pieces of simple software, such as a calculator and a text editor (for safety reasons we did not allow the model to access the internet during training). In combination with Claude’s other skills, this training granted it the remarkable ability to turn a user’s written prompt into a sequence of logical steps and then take actions on the computer. We observed that the model would even self-correct and retry tasks when it encountered obstacles.

Although the subsequent advances came quickly once we made the initial breakthrough, it took a great deal of trial and error to get there. Some of our researchers noted that developing computer use was close to the “idealized” process of AI research they’d pictured when they first started in the field: constant iteration and repeated visits back to the drawing board until there was progress.

The research paid off. At present, Claude is state-of-the-art for models that use computers in the same way as a person does—that is, from looking at the screen and taking actions in response. On one evaluation created to test developers’ attempts to have models use computers, OSWorld, Claude currently gets 14.9%. That’s nowhere near human-level skill (which is generally 70-75%), but it’s far higher than the 7.7% obtained by the next-best AI model in the same category.

Making computer use safe
Every advance in AI brings with it new safety challenges. Computer use is mainly a way of lowering the barrier to AI systems applying their existing cognitive skills, rather than fundamentally increasing those skills, so our chief concerns with computer use focus on present-day harms rather than future ones. We confirmed this by assessing whether computer use increases the risk of frontier threats as outlined in our Responsible Scaling Policy. We found that the updated Claude 3.5 Sonnet, including its new computer use skill, remains at AI Safety Level 2—that is, it doesn’t require a higher standard of safety and security measures than those we currently have in place.

When future models require AI Safety Level 3 or 4 safeguards because they present catastrophic risks, computer use might exacerbate those risks. We judge that it’s likely better to introduce computer use now, while models still only need AI Safety Level 2 safeguards. This means we can begin grappling with any safety issues before the stakes are too high, rather than adding computer use capabilities for the first time into a model with much more serious risks.

In this spirit, our Trust & Safety teams have conducted extensive analysis of our new computer-use models to identify potential vulnerabilities. One concern they've identified is “prompt injection”—a type of cyberattack where malicious instructions are fed to an AI model, causing it to either override its prior directions or perform unintended actions that deviate from the user's original intent. Since Claude can interpret screenshots from computers connected to the internet, it’s possible that it may be exposed to content that includes prompt injection attacks.

Those using the computer-use version of Claude in our public beta should take the relevant precautions to minimize these kinds of risks. As a resource for developers, we have provided further guidance in our reference implementation.

As with any AI capability, there’s also the potential for users to intentionally misuse Claude’s computer skills. Our teams have developed classifiers and other methods to flag and mitigate these kinds of abuses. Given the upcoming U.S. elections, we’re on high alert for attempted misuses that could be perceived as undermining public trust in electoral processes. While computer use is not sufficiently advanced or capable of operating at a scale that would present heightened risks relative to existing capabilities, we've put in place measures to monitor when Claude is asked to engage in election-related activity, as well as systems for nudging Claude away from activities like generating and posting content on social media, registering web domains, or interacting with government websites. We will continuously evaluate and iterate on these safety measures to balance Claude's capabilities with responsible use during the public beta.

Consistent with our standard approach to data privacy, by default we don’t train our generative AI models on user-submitted data, including any of the screenshots Claude receives.

The future of computer use
Computer use is a completely different approach to AI development. Up until now, LLM developers have made tools fit the model, producing custom environments where AIs use specially-designed tools to complete various tasks. Now, we can make the model fit the tools—Claude can fit into the computer environments we all use every day. Our goal is for Claude to take pre-existing pieces of computer software and simply use them as a person would.

There’s still a lot to do. Even though it’s the current state of the art, Claude’s computer use remains slow and often error-prone. There are many actions that people routinely do with computers (dragging, zooming, and so on) that Claude can’t yet attempt. The “flipbook” nature of Claude’s view of the screen—taking screenshots and piecing them together, rather than observing a more granular video stream—means that it can miss short-lived actions or notifications.

Even while we were recording demonstrations of computer use for today’s launch, we encountered some amusing errors. In one, Claude accidentally clicked to stop a long-running screen recording, causing all footage to be lost. In another, Claude suddenly took a break from our coding demo and began to peruse photos of Yellowstone National Park.

We expect that computer use will rapidly improve to become faster, more reliable, and more useful for the tasks our users want to complete. It’ll also become much easier to implement for those with less software-development experience. At every stage, our researchers will be working closely with our safety teams to ensure that Claude’s new capabilities are accompanied by the appropriate safety measures.

We invite developers who try computer use in our public beta to contact us with their feedback using this form, so that our researchers can continue to improve the usefulness and safety of this new capability.

###
https://www.anthropic.com/news/analysis-tool
Anthrophic
Introducing the analysis tool in Claude.ai
2024년 10월 25일

2 min read
Visual of two hands engaging with a chart
We're introducing the analysis tool, a new built-in feature for Claude.ai that enables Claude to write and run JavaScript code. Claude can now process data, conduct analysis, and produce real-time insights. The analysis tool is available for all Claude.ai users in feature preview.

Think of the analysis tool as a built-in code sandbox, where Claude can do complex math, analyze data, and iterate on different ideas before sharing an answer. The ability to process information and run code means you get more accurate answers—building on Claude 3.5 Sonnet’s state-of-the-art coding and data skills.

Analyzing and visualizing data from CSV files
When you need precise, verifiable answers from data, Claude now works more like a real data analyst. Instead of relying on abstract analysis alone, it can systematically process your data—cleaning, exploring, and analyzing it step-by-step until it reaches the correct result.

While Claude could always write code, it can now run that code within Claude.ai to support all types of analysis tasks. With the analysis tool, you get answers that are not just well-reasoned, but are mathematically precise and reproducible.

The analysis tool in Claude.ai can expand capabilities across teams. For example:

Marketers can upload customer interactions across the full funnel and Claude will surface opportunities to improve conversions.
Sales teams can upload global sales data and Claude will provide country-specific performance analysis.
Product managers can upload customer engagement data and Claude will help inform sprint planning and development priorities.
Engineers can upload performance logs from various servers and Claude will identify areas for better resource utilization.
Finance teams can upload monthly financial data and Claude will create a financial dashboard to convey key trends and inform decision making.
Getting started
You can turn on the analysis tool feature preview by logging into Claude.ai. To manage all feature previews, click on your name in the bottom left corner.

###
https://huggingface.co/models?other=arxiv:2405.16406
META
Large Language Model
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
September 25, 2024

15 minute read

Takeaways:

Today, we’re releasing Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B), and lightweight, text-only models (1B and 3B) that fit onto edge and mobile devices, including pre-trained and instruction-tuned versions.
The Llama 3.2 1B and 3B models support context length of 128K tokens and are state-of-the-art in their class for on-device use cases like summarization, instruction following, and rewriting tasks running locally at the edge. These models are enabled on day one for Qualcomm and MediaTek hardware and optimized for Arm processors.
Supported by a broad ecosystem, the Llama 3.2 11B and 90B vision models are drop-in replacements for their corresponding text model equivalents, while exceeding on image understanding tasks compared to closed models, such as Claude 3 Haiku. Unlike other open multimodal models, both pre-trained and aligned models are available to be fine-tuned for custom applications using torchtune and deployed locally using torchchat. They’re also available to try using our smart assistant, Meta AI.
We’re sharing the first official Llama Stack distributions, which will greatly simplify the way developers work with Llama models in different environments, including single-node, on-prem, cloud, and on-device, enabling turnkey deployment of retrieval-augmented generation (RAG) and tooling-enabled applications with integrated safety.
We’ve been working closely with partners like AWS, Databricks, Dell Technologies, Fireworks, Infosys, and Together AI to build Llama Stack distributions for their downstream enterprise clients. On-device distribution is via PyTorch ExecuTorch, and single-node distribution is via Ollama.
We continue to share our work because we believe openness drives innovation and is good for developers, Meta, and the world. Llama is already leading the way on openness, modifiability, and cost efficiency—enabling more people to have creative, useful, and life-changing breakthroughs using generative AI.
We’re making Llama 3.2 models available for download on llama.com and Hugging Face, as well as available for immediate development on our broad ecosystem of partner platforms, including AMD, AWS, Databricks, Dell, Google Cloud, Groq, IBM, Intel, Microsoft Azure, NVIDIA, Oracle Cloud, Snowflake, and more.

We’ve been excited by the impact the Llama 3.1 herd of models have made in the two months since we announced them, including the 405B—the first open frontier-level AI model. While these models are incredibly powerful, we recognize that building with them requires significant compute resources and expertise. We’ve also heard from developers who don’t have access to these resources and still want the opportunity to build with Llama. As Meta Founder and CEO Mark Zuckerberg shared today at Connect, they won’t have to wait any longer. Today, we’re releasing Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices.

It’s only been a year and a half since we first announced Llama, and we’ve made incredible progress in such a short amount of time. This year, Llama has achieved 10x growth and become the standard for responsible innovation. Llama also continues to lead on openness, modifiability, and cost efficiency, and it’s competitive with closed models—even leading in some areas. We believe that openness drives innovation and is the right path forward, which is why we continue to share our research and collaborate with our partners and the developer community.

We’re making Llama 3.2 models available for download on llama.com and Hugging Face, as well as available for immediate development on our broad ecosystem of partner platforms. Partners are an important part of this work, and we’ve worked with over 25 companies, including AMD, AWS, Databricks, Dell, Google Cloud, Groq, IBM, Intel, Microsoft Azure, NVIDIA, Oracle Cloud, and Snowflake, to enable services on day one. For the Llama 3.2 release, we’re also working with on-device partners Arm, MediaTek, and Qualcomm to offer a broad range of services at launch. Starting today, we’re also making Llama Stack available to the community. More details on the latest release, including information on the multimodal availability in Europe, can be found in our acceptable use policy.

Meet Llama 3.2

The two largest models of the Llama 3.2 collection, 11B and 90B, support image reasoning use cases, such as document-level understanding including charts and graphs, captioning of images, and visual grounding tasks such as directionally pinpointing objects in images based on natural language descriptions. For example, a person could ask a question about which month in the previous year their small business had the best sales, and Llama 3.2 can then reason based on an available graph and quickly provide the answer. In another example, the model could reason with a map and help answer questions such as when a hike might become steeper or the distance of a particular trail marked on the map. The 11B and 90B models can also bridge the gap between vision and language by extracting details from an image, understanding the scene, and then crafting a sentence or two that could be used as an image caption to help tell the story.

The lightweight 1B and 3B models are highly capable with multilingual text generation and tool calling abilities. These models empower developers to build personalized, on-device agentic applications with strong privacy where data never leaves the device. For example, such an application could help summarize the last 10 messages received, extract action items, and leverage tool calling to directly send calendar invites for follow-up meetings.

Running these models locally comes with two major advantages. First, prompts and responses can feel instantaneous, since processing is done locally. Second, running models locally maintains privacy by not sending data such as messages and calendar information to the cloud, making the overall application more private. Since processing is handled locally, the application can clearly control which queries stay on the device and which may need to be processed by a larger model in the cloud.

Model evaluations

Our evaluation suggests that the Llama 3.2 vision models are competitive with leading foundation models, Claude 3 Haiku and GPT4o-mini on image recognition and a range of visual understanding tasks. The 3B model outperforms the Gemma 2 2.6B and Phi 3.5-mini models on tasks such as following instructions, summarization, prompt rewriting, and tool-use, while the 1B is competitive with Gemma.

We evaluated performance on over 150 benchmark datasets that span a wide range of languages. For the vision LLMs, we evaluated performance on benchmarks for image understanding and visual reasoning.




Vision models

As the first Llama models to support vision tasks, the 11B and 90B models required an entirely new model architecture that supports image reasoning.

To add image input support, we trained a set of adapter weights that integrate the pre-trained image encoder into the pre-trained language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the language model. We trained the adapter on text-image pairs to align the image representations with the language representations. During adapter training, we also updated the parameters of the image encoder, but intentionally did not update the language-model parameters. By doing that, we keep all the text-only capabilities intact, providing developers a drop-in replacement for Llama 3.1 models.

Our training pipeline consists of multiple stages, starting from pretrained Llama 3.1 text models. First, we add image adapters and encoders, then pretrain on large-scale noisy (image, text) pair data. Next, we train on medium-scale high quality in-domain and knowledge-enhanced (image, text) pair data.

In post-training, we use a similar recipe as the text models by doing several rounds of alignment on supervised fine-tuning, rejection sampling, and direct preference optimization. We leverage synthetic data generation by using the Llama 3.1 model to filter and augment question and answers on top of in-domain images, and use a reward model to rank all the candidate answers to provide high quality fine-tuning data. We also add safety mitigation data to produce a model with a high level of safety while retaining helpfulness of the mode

The end result is a set of models that can take in both image and text prompts, and deeply understand and reason on the combination. This is another step toward Llama models having even richer agentic capabilities.


Lightweight models

As we talked about with Llama 3.1, powerful teacher models can be leveraged to create smaller models that have improved performance. We used two methods—pruning and distillation—on the 1B and 3B models, making them the first highly capable lightweight Llama models that can fit on devices efficiently.

Pruning enabled us to reduce the size of extant models in the Llama herd while recovering as much knowledge and performance as possible. For the 1B and 3B models, we took the approach of using structured pruning in a single shot manner from the Llama 3.1 8B. This involved systematically removing parts of the network and adjusting the magnitude of the weights and gradients to create a smaller, more efficient model that retains the performance of the original network.

Knowledge distillation uses a larger network to impart knowledge on a smaller network, with the idea that a smaller model can achieve better performance using a teacher than it could from scratch. For the 1B and 3B in Llama 3.2, we incorporated logits from the Llama 3.1 8B and 70B models into the pre-training stage of the model development, where outputs (logits) from these larger models were used as token-level targets. Knowledge distillation was used after pruning to recover performance.



In post-training, we use a similar recipe as Llama 3.1 and produce final chat models by doing several rounds of alignment on top of the pre-trained model. Each round involves supervised fine-tuning (SFT), rejection sampling (RS), and direct preference optimization (DPO).

In post-training, we scale context length support to 128K tokens, while maintaining the same quality as the pre-trained model. We also engage in synthetic data generation that goes through careful data processing and filtering to ensure high quality. We carefully blend the data to optimize for high quality across multiple capabilities like summarization, rewriting, instruction following, language reasoning, and tool use.

To enable the community to innovate on these models, we worked closely with Qualcomm and Mediatek, the top two mobile system on a chip (SoC) companies in the world, and Arm, who provides the foundational compute platform for 99% of mobile devices. The weights being released today are based on BFloat16 numerics. Our teams are actively exploring quantized variants that will run even faster, and we hope to share more on that soon.


This demo is based on an unreleased quantized model.
This demo is based on an unreleased quantized model.
Llama Stack distributions

In July, we released a request for comment on the Llama Stack API, a standardized interface for canonical toolchain components (fine-tuning, synthetic data generation) to customize Llama models and build agentic applications. The engagement has been great.

Since then, we have been working hard to make the API real. We built a reference implementation of the APIs for inference, tool use, and RAG. In addition, we have been working with partners to adapt them to become providers for the APIs. Finally, we have introduced Llama Stack Distribution as a way to package multiple API Providers that work well together to provide a single endpoint for developers. We are now sharing with the community a simplified and consistent experience that will enable them to work with Llama models in multiple environments, including on-prem, cloud, single-node, and on-device.



The full set of releases includes:

Llama CLI (command line interface) to build, configure, and run Llama Stack distributions
Client code in multiple languages, including python, node, kotlin, and swift
Docker containers for Llama Stack Distribution Server and Agents API Provider
Multiple distributions
Single-node Llama Stack Distribution via Meta internal implementation and Ollama
Cloud Llama Stack distributions via AWS, Databricks, Fireworks, and Together
On-device Llama Stack Distribution on iOS implemented via PyTorch ExecuTorch
On-prem Llama Stack Distribution supported by Dell
We look forward to working with developers and partners to simplify all aspects of building with Llama models and welcome feedback.



System level safety

Taking an open approach has many benefits. It helps ensure that more people around the world can access the opportunities that AI provides, guards against concentrating power in the hands of a small few, and deploys technology more equitably and safely across society. As we continue to innovate, we also want to make sure we’re empowering developers to build safe and responsible systems.

Building on our previous release and continuous effort to support responsible innovation, today we’re adding new updates to our family of safeguards:

First, we’re releasing Llama Guard 3 11B Vision, which is designed to support Llama 3.2’s new image understanding capability and filter text+image input prompts or text output responses to these prompts.
Second, as we released 1B and 3B Llama models to be used in more constrained environments like on-device, we also optimized Llama Guard to drastically reduce its deployment cost. Llama Guard 3 1B is based on the Llama 3.2 1B model and has been pruned and quantized bringing its size from 2,858 MB down to 438 MB, making it more efficient than ever to deploy.
These new solutions are integrated into our reference implementations, demos, and applications and are ready for the open source community to use on day one.


Try Llama 3.2 today

Llama 3.2 is poised to reach more people than ever before and enable exciting new use cases. We believe sharing these models with the open source community isn’t enough. We want to make sure developers also have the tools they need to build with Llama responsibly. As part of our continued responsible release efforts, we’re offering developers new tools and resources, and as always, we’ll update best practices in our Responsible Use Guide.

We continue to share the latest advancements in the Llama ecosystem because we believe openness drives innovation and is good for developers, Meta, and the world. We’re excited to continue the conversations we’re having with our partners and the open source community, and as always, we can’t wait to see what the community builds using Llama 3.2 and Llama Stack.

This work was supported by our partners across the AI community. We’d like to thank and acknowledge (in alphabetical order): Accenture, AMD, Arm, AWS, Cloudflare, Databricks, Dell, Deloitte, Fireworks.ai, Google Cloud, Groq, Hugging Face, IBM watsonx, Infosys, Intel, Kaggle, Lenovo, LMSYS, MediaTek, Microsoft Azure, NVIDIA, OctoAI, Ollama, Oracle Cloud, PwC, Qualcomm, Sarvam AI, Scale AI, Snowflake, Together AI, and UC Berkeley - vLLM Project.
Blazing-fast LLMs on your edge devices? 🚀 Meta's just dropped quantized Llama 3.2 1B and 3B, Ideal for on-device and edge deployments, prioritizing privacy and speed while maintaining almost full precision performance! 🦙
TL;DR:
📚 Initialized from Llama 3.2 1B and 3B parameter
💡 2-3x faster inference compared to the original models.
📉 45-60% reduction in model size and memory usage.
🏆 Maintains almost full precision accuracy
⚙️ Utilizes 4-bit groupwise quantization and 8-bit dynamic activation for optimal performance.
🤏🏻 Quantization scheme created with PyTorch’s ExecuTorch with Arm CPU in mind
🥇 Best for knowledge retrieval, summarization, and instruction following.
🤗 Available on Hugging Face

###
https://huggingface.co/blog/aya-expanse
Cohere
10/25/24
Cohere releases Aya 8B & 32B: SOTA multilingual models for 23 languages !
Cohere just dropped two great models that beat top contenders while also adding 23 languages! How did they pull that?
🔄 𝗧𝗿𝗮𝗶𝗻 𝗼𝗻 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮:
• Synthetic data has been said to cause model-collapse after too much training
• Cohere has introduced "data arbitrage" to prevent this by strategically sampling from a pool of several teacher models instead of one single teacher
• First train a model pool for each different groups of languages, and employ an internal Reward Model named "Arbiter" to evaluate and select the optimal generation. Then only the best generation is kept as the final completion for each prompt
➡️ This process is particularly effective for multilingual setting, where no single teacher model performs in all languages : here "Multilingual Arbitrage" singlehandedly improves win rates of the 8B model vs Gemma-2-9B by 10 points!
🧩 𝗨𝘀𝗲 𝗺𝗼𝗱𝗲𝗹 𝗺𝗲𝗿𝗴𝗶𝗻𝗴: Rather than struggling to find the right mix of data in training a single model for multilingual use, just train language specific models then merge them!
• Maximize diversity between merged checkpoints by training each on different language families.
• Experimented fancy techniques (SLERP, TIES, DARE-TIES) but found out weighted averaging to be the most consistent!
➡️ Merging had 3x more gains at high 35B scale vs the 8B scale - consistent with literature findings that merging is more effective at scale
⚡️ 𝗚𝗿𝗲𝗮𝘁 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲: Automatic evaluations on Arena-Hard-Auto dataset:
➡️ Aya Expanse 8B beats models from its weight class such as Gemma 2 9B, Llama 3.1 8B, and the recent Ministral 8B, with win rates ranging from 60.4% to 70.6%
➡️ Aya Expanse 32B outperforms Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B (2x its size)
• ⚠️ But this performance eval comes from only one benchmark! Let's wait for Open-LLM-Leaderboard
🔒 Open weights, but CC-by-NC non-commercial license.

A Deepdive into Aya Expanse: Advancing the Frontier of Multilinguality
Published October 24, 2024
John Dang's avatar
johndang-cohere
John Dang
Cohere For AI's avatar
CohereForAI
Shivalika Singh's avatar
shivi
Shivalika Singh
Cohere For AI's avatar
CohereForAI
Daniel D'souza's avatar
dsouzadaniel
Daniel D'souza
Cohere For AI's avatar
CohereForAI
Arash Ahmadian's avatar
ArashAhmadian
Arash Ahmadian
Cohere For AI's avatar
CohereForAI

This is a guest blog post by the Cohere For AI team. Cohere For AI is Cohere's research lab that seeks to solve complex machine learning problems.
With the release of the Aya Expanse family, featuring 8B and 32B parameter models, we are addressing one of the most urgent challenges in AI: the lack of highly performant multilingual models that can rival the capabilities of monolingual ones. While AI has made tremendous progress, there remains a stark gap in the performance of models across multiple languages. Aya Expanse is the result of several years of dedicated research at C4AI --- data arbitrage, multilingual preference training, safety tuning, and model merging.

These combined breakthroughs have resulted in new state-of-the-art performance on multilingual. We evaluate our models on a set of evaluations including the Arena-Hard-Auto dataset (paper), translated to the 23 languages which we release for others to use here. In pairwise comparison, Aya Expanse 32B outperforms Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B, a model more than 2x its size, setting a new state-of-the-art for multilingual performance. We also release Aya Expanse 8B, which outperforms the leading open-weights models in its parameter class such as Gemma 2 9B, Llama 3.1 8B, and the recently released Ministral 8B with win rates ranging from 60.4% to 70.6%. We observe even larger gains across less challenging evals.

Aya Expanse 8B Win RatesAya Expanse 8B Language Specific Win Rates vs Gemma 2 9B

We release both models as open weights for the research community, and hope it will further accelerate multilingual progress. In this blog post, we share technical details behind each of the key algorithmic components used in the training pipeline.

Aya Expanse 32B win rates

Avoiding model collapse in synthetic data
The use of synthetic data – data generated by an expert or “teacher” model to train another model – has become increasingly central to the development of LLMs, particularly as model training has exhausted current data sources. However, for multilingual data, especially with low-resource languages, there are few good examples of teacher models, creating an extra added challenge to leveraging synthetic data. Furthermore, recent research has suggested that an over-reliance on synthetic data leads to model collapse.

In our recent work we demonstrate that these limitations can be addressed through “data arbitrage” – strategically sampling from a pool of teacher models. This approach has important implications as it challenges the traditional reliance on a single-teacher model for generating synthetic data. Instead, data arbitrage leverages performance variations among a pool of models. Although this technique is applicable to any domain, it is particularly suited to the multilingual setting, where the absence of a universally effective teacher that excels across all languages presents significant challenges In the creation of high-quality synthetic multilingual datasets, multilingual arbitrage proves valuable by utilizing a diverse pool of models to strategically sample different parts of the data distribution for improved multilingual generations.

We first train a model pool for groups of languages and employ an Arbiter to evaluate and select the optimal generation. The Arbiter here is an internal reward model (RM) to score the model generations. In Reward-Based Routing, for each prompt in a given language, we generate completions from all models in the pool and score them using the reward model. The completion with the highest score is chosen as the final completion for that prompt. Our 8B model, even at the SFT stage trained with Multilingual Arbitrage, had over 9.1% improvement in win-rate measured against Gemma 2 9B compared to the previous Aya 23 model, demonstrating the effectiveness of this approach in leveraging diverse model strengths across languages.

Step by Step improvements in win rates against Gemma 2 9B

Iteratively Improving with Global Preferences
Following supervised fine-tuning, alignment to human preferences is a key step for training today’s state-of-the-art LLMs. Although heavily adopted, It is known that preference training is already challenging in a monolingual setting. Maximizing gains from preference training in a multilingual setting introduces even more challenges. The vast majority of existing preference datasets are exclusively English and the few existing multilingual preference datasets are often of low-quality. Moreover, modeling many diverse languages simultaneously is known to be a difficult optimization problem where naively optimizing for performance in some languages often leads to regressions in performance in other languages.

In our recent work, we leverage a novel synthetic data generation technique to construct high-quality multilingual preference data pairs by contrasting in-language completions from a highly performant multilingual LLM with lower quality completions translated from English which were generated by a weaker model. This steers our model away from generating low-quality multilingual completions which often contain undesirable artifacts, such as those introduced by poor translation. We show that this method unlocks substantial gains in performance across all languages and often also results in gains for languages not included in the preference training data.

While this work also shows that preference training with online data outperforms its offline variant, during training of Aya Expanse, we found that the combination of first preference-training with offline data followed by preference-training with online data to be better than either online or offline training alone. In the first preference training stage, we train on data curated by taking the highest and lowest reward responses from the Arbitrage stage as the chosen and rejected completions, which makes the first stage of DPO training offline.

After offline preference training, we run online iterative DPO, where we sample multiple online generations for each prompt from the model trained during the last iteration, rank these generations with a Reward Model, and then further train on these preference pairs. For both models, we repeat this process for 3 iterations as we found that going beyond 3 iterations led to minimal gains at the cost of additional re-tuning parameters like regularization coefficient (beta) and sometimes introduced reward hacking behavior. Overall, for Aya Expanse 8B, the combination of offline and online preference training on top of the model trained with arbitrage, led to 7.1% additional gains in win rate against Gemma 2 9B.

Maximizing Performance through Model Merging
A reappearing problem throughout any post-training (and pre-training) pipeline, whether it consists of a single stage such as SFT, or a more complex multi-stage optimization pipeline, such as our pipeline above, is choosing the right data mixtures for training. The intricacies of this process demand considerable effort in fine-tuning hyperparameters and data combinations. Merging multiple models is an alternative approach for enabling complex multi-tasking at a reduced aggregate computational cost. In Aya Expanse, we directly build on the findings of our recent research paper and apply merging in both the Arbitrage phase, and at each iteration of preference training.

When training multiple separate models with the goal of merging, it is important to maximize diversity between checkpoints. However, this should be balanced with ensuring that each individual model within the pool achieves high performance. To balance these objectives, we maximize diversity between checkpoints by training models for different language families. This takes advantage of cross-lingual transfer which often provides significant performance benefits while ensuring that linguistic differences provide sufficient differentiation between checkpoints.

Naively, one could split-train a model for each language and then merge, but this does not achieve the same benefits we observe from cross-lingual transfer. To improve robustness in merging, we include some shared languages across each cluster (here English, Spanish, and French). In the final recipe, we used multiple stages of merging runs trained on different clusters of data, and checkpoints within the same run.

In addition to weighted linear averaging, we experiment with multiple merging techniques, namely SLERP, TIES-merging, and DARE-TIES. However, we found weighted averaging to be the most consistent method. As a result, we use weighted averaging throughout the pipeline. Interestingly, we observed significantly larger gains from merging at the 35B scale compared to the 8B scale – up to 3x. This is inline with recent work suggesting merging to be more effective at scale.

Bringing it all Together
Components

These diagrams show our end-to-end post-training pipeline, which resulted in the step-by-step gains discussed earlier. It is truly special to look back and see how far the Aya model series has come, since its inception with Aya 101 accompanied by the Aya Collection, which stretched the limits of open-source collaboration, to now which combines steady progress in key open fundamental research questions to set a new standard for multilingual performance.

Combined

Acknowledgements
This work wouldn’t have been possible without the core Aya Expanse team: Madeline Smith, Marzieh Fadaee, Ahmet Üstün, Beyza Ermis, Sara Hooker, John Dang, Shivalika Singh, Arash Ahmadian, Daniel D'souza, Alejandro Salamanca, Aidan Peppin, Arielle Bailey, Meor Amer, Sungjin Hong, Manoj Govindassamy, Sandra Kublik.

It also wouldn’t have been possible without the wider Cohere For AI and Cohere team. Special thanks to Acyr Locatelli, Adrien Morisot, Jon Ander Campos, Sara Elsharkawy, Eddie Kim, Julia Kreutzer, Nick Frosst, Aidan Gomez, Ivan Zhang.

A huge thanks also goes to our research community – the 220 language ambassadors from around the world who have been part of this release. Thank you to Sree Harsha Nelaturu, Bhavnick Minhas, Christopher Klamm, Isabella Bicalho Frazeto who contributed notebooks that are accessible on the model Hugging Face cards.

Special thank you to Hugging Face for helping make this come together: Omar Sanseviero, Pedro Cuenca, Vaibhav Srivastav, Lysandre Debut, Aritra Roy Gosthipaty.

###
https://huggingface.co/blog/sd3-5
🧨 Diffusers welcomes Stable Diffusion 3.5 Large
Published October 22, 2024
YiYi Xu's avatar
YiYiXu
YiYi Xu
Aryan V S's avatar
a-r-r-o-w
Aryan V S
Dhruv Nair's avatar
dn6
Dhruv Nair
Sayak Paul's avatar
sayakpaul
Sayak Paul
Linoy Tsaban's avatar
linoyts
Linoy Tsaban
Apolinário from multimodal AI art's avatar
multimodalart
Apolinário from multimodal AI art
Alvaro Somoza's avatar
OzzyGT
Alvaro Somoza
Aritra Roy Gosthipaty's avatar
ariG23498
Aritra Roy Gosthipaty

Stable Diffusion 3.5 is the improved variant of its predecessor, Stable Diffusion 3. As of today, the models are available on the Hugging Face Hub and can be used with 🧨 Diffusers.
The release comes with two checkpoints:

A large (8B) model
A large (8B) timestep-distilled model enabling few-step inference
In this post, we will focus on how to use Stable Diffusion 3.5 (SD3.5) with Diffusers, covering both inference and training.

Table Of Contents
Architectural changes
Using SD3.5 with Diffusers
Performing inference with quantization
Training LoRAs with quantization
Using single-file loading
Important links
Architectural changes
The transformer architecture of SD3.5 (large) is very similar to SD3 (medium), with the following changes:

QK normalization: For training large transformer models, QK normalization has now become a standard, and SD3.5 Large is no exception.
Dual attention layers: Instead of using single attention layers for each stream of modality in the MMDiT blocks, SD3.5 uses double attention layers.
The rest of the details in terms of the text encoders, VAE, and noise scheduler stay exactly the same as in SD3 Medium. For more on SD3, we recommend checking out the original paper.

###
https://huggingface.co/genmo/mochi-1-preview#running
10/22/24
Mochi 1
Dramatically closes the gap between closed and open video generation models. ✅
Apache 2.0 license 🤯
High-fidelity videos
Strong prompt adherence
Model available on 🤗 Hugging Face Hub
Mochi is a state-of-the-art video generation model and is released by Genmo.
You can start a Gradio 5 app following the instructions on the model card on Hugging Face Hub,

Introducing Mochi 1
The best open source video generation model

Weights
Try Now
Describe your video...

Need prompt ideas?
Mochi Icon
Generate
Mochi 1 is a research preview of the world's best open video generation model.
We're solving fundamental problems with AI video today.

Unmatched motion quality
Realistic motion that respects the laws of physics, down to the tiniest of details.

Superior prompt adherence
Get detailed control over characters, settings, and actions with exceptional alignment of videos with textual prompts.

Prompt: A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.

Crossing the uncanny valley
Mochi 1 generates consistent, fluid human action and expression.




###
https://huggingface.co/papers/2410.15735
AutoTrain: No-code training for state-of-the-art models
Published on Oct 21
·
Submitted by
derek-thomas
on Oct 22
Authors:

Abhishek Thakur
Abstract
With the advancements in open-source models, training (or finetuning) models on custom datasets has become a crucial part of developing solutions which are tailored to specific industrial or open-source applications. Yet, there is no single tool which simplifies the process of training across different types of modalities or tasks. We introduce AutoTrain (aka AutoTrain Advanced) -- an open-source, no code tool/library which can be used to train (or finetune) models for different kinds of tasks such as: large language model (LLM) finetuning, text classification/regression, token classification, sequence-to-sequence task, finetuning of sentence transformers, visual language model (VLM) finetuning, image classification/regression and even classification and regression tasks on tabular data. AutoTrain Advanced is an open-source library providing best practices for training models on custom datasets. The library is available at https://github.com/huggingface/autotrain-advanced. AutoTrain can be used in fully local mode or on cloud machines and works with tens of thousands of models shared on Hugging Face Hub and their variations.

🤗 AutoTrain Advanced
AutoTrain Advanced: faster and easier training and deployments of state-of-the-art machine learning models. AutoTrain Advanced is a no-code solution that allows you to train machine learning models in just a few clicks. Please note that you must upload data in correct format for project to be created. For help regarding proper data format and pricing, check out the documentation.

NOTE: AutoTrain is free! You only pay for the resources you use in case you decide to run AutoTrain on Hugging Face Spaces. When running locally, you only pay for the resources you use on your own infrastructure.

###
https://huggingface.co/blog/s2s_endpoint
Huggingface
October 22, 2024
Deploying Speech-to-Speech on Hugging Face

🔥 Announcing our latest development with Hugging Face's Speech-to-Speech! 🔥
🚀 You asked for lightning-fast latency without the hassle of server setups or cloud provider issues. We listened!
💰 Dive into our new blog post, where we show how to use Hugging Face's Inference Endpoints to deliver ultra-low latency on an L4 GPU—for just $0.80/hour!
🛠️ We created a custom Docker image for low latency, and we're open-sourcing the entire solution for everyone to use!
🎥 The video shows a word game played against Llama 3.1 8B; the latency is so low that the game flows seamlessly!
👉 Read all about how we did it in our blog post:
https://lnkd.in/e-4sKJnB
💻 Want to try it yourself? Here's the code to get you started:
https://lnkd.in/eVEQYHWA
🚀
We can't wait to see what you'll build with this! Feedback is more than appreciated. 🙌

Deploying Speech-to-Speech on Hugging Face
Published October 22, 2024
Andres Marafioti's avatar
andito
Andres Marafioti
Derek Thomas's avatar
derek-thomas
Derek Thomas
Diego Maniloff's avatar
dmaniloff
Diego Maniloff
Eustache Le Bihan's avatar
eustlb
Eustache Le Bihan
Introduction
Speech-to-Speech (S2S) is an exciting new project from Hugging Face that combines several advanced models to create a seamless, almost magical experience: you speak, and the system responds with a synthesized voice.

The project implements a cascaded pipeline leveraging models available through the Transformers library on the Hugging Face hub. The pipeline consists of the following components:

Voice Activity Detection (VAD)
Speech to Text (STT)
Language Model (LM)
Text to Speech (TTS)
What's more, S2S has multi-language support! It currently supports English, French, Spanish, Chinese, Japanese, and Korean. You can run the pipeline in single-language mode or use the auto flag for automatic language detection. Check out the repo for more details here.

> 👩🏽‍💻: That's all amazing, but how do I run S2S?
> 🤗: Great question!

Running Speech-to-Speech requires significant computational resources. Even on a high-end laptop you might encounter latency issues, particularly when using the most advanced models. While a powerful GPU can mitigate these problems, not everyone has the means (or desire!) to set up their own hardware.

This is where Hugging Face's Inference Endpoints (IE) come into play. Inference Endpoints allow you to rent a virtual machine equipped with a GPU (or other hardware you might need) and pay only for the time your system is running, providing an ideal solution for deploying performance-heavy applications like Speech-to-Speech.

In this blog post, we’ll guide you step by step to deploy Speech-to-Speech to a Hugging Face Inference Endpoint. This is what we'll cover:

Understanding Inference Endpoints and a quick overview of the different ways to setup IE, including a custom container image (which is what we'll need for S2S)
Building a custom docker image for S2S
Deploying the custom image to IE and having some fun with S2S!
Inference Endpoints
Inference Endpoints provide a scalable and efficient way to deploy machine learning models. These endpoints allow you to serve models with minimal setup, leveraging a variety of powerful hardware. Inference Endpoints are ideal for deploying applications that require high performance and reliability, without the need to manage underlying infrastructure.

Here's a few key features, and be sure to check out the documentation for more:

Simplicity - You can be up and running in minutes thanks to IE's direct support of models in the Hugging Face hub.
Scalability - You don't have to worry about scale, since IE scales automatically, including to zero, in order to handle varying loads and save costs.
Customization: You can customize the setup of your IE to handle new tasks. More on this below.
Inference Endpoints supports all of the Transformers and Sentence-Transformers tasks, but can also support custom tasks. These are the IE setup options:

Pre-built Models: Quickly deploy models directly from the Hugging Face hub.
Custom Handlers: Define custom inference logic for more complex pipelines.
Custom Docker Images: Use your own Docker images to encapsulate all dependencies and custom code.
For simpler models, options 1 and 2 are ideal and make deploying with Inference Endpoints super straightforward. However, for a complex pipeline like S2S, you will need the flexibility of option 3: deploying our IE using a custom Docker image.

This method not only provides more flexibility but also improved performance by optimizing the build process and gathering necessary data. If you’re dealing with complex model pipelines or want to optimize your application deployment, this guide will offer valuable insights.

Deploying Speech-to-Speech on Inference Endpoints
Let's get into it!

Building the custom Docker image
To begin creating a custom Docker image, we started by cloning Hugging Face’s default Docker image repository. This serves as a great starting point for deploying machine learning models in inference tasks.

git clone https://github.com/huggingface/huggingface-inference-toolkit

Why Clone the Default Repository?
Solid Foundation: The repository provides a pre-optimized base image designed specifically for inference workloads, which gives a reliable starting point.
Compatibility: Since the image is built to align with Hugging Face’s deployment environment, this ensures smooth integration when you deploy your own custom image.
Ease of Customization: The repository offers a clean and structured environment, making it easy to customize the image for the specific requirements of your application.
You can check out all of our changes here

Customizing the Docker Image for the Speech-to-Speech Application
With the repository cloned, the next step was tailoring the image to support our Speech-to-Speech pipeline.

Adding the Speech-to-Speech Project
To integrate the project smoothly, we added the speech-to-speech codebase and any required datasets as submodules. This approach offers better version control, ensuring the exact version of the code and data is always available when the Docker image is built.

By including data directly within the Docker container, we avoid having to download it each time the endpoint is instantiated, which significantly reduces startup time and ensures the system is reproducible. The data is stored in a Hugging Face repository, which provides easy tracking and versioning.

git submodule add https://github.com/huggingface/speech-to-speech.git
git submodule add https://huggingface.co/andito/fast-unidic

Optimizing the Docker Image
Next, we modified the Dockerfile to suit our needs:

Streamlining the Image: We removed packages and dependencies that weren’t relevant to our use case. This reduces the image size and cuts down on unnecessary overhead during inference.
Installing Requirements: We moved the installation of requirements.txt from the entry point to the Dockerfile itself. This way, the dependencies are installed when building the Docker image, speeding up deployment since these packages won’t need to be installed at runtime.
Deploying the Custom Image
Once the modifications were in place, we built and pushed the custom image to Docker Hub:

DOCKER_DEFAULT_PLATFORM="linux/amd64" docker build -t speech-to-speech -f dockerfiles/pytorch/Dockerfile .
docker tag speech-to-speech andito/speech-to-speech:latest
docker push andito/speech-to-speech:latest

With the Docker image built and pushed, it’s ready to be used in the Hugging Face Inference Endpoint. By using this pre-built image, the endpoint can launch faster and run more efficiently, as all dependencies and data are pre-packaged within the image.

###
https://huggingface.co/microsoft/OmniParser
Microsoft
10/25/24
OmniParser
🤩 Microsoft has casually dropped this gem to enable GPT4V to navigate your computer! Looks like, 'Computer use' is the next battleground.
More details👇
> Screen Parsing tool for Pure Vision Based GUI Agent
> A method for parsing user interface screenshots into structured and easy-to-understand elements.
> This significantly enhances the ability of OpenAI's GPT-4V to generate actions 🤯
> Makes it possible for powerful LLMS to accurately ground the corresponding regions of interest in an interface.
> 🚀 Understanding the user interfaces like never before!

Model Summary
OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent. Training Datasets include: 1) an interactable icon detection dataset, which was curated from popular web pages and automatically annotated to highlight clickable and actionable regions, and 2) an icon description dataset, designed to associate each UI element with its corresponding function.

This model hub includes a finetuned version of YOLOv8 and a finetuned BLIP-2 model on the above dataset respectively. For more details of the models used and finetuning, please refer to the paper.

Responsible AI Considerations
Intended Use
OmniParser is designed to be able to convert unstructured screenshot image into structured list of elements including interactable regions location and captions of icons on its potential functionality.
OmniParser is intended to be used in settings where users are already trained on responsible analytic approaches and critical reasoning is expected. OmniParser is capable of providing extracted information from the screenshot, however human judgement is needed for the output of OmniParser.
OmniParser is intended to be used on various screenshots, which includes both PC and Phone, and also on various applications.

###
https://arxiv.org/abs/2410.12896
[Submitted on 16 Oct 2024]
A Survey on Data Synthesis and Augmentation for Large Language Models
Ke Wang, Jiahui Zhu, Minjie Ren, Zeming Liu, Shiwei Li, Zongye Zhang, Chenkai Zhang, Xiaoyu Wu, Qiqi Zhan, Qingjie Liu, Yunhong Wang
The success of Large Language Models (LLMs) is inherently linked to the availability of vast, diverse, and high-quality data for training and evaluation. However, the growth rate of high-quality data is significantly outpaced by the expansion of training datasets, leading to a looming data exhaustion crisis. This underscores the urgent need to enhance data efficiency and explore new data sources. In this context, synthetic data has emerged as a promising solution. Currently, data generation primarily consists of two major approaches: data augmentation and synthesis. This paper comprehensively reviews and summarizes data generation techniques throughout the lifecycle of LLMs, including data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and applications. Furthermore, We discuss the current constraints faced by these methods and investigate potential pathways for future development and research. Our aspiration is to equip researchers with a clear understanding of these methodologies, enabling them to swiftly identify appropriate data generation strategies in the construction of LLMs, while providing valuable insights for future exploration.
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:2410.12896 [cs.CL]
(or arXiv:2410.12896v1 [cs.CL] for this version)

###
https://openai.com/index/simplifying-stabilizing-and-scaling-continuous-time-consistency-models/
October 23, 2024
OpenAI

Simplifying, stabilizing, and scaling continuous-time consistency models
Continuous-time consistency models with sample quality comparable to leading diffusion models in just two sampling steps.

Read paper(opens in a new window)
Abstract close-up of a butterfly wing with vibrant blue and orange sections blending together in a painterly style. The image focuses on the texture and gradient of the colors, creating a soft, artistic feel.
Diffusion models have revolutionized generative AI, enabling remarkable advances in generating realistic images, 3D models, audio, and video. However, despite their impressive results, these models are slow at sampling.

We are sharing a new approach, called sCM, which simplifies the theoretical formulation of continuous-time consistency models, allowing us to stabilize and scale their training for large scale datasets. This approach achieves comparable sample quality to leading diffusion models, while using only two sampling steps. We are also sharing our research paper⁠(opens in a new window) to support further progress in this field.


Sampling procedure of consistency models. Sampling time measured on a single A100 GPU with batch size = 1.


Sampling procedure of diffusion models. Sampling time measured on a single A100 GPU with batch size = 1.

Introduction
Current sampling approaches of diffusion models often require dozens to hundreds of sequential steps to generate a single sample, which limits their efficiency and scalability for real-time applications. Various distillation techniques have been developed to accelerate sampling, but they often come with limitations, such as high computational costs, complex training, and reduced sample quality.

Extending our previous research on consistency models 1,2, we have simplified the formulation and further stabilized the training process of continuous-time consistency models. Our new approach, called sCM, has enabled us to scale the training of continuous-time consistency models to an unprecedented 1.5 billion parameters on ImageNet at 512×512 resolution. sCMs can generate samples with quality comparable to diffusion models using only two sampling steps, resulting in a ~50x wall-clock speedup. For example, our largest model, with 1.5 billion parameters, generates a single sample in just 0.11 seconds on a single A100 GPU without any inference optimization. Additional acceleration is easily achievable through customized system optimization, opening up possibilities for real-time generation in various domains such as image, audio, and video.

For rigorous evaluation, we benchmarked sCM against other state-of-the-art generative models by comparing both sample quality, using the standard Fréchet Inception Distance (FID) scores (where lower is better), and effective sampling compute, which estimates the total compute cost for generating each sample. As shown below, our 2-step sCM produces samples with quality comparable to the best previous methods while using less than 10% of the effective sampling compute, significantly accelerating the sampling process.

Scatter plot comparing Frechet Inception Distance (lower is better) and Effective Sampling Compute for various models. Notable models include sCM (ours), BigGAN, StyleGAN-XL, ADM-G, U-ViT-H/4, MaskGIT, and DiT-XL/2.
How it works
Consistency models offer a faster alternative to traditional diffusion models for generating high-quality samples. Unlike diffusion models, which generate samples gradually through a large number of denoising steps, consistency models aim to convert noise directly into noise-free samples in a single step. This difference is visualized by paths in the diagram: the blue line represents the gradual sampling process of a diffusion model, while the red curve illustrates the more direct, accelerated sampling of a consistency model. Using techniques like consistency training or consistency distillation 1,2, consistency models can be trained to generate high-quality samples with significantly fewer steps, making them appealing for practical applications that require fast generation.

Diagram illustrating ODE trajectories between data and noise, showing points connected by curved and straight paths labeled 𝑥 0 x 0 ​ , 𝑥 𝜃 x θ ​ , 𝑥 𝑡 − Δ 𝑡 x t−Δt ​ , and 𝑥 𝑡 x t ​ , with mathematical notations.
Illustration on diffusion model sampling (red) and consistency model sampling (blue).

We've trained a continuous-time consistency model with 1.5B parameters on ImageNet 512x512, and provided two-step samples from this model to demonstrate its capabilities.

Selected 2-step samples from a continuous-time consistency model trained on ImageNet 512x512.



Close-up of a snake’s head with a patterned and textured scale appearance.
Headshot of a white cockatoo with a black beak and yellow crest.
Jellyfish floating underwater with long, thin tentacles and a transparent, circular body.
Underwater view of a bright, white sea anemone with tentacles spread out.
A conch shell on wet sand with a striped pattern.
A snail on a green leaf, with a spiral-patterned shell.
A hermit crab with a red body, emerging from a shell on wet sand.
A white wolf resting on a rock, looking alert.
Close-up of a snow leopard’s face with spotted fur.
Close-up of a lion’s face with a thick mane.
A black beetle crawling on the ground.
A yellow and black beetle with long antennae on a green leaf.
A monarch butterfly with bright orange and black wings on a green plant.
A lionfish with long, spiky fins swimming near a coral reef.
A car’s side mirror reflecting a countryside view with hills and trees.
A large, ancient stone structure composed of stacked rocks in a grassy landscape.
A ceramic teapot and two small cups on a wooden table.
Close-up of a cheeseburger with melted cheese and a soft bun.
A scenic view of snow-capped mountains with lush green meadows and pine trees.
Aerial view of a coastal bay with turquoise water surrounded by cliffs.
A fast-flowing turquoise river cutting through rocky hills with dense green vegetation.
Our sCM distills knowledge from a pre-trained diffusion model. A key finding is that sCMs improve proportionally with the teacher diffusion model as both scale up. Specifically, the relative difference in sample quality, measured by the ratio of FID scores, remains consistent across several orders of magnitude in model sizes, causing the absolute difference in sample quality to diminish at scale. Additionally, increasing the sampling steps for sCMs further reduces the quality gap. Notably, two-step samples from sCMs are already comparable (with less than a 10% relative difference in FID scores) to samples from the teacher diffusion model, which requires hundreds of steps to generate.

FID Scaling
FID Ratio Scaling
FID Scaling
FID Ratio Scaling
Line graph comparing FID against single forward flops for three methods: 1-step SCM (red), 2-step SCM (blue), and Diffusion (orange), across model sizes (S, M, L, XL, XXL). All lines show decreasing FID as flops increase, with Diffusion performing best.
sCM scales commensurately with teacher diffusion models.

Line graph showing FID Ratio versus single forward flops for 1-step SCM (red), 2-step SCM (blue), and Diffusion (orange, constant at 1.0), across model sizes (S, M, L, XL, XXL). The 1-step and 2-step SCM show varying FID Ratios.
sCM scales commensurately with teacher diffusion models.

Limitations
The best sCMs still rely on pre-trained diffusion models for initialization and distillation, resulting in a small but consistent gap in sample quality compared to the teacher diffusion model. Additionally, FID as a metric for sample quality has its own limitations; being close in FID scores does not always reflect actual sample quality, and vice versa. Therefore, the quality of sCMs may need to be assessed differently depending on the requirements of specific applications.

What's next
We will continue to work toward developing better generative models with both improved inference speed and sample quality. We believe these advancements will unlock new possibilities for real-time, high-quality generative AI across a wide range of domains.


기술적으로 최대한 자세하게 적어. 11개의 기사가 있고 하나도 빼먹지 말고 적어.