Google DeepMind에서는 AlphaFold 3의 추론코드를 공개하여 신약 개발 분야에 혁신을 가져왔습니다. Alibaba Cloud는 Qwen2.5-Coder-32B-Instruct 모델을 공개하여 Anthropic Claude Sonnet 3.5와 경쟁할 수 있는 오픈 LLM을 선보였습니다. Tencent는 새로운 대형 MoE 모델인 Hunyuan-Large를 출시하여 Meta Llama 3.1 405B를 능가하는 성능을 보였습니다. FishAudio는 70만 시간의 다국어 오디오로 훈련된 음성-음성 모델인 Fish Agent v0.1 3B를 공개하였습니다. AMD는 1B 파라미터의 언어 모델인 AMD OLMo를 출시하여 OpenELM과 Tiny Llama를 능가하는 성능을 보였습니다. Standard Intelligence는 음성 전용 베이스 모델인 hertz-dev를 발표하였습니다. GitHub는 AI를 활용한 마이크로 앱 생성 도구인 GitHub Spark를 공개하였습니다. GitHub의 Octoverse 2024 보고서에 따르면, AI의 확산으로 인해 Python이 가장 인기 있는 언어로 부상하였습니다. Google은 제로샷으로 개인화된 인물 이미지를 생성하는 새로운 AI 모델을 발표하였습니다. Mixture of In-Context Learners 논문에서는 In-Context Learning의 효율성을 높이는 새로운 방법을 제안하였습니다. NVIDIA는 TensorRT-LLM MultiShot을 활용하여 NVSwitch에서 AllReduce 속도를 3배 향상시켰습니다. IBM은 문서 파싱 도구인 Docling을 공개하여 다양한 문서 형식을 손쉽게 처리할 수 있게 하였습니다.

Google DeepMind, AlphaFold 3 출시

링크, 11/11/24

  • DeepMind는 AlphaFold 3의 추론 코드베이스, 모델 가중치 및 온디맨드 서버를 공개함
  • AlphaFold 3는 단백질, DNA, RNA, 리간드, 이온 등의 고정밀 생체분자 구조 예측을 단일 플랫폼에서 가능하게 함
  • 모델은 화학적 수정 및 생체분자 복합체의 3차원 구조를 정확히 예측할 수 있음
  • AlphaFold 3는 신약 개발 분야에서 새로운 패러다임을 제시하며 과학계의 큰 관심을 받음
  • 논문 발표 후 한 달도 안 되어 인용 수가 25회에 달하는 등 높은 주목을 받음
  • AlphaFold 3는 Evoformer에서 Pairformer로 내부 모듈을 변경하여 계산 자원과 시간을 단축함
  • 생성형 AI 기술인 Diffusion을 구조 예측 네트워크에 도입하여 원자 단위의 3차원 좌표를 예측함
  • AlphaFold 3의 한계점으로 chirality 문제와 hallucination 문제가 언급되었으며, 추가적인 개선의 여지가 있음

Alibaba Cloud, Qwen2.5-Coder-32B-Instruct 공개

링크, 11/11/24

  • Anthropic Claude Sonnet 3.5와 경쟁할 수 있는 오픈 LLM을 공개함
  • Qwen2.5-Coder-32B는 여러 코딩 벤치마크에서 Claude Sonnet 3.5와 성능을 맞춤
  • HumanEval에서 92.7, EvalPlus에서 86.3의 점수로 Claude 3.5 Sonnet을 능가함
  • 코드 생성, 코드 추론 및 코드 수리에 있어 성능이 크게 향상됨
  • 40개 이상의 언어를 지원하며 128K의 컨텍스트 길이를 가짐
  • Apache 2.0 라이선스로 공개되어 Hugging Face에서 이용 가능함

Tencent, Hunyuan-Large 모델 출시

링크, 11/3/24

  • Tencent는 1.5조 개의 합성 데이터로 훈련된 새로운 대형 MoE 모델을 공개함
  • 389B-A52B MoE 모델로 Meta Llama 3.1 405B를 능가하는 성능을 보임
  • 총 236B 파라미터로, 생성 시 21B 파라미터가 활성화됨
  • 160개의 전문가 중 6개가 생성 시 활성화됨
  • 영어와 중국어 데이터를 주로 훈련하여 다국어 지원을 강화함
  • 코드 생성 및 Fill-in-the-Middle 작업에서 우수한 성능을 보임
  • 7조 개의 토큰으로 훈련되었으며, 그 중 1.5조 개는 합성 데이터임
  • Apache 2.0 라이선스로 공개되었으나, EU 내의 시민과 기업은 사용이 제한됨

FishAudio, Fish Agent v0.1 3B 공개

링크, 11/1/24

  • 70만 시간의 다국어 오디오로 훈련된 음성-음성 모델을 공개함
  • Qwen-2.5-3B-Instruct를 기반으로 2000억 개의 오디오 및 텍스트 토큰으로 추가 훈련됨
  • 제로샷 음성 복제를 지원함
  • 텍스트 및 오디오 입력/오디오 출력을 지원함
  • 200ms의 짧은 추론 시간으로 초고속 추론이 가능함
  • 모델은 Hugging Face에서 이용 가능하며, 파인튜닝 코드도 곧 공개 예정임

AMD, AMD OLMo 1B 언어 모델 발표

링크, 10/31/24

  • 1B 파라미터의 언어 모델인 AMD OLMo를 공개함
  • OpenELM과 Tiny Llama를 능가하는 성능을 보이며, Apache 2.0 라이선스로 공개됨
  • 16개의 노드, 각 노드에 4개의 MI250 GPU를 사용하여 1.3조 개의 토큰으로 훈련됨
  • 세 가지 체크포인트 공개: Pre-trained, SFT, SFT DPO
  • SFT는 Tulu V2, OpenHermes-2.5, WebInstructSub, Code-Feedback 데이터셋으로 진행됨
  • DPO를 통해 UltraFeedback 데이터셋으로 인간의 선호도에 맞게 정렬됨
  • MT Bench, Alpaca Eval에서 OpenELM, Tiny Llama보다 우수한 성능을 보임

Standard Intelligence, hertz-dev 발표

링크, 11/6/24

  • 8.5B 파라미터의 음성 전용 베이스 모델인 hertz-dev를 공개함
  • 2000만 시간의 오디오 데이터로 훈련됨
  • 음성-음성, 번역, 분류, 음성 인식, 텍스트-음성 변환 등 다양한 다운스트림 작업에 활용 가능함
  • Apache 2.0 라이선스로 공개되어 모델 체크포인트를 이용할 수 있음

GitHub, GitHub Spark 공개

링크, 11/1/24

  • AI를 활용하여 마이크로 앱(“sparks”)을 생성하고 공유할 수 있는 도구인 GitHub Spark를 발표함
  • 코드 작성 없이 자연어 기반 편집기로 아이디어를 표현하고 앱을 생성할 수 있음
  • 관리형 런타임 환경을 제공하여 데이터 저장, 테마, LLM 접근을 지원함
  • 대시보드를 통해 데스크톱 및 모바일 기기에서 스파크를 관리하고 실행할 수 있음
  • 사용자 정의 및 개인화된 소프트웨어 생성이 용이해짐

GitHub, Octoverse 2024 보고서 발표

링크, 10/29/24

  • AI의 확산으로 인해 Python이 GitHub에서 가장 인기 있는 언어로 부상함
  • 전 세계 개발자 수가 급증하였으며, 특히 아프리카, 라틴 아메리카, 아시아에서 두드러짐
  • Generative AI 프로젝트에 대한 글로벌 활동이 증가하였으며, 미국 외 지역에서의 기여도가 높아짐
  • 오픈 소스 활동이 전통적인 소프트웨어 개발을 넘어 확장되고 있음
  • Jupyter Notebooks의 사용이 92% 증가하여 데이터 과학 및 머신러닝 분야의 성장 반영

Google, 제로샷 개인화된 인물 이미지 생성 모델 발표

링크, 11/11/24

  • 입력된 셀피를 다양한 예술적 스타일로 변환하는 새로운 AI 모델을 공개함
  • 이미지 어댑터와 컨트롤 어댑터를 사용하여 얼굴의 세부 특징과 포즈, 표정을 정확히 캡처함
  • 사용자는 원하는 스타일과 표정을 텍스트 프롬프트로 지정하여 이미지를 생성할 수 있음
  • 모델은 다양한 스타일(3D 카툰, 수채화, 애니메이션, 연필 스케치 등)을 지원함
  • Imagen on Vertex AI를 통해 모델에 접근 가능함

Mixture of In-Context Learners 논문 발표

링크, 11/5/24

  • In-Context Learning에서 데모를 하위 집합으로 나누어 전문가로 취급하고, 가중치 함수를 학습하여 출력 분포를 결합하는 새로운 접근법 제안
  • 블랙박스 LLM에 적용 가능하며, 데이터, 메모리, 계산 효율성이 높음
  • 노이즈가 있는 데모와 레이블 불균형에 강인함
  • 간단한 방법으로 현재 LLM의 In-Context Learning 성능을 향상시킴

NVIDIA, TensorRT-LLM MultiShot 발표

링크, 11/1/24

  • NVSwitch와 TensorRT-LLM MultiShot을 활용하여 AllReduce 통신 속도를 최대 3배 향상시킴
  • 멀티 GPU 환경에서의 통신 병목 현상을 개선하여 저지연 추론 성능을 향상시킴
  • 기존의 링 기반 AllReduce 알고리즘의 통신 단계를 2단계로 줄여 지연 시간을 감소시킴
  • NVSwitch의 멀티캐스트 기능을 활용하여 데이터 전송 효율을 높임

IBM, Docling 도구 공개

링크, 11/1/24

  • 문서를 파싱하고 원하는 형식으로 빠르고 쉽게 내보낼 수 있는 도구인 Docling을 발표함
  • PDF, DOCX, PPTX, 이미지, HTML, Markdown 등 인기 있는 문서 형식을 읽고 Markdown과 JSON으로 내보낼 수 있음
  • 고급 PDF 문서 이해를 지원하여 페이지 레이아웃, 읽기 순서, 테이블 구조를 파악함
  • LlamaIndex 및 LangChain과의 쉬운 통합으로 강력한 RAG/QA 애플리케이션에 활용 가능
  • OCR을 통한 스캔된 PDF 지원 및 간단한 CLI 제공
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
###
https://github.com/google-deepmind/alphafold3
Google Deepmind
11/11/24

Wohoo! DeepMind released AlphaFold 3 inference codebase, model weights and an on-demand server! ⚡
AlphaFold can generate highly accurate biomolecular structure predictions containing proteins, DNA, RNA, ligands, ions, and also model chemical modifications for proteins and nucleic acids in one platform.
AlphaFold 3
This package provides an implementation of the inference pipeline of AlphaFold 3. See below for how to access the model parameters. You may only use AlphaFold 3 model parameters if received directly from Google. Use is subject to these terms of use.

Any publication that discloses findings arising from using this source code, the model parameters or outputs produced by those should cite the Accurate structure prediction of biomolecular interactions with AlphaFold 3 paper.

Please also refer to the Supplementary Information for a detailed description of the method.

AlphaFold 3 is also available at alphafoldserver.com for non-commercial use, though with a more limited set of ligands and covalent modifications.

If you have any questions, please contact the AlphaFold team at alphafold@google.com.

Obtaining Model Parameters
This repository contains all necessary code for AlphaFold 3 inference. To request access to the AlphaFold 3 model parameters, please complete this form. Access will be granted at Google DeepMind’s sole discretion. We will aim to respond to requests within 2–3 business days. You may only use AlphaFold 3 model parameters if received directly from Google. Use is subject to these terms of use.

AlphaFold3 리뷰 - Google DeepMind, 신약 개발의 새로운 패러다임을 제시하다
AlphaFold 3가 약 9개월간 베일에 싸여있다가 이번 논문을 통해 드디어 세상에 공개됐습니다. AlphaFold 2와는 어떤 점이 크게 달라지고, 발전했을까요? 이번 포스트에서는 AlphaFold3에 대해 최대한 쉽고 간략하게 알려드립니다.

지난 5월 8일, AI 및 신약 개발 분야를 비롯한 과학계 전체를 흥분 시킬 만 한 소식이 전해졌습니다. 바로 구글 딥마인드(Google DeepMind)에서 차세대 AlphaFold, 속칭 AlphaFold3에 대한 논문을 발표한 것 입니다. AlphaFold3는 지난 2023년 8월에 딥마인드에서 그 등장을 처음 예고한 직후 단일 단백질뿐만 아니라 단백질-단백질, 단백질-핵산, 단백질-리간드에 이르기까지 거의 모든 유형의 단백질 기반 생체 분자 복합체의 3차원 구조를 정확히 예측할 수 있다는 점에서 많은 화제를 모으고 있습니다. 이 놀라운 기술이 약 9개월간 베일에 싸여있다가 이번 논문을 통해 드디어 세상에 공개된 것입니다.

해당 논문은 발표된 지 한 달이 약간 안 된 현재 시점(2024.06.05)에 벌써 인용 수가 25회에 달할 정도로 AlphaFold3를 필두로 한 AI 기반 신약 개발 분야는 현시점 가장 주목 받는 신기술 중 하나입니다. 따라서 저희 히츠 AI 연구팀과 같은 AI 연구자들은 물론이고 제약바이오 분야에 종사하시는 많은 분들이 이번에 공개된 AlphaFold3에 대해 관심이 많으실 거라고 생각되는데요. 하지만 AlphaFold3는 현재 전 세계의 AI 기술 발전을 주도하고 있는 그룹 중 하나인 딥마인드의 작품답게 매우 복잡하고 정교하게 설계된 모델이라 AI 기술에 익숙하지 않으시면 이해하기 어려우실 수 있습니다. 이번 포스트에서는 여러분들의 이해를 돕고자 AlphaFold3에 대해 최대한 쉽고 간략하게 리뷰하고자 합니다.

전체적으로 AlphaFold3는 이전 AlphaFold2를 기반으로 재설계된 모델이라고 할 수 있을 정도로 내부 구조나 알고리즘적으로 유사한 점이 많기에, 이번 리뷰는 AlphaFold3에서 AlphaFold2와 비교했을 때 바뀐 부분을 위주로 설명하며 진행하고자 합니다. 따라서 AlphaFold2에 대한 사전 지식이 없으신 분들은 이를 리뷰한 이전 포스트를 먼저 보시고 오시는 것을 추천해 드립니다. 그럼, 본격적으로 AlphaFold3 리뷰를 시작하겠습니다.

AlphaFold3, 무엇이 달라졌나
이번 AlphaFold3리뷰도 이전 포스트와 마찬가지로 전체 내부 구조와 흐름을 입력 데이터 구성 → 내부 업데이트 모듈 → 최종 3차원 구조 예측 알고리즘 순으로 알아보는 형식으로 진행하겠습니다.



AlphaFold3의 내부 네트워크 구조 및 데이터 흐름 요약 (Adapted from Abramson et al. Nature, 2024, 1-3.)
AlphaFold3 변화점 - 1. 입력 데이터 구성 및 업데이트
먼저 입력 데이터부터 살펴보겠습니다.

이전 포스트에서 설명해 드렸듯이 AlphaFold2는 구조를 예측하고자 하는 단백질 서열에 대한 진화론적인 힌트를 주는 다중 서열 정렬 (Multiple Sequence Alignment, MSA) 데이터와 구조적인 힌트를 제공하는 template 데이터를 기반으로 형성한 pair representation, 이 두 가지 데이터를 입력으로 받습니다. 이때 AlphaFold2는 오직 단일 단백질에 대한 구조를 예측하는 모델이기 때문에 이 두 데이터의 기본 단위는 단백질 서열의 구성 요소인 아미노산이 됩니다.

반면에 AlphaFold3는 앞서 언급한 것처럼 예측하고자 하는 대상이 단백질뿐만 아니라 핵산과 리간드도 포함할 수 있기에 입력 데이터의 구성이 달라지기 때문에 입력 서열을 각 구성 성분의 유형에 따라 서로 다른 입력 단위로 표시합니다. 즉 입력 서열 내에 단백질은 아미노산, 핵산은 뉴클레오타이드, 그리고 리간드는 원자 단위로 표시하여 초기 입력 데이터 구성합니다. 이렇게 구성된 입력 데이터는 Input Embedder에서 각 요소 합쳐진 복합체 단위로 업데이트를 진행합니다.

그렇다면 서로 다른 단위를 가지는 입력 데이터를 어떻게 하나의 복합체로 통합한 뒤 업데이트할 수 있까요? 정답은 아미노산이나 뉴클레오타이드 단위들을 최소 입력 단위인 원자 단위로 쪼갠 뒤 연산을 수행하는 것입니다. 이렇게 원자 단위로 입력 단위를 세분화하면 임의의 생체 분자 복합체가 입력으로 들어오더라도 원자라는 공통된 입력 단위를 토대로 전체 복합체를 표시할 수 있고 추후 하나의 일관된 연산을 적용할 수 있게 됩니다. 이때 reference conformer라는 추가 데이터가 사용되는데, 이것은 아미노산이나 뉴클레오타이드와 같은 분자 단위의 입력을 원자 단위로 쪼갤 때 필요한 각 분자의 원자 구성 및 구조 정보를 제공합니다. 이렇게 재구성된 입력 데이터는 Input Embedder 네트워크 내에서 원자 단위의 연산을 통해 업데이트됩니다. 이때 각 원자마다 상관관계(attention)를 고려하는 연산을 수행하기 때문에 이 연산을 AtomAttention이라고 부릅니다.

AtomAttention 연산 결과값은 다시 원래 입력 단위로 재구성되어 아래 figure에서 single로 표현되는 업데이트 된 입력 서열을 형성합니다. 이는 AlphaFold3가 AlphaFold2와 마찬가지로 MSA 및 template 데이터 사용하여 추가적인 업데이트를 진행하기 때문입니다. MSA 와 template 데이터는 아미노산 혹은 뉴클레오타이드 단위로 표현되기 때문에 AtomAttention 연산 결과값 내에 원자 단위로 쪼개진 아미노산/뉴클레오타이드 표현을 다시 분자 단위 표현으로 합쳐주는 것이죠. 그리고 업데이트된 입력 서열에 외적 (outer product) 연산을 적용하여 pair representation을 형성한 뒤, 이를 순차적으로 Template module 과 MSA module을 통해 각각 template, MSA 데이터의 정보와 결합하여 업데이트 해줍니다.



AlphaFold3의 입력 데이터 구성 및 업데이트 과정 (Adapted from Abramson et al. Nature, 2024, 1-3.)
AlphaFold3의 입력 데이터 구성과 업데이트 과정은 아래와 같이 요약할 수 있습니다.

입력 : 입력 서열
단백질은 아미노산, 핵산은 뉴클레오타이드, 그리고 리간드는 원자 단위로 입력 단위를 구성하여 초기 입력 데이터를 구성
입력 데이터를 reference conformer 데이터를 활용하여 Input Embedder 내에서 원자 단위로 쪼갠 뒤 원자 단위로 상관관계를 고려하며 업데이트하는 AtomAttention 연산을 수행
AtomAttention 결과 업데이트된 single representation을 기반으로 pair를 형성한 뒤, Template & MSA module 내에서 template & MSA 데이터의 정보를 결합하여 업데이트
출력 : 업데이트 된 single 및 pair representation

AlphaFold3 변화점 - 2. 내부 모듈의 변화 : Evoformer에서 Pairformer로
AlphaFold2에서는 MSA 데이터와 pair representation 데이터를 Evoformer라는 네트워크를 통해 업데이트하였습니다. Evoformer는 attention을 사용하여 MSA 내에서는 서로 다른 단백질 서열 간의 진화론적인 상관관계를 고려하고, pair representation 내에서는 입력 단백질 서열 내의 아미노산 간의 상관관계를 고려하며 각 데이터를 업데이트했습니다. 추가로 Evoformer는 연산 도중 MSA 와 pair representation 사이에 서로 정보를 한번씩 교환하는 연산이 존재하여 두 입력 데이터가 서로의 정보를 반영하면서 업데이트되도록 유도하였습니다.



Evoformer 내부 구조 (Adapted from Jumper et al. Nature 596: 583 (2021), CC BY 4.0)
AlphaFold3에서는 Evoformer와 유사한 Pairformer네트워크를 사용합니다. Pairformer와 Evoformer간의 가장 큰 차이점은 바로 입력으로 받는 데이터의 종류와 내부 연산입니다. Evoformer가 입력 서열과 진화론적으로 유사한 다른 단백질 서열까지 포함된 MSA 데이터를 입력으로 받는 대신에, Pairformer는 앞선 Input Embedder에서 업데이트된 입력 서열(single representation)을 받습니다. 따라서 Evoformer와 달리 MSA 내부에서 서열 간 attention 연산과 MSA 와 pair representation 간의 정보 교환 연산이 각각 업데이트된 서열(single representation) 내의 요소들끼리의 attention 연산과 pair에서 single representation으로의 하나의 정보 교환 연산으로 간소화됩니다. 이때 Pairformer에서 빠진 MSA와 pair representation간의 정보 교환 연산은 앞서 언급한 MSA module로 대체합니다.




Pairformer 내부 구조 (Adapted from Abramson et al. Nature, 2024, 1-3.)

MSA module 내부 구조 상세. Pairformer에서 빠진 MSA ↔ pair representation간 정보 교환 연산이 수행됨을 알 수 있다. (Adapted from Abramson et al. Nature, 2024, 1-3.)
Pairformer내의 데이터 흐름과 연산을 요약하자면 아래와 같습니다.

입력 : Input Embedder에서 업데이트한 single & pair representation
Pair representation 내에 요소들끼리의 attention을 통한 연산을 통해 업데이트
업데이트된 pair representation의 정보를 single로 전달
pair의 정보를 받은 single representation을 업데이트
출력 : 업데이트된 single & pair representation
정리하자면 Pairformer는 Evoformer를 토대로 입력으로 받는 데이터의 크기 (여러 서열로 이루어진 MSA 데이터에서 입력 서열로만 이루어진 single representation으로)와 내부 연산 (MSA ↔ pair 정보 교환이 pair → single로)이 간소화 된 네트워크라고 할 수 있습니다. Pairformer와 Evoformer는 각각 AlphaFold 3와 AlphaFold2의 내부 구조에서 가장 큰 비중을 차지하는 네트워크인데요. 간소화된 Pairformer를 통해 AlphaFold3는 AlphaFold2에 비해 전체 연산에 쓰이는 계산 자원과 시간을 단축할 수 있습니다.


AlphaFold3 변화점 - 3. 새로운 구조 예측 네트워크, Diffusion
AlphaFold2는 Evoformer에서 업데이트된 MSA & pair representation을 기반으로 Structure module 을 통해 최종적으로 단백질의 구조를 예측했습니다. Structure module은 먼저 입력으로 받은 MSA & pair representation으로부터 아미노산들의 유클리디언 변환 (Euclidean Transformation) 행렬을 예측하여 backbone의 위치를 구한 뒤, 각 아미노산 마다 뒤틀림각을 예측하여 아미노산 내의 개별 원자들의 좌표를 예측해 내는 식으로 단백질 내 전체 원자들의 3차원 좌표를 예측했습니다. 그리고 현재까지 예측한 단백질 구조 정보를 다시 Structure module의 입력으로 넣어주는 과정을 반복하여 단백질 내 원자들이 원점에서 점진적으로 실제 좌값으로 이동하게 만들었습니다.

AlphaFold3는 생체 분자의 구조를 예측하기 위해 이미지 생성에 널리 쓰이는 AI 모델 중 하나인 Diffusion을 사용하였습니다. Diffusion이란 원본 데이터(이미지)에 점진적으로 노이즈를 준 뒤, 그것을 제거하는 과정을 네트워크를 통해 학습시켜 최종적으로 완전한 노이즈에서 학습한 데이터와 비슷한 새로운 데이터를 생성할 수 있는 생성형 AI 모델 입니다.


Diffusion 개요 (Adapted from Yang et al. ACM Computing Surveys, 2023, 56.4: 1-39.)
AlphaFold3에서는 이 Diffusion을 기반으로 노이즈에서 생체 분자 내의 원자들의 3차원 좌표를 생성하는 Diffusion module로 Structure module을 대체했습니다. 먼저 Pairformer의 출력인 업데이트 된 single & pair representation을 입력으로 받아서 Diffusion Conditioning연산을 수행하여 각 원자의 3차원 내 공간적 조건을 계산합니다. 이때 single & pair representation 내에서 단백질/핵산의 단위는 아미노산/뉴클레오타이드이기 때문에 각 원자마다 조건을 할당하기 위해서 Input Embedder처럼 데이터 내 입력 단위를 모두 원자 단위로 쪼개주는 과정이 포함됩니다. 이렇게 계산된 공간적 조건들은 노이즈가 추가된 원자들의 3차원 좌표 정보와 결합하여 노이즈가 추가되기 전 올바른 3차원 좌표를 예측합니다.


Diffusion module 내부 구조 (Adapted from Abramson et al. Nature, 2024, 1-3.)
Diffusion module이 최종적으로 원자들의 3차원 좌표를 예측하는 과정을 정리하면 아래와 같습니다.

입력 : Pairformer에서 업데이트한 single & pair representation, 노이즈가 추가된 원자들의 3차원 좌표
업데이트된 single & pair representation을 원자 단위로 세분화한 뒤 각 원자마다 3차원 공간적 조건을 담도록 업데이트
업데이트 각 원자별 공간적 조건과 노이즈가 추가된 원자들의 3차원 좌표를 결합하여 노이즈가 제거된 원본 좌푯값을 예측
출력 : 노이즈가 제거된 원자들의 3차원 좌표



AlphaFold3는 학습 시 생체 분자의 실체 3차원 구조 (Ground truth) 하나당 여러 개의 노이즈가 추가된 샘플을 만들어 내고 각 샘플을 Diffusion module의 입력으로 넣어줘서 Ground truth 내의 3차원 좌표들을 예측하도록 합니다. 즉, Diffusion module은 학습 과정에서는 노이즈를 한 번만 제거하는 single step of the diffusion을 학습하는 것이지요. 그러나 추론 과정에서는 single step of the diffusion을 열거하여 각 step마다 나온 노이즈가 제거된 좌푯값을 다시 Diffusion module의 입력으로 넣어 주는 과정을 반복합니다. 이것은 AlphaFold2에서 Structure module이 이전 구조 예측값을 입력으로 받으며 점진적으로 개선하는 과정과 비슷하며 이 과정을 논문에서는 mini-rollout이라고 언급합니다. 이러한 mini-rollout 과정을 통해 Diffusion module은 최종적으로 완전한 노이즈를 점진적으로 원자들의 3차원 좌표들로 변환할 수 있게 됩니다.


AlphaFold3의 등장이 우리에게 시사하는 점
지금까지 AlphaFold3의 내부 구조 및 알고리즘에 대해 알아봤습니다. AlphaFold3는 AlphaFold2와 비교했을 때 크게 아래 3가지 요소들에서 특징적임을 알 수 있습니다.

임의의 생체 분자 복합체를 입력으로 받을 수 있게 입력 서열의 구성과 업데이트 방식 변화
Evoformer보다 간소화된 Pairformer를 통해 singe & pair representation을 업데이트
원자들의 3차원 좌표를 예측하기 위해서 생성형 AI 기술인 Diffusion을 접목
이 외에도 AlphaFold 3에는 하나의 포스트로는 차마 다 다루지 못할 정도로 많은 변화와 신기술들이 녹아 있습니다. 이것들을 다룬 Supplementary Information을 보면 저자들이 AlphaFold3를 설계하기 위해 얼마나 많은 고민과 연구를 했는지 여실히 알 수 있습니다. 그 결과 탄생한 AlphaFold3는 단일 AI 모델로서 핵산, 리간드를 포함한 여러 유형의 단백질 기반 생체 분자 복합체의 구조를 정확하게 예측하며 현시대 AI 기반 신약 개발 분야의 게임 체인저로 부상하는 기염을 토해냈습니다.




AlphaFold3가 예측한 여러 유형의 단백질 기반 생체 분자들의 구조 (Adapted from Abramson et al. Nature, 2024, 1-3.)
이번에 공개된 AlphaFold3가 굉장히 놀랍고 혁신적인 기술임에는 틀림없지만 여전히 한계는 존재합니다. 저희 AI연구팀 황상연 팀장님이 AlphaFold-latest라는 이름으로 AlphaFold3의 등장을 소개한 지난 포스트에서 언급하셨듯이 현재 AlphaFold3가 보여주는 구조 예측 성능이 지금까지의 State-Of-The-Art (SOTA)일 뿐, 절대적으로 봤을 때 아직 만점에는 한참 못 미치는 수준입니다. 또한 논문 내에서도 단백질-리간드 예측 구조에서 chirality가 제대로 반영되지 않거나 atom clashing이 일어나는 문제, Diffusion을 포함한 생성형 AI 모델의 고질적인 문제로 언급되는 hallucination으로 인해 왜곡된 구조가 예측되는 문제 등 여러 한계점이 있음을 언급하였습니다. 즉, 현재 AlphaFold3는 추가적인 개선이 될 여지가 많으며, 이를 필두로 더 많은 진보가 예고된 것입니다.


AlphaFold3의 한계점으로 언급된 리간드의 chirality 문제와 (좌) 예측 구조의 hallucination 문제 (우) (Adapted from Abramson et al. Nature, 2024, 1-3.)
AI 신약 개발이라는 기술적 격변의 시기에 대응하기 위해 전 세계적으로도 분주히 움직이고 있습니다. 실제로 이번 AlphaFold3 논문이 코드가 첨부되지 않은 불완전한 오픈소스임이 들어나자 많은 연구자 이에 대해 문의하였고, 해당 논문의 저널인 Nature에서 이례적으로 해명 기고를 낼 정도로 이번 AlphaFold3의 등장과 AI 신약 개발이라는 혁신은 학계의 뜨거운 관심을 받고 있습니다. 이제 이 혁신이 실제 업계에 반영되는 미래도 멀지 않아 보입니다. 빠르게 변화하는 기술적 흐름에 압도되어 AI 신약 개발이 미지의 우주처럼 느껴지시는 분도 있으실 겁니다. 하지만 걱정하지 마시길. 국내외에서 인정받는 기술력을 보유한 AI 신약 개발 플랫폼인 히츠의 하이퍼랩과 함께라면 AI 신약 개발은 무궁무진한 가능성을 담은 기회의 장이 될 것입니다. 저희와 함께 AI 신약 개발이라는 미지의 영역을 마음껏 탐사해 보시죠.

###
https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct
11/11/24
Alibaba Cloud

An open LLM that competes with Anthropic Claude Sonnet 3.5 impossible? No, Qwen2.5-Coder-32B entered the game matching Claude Sonnet 3.5 across multiple coding benchmarks. Early testers say, “doing the things Sonnet didn't want to,” “Trying out a preview of Qwen2.5 Coder 32B, and it feels like Claude 3.5 Sonnet”., “I am blown away how well it does in long context work.” 👀
TL;DR:
💪 Nearly matches Claude in overall coding capabilities with just 32B parameters
🚀 Beats Claude 3.5 Sonnet on HumanEval (92.7 vs. 92.1) and EvalPlus (86.3 vs. 85.9).
📈 Outperforms open models across coding benchmarks and Fill-in-the-Middle tasks.
🌐 Released as Base and Instruct version in over 40+ languages with 128K context length.
🤖 High performance in code repair benchmarks 73.7 on Aider.
📝 Use long system prompts (over 16k) with examples, docs etc.
🤗 Licensed under Apache 2.0 and available on Hugging Face.

Qwen2.5-Coder-32B-Instruct
Introduction
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5:

Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o.
A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies.
Long-context Support up to 128K tokens.
This repo contains the instruction-tuned 32B Qwen2.5-Coder model, which has the following features:

Type: Causal Language Models
Training Stage: Pretraining & Post-training
Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
Number of Parameters: 32.5B
Number of Paramaters (Non-Embedding): 31.0B
Number of Layers: 64
Number of Attention Heads (GQA): 40 for Q and 8 for KV
Context Length: Full 131,072 tokens
Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts.

###
https://huggingface.co/tencent/Tencent-Hunyuan-Large
11/3/24
Syntehtic data is all you need? New large MoE from Tencent was trained on 1.5 trillion tokens of synthetic data. The 389B-A52B MoE outperforms Meta Llama 3.1 405B across academic benchmarks. 👀
TL;DR
🧮 236B parameters with 21B activated during generation
👨‍🏫 160 experts with 6 active in generation
😍 Detailed technical report with scaling experiments
🪟 Released Pretrain, Instruct, and FP8 version
🌱 Trained on 7 trillion tokens with 1.5T synthetic tokens
🌎 Trained mostly English and Chinese data
🏎️ Should fit on a single H100 Node (8x) in FP8
📜 Custom License, commercially useable below 100MAU
🇪🇺 License forbids use for citizens and companies in the EU
🧬 Post Training used SFT > DPO
🤗 Available on Hugging Face

GITHUB | 🖥️ official website | 🕖 HunyuanAPI| 🐳 Gitee

Technical Report | Demo | Tencent Cloud TI


Download Models
Models Huggingface Download URL Tencent Cloud Download URL
Hunyuan-A52B-Instruct-FP8 Hunyuan-A52B-Instruct-FP8 Hunyuan-A52B-Instruct-FP8
Hunyuan-A52B-Instruct Hunyuan-A52B-Instruct Hunyuan-A52B-Instruct
Hunyuan-A52B-Pretrain Hunyuan-A52B-Pretrain Hunyuan-A52B-Pretrain
Model Introduction
With the rapid development of artificial intelligence technology, large language models (LLMs) have made significant progress in fields such as natural language processing, computer vision, and scientific tasks. However, as the scale of these models increases, optimizing resource consumption while maintaining high performance has become a key challenge. To address this challenge, we have explored Mixture of Experts (MoE) models. The currently unveiled Hunyuan-Large (Hunyuan-MoE-A52B) model is the largest open-source Transformer-based MoE model in the industry, featuring a total of 389 billion parameters and 52 billion active parameters. This is currently the largest open-source Transformer-based MoE model in the industry, featuring a total of 389 billion parameters and 52 billion active parameters.

By open-sourcing the Hunyuan-Large model and revealing related technical details, we hope to inspire more researchers with innovative ideas and collectively advance the progress and application of AI technology. We welcome you to join our open-source community to explore and optimize future AI models together!

Introduction to Model Technical Advantages
Model
High-Quality Synthetic Data: By enhancing training with synthetic data, Hunyuan-Large can learn richer representations, handle long-context inputs, and generalize better to unseen data.

KV Cache Compression: Utilizes Grouped Query Attention (GQA) and Cross-Layer Attention (CLA) strategies to significantly reduce memory usage and computational overhead of KV caches, improving inference throughput.

Expert-Specific Learning Rate Scaling: Sets different learning rates for different experts to ensure each sub-model effectively learns from the data and contributes to overall performance.

Long-Context Processing Capability: The pre-trained model supports text sequences up to 256K, and the Instruct model supports up to 128K, significantly enhancing the ability to handle long-context tasks.

Extensive Benchmarking: Conducts extensive experiments across various languages and tasks to validate the practical effectiveness and safety of Hunyuan-Large.



Benchmark Evaluation
Hunyuan-Large pre-trained model achieves the best overall performance compared to both Dense and MoE based competitors having similar activated parameter sizes. For aggregated benchmarks such as MMLU, MMLU-Pro, and CMMLU, Hunyuan-Large consistently achieves the best performance, confirming its comprehensive abilities on aggregated tasks. Hunyuan-Large also shows superior performance in commonsense understanding and reasoning, and classical NLP tasks such as QA and reading comprehension tasks (e.g., CommonsenseQA, PIQA and TriviaQA).
For the mathematics capability, Hunyuan-Large outperforms all baselines in math datasets of GSM8K and MATH, and also gains the best results on CMATH in Chinese.We also observe that Hunyuan-Large achieves the overall best performance in all Chinese tasks (e.g., CMMLU, C-Eval).

Model LLama3.1-405B LLama3.1-70B Mixtral-8x22B DeepSeek-V2 Hunyuan-Large
MMLU 85.2 79.3 77.8 78.5 88.4
MMLU-Pro 61.6 53.8 49.5 - 60.2
BBH 85.9 81.6 78.9 78.9 86.3
HellaSwag - - 88.7 87.8 86.8
CommonsenseQA 85.8 84.1 82.4 - 92.9
WinoGrande 86.7 85.3 85.0 84.9 88.7
PIQA - - 83.6 83.7 88.3
NaturalQuestions - - 39.6 38.7 52.8
DROP 84.8 79.6 80.4 80.1 88.9
ARC-C 96.1 92.9 91.2 92.4 95.0
TriviaQA - - 82.1 79.9 89.2
CMMLU - - 60.0 84.0 90.2
C-Eval - - 59.6 81.7 91.9
C3 - - 71.4 77.4 82.3
GSM8K 89.0 83.7 83.7 79.2 92.8
MATH 53.8 41.4 42.5 43.6 69.8
CMATH - - 72.3 78.7 91.3
HumanEval 61.0 58.5 53.1 48.8 71.4
MBPP 73.4 68.6 64.2 66.6 72.6
Hunyuan-Large-Instruct achieves consistent improvements on most types of tasks compared to LLMs having similar activated parameters, indicating the effectiveness of our post-training. Delving into the model performance in different categories of benchmarks, we find that our instruct model achieves the best performance on MMLU and MATH dataset.
Notably, on the MMLU dataset, our model demonstrates a significant improvement, outperforming the LLama3.1-405B model by 2.6%.
This enhancement is not just marginal but indicative of the Hunyuan-Large-Instruct’s superior understanding and reasoning capabilities across a wide array of language understanding tasks. The model’s prowess is further underscored in its performance on the MATH dataset, where it surpasses the LLama3.1-405B by a notable margin of 3.6%.
Remarkably, this leap in accuracy is achieved with only 52 billion activated parameters, underscoring the efficiency of our model.

Model LLama3.1 405B Inst. LLama3.1 70B Inst. Mixtral 8x22B Inst. DeepSeekV2.5 Chat Hunyuan-Large Inst.
MMLU 87.3 83.6 77.8 80.4 89.9
CMMLU - - 61.0 - 90.4
C-Eval - - 60.0 - 88.6
BBH - - 78.4 84.3 89.5
HellaSwag - - 86.0 90.3 88.5
ARC-C 96.9 94.8 90.0 - 94.6
GPQA_diamond 51.1 46.7 - - 42.4
MATH 73.8 68.0 49.8 74.7 77.4
HumanEval 89.0 80.5 75.0 89.0 90.0
AlignBench 6.0 5.9 6.2 8.0 8.3
MT-Bench 9.1 8.8 8.1 9.0 9.4
IFEval strict-prompt 86.0 83.6 71.2 - 85.0
Arena-Hard 69.3 55.7 - 76.2 81.8
AlpacaEval-2.0 39.3 34.3 30.9 50.5 51.8

###
https://huggingface.co/fishaudio/fish-agent-v0.1-3b
11/1/24

Speech to Speech model - Fish Agent v0.1 3B by @FishAudio 🔥
> Trained on 700K hours of multilingual audio
> Continue-pretrained version of Qwen-2.5-3B-Instruct for 200B audio & text tokens
> Zero-shot voice cloning
> Text + audio input/ Audio output
> Ultra-fast inference w/ 200ms TTFA
> Models on the Hub & Finetuning code on its way! 🚀
What an amazing time to be alive 🤗

Fish Agent V0.1 3B
Fish Agent V0.1 3B is a groundbreaking Voice-to-Voice model capable of capturing and generating environmental audio information with unprecedented accuracy. What sets it apart is its semantic-token-free architecture, eliminating the need for traditional semantic encoders/decoders like Whisper and CosyVoice.

Additionally, it stands as a state-of-the-art text-to-speech (TTS) model, trained on an extensive dataset of 700,000 hours of multilingual audio content.

This model is a continue-pretrained version of Qwen-2.5-3B-Instruct for 200B voice & text tokens.

Supported Languages
The model supports the following languages with their respective training data sizes:

English (en): ~300,000 hours
Chinese (zh): ~300,000 hours
German (de): ~20,000 hours
Japanese (ja): ~20,000 hours
French (fr): ~20,000 hours
Spanish (es): ~20,000 hours
Korean (ko): ~20,000 hours
Arabic (ar): ~20,000 hours

###
https://huggingface.co/amd/AMD-OLMo
AMD
Introducing the First AMD 1B Language Models: AMD OLMo
Oct 31, 2024

Core Contributors: Jiang Liu, Jialian Wu, Prakamya Mishra, Zicheng Liu
Contributors: Sudhanshu Ranjan, Pratik Prabhanjan Brahma, Yusheng Su, Gowtham Ramesh, Peng Sun, Zhe Li, Dong Li, Lu Tian, Emad Barsoum

Introduction
In recent years, the rapid development of artificial intelligence technology, especially the progress in large language models (LLMs), has garnered significant attention and discussion. From the emergence of ChatGPT to subsequent models like GPT-4 and Llama, these language models have demonstrated remarkable capabilities in natural language processing, generation, understanding and reasoning. Continuing AMD tradition of open-sourcing models and code to help the community advance together, we are excited to release our first series of fully open 1 billion parameter language models, AMD OLMo.

Why Build Your Own Language Models
The ability to pre-train and fine-tune your own LLM helps towards the incorporation of domain-specific knowledge, ensuring better alignment with unique use cases. This approach allows organizations to tailor the model’s architecture and training process to meet their unique requirements, achieving a balance between scalability and specialization that off-the-shelf models may not provide. As the demand for customized AI solutions continues to grow, the ability to pre-train LLMs unlocks unprecedented opportunities for innovation and product differentiation across industries.

The AMD in-house trained series of language models (LMs), AMD OLMo, are 1 billion parameter LMs trained from scratch using trillions of tokens on a cluster of AMD Instinct™ MI250 GPUs. Aligned with the goal of advancing accessible AI research, AMD has open-sourced its complete training details and released the checkpoints for the first series of AMD OLMo models. This initiative empowers a diverse community of users, developers, and researchers to explore, utilize, and train state-of-the-art large language models. By demonstrating the capabilities of AMD Instinct™ GPUs in demanding AI workloads, AMD aims to highlight its potential for running large-scale multi-node LM training jobs with trillions of tokens to achieving improved reasoning and instruction-following performance compared to other fully open similar size LMs. In addition, the community can run such models on AMD Ryzen™ AI PCs that are equipped with Neural Processing Units (NPUs) utilizing the AMD Ryzen™ AI Software to enable easier local access without privacy concerns, efficient AI inference, and lower power consumption.

Unveiling AMD OLMo Language Models
AMD OLMo are a series of 1 billion parameter language models pre-trained with 1.3 trillion tokens on 16 nodes, each with four (4) AMD Instinct™ MI250 GPUs. Along with complete details to reproduce, we are releasing three (3) checkpoints corresponding to the various stages of training:

AMD OLMo 1B: Pre-trained on a subset of Dolma v1.7 that consists of 1.3 trillion tokens.
AMD OLMo 1B SFT: Supervised fine-tuned (SFT) on Tulu V2 dataset (1st phase) and then OpenHermes-2.5, WebInstructSub, and Code-Feedback datasets (2nd phase).
AMD OLMo 1B SFT DPO: Aligned with human preferences using Direct Preference Optimization (DPO) on UltraFeedback dataset.
AMD OLMo 1B is based on the model architecture and training set up of fully open source 1 billion version of OLMo, with some key differences. We pre-train with less than half the tokens used for OLMo-1B (effectively cutting the compute budget by half while maintaining comparable performance) and execute post-training comprising of a two-phase SFT and DPO alignment to enhance performance in general reasoning, instruction-following and chat capabilities (OLMo-1B does not carry-out any post-training steps). For the two-phase SFT, we create a data mix of high quality and diverse instructional datasets that are publicly available. Overall, our training recipe helps to produce a series of models that achieve better performance over various types of benchmarks as compared to other similar sized fully open-source models trained on publicly available data.

AMD OLMo
The AMD OLMo models are decoder-only transformer language models that are trained using next-token prediction. The key model architecture and training hyperparameter details are provided in our model card here.

Data and Training Recipe
We trained the AMD OLMo series of models in three stages as shown in Figure 1.

AMD Pretraining Pipeline
Figure 1: AMD OLMo training stages.
Stage 1: Pre-training

The pre-training stage comprised of training on a large corpus of general-purpose text data for teaching the model to learn the language structure and gain general world knowledge by performing next-token prediction task. We used a subset of 1.3 trillion tokens from the publicly available Dolma v1.7 dataset. Scripts for extracting the exact subset can be found in our Hugging Face model card here.

Stage 2: Supervised Fine-tuning (SFT)

Next, we fine-tuned the pre-trained model on instructional datasets to enable instruction following capabilities in our model. This stage comprises of two phases:

Phase 1: First, we fine-tune the model on TuluV2 dataset, which is a publicly available high-quality instruction dataset consisting of 0.66 billion tokens.
Phase 2: To further improve the instruction following capabilities, we subject the model to be fine-tuned on a relatively larger instruction dataset OpenHermes 2.5. In this phase, we also use Code-Feedback and WebInstructSub dataset to improve the model’s capability along the dimensions of coding, science and mathematical problem solving. These datasets consist of ~7 billion tokens in total.
We conducted multiple fine-tuning experiments with different ordering of datasets along the two phases and found the above sequencing to be most helpful. We use a relatively smaller sized but high-quality dataset in Phase 1 to provide a good foundation, and then leverage more diverse and bigger dataset combination in Phase 2 to further improve model’s capabilities.

Stage 3: Alignment

At the end, we further tune our SFT model with Direct Preference Optimization (DPO) using the UltraFeedback dataset, which is a large-scale, fine-grained, and diverse preference dataset. This helps the model align better and produce outputs that are consistent with human values and preferences.

Results
We compare AMD OLMo models with other similarly sized fully open-source models that have publicly released their data, model weights and training code The pre-trained baseline models that we used for comparison include: TinyLLaMA-v1.1 (1.1B), MobiLLaMA-1B (1.2B), OLMo-1B-hf (1.2B), OLMo-1B-0724-hf (1.2B), and OpenELM-1_1B (1.1B).

AMD Pretraining Results on Standard Benchmarks
Figure 2: Pre-trained model results on standard benchmarks for general reasoning capabilities and multi-task understanding. Top markers represent the performance gain of the best performing AMD OLMo 1B model compared to the next best model.
Figure 2 compares pre-trained models across various standard benchmarks for general reasoning capabilities (see here for exact numbers). We use Language Model Evaluation Harness for evaluating common sense reasoning, multi-task understanding and responsible AI benchmarks. Of the 11 benchmarks, we evaluate GSM8k in 8-shot and BBH in 3-shot setting, and rest in zero-shot setting.

With AMD OLMo 1B:

The average overall general reasoning tasks (48.77%) is comparable to that of the latest OLMo-0724-hf model (49.3%) with less than half of its pre-training compute budget and better than all the other baseline models.
Accuracy gains over the next best models on ARC-Easy (+6.36%), ARC-Challenge (+1.02%), and SciQ (+0.50%) benchmarks.
For evaluating the chat capabilities, we used the following instruction-tuned chat counterparts of the pre-trained baselines: TinyLlama-1.1B-Chat-v1.0 , MobiLlama-1B-Chat, and OpenELM-1_1B-Instruct. Along with Language Model Evaluation Harness for evaluating common sense reasoning, multi-task understanding and responsible AI benchmarks, we used Alpaca Eval for evaluating instruction-following capabilities, and MT-Bench for evaluating multi-turn chat capabilities.

On comparing our fine-tuned and aligned models with other instruction-tuned baselines:

AMD Instruction Tuning Results on Stardad Benchmark
Figure 3: Instruction tuning results on standard benchmarks for general reasoning capabilities and multi-task understanding. Top markers represent the performance gain of the best performing AMD OLMo 1B SFT/SFT DPO models compared to the next best baseline model.
Two phased SFT helped raise the model accuracy from the pre-trained checkpoint across almost all benchmarks on average, specifically MMLU by +5.09% and GSM8k by +15.32%.
AMD OLMo 1B SFT performance on GSM8k (18.2%) is significantly better (+15.39%) than the next best baseline model (TinyLlama-1.1B-Chat-v1.0 at 2.81%).
Average accuracy over standard benchmarks (Figure 3) for our SFT model beats baseline chat models by minimum +2.65%. Alignment (DPO) boosts it by further +0.46%.
AMD Instruction Tuning Results On Chat Benchmarks
Figure 4:SFT and DPO model results on chat benchmarks. *MT-Bench evaluation was done using max_new_tokens=2048 while the context length for OpenELM-1_1B-Instruct restricts this generation resulting in an unfair comparison. Top markers represent the performance gain of the best performing AMD OLMo 1B SFT/SFT DPO model compared to the next best baseline model.
Our SFT model also exceeds the next best model on chat benchmarks AlpacaEval 2 (+2.29%) and MT-Bench (+0.97%) as shown in Figure 4.
AMD Instruction Tuning Results On Responsible AI Benchmarks
Figure 5: SFT and DPO model results on responsible AI benchmarks. Here for ToxiGen a lower score is better. Top markers represent the performance gain of the best performing AMD OLMo 1B SFT/SFT DPO model compared to the next best baseline model.
Alignment training helps our AMD OLMo 1B SFT DPO model perform on par with other chat baselines on responsible AI evaluation benchmarks, as shown in Figure 5.
Furthermore, AMD OLMo models were also able to run inference on AMD Ryzen™ AI PCs that are equipped with Neural Processing Units (NPUs). Developers can easily run Generative AI models locally by utilizing the AMD Ryzen™ AI Software. Local deployment of such models on edge devices provides a sustainable and secure approach by optimizing energy efficiency and safeguarding data privacy while enabling various types of AI applications.

Conclusion
Using an end-to-end training pipeline running on AMD Instinct™ GPUs that consists of a pre-training stage with 1.3 trillion tokens (which is half the pre-training compute budget as compared to OLMo-1B), a two-phase supervised fine-tuning stage, and DPO based human preference alignment stage, AMD OLMo models are comparable to or outperform the other similar sized fully open models across general reasoning and chat capabilities, while performing at par on responsible AI benchmarks. Also, the language model was deployed onto AMD Ryzen™ AI PCs with NPUs that can potentially help enable a diverse set of edge use cases. Open sourcing the data, weights, training recipes and code is primarily aimed at helping developers to reproduce as well as innovate further on top. AMD remains committed to providing the open-source community with a steady stream of new AI models and eagerly anticipates the innovations that will emerge from their collaborative efforts.


Smol models ftw! @AMD released AMD OLMo 1B - beats OpenELM, tiny llama on MT Bench, Alpaca Eval - Apache 2.0 licensed 🔥
> Trained with 1.3 trillion (dolma 1.7) tokens on 16 nodes, each with 4 MI250 GPUs
> Three checkpoints:
- AMD OLMo 1B: Pre-trained model
- AMD OLMo 1B SFT: Supervised fine-tuned on Tulu V2, OpenHermes-2.5, WebInstructSub, and Code-Feedback datasets
- AMD OLMo 1B SFT DPO: Aligned with human preferences using Direct Preference Optimization (DPO) on UltraFeedback dataset
Key Insights:
> Pre-trained with less than half the tokens of OLMo-1B
> Post-training steps include two-phase SFT and DPO alignment
> Data for SFT:
- Phase 1: Tulu V2
- Phase 2: OpenHermes-2.5, WebInstructSub, and Code-Feedback
> Model checkpoints on the Hub & Integrated with Transformers ⚡️
Congratulations & kudos to @AMD on a brilliant smol model release! 🤗

###
https://si.inc/hertz-dev/
11/6/24
Standard Intelligence

Hertz-dev - 8.5 billion parameters, full-duplex, audio-only base model, APACHE 2.0 licensed 🔥
> Trained on 20 million hours of audio
Train on any down-stream task, speech-to-speech, translation, classification, speech recognition, text-to-speech and more!
Model checkpoints in the comments below

Introducing hertz-dev, the first open-source base model for conversational audio generation

For the last few months, the team at Standard Intelligence has been researching scalable cross-modality learning. We're excited to announce that we're open-sourcing current checkpoints of our full-duplex, audio-only base model, hertz-dev, with a total of 8.5 billion parameters and three primary parts:

Hertz-codec: a convolutional audio VAE which takes mono, 16kHz speech and encodes a 8Hz latent representation with a KL-regularized 1kbps bitrate. The codec latents have no residuals, just a single 32-dim latent per 125ms frame. The codec outperforms Soundstream and Encodec at 6kbps and is on par with DAC at 8kbps in subjective evaluations, while having lower tokens per second than any popular tokenizer, critical for language modeling. Hertz-codec has 5 million encoder parameters and 95 million decoder parameters.
Hertz-lm: a 6.6 billion parameter, 32-layer decoder-only transformer with a context of 2048 input tokens (~4.5 mins). Hertz-lm receives as input the full latent history, but predicts a series of quantized representations, which are 15-bit compressed versions of the hertz-codec tokens. It acts like a typical language model trained on next-token prediction loss.
We're releasing two versions of the one-channel stack, both trained on 20 million hours of audio data. The primary checkpoint has weights initialized from the weights of a pretrained language model trained on 2T text tokens. The second is an ablation that was trained purely on audio, with no text pretraining. While the one that began with text training had higher coherence in subjective evaluations, both exhibit similar linguistic understanding, and we're excited to learn that audio alone contains sufficient grounding for the model to learn language.
The two-channel version of hertz-lm predicts two quantized latents, which are used as input for two separate instances of hertz-vae and hertz-codec.
Hertz-vae: a 1.8 billion parameter, 8 layer decoder-only transformer. The first four layers receive as input the latent history. During training, layer 5 receives the ground-truth, 15-bit quantized representation of the next latent. During inference, we directly sample hertz-lm's next token prediction and provide it to hertz-vae as the quantized representation.
We evaluate hertz-vae during training by doing autoregressive generation while holding the ground-truth quantized latents static, and measure the quality of resynthesis: how well the model is able to reconstruct the original speech from just quantized latents, prompt, and generation history. From transcript evaluations, hertz-vae is near-perfect at reconstructing the semantics of speech from just 120 bits per second of information.
Hertz-dev is the first publicly released audio base model of its kind. Base models accurately predict the distribution of the data that they were trained on, as opposed to models that have had substantial RL tuning done to collapse their generation distributions. This makes base models the best starting point to fine-tune for a large number of different tasks. We're currently training a larger, more advanced version of Hertz, which will use a scaled base model recipe and RL tuning to substantially improve the raw capabilities and final coherence of the model. Hertz-dev is a glimpse at the future of real-time voice interaction, and is the easiest conversational audio model in the world for researchers to fine-tune and build on top of.

###
https://githubnext.com/projects/github-spark
Github
11/1/24
GitHub Spark
Can we enable anyone to create or adapt software for themselves, using AI and a fully-managed runtime?

What's it for?
Building and sharing personalized micro apps (“sparks”)
Share
Stage
Technical Preview
Who made it?
Devon Rifkin
Devon Rifkin
Terkel Gjervig Nielsen
Terkel Gjervig Nielsen
Cole Bemis
Cole Bemis
Alice Li
Alice Li
👋 If you’d like to try out GitHub Spark, then sign up for the technical preview.

As developers, we love to customize our environment, and to build tools that fit our unique preferences and workflows. We do this not just because it improves productivity and ergonomics, but also, because it makes our daily routine feel more personal. And when things feel personal, they’re typically more fun.

However, while we may invest in things like managing dotfiles, writing automation scripts, or configuring editor settings, how often do we pass up ideas for making our own apps? Not necessarily because we couldn’t build them, but because they seem too short-lived, niche, or time-consuming to prioritize? 😩

And in this lies the irony with software today: we have powerful computers on our desks and in our pockets, but they aren’t nearly as personalized as they could be. Instead, we rely on general-purpose tools that were designed by and for someone else, because the complexity of creating bespoke apps is too high.

Which raises two interesting questions: how could we make personalizing our software as easy as personalizing our dev environment? And then enable those around us to do the same? Not because that should be necessary—but because it could be fun 🙌

Introducing GitHub Spark
GitHub Spark is an AI-powered tool for creating and sharing micro apps (“sparks”), which can be tailored to your exact needs and preferences, and are directly usable from your desktop and mobile devices. Without needing to write or deploy any code.

And it enables this through a combination of three tightly-integrated components:

An NL-based editor, which allows easily describing your ideas, and then refining them over time
A managed runtime environment, which hosts your sparks, and provides them access to data storage, theming, and LLMs
A PWA-enabled dashboard, which lets you manage and launch your sparks from anywhere
Additionally, GitHub Spark allows you to share your sparks with others, and control whether they get read-only or read-write permissions. They can then choose to favorite the spark—and use it directly—or remix it, in order to further adapt it to their preferences. Because…ya know…personalization!

So let’s take a look at how it works 🎬



What are “micro apps”?
GitHub Spark subscribes to the Unix philosophy for apps, where software can be unapologetic about doing one thing, and doing it well–specifically for you, and the duration of time that it’s useful. So “micro” doesn’t refer to the size of the app’s value, but rather, the size of its intended feature complexity.

For example, here are some sparks that the team made (and use!), during the process of creating GitHub Spark. These range from life management tools, learning aids, silly animations, and news clients. But the common thread across them all is: they look and feel exactly how the creator wanted them to. Nothing more and absolutely nothing less ❤️

Allowance tracker appAn allowance tracker for kids, which can be shared in either read-only or read-write mode (for parents), and uses an LLM to generate a celebratory message when an earning goal is reached

Vehicle world appAn animated world of vehicles, as envisioned–and created–by a six year old

Karaoke night appAn app for tracking a weekly karaoke night, along with the status of each invited guest

Find my City appA maps app that allows searching for cities by name, and then using an LLM to generate a fun tldr description of it. Created and used by a 10 year old for school

Spark news appA custom HackerNews client that shows the top 20 posts, and uses an LLM to summarize the comment threads (which is really useful!). This is the daily HN driver for the team

So with that context in mind, let’s talk about the “what?” and “why?” behind the major components of GitHub Spark 👍

NL-based toolchain
When creating an app, you have to know what you want. And not just the general idea, but also the exact set of features, detailed interaction behaviors, and the overall look and feel of it. Unfortunately, this can get quite complicated, and may be overwhelming enough to prevent some from even trying. Which is exactly the problem we’re looking to solve!

GitHub Spark mitigates this, by enabling you to start with a simple idea (“An app to track my kid’s allowance”), and then allowing complexity to slowly emerge through “assisted exploration”. In particular, it’s NL-based editor is designed to make forward progress feel easy—and playful!—using four core iteration capabilities:

Interactive previews
Revision variants
Automatic history
Model selection
Interactive previews
When you type an NL expression into GitHub Spark, it doesn’t just generate code–it immediately runs and displays it via an interactive preview. This “app-centric feedback loop” allows you to specify as little or as much detail as you want, and then iterate as you visually learn more about your intent (“Hmm, I guess I wanted a toggle button here!”).

Spark editor preview
Revision variants
When you create or iterate on a spark, you can optionally request a set of variants. This will generate 3-6 different versions of your request, each with subtle yet meaningful deviations. And since you might know you want a feature, but not quite know how it should look or behave, it can be helpful to get ideas that inform and expand on your thinking. Like an AI thought partner!

Spark editor variantsAsking for variants on an ambiguous revision ("Make the UI look really silly")

Automatic history
As you iterate on a spark, every revision is automatically saved and can be restored in a single click. This allows you to explore ideas (and variants) without worrying about losing any progress. And more importantly, without requiring you to manage version control yourself. This enables a sort of “curiosity-driven development”, where you can have an idea, and then try it out, without any fear of negative consequences (e.g. messing up your app).

Spark editor history
From a collaboration perspective, history is also compelling because it provides a form of “semantic view source” whenever someone shares a spark with you. While creating GitHub Spark, we found that we’d naturally share new ideas with each other, and then immediately look at the history to see how they made it. It’s almost like being able to peek into the minds of others, and see their serialized thought process.

Model selection
When you create or revise a spark, you can choose from one of four AI models: Claude Sonnet 3.5, GPT-4o, o1-preview, and o1-mini. This is neat because it allows you to try an idea, and if you don’t get what you expected, you can undo and try again with an entirely different model. Additionally, the history tracks which model you used for each revision, which allows you to see how your sparks evolve over time.

New spark model picker
Selecting a model when creating a new spark

Spark revision model picker
Selecting a model when revising an existing spark

Managed runtime environment
We refer to GitHub Spark as an “app centric” tool (vs. a “code centric” tool). Not because it doesn’t allow you to see or edit the code (it does!), but because it’s designed for creating apps that are meant to be seen, felt, and used—as opposed to simply generating code, and then expecting you to do something with it (build, deploy, provision a database, etc.).

And it enables this by complimenting its toolchain with a managed runtime environment, that is built around four core capabilities:

Deployment-free hosting
Themable design system
Persistent data storage
Integrated model prompting
Deployment-free hosting
When you create or revise a spark, the changes are automatically deployed, and can be run and installed on your desktop, tablet, or mobile device (via a PWA). In this sense, GitHub Spark is kind of like a micro app cloud, which collapses the act of creating, deploying, and using software into a single gesture: expressing your ideas through natural language 🚀

Spark dashboard on mobileSpark app in fullscreen mode on mobile
Viewing your dashboard of sparks and then opening one on your phone

Themable design system
To ensure that your apps look and feel nice, GitHub Spark includes a set of built-in UI components, and a themable design system. So whenever you create a new app, things like form controls, layout, and icons should seem polished out-of-the-box. And if you want to tweak anything further, you can use the theme editor to change the default accent color, border radius, app spacing, and color theme (light/dark).

Spark theme editorSpark app after modifying its theme properties
Before and after modifying the theme properties of a spark

Persistent data storage
Whether you’re making a todo list, a gardening planner, or a tic-tac-toe game, most interesting apps need to store data. And the GitHub Spark runtime has you covered, by providing a managed key-value store, and automatically knowing when to use it. Additionally, GitHub Spark provides a data editor, which lets you easily see and edit the data your spark is using. That way you have full control over any state, but without needing to worry about any of the details.

Spark data editorEditing a key in the Spark data editor
Viewing the data that a spark is storing, and then editing a specific key/value

Integrated model prompting
The GitHub Spark runtime is integrated with GitHub Models, and allows you to add generative AI features to your sparks, without any knowledge of LLMs (e.g. summarizing a document, generating stories for a children’s bedtime app). Additionally, it provides a prompt editor, which lets you see the prompts that GitHub Spark generates, and enables you to tweak them if needed—without needing to edit any code.

Spark prompt editorEditing a prompt in the Spark prompt editor
Viewing the AI prompts that your spark is using, and then editing one manually

Phew! That was a lot. But in order for GitHub Spark to enable the aspiration we have (reducing the cost of app creation to zero), we felt like this toolchain and runtime were absolutely necessary. And we think that’s users are going to love the way it feels 🥰

What’s next?
As a technical preview, GitHub Spark is still very early, and has a loooong list of TODOs. But over the next few months, we’re looking forward to admitting users off the waitlist, and iterating closely with them every week. So if you’re interested in taking this journey with us, then check out the FAQ and then join in on the fun over at the GitHub Next Discord server 👋

That said, if you’re curious about what things are top of mind, you can expect to see us exploring into the following directions:

Expanding the collaboration modalities (e.g. a public gallery, allowing users to perform a semantic merge of changes that someone made in a fork of their spark, multi-player)
Expanding the editor surface (e.g. providing an “x-ray mode” that allows summarizing and adjusting precise behaviors of the app)
Expanding the runtime environment (e.g. more built-in components, better integration with 3rd party services, enabling file storage and vector search).
Lots of other cool stuff that we haven't even thought of!

###
https://github.blog/news-insights/octoverse/octoverse-2024/
Github
Octoverse: AI leads Python to top language as the number of global developers surges
In this year’s Octoverse report, we study how public and open source activity on GitHub shows how AI is expanding as the global developer community surges in size.

Hero image for GitHub Octoverse 2024 report. The design features abstract, colorful lines in neon pink, green, blue, and purple on a dark background with subtle gradients and geometric shapes. The GitHub logo is positioned in the bottom center alongside the text 'Octoverse 2024' in white.
GitHub Staff·@github
October 29, 2024
|
Updated October 31, 2024
|
32 minutes
Share:
Remember when people said AI would replace developers? Our data tells a different story. As AI rapidly expands, developers are increasingly building AI models into applications and engaging with AI projects on GitHub in large numbers. At the same time, we’re seeing an unprecedented number of developers join GitHub from across the globe, and many of these developers are contributing to open source projects for the first time.

In 2024, Python overtook JavaScript as the most popular language on GitHub, while Jupyter Notebooks skyrocketed—both of which underscore the surge in data science and machine learning on GitHub. We’re also seeing increased interest in AI agents and smaller models that require less computational power, reflecting a shift across the industry as more people focus on new use cases for AI.

Our data also shows a lot more people are joining the global developer community. In the past year, more developers joined GitHub and engaged with open source and public projects (in some cases, empowered by AI). And since tools like GitHub Copilot started going mainstream in early 2023, the number of developers on GitHub has rapidly grown with significant gains in the global south. While we see signals that AI is driving interest in software development, we can’t fully explain the surge in global growth our data reflects (but we’ll keep studying it).

At GitHub, we know the critical role open source plays in bridging early experimentation and widespread adoption. In this year’s Octoverse report, we’ll explore how AI and a rapidly growing global developer community are coming together with compounding results.

Graphic from GitHub's Octoverse 2024 report highlighting the top-line metrics across the GitHub platform in 2024. This includes call outs about there now being 518 million total projects on GitHub with 25% year-over-year growth, nearly 1 billion contributions to public and open source projects, 5.6 billion contributions to all projects on GitHub, 137,000 public generative AI projects with 98% year-over-year growth, more than 1 million maintainers, teachers, and students that have used GitHub Copilot for free, and Python’s new place as the top language on GitHub.

We uncover three big trends:

A surge in global generative AI activity. AI is growing and evolving fast, and developers globally are going far beyond code generation with today’s tools and models. While the United States leads in contributions to generative AI projects on GitHub, we see more absolute activity outside the United States. In 2024, there was a 59% surge in the number of contributions to generative AI projects on GitHub and a 98% increase in the number of projects overall—and many of those contributions came from places like India, Germany, Japan, and Singapore.
A rapidly growing number of developers worldwide—especially in Africa, Latin America, and Asia. Notable growth is occurring in India, which is expected to have the world’s largest developer population on GitHub by 2028, as well as across Africa and Latin America. We also see Brazil’s developer community growing fast. Some of this is attributable to students. The GitHub Education program, for instance, has had more than 7 million verified participants. We’ve also seen 100% year-over-year growth among students, teachers, and open source maintainers adopting GitHub Copilot as part of our complimentary access program. This suggests AI isn’t just helping more people learn to write code or build software faster—it’s also attracting and helping more people become developers. First-time open source contributors continue to show wide-scale interest in AI projects. But we aren’t seeing signs that AI has hurt open source with low-quality contributions.

Python is now the most used language on GitHub as global open source activity continues to extend beyond traditional software development. We saw Python emerge for the first time as the most used language on GitHub (more on that later). Python is used heavily across machine learning, data science, scientific computing, hobbyist, and home automation fields among others. The rise in Python usage correlates with large communities of people joining the open source community from across the STEM world rather than the traditional community of software developers. This year, we also saw a 92% spike in usage across Jupyter Notebooks. This could indicate people in data science, AI, machine learning, and academia increasingly use GitHub. Systems programming languages, like Rust, are also on the rise, even as Python, JavaScript, TypeScript, and Java remain the most widely used languages on GitHub.

###
https://research.google/blog/generating-zero-shot-personalized-portraits/
Google
Generating zero-shot personalized portraits
November 11, 2024

Suraj Kothawade, Software Engineer, Core ML, and Sherry Ben, Staff Software Engineer, Google DeepMind

A new AI model that can transform selfies into different artistic styles while keeping facial features recognizable.

Recent advances in text-to-image and image-to-image (I2I) models have led to significant improvements in image quality and prompt adherence. However, existing I2I models often struggle with generating fine-grained details, particularly in challenging domains like human faces where preserving image likeness is crucial.

This post introduces a novel zero-shot I2I model specifically designed for personalized and stylized selfie generation. It effectively addresses the challenge of fine-grained image manipulation by combining two key capabilities: (1) personalization, which accurately preserves the similarity of the facial image in the input selfie, and (2) stylization, which faithfully applies the artistic style specified in the text prompt. This allows users to transform their selfies into a variety of styles while maintaining their unique facial features. We demonstrate the effectiveness of the model by showcasing its ability to generate high-quality, personalized, and stylized selfies.

Model process: (a) The input provides the reference image. (b) The text prompt specifies the desired artistic style (e.g., "A portrait of watercolor style using pastel colors"). (c) The generated output image exhibits the specified style while preserving the subject's likeness.

Using adapters to capture face nuances
While text prompts work well for many image generation tasks (like "a cat wearing a hat") they can be limiting when it comes to generating images with specific and nuanced details. This is particularly true for faces, where capturing individual features and expressions with just words is incredibly difficult.

The model takes two inputs: an image of a person’s face, and a text prompt describing the desired style. We then use two kinds of "adapters," which are like mini-AI assistants, to help the foundation model understand the nuances of faces.

Image adapter: This assistant focuses on unique features of a selfie image. This ensures the generated images truly look like the selfie image.
Control adapter: This assistant analyzes the face’s pose and expression.
These assistants communicate with the foundation model using a technique called cross-attention, which allows them to blend information from the reference image, the desired style, and any expression seamlessly. This teamwork ensures the creation of a stylized image that's still unmistakably recognizable as the input image.

GenSelfie2-OverviewHero
The model encodes two inputs: a reference image and a text prompt; control encoder is generated automatically. This encoded information guides the diffusion process, generating a stylized output image that retains the features depicted in the reference image while adhering to the style specified in the text prompt and incorporating the facial pose and expression analysis from the control adapter.

Creativity with a breadth of styles
Our model can generate faces in a plethora of styles. Below are some examples:

3D cartoon: Transform yourself into a 3D animated character.
Watercolor painting: Capture the delicate beauty of a hand-painted portrait.
Anime: Become the star of your own anime adventure.
Pencil sketch: Embrace the classic elegance of a sketched portrait.
GenSelfie3-Examples
Model outputs from left to right: (a) Input portrait with prompt, “A person portrait stylized as 3D cartoon character..” (b-e) Generated outputs with the prompt adherence with minor adjustments to head pose.

GenSelfie4-Examples
Model outputs from left to right: (a) Input portrait with prompt, “A portrait of watercolor style using pastel colors..” (b-e) Generated outputs with the prompt adherence with minor adjustments to head pose.

GenSelfie5-Examples
Model outputs from left to right: (a) Input portrait with prompt, “A person portrait in a detailed anime style.” (b-e) Generated outputs with the prompt adherence with minor adjustments to head pose.

GenSelfie6-Examples
Model outputs from left to right: (a) Input portrait with prompt, “A 4B pencil sketch of a portrait.” (b-e) Generated outputs with the prompt adherence with minor adjustments to head pose.

Additionally, a user can prompt the model to modify the expression — to smiling, crying, or looking angry — while maintaining the image likeness and the chosen style.

Left: Input Image. Right three images: Model outputs (a) top row caption, “An image of smiling face in watercolor painting style”, (b) middle row caption, “An image of crying face in watercolor painting style”, (c) bottom row caption, “An image of angry face in watercolor painting style”.

Applying portrait stylization
This model is accessible on Imagen on Vertex AI. Detailed instructions for utilizing the model can be found in the accompanying user guide and in guidance for using Imagen responsibly. This framework enables personalized image stylization, allowing users to explore diverse artistic expressions while preserving the similarity of input facial images.

What’s next
Personalization of AI-generated images goes beyond generating headshots. Often, users want to personalize the full person, including features like body pose. Stay tuned for additional innovation that will enable further artistic expression.

###
https://arxiv.org/abs/2411.02830
Mixture of In-Context Learners
[Submitted on 5 Nov 2024]
Mixtures of In-Context Learners
Giwon Hong, Emile van Krieken, Edoardo Ponti, Nikolay Malkin, Pasquale Minervini
In-context learning (ICL) adapts LLMs by providing demonstrations without fine-tuning the model parameters; however, it does not differentiate between demonstrations and quadratically increases the complexity of Transformer LLMs, exhausting the memory. As a solution, we propose Mixtures of In-Context Learners (MoICL), a novel approach to treat subsets of demonstrations as experts and learn a weighting function to merge their output distributions based on a training set. In our experiments, we show performance improvements on 5 out of 7 classification datasets compared to a set of strong baselines (up to +13\% compared to ICL and LENS). Moreover, we enhance the Pareto frontier of ICL by reducing the inference time needed to achieve the same performance with fewer demonstrations. Finally, MoICL is more robust to out-of-domain (up to +11\%), imbalanced (up to +49\%), or noisy demonstrations (up to +38\%) or can filter these out from datasets. Overall, MoICL is a more expressive approach to learning from demonstrations without exhausting the context window or memory.

Uses subsets of demonstrations to train experts via in-context learning. A trainable weighting function is used to combine the experts' next-token predictions.
This approach applies to black-box LLMs since access to the internal parameters of the LLM is not required.
Good properties include the following:
- competitive with standard ICL while being significantly more data, memory, and computationally efficient
- resilient to noisy demonstrations and label imbalance
Overall, it is a very cool and simple approach to make better use of in-context demonstrations which is one of the more important methods to get the most out of the LLMs today.

###
https://developer.nvidia.com/blog/3x-faster-allreduce-with-nvswitch-and-tensorrt-llm-multishot/?ncid=so-link-567977
NVIDIA
Generative AI

English
3x Faster AllReduce with NVSwitch and TensorRT-LLM MultiShot
Nov 01, 2024
By Anton Korzh, Brian Pharris, Nick Comly, Ashraf Eassa and Amr Elmeleegy

+12
Like
Discuss (0)
Image of an HGX H200
Deploying generative AI workloads in production environments where user numbers can fluctuate from hundreds to hundreds of thousands – and where input sequence lengths differ with each request – poses unique challenges. To achieve low latency inference in these environments, multi-GPU setups are a must – irrespective of the GPU generation or its memory capacity. To enhance inference performance in production-grade setups, we’re excited to introduce TensorRT-LLM Multi-shot, a new multi-GPU communication protocol that leverages the NVIDIA NVLink Switch to significantly increase communication speeds by up to 3x. This blog outlines this new feature and how it helps developers and solution architects address the limitations of traditional multi-GPU communication methods.

Challenges with traditional AllReduce algorithms
For low latency inference, multi-GPU is critical, regardless of the memory capacity of a single GPU. However, at low concurrency, the time GPUs spend exchanging data can outweigh the time spent on compute. For optimal performance, an efficient AllReduce operation – a collective operation that combines partial results from each participating GPU – is critical.

Traditional approaches use ring-based algorithms, where the partial values are passed around a ring of GPUs. Each GPU contributes its values and passes the result to its neighbor. This process is repeated 2N-2 times where N is the number of GPUs working together, and by the end of the process, every GPU has the same summed value. A second pass over the ring is required to propagate summed values from the last GPU to the rest.

The Ring approach makes efficient use of available GPU-to-GPU bandwidth per communication step, but as the number of GPUs increases, so does the number of steps. This increases latency, as all GPUs need to stay synchronized at every step of the ring. ‌These synchronization latencies add significant latency overhead and can make it difficult to meet more stringent latency targets.

The Ring AllReduce algorithm is described below:

Ring Algorithm: GPU-1 → GPU-2 → … → GPU-N → GPU-1 → GPU-2 → … → GPU-(N-1)
2N-2 steps, with full tensor send/recv each step
Latency: 2N-2 communication steps. (N: # of GPUs)
Traffic: (4N-4)/N tensor bytes of send/recvs
Addressing AllReduce communication challenges with TensorRT-LLM MultiShot
TensorRT-LLM MultiShot is a new algorithm that reduces the O(N) latency of Ring AllReduce by up to 3x leveraging multicast in NVSwitch. Multicast is a hardware acceleration feature in NVSwitch which allows a GPU to send data once and have that data sent simultaneously to all other GPUs, minimizing the number of communication steps to two inter-GPU synchronizations while remaining bandwidth efficient. Without NVSwitch, this would take N times the communication bandwidth.

TensorRT-LLM Multishot separates the AllReduce into a ReduceScatter operation followed by an AllGather operation (for more detailed descriptions of collective operations, see this documentation).

Each GPU is responsible for accumulating only a portion of the result tensor.

The first step (or “shot”) involves each GPU sending the different slices of the tensor to the respective GPU responsible for accumulating that slice of the tensor.

After accumulating locally, each GPU now has the correct sum accumulators for its unique slice of the output.

In the second step (or “shot”), each GPU broadcasts the result slice to all other GPUs using the NVSwitch multicast capability. This minimizes the per GPU bandwidth required as the NVSwitch itself performs data amplification; each GPU sends 1/N the data and receives the full result tensor in one step.

The entire operation only takes two communication steps, regardless of the number GPUs performing tensor parallel inference.

TensorRT-LLM MultiShot Algorithm: GPU_N sends slices, Compute slice sum, broadcast result in single multicast operation.
Latency: 2 communication steps (regardless of number of GPUs)
Traffic: 2 tensor bytes of send/recv (regardless of number of GPUs)
Why this matters
Since this algorithm requires only two communication steps rather than 2N-2 (where N is the number of GPUs), MultiShot can be nearly 3x faster than Ring AllReduce. The benefits of this algorithm are particularly evident with smaller message sizes and high parallelism – the scenario needed when minimum latency is required for a great user experience.

This can be used to either reduce minimum latency, or increase throughput at a given latency. In scenarios with more aggressive latency thresholds, this can lead to super-linear scaling with the number of GPUs.

A chart showing the reduction in latency that TensorRT-LLM MultiShot provides across message sizes.
Figure 1. With TensorRT-LLM MultiShot, AllReduce latency is reduced by up to 3x.
Achieving optimal inference performance requires careful workload analysis and a deep understanding of performance bottlenecks. By gaining that understanding – both through internal engineering work as well as through close collaboration with external developers and researchers – we can quickly and frequently optimize many aspects of our platform to deliver great performance for users.

As we continue to identify and implement new performance optimizations – some may be extensive, others might be narrower in scope – we will be providing regular updates on these optimizations, providing both technical motivation and quantified benefits.

###
https://ds4sd.github.io/docling/
IBM
11/1/24
Docling
Docling DS4SD%2Fdocling | Trendshift

arXiv PyPI version Python Poetry Code style: black Imports: isort Pydantic v2 pre-commit License MIT

Docling parses documents and exports them to the desired format with ease and speed.

Features
🗂️ Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON
📑 Advanced PDF document understanding incl. page layout, reading order & table structures
🧩 Unified, expressive DoclingDocument representation format
🤖 Easy integration with LlamaIndex 🦙 & LangChain 🦜🔗 for powerful RAG / QA applications
🔍 OCR support for scanned PDFs
💻 Simple and convenient CLI
Coming soon
♾️ Equation & code extraction
📝 Metadata extraction, including title, authors, references & language
🦜🔗 Native LangChain extension
IBM ❤️ Open Source AI
Docling has been brought to you by IBM.

기술적으로 최대한 자세하게 적어. 12개의 기사가 있고 하나도 빼먹지 말고 적어.