OpenAI는 복잡한 문제 해결 능력을 갖춘 o1-preview와 o1-mini 모델을 발표했고, Google은 공공 데이터 통계를 AI 응답에 활용하는 DataGemma 모델과 NotebookLM을 선보였습니다. Microsoft는 AI 모델 평가를 위한 Eureka 프레임워크를 발표하여 다양한 AI 모델의 강점과 약점을 분석했습니다. Mistral AI는 128K 컨텍스트를 지원하는 Small Instruct 22B 모델을 출시하며, Alibaba는 다국어 지원과 구조화된 데이터 처리에 강점이 있는 Qwen 2.5를 공개했습니다. Hugging Face는 Reflective MAGLLAMA 데이터셋과 FineVideo 비디오 이해 데이터셋을 발표하여 AI의 학습 및 비디오 분석 기능을 강화했습니다.소규모 모델(Small Models)의 역할을 분석한 연구가 발표되어, 대형 언어 모델(LLM) 시대에서 소규모 모델의 실용성을 조명하였으며, 또 다른 논문에서는 LLM의 메모리 기능을 심층 분석하며, ‘슈뢰딩거의 메모리’ 개념을 제시해 LLM이 특정 입력에 따라 메모리 기능을 발휘하는 방식을 설명했습니다.

OpenAI, o1-preview 및 o1-mini 발표

링크, 2024년 9월 13일

  • OpenAI는 o1-previewo1-mini라는 두 가지 주요 모델을 발표하여 복잡한 문제 해결 능력을 강화
  • o1-preview는 복잡한 문제를 단계적으로 분석하고 해결하는 능력을 갖추었으며, 과학, 코딩, 수학 분야에서 특히 뛰어난 성능을 보임
  • 사고의 사슬(Chain of Thought) 방식을 도입하여 모델이 추론 과정에서 논리적 단계를 거쳐 정확한 답변을 생성
  • **강화 학습(Reinforcement Learning)**을 통해 모델이 스스로 학습하고 추론 과정에서 발생한 오류를 수정하며 성능을 지속적으로 향상
  • o1-preview는 고성능 문제 해결을 목표로 하며, o1-mini는 개발자와 연구자에게 실용적이고 비용 효율적인 솔루션을 제공
  • 향후 웹 브라우징, 파일 및 이미지 업로드 기능 추가 예정
  • 안전성과 정렬 지침 준수로 부적절한 응답 방지
  • 다양한 분야에서 성능 입증: GPQA 다이아몬드 테스트에서 73.3% 정확도 기록, AIME 수학 시험에서 83.3% 정확도 도달
  • 코딩 분야에서는 Codeforces 대회에서 1,673점의 Elo 점수를 기록하며 GPT-4o의 성능을 크게 상회
  • 향후 ChatGPT Plus 및 ChatGPT Enterprise 사용자들이 o1 모델을 사용할 수 있으며, API를 통한 개발자 지원도 포함됨

Google, DataGemma 출시

링크, 2024년 9월 12일

  • Google DeepMind는 DataGemma라는 새로운 AI 모델을 발표, Data Commons와 연결하여 공공 데이터에 기반한 통계 정보를 활용해 AI 응답의 정확성을 극대화
  • DataGemma RIG 모델은 사용자의 질의에 대해 실시간으로 Data Commons 데이터를 조회하고 응답 생성 시 참조, RAG(Retrieval-Augmented Generation) 방법론을 통해 모델이 훈련 데이터 외부의 컨텍스트 정보를 가져와 더욱 정확한 응답을 생성
  • DataGemma 모델은 LLM의 “환각(hallucination)” 문제를 해결하는 데 중점을 두며, 신뢰할 수 있는 통계 정보를 사용해 응답의 정확성을 크게 향상
  • TPUv5e를 사용하여 JAX로 훈련되었으며, Hugging Face에서도 사용 가능

Google, NotebookLM 발표

링크, 2023년 7월 12일

  • Google Labs는 NotebookLM이라는 AI 기반 노트북을 발표, 이 도구는 사용자가 문서에서 더 빠르게 인사이트를 얻을 수 있도록 도와주는 실험적인 노트북 시스템
  • 기존의 문서 기반 노트테이킹 방식과 달리, NotebookLM은 AI를 통해 문서를 요약하고, 복잡한 아이디어를 설명하며, 새로운 연결점을 찾아내는 데 중점을 둠
  • NotebookLM은 사용자가 직접 선택한 Google Docs 문서에 AI 모델을 ‘기반’으로 하여 작업을 수행, 문서의 요약, 질문 응답, 아이디어 생성 등 다양한 작업을 지원
  • 학생, 연구자, 크리에이터들이 데이터를 쉽게 통합하고 인사이트를 얻을 수 있도록 설계되었으며, 향후 더 많은 문서 형식 지원 예정

Mistral AI, Small Instruct 22B 모델 출시

링크, 2024년 9월 17일

  • Mistral AI는 Small Instruct 22B라는 새로운 다국어 AI 모델을 발표, 128K 컨텍스트 지원 및 함수 호출(function calling) 기능 포함
  • 이 모델은 Mistral NeMo 12B와 Mistral Large 123B 사이의 중간 모델로서 22B 파라미터를 갖추고 있으며, 번역, 요약, 감정 분석 등 다양한 태스크에서 뛰어난 성능을 발휘
  • 비상업적 용도로 사용 가능한 모델 가중치를 Hugging Face에서 제공
  • 또한 Pixtral 12B라는 비전 모델도 출시하여, 이미지 이해 기능을 제공하며 Apache 2.0 라이선스 하에 배포

Alibaba, Qwen 2.5 발표

링크, 2024년 9월 19일

  • Alibaba는 Qwen 2.5 모델 시리즈를 발표하며, 최대 72B 파라미터를 가진 이 모델은 Llama 3.1 및 Mistral Large 2(123B)를 능가하는 성능을 자랑
  • Qwen2.5는 18조 개의 토큰을 사용해 훈련되었으며, MMLU 벤치마크에서 85점 이상, HumanEval 및 MATH 벤치마크에서 각각 85점 이상과 80점 이상의 성과를 기록
  • 128K 토큰까지 처리 가능하며, 다국어 지원 (29개 언어) 및 JSON 생성 등 구조화된 데이터 처리에서 뛰어난 성능 발휘
  • Qwen2.5-Coder, Qwen2.5-Math 등의 특화 모델도 함께 출시, 특히 코딩 및 수학 관련 작업에서 우수한 성능을 보임

Microsoft, Eureka 발표

링크, 2024년 9월 18일

  • Microsoft는 AI 모델 성능을 평가하는 Eureka라는 오픈소스 프레임워크를 발표, 이 도구는 12개의 최첨단 AI 모델에 대한 심층 분석을 제공
  • 멀티모달언어 능력을 초점으로 한 평가를 통해 AI 모델의 강점과 약점에 대한 통찰을 제공하며, 단일 점수로 모델을 평가하는 것을 넘어 다양한 요소를 분석
  • 모델 간 비교뿐만 아니라 AI의 현실 세계 응용에 중요한 기본 기능들이 여전히 도전적인 과제임을 강조

Hugging Face, Reflective-MAGLLAMA 데이터셋 출시

링크, 2024년 9월 13일

  • Reflective MAGLLAMA 데이터셋은 반사적 프롬프팅(reflection prompting) 기법을 통해 심층적인 사고 및 분석을 유도하는 합성 데이터셋으로, 10,000개의 샘플을 포함
  • 이 데이터셋은 LLaMa 3.1 모델을 사용하여 생성된 반사적 응답을 수집하여, 분석적 문제 해결 및 학습 촉진에 적합한 모델 훈련 및 평가에 활용 가능

Jina AI, Reader-LM 출시

링크, 2024년 9월 14일

  • Jina AI는 Reader-LM이라는 모델을 발표, HTML 웹페이지에서 Markdown으로 변환하는 전체 파이프라인을 처리하는 모델
  • Reader-LM-0.5B1.5B 두 가지 모델을 출시하여, 다양한 HTML 데이터를 처리하는 능력을 강화하고, Markdown 추출 작업을 자동화

Hugging Face, FineVideo 데이터셋 출시

링크, 2024년 9월 15일

  • Hugging Face는 FineVideo라는 비디오 이해를 위한 데이터셋을 발표, 43,751개의 비디오와 122개의 카테고리를 포함
  • 약 3,425시간의 콘텐츠를 제공하며, 감정 분석, 스토리텔링, 미디어 편집 등 다중 모달 작업에 최적화된 데이터셋으로, 비디오 내 장면, 캐릭터, 음향-시각적 상호작용

에 대한 자세한 주석을 포함

소규모 모델의 역할에 관한 설문 조사 논문 발표

링크, 2024년 9월 10일

  • **소규모 모델(Small Models)**이 LLM(대형 언어 모델) 시대에서 가지는 역할을 체계적으로 분석한 설문 조사 논문 발표
  • 소규모 모델(SM)은 실용적이며, 학술 연구나 자원이 제한된 비즈니스 환경에서 특히 유용하다는 점을 강조
  • SM이 LLM과 협업 또는 경쟁하는 방식에 대해 분석하고, 효율적인 컴퓨팅 자원 사용에 대한 통찰 제공

LLM 메모리 연구 논문 발표

링크, 2024년 9월 16일

  • LLM의 메모리 기능에 대한 심층 연구를 다룬 논문 발표
  • **슈뢰딩거의 메모리(Schrodinger’s Memory)**라는 개념을 도입하여, LLM의 메모리는 특정 질의가 있을 때만 관찰될 수 있다고 설명
  • 인간의 기억과 LLM의 기억 간 유사점과 차이점에 대해 탐구, Transformer 아키텍처가 동적 적응성을 갖춘 모델로 작동하며, 최소한의 입력 정보만으로 전체 내용을 기억할 수 있음
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
###
https://openai.com/o1/
OpenAI
9/13/24


OpenAI는 혁신적인 AI 모델 시리즈인 **o1-preview**와 **o1-mini**를 발표하여 인공지능 기술의 새로운 지평을 열었습니다. OpenAI의 o1-preview와 o1-mini 모델은 인간의 심층적인 사고 방식을 모방하여 복잡한 문제를 단계별로 분석하고 해결하는 혁신적인 AI 모델로, 과학, 코딩, 수학 분야에서 뛰어난 성능을 보여줍니다. 이 모델들은 '사고의 사슬' 방식을 통해 논리적인 추론을 수행하며, 강화 학습을 통해 스스로 학습하고 오류를 수정하여 지속적으로 성능을 향상시킵니다. 또한, 안전성과 정렬 지침을 준수하여 부적절한 응답을 방지하고 사용자에게 신뢰할 수 있는 서비스를 제공합니다. o1-preview는 고도화된 문제 해결 능력을 제공하며, o1-mini는 비용 효율성과 속도를 중시하여 개발자와 연구자들에게 실용적인 솔루션을 제공합니다. 향후 업데이트를 통해 웹 브라우징, 파일 및 이미지 업로드 등의 기능이 추가될 예정이며, 이러한 모델들은 인공지능 분야에서 새로운 가능성을 열어줄 것으로 기대됩니다.

- 관련 링크: [OpenAI o1](https://openai.com/o1/)

## 작동 원리

o1 시리즈는 인간이 복잡한 문제를 해결할 때처럼 **'심층적인 사고 과정'**을 거쳐 답변을 생성하도록 설계되었습니다. 이를 위해 다음과 같은 기술과 방법이 적용되었습니다:

1. **사고의 사슬(Chain of Thought)**: 모델은 내부적으로 긴 추론 과정을 거쳐 최종 답변을 생성합니다. 이 과정에서 문제를 단계별로 분해하고, 각 단계에서 논리적인 결론을 도출합니다. 예를 들어, 복잡한 수학 문제를 풀 때 모델은 공식 적용, 변수 대입, 계산 등 중간 과정을 모두 고려합니다.

2. **강화 학습(Reinforcement Learning)**: 모델은 강화 학습을 통해 자신의 추론 과정을 지속적으로 개선합니다. 성공적인 문제 해결은 강화되고, 실수나 오류는 교정됩니다. 이를 통해 모델은 다양한 전략을 시도하고 최적의 해결책을 찾는 능력을 갖추게 됩니다.

3. **오류 인식 및 수정 능력**: 모델은 자신의 추론 과정에서 발생하는 오류를 인식하고, 이를 수정하는 방법을 학습했습니다. 이는 복잡한 문제를 해결하는 데 필수적인 능력으로, 모델의 신뢰성을 높여줍니다.

## 성능 평가

o1 시리즈는 다양한 분야에서 탁월한 성능을 입증하였습니다:

- **과학 분야**:

- **GPQA 다이아몬드 테스트**: 과학 분야의 지식을 평가하는 이 테스트에서 o1-preview 모델은 **73.3%**의 정확도를 보였으며, 이는 박사 학위 수준의 인간 전문가를 능가하는 성과입니다.
- **세부 분야 성과**:
- **물리학**: 89.5%의 정확도로 GPT-4o의 68.6%를 크게 앞질렀습니다.
- **화학**: 60.2%의 정확도를 보여 GPT-4o의 43%보다 우수한 성과를 냈습니다.

- **수학 분야**:

- **AIME(미국 수학 경시대회)**: o1 모델은 단일 시도에서 **74.4%**의 문제를 정확히 풀어냈으며, 다중 시도에서는 **83.3%**의 정확도를 보였습니다. 이는 미국 내 상위 500명의 학생에 해당하는 성적입니다.
- **MATH 벤치마크**: o1 모델은 **94.8%**의 정확도를 달성하여 GPT-4o의 60.3%를 크게 상회했습니다.

- **코딩 분야**:
- **Codeforces 대회**: o1 모델은 **1,673점**의 Elo 점수를 기록하여 상위 89%의 성적을 거두었습니다. 이는 GPT-4o의 808점에 비해 두 배 이상의 향상입니다.
- **HumanEval 벤치마크**: o1-mini 모델은 **92.4%**의 정확도를 보여 코딩 문제 해결 능력에서 탁월한 성능을 입증했습니다.

## 사용 예시

- **수학 문제 해결**:

- **교육 분야**: o1 모델은 복잡한 수학 문제를 단계별로 풀어내어 학생들에게 학습 자료로 활용될 수 있습니다. 예를 들어, 적분 계산에서 중간 단계의 미분 과정과 함수 분석을 상세히 설명하여 학생들이 개념을 깊이 이해할 수 있도록 돕습니다.

- **코드 생성 및 디버깅**:

- **소프트웨어 개발**: 개발자들은 o1 모델을 활용하여 복잡한 알고리즘 구현이나 버그 수정에 도움을 받을 수 있습니다. 예를 들어, 병렬 프로그래밍이나 분산 시스템에서의 동시성 문제를 해결하는 코드를 생성하고, 잠재적인 동기화 이슈를 식별하여 수정할 수 있습니다.

- **과학 연구 보조**:

- **생물정보학**: 유전자 시퀀싱 데이터의 주석 처리, 단백질 구조 예측 등에서 o1 모델의 고도화된 추론 능력을 활용할 수 있습니다.
- **화학 반응 예측**: 새로운 화합물의 합성 경로를 예측하고, 반응 메커니즘을 분석하여 연구 시간을 단축할 수 있습니다.

- **데이터 분석 및 해석**:
- **빅데이터 처리**: 대규모 데이터셋에서 패턴을 추출하고 해석하여 비즈니스 인사이트를 도출하는 데 활용될 수 있습니다.
- **통계 모델링**: 복잡한 통계 모델을 구축하고 결과를 해석하여 의사 결정에 도움을 줍니다.

## 안전성 고려

OpenAI는 모델의 고도화된 추론 능력을 안전하게 활용하기 위해 다음과 같은 조치를 취했습니다:

1. **새로운 안전 훈련 접근법 도입**: 모델이 **안전성과 정렬 지침**을 맥락에서 추론하고 적용할 수 있도록 훈련되었습니다. 이는 모델이 사용자 요청을 처리할 때 안전 규칙을 고려하여 부적절한 응답을 방지합니다.

2. **'탈옥(jailbreaking)' 방어 성능 강화**: 모델이 안전 지침을 우회하려는 시도를 방어하는 능력이 크게 향상되었습니다. 예를 들어, 어려운 탈옥 테스트에서 o1-preview 모델은 **84점**을 받아 GPT-4o의 22점을 크게 앞질렀습니다.

3. **내부 거버넌스 및 외부 협력 강화**: 모델의 안전성을 보장하기 위해 내부적인 평가 프로세스를 강화하고, 미국 및 영국의 AI 안전 연구 기관과 협력하여 모델의 사전 및 사후 검증을 수행하고 있습니다.

## 사용 대상

o1 시리즈는 복잡한 문제 해결이 필요한 다양한 분야에서 활용될 수 있습니다:

- **의료 연구자**:

- **유전자 분석**: 유전 질환의 원인 유전자 식별, 유전자 발현 패턴 분석 등에 활용하여 개인 맞춤형 치료법 개발에 기여할 수 있습니다.
- **약물 개발**: 신약 후보 물질의 효능 예측과 부작용 분석을 통해 연구 효율을 높일 수 있습니다.

- **물리학자**:

- **양자 컴퓨팅**: 복잡한 양자 알고리즘의 시뮬레이션과 최적화에 도움을 줍니다.
- **천체물리학**: 우주 현상의 모델링과 데이터 해석을 지원하여 새로운 발견을 촉진합니다.

- **개발자**:

- **AI 어시스턴트 개발**: 자연어 처리 능력을 활용하여 사용자와의 대화형 인터페이스를 구현하고, 사용자 요구에 맞는 서비스를 제공합니다.
- **자동화 도구 개발**: 반복적인 작업을 자동화하는 스크립트나 프로그램을 생성하여 생산성을 향상시킵니다.

- **교육자 및 학생**:
- **교육 콘텐츠 생성**: 복잡한 개념을 이해하기 쉬운 언어로 설명하고, 예제와 함께 교육 자료를 작성합니다.
- **과제 및 연구 보조**: 학생들의 과제 해결을 돕고, 연구 아이디어를 구체화하는 데 도움을 줍니다.

## OpenAI o1-mini

**o1-mini**는 o1 시리즈의 경량화된 버전으로, 특정 분야에서 고성능을 발휘하도록 최적화되었습니다.

- **효율성**:

- **비용 절감**: o1-preview에 비해 **80% 저렴**하여 더 많은 사용자들이 접근할 수 있습니다.
- **속도 향상**: 모델 크기가 작아 응답 시간이 빨라졌으며, 실시간 응답이 중요한 애플리케이션에 적합합니다.

- **특화 분야**:

- **코딩 및 디버깅**: 복잡한 코드 생성과 디버깅에 특화되어 개발자들이 코드 작성 시간을 단축하고 오류를 줄일 수 있습니다.
- **수학 및 논리 추론**: 고도의 수학적 계산과 논리 문제 해결에 탁월한 성능을 보입니다.

- **성능 예시**:

- **코딩 능력**:
- **HumanEval 벤치마크**: o1-mini는 **92.4%**의 정확도를 기록하여 복잡한 프로그래밍 문제를 효과적으로 해결할 수 있음을 보여줍니다.
- **사이버 보안 CTF**: 고등학교 수준의 CTF 챌린지에서 **43%**의 정확도로 우수한 성과를 냈습니다.
- **수학 능력**:
- **AIME 시험**: o1-mini는 **70%**의 정확도로 GPT-4o의 13.4%를 크게 상회하였으며, 이는 상위권 성적에 해당합니다.

- **한계점 및 향후 개선**:
- **세계 지식 제한**: 광범위한 일반 지식이 필요한 작업에서는 GPT-4o보다 성능이 낮을 수 있습니다.
- **향후 계획**: 향후 버전에서는 이러한 한계를 극복하고, 다양한 분야로 적용 범위를 넓히기 위한 연구가 진행될 예정입니다.

## OpenAI o1의 사용 방법

- **ChatGPT Plus 및 팀 사용자**:

- **접근 방법**: ChatGPT 인터페이스에서 모델 선택기(model picker)를 통해 o1-preview와 o1-mini를 선택할 수 있습니다.
- **메시지 제한**: 초기에는 주간 메시지 제한이 적용됩니다(예: o1-preview는 주당 30개 메시지). 이는 모델의 안정성과 인프라 확장을 위한 조치이며, 추후 확대될 예정입니다.

- **ChatGPT Enterprise 및 교육 기관 사용자**:

- **접근 시기**: 다음 주부터 두 모델 모두에 접근할 수 있으며, 팀 규모에 따라 추가 혜택이 제공될 수 있습니다.

- **개발자**:

- **API 사용**: API 사용 등급 5에 해당하는 개발자는 오늘부터 두 모델을 API에서 프로토타이핑할 수 있습니다.
- **기능 제한**: 현재 API에서는 함수 호출, 스트리밍, 시스템 메시지 지원 등의 기능은 포함되지 않지만, 개발자들의 피드백을 바탕으로 향후 추가될 예정입니다.
- **시작 방법**: OpenAI의 [API 문서](http://platform.openai.com/docs/guides/reasoning)를 참고하여 모델 통합과 사용법을 익힐 수 있습니다.

- **일반 사용자**:
- **향후 계획**: ChatGPT 무료 사용자들에게도 o1-mini에 대한 접근 권한을 제공할 예정이며, 이는 더 많은 사람들이 최신 AI 기술을 경험할 수 있도록 하기 위한 노력입니다.

## 향후 계획

이번에 출시된 모델들은 초기 버전으로, OpenAI는 향후 업데이트를 통해 다음과 같은 기능과 개선을 계획하고 있습니다:

1. **기능 추가**:

- **웹 브라우징**: 모델이 인터넷에서 최신 정보를 검색하여 더 정확하고 시의적절한 답변을 제공할 수 있도록 할 예정입니다.
- **파일 및 이미지 업로드**: 사용자가 파일이나 이미지를 업로드하여 모델이 이를 분석하고 처리할 수 있는 기능을 추가할 계획입니다.

2. **모델 성능 향상**:

- **강화 학습 개선**: 모델의 추론 능력을 더욱 향상시키기 위해 강화 학습 알고리즘을 지속적으로 개선할 것입니다.
- **다중 모달 지원**: 텍스트 외에도 이미지, 음성 등 다양한 형태의 데이터를 처리할 수 있도록 연구하고 있습니다.

3. **GPT 시리즈의 발전**:
- **새로운 GPT 모델 개발**: o1 시리즈와 함께 GPT 시리즈의 새로운 모델도 계속 개발 및 출시하여 다양한 사용자 요구에 부응할 것입니다.
- **모델의 통합 및 호환성 강화**: 다양한 모델 간의 호환성을 높여 사용자들이 원하는 모델을 자유롭게 선택하고 활용할 수 있도록 지원할 예정입니다.

## LangChain과 OpenAI o1의 비교

| 평가 항목 | LangChain | o1-preview |
| ----------------- | ------------------------------------------------------ | -------------------------------------------------------------- |
| **주요 기능** | 다양한 모델과 도구를 연결해 체인을 구성하는 프레임워크 | 복잡한 문제 해결을 위한 추론 모델 |
| **사용성** | 사용자가 선택한 모델과 API에 따라 유연성 제공 | 특정 문제 해결을 위해 훈련된 추론 모델, API 기능 일부 제한 |
| **성능** | 복잡한 연산과 멀티스텝 추론은 제한적 | 높은 성능의 수학, 코딩, 과학 문제 해결 |
| **추론 능력** | 연결된 모델과 도구에 의존하여 추론 | 체인 오브 싱킹을 사용해 더 높은 수준의 추론 수행 |
| **안정성** | 연결된 도구나 모델에 따라 안정성 차이 | 강화 학습을 통한 안전성 높은 모델, 안전성 평가에서 우수한 성과 |
| **확장성** | 다양한 도구, API와 쉽게 통합 가능 | 복잡한 추론 문제에 최적화, 범용성은 제한적 |
| **비용** | 사용자가 설정한 리소스에 따라 달라짐 | 고성능 모델 대비 상대적으로 높은 비용 |
| **코딩 능력** | 모델에 따라 다르지만 특정 코딩 최적화는 없음 | 복잡한 코딩 문제 해결에 매우 높은 성능 (Codeforces 89% 이상) |
| **안전성** | 모델에 따라 다르며, 직접적인 안전성 메커니즘은 없음 | 강화 학습을 통한 안전성 강화, 높은 수준의 안전성 테스트 통과 |
| **적용 분야** | 광범위한 응용 가능 (상황에 맞게 다양한 모델 선택 가능) | 수학, 과학, 코딩 등 복잡한 문제 해결에 적합 |
| **웹 브라우징** | 사용자가 원하는 모델과 브라우징 가능 | 현재 미지원 (향후 업데이트 예정) |
| **파일 업로드** | 지원 가능 (모델에 따라 다름) | 현재 미지원 (향후 업데이트 예정) |
| **모델 업데이트** | 다양한 모델의 최신 버전 사용 가능 | 정기적인 업데이트로 성능 개선 예정 |

**종합 비교**:

- LangChain은 다양한 도구와 모델을 연결하는 유연한 프레임워크로, 다양한 상황에서 응용 가능.
- o1-preview는 수학, 과학, 코딩 등 복잡한 문제 해결에 특화된 모델로, 체인 오브 싱킹을 통해 뛰어난 추론 능력과 안전성을 제공. 다만 일부 일반적인 기능(브라우징, 파일 업로드 등)은 아직 지원되지 않음.

## 결론

OpenAI의 o1-preview와 o1-mini 모델은 인공지능 분야에서 중요한 진전을 이뤄냈으며, 복잡한 문제 해결에 새로운 가능성을 열어주고 있습니다. 강화 학습과 '사고의 사슬' 방식을 통해 모델의 추론 능력이 크게 향상되었으며, 이는 다양한 분야에서 혁신을 가져올 것으로 기대됩니다.

향후 업데이트를 통해 더 많은 기능과 개선이 기대되며, 특히 웹 브라우징, 파일 및 이미지 업로드 등의 기능이 추가될 예정입니다. 이는 모델의 활용 범위를 더욱 넓혀줄 것이며, 사용자들은 더욱 다양한 방식으로 AI의 능력을 활용할 수 있게 될 것입니다.

또한, GPT 시리즈의 새로운 모델도 계속 개발 및 출시될 예정이므로, OpenAI의 AI 기술 발전은 앞으로도 지속될 것으로 보입니다. 사용자들과 개발자들은 이러한 발전을 통해 새로운 기회를 발견하고, 인공지능을 활용한 혁신적인 솔루션을 개발할 수 있을 것입니다.

###
https://blog.google/technology/ai/google-datagemma-ai-llm/
Google

New Gemma 2 models! 🚀
Google DeepMind
just released 2x Gemma2 fine-tuned optimized Data Commons (DC). DataGemma RAG transfers a user query with context into a DC query. DataGemma RIG includes interleaved DC queries.

TL;DR;

🔍 Incorporates public statistical data to answer user queries on data commons.

📚 Trained on synthetically generated data generated by Gemini 1.5

🚀 DataGemma RIG improved factual accuracy from 5-17% to 58%.

🥇 DataGemma RAG 99% for statistical claims when citing directly from the table.

🔥 DataGemma was trained on TPUv5e, using JAX

🤗 Available on
Hugging Face
under Gemma license


Key takeaway: fine-tuning your LLMs improves performance even for Retrieval Augmented Generation (RAG) and tool use.

DataGemma: Using real-world data to address AI hallucinations
Sep 12, 2024

DataGemma are the world’s first open models designed to help address the challenges of hallucination by grounding LLMs in the vast, real-world statistical data of Google's Data Commons.

Prem Ramaswami
Prem Ramaswami
Head of Data Commons
2024 Headshot for James Manyika
James Manyika
SVP, Technology & Society
Share
DataGemma Logo
Large language models (LLMs) powering today’s AI innovations are becoming increasingly sophisticated. These models can comb through vast amounts of text and generate summaries, suggest new creative directions and even draft code. However, as impressive as these capabilities are, LLMs sometimes confidently present information that is inaccurate. This phenomenon, known as "hallucination," is a key challenge in generative AI.

Today we're sharing promising research advancements that tackle this challenge directly, helping reduce hallucination by anchoring LLMs in real-world statistical information. Alongside these research advancements, we are excited to announce DataGemma, the first open models designed to connect LLMs with extensive real-world data drawn from Google's Data Commons.

Data Commons: A vast repository of publicly available, trustworthy data
Data Commons is a publicly available knowledge graph containing over 240 billion rich data points across hundreds of thousands of statistical variables. It sources this public information from trusted organizations like the United Nations (UN), the World Health Organization (WHO), Centers for Disease Control and Prevention (CDC) and Census Bureaus. Combining these datasets into one unified set of tools and AI models empowers policymakers, researchers and organizations seeking accurate insights.

Think of Data Commons as a vast, constantly expanding database filled with reliable, public information on a wide range of topics, from health and economics to demographics and the environment, which you can interact with in your own words using our AI-powered natural language interface. For example, you can explore which countries in Africa have had the greatest increase in electricity access, how income correlates with diabetes in US counties or your own data-curious query.

How Data Commons can help tackle hallucination
As generative AI adoption is increasing, we’re aiming to ground those experiences by integrating Data Commons within Gemma, our family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models. These DataGemma models are available to researchers and developers starting now.

DataGemma will expand the capabilities of Gemma models by harnessing the knowledge of Data Commons to enhance LLM factuality and reasoning using two distinct approaches:

1. RIG (Retrieval-Interleaved Generation) enhances the capabilities of our language model, Gemma 2, by proactively querying trusted sources and fact-checking against information in Data Commons. When DataGemma is prompted to generate a response, the model is programmed to identify instances of statistical data and retrieve the answer from Data Commons. While the RIG methodology is not new, its specific application within the DataGemma framework is unique.

Example query: ''Has the use of renewables increased in the world?'' applying DataGemma RIG methodology leverages Data Commons (DC) for authoritative data.

2. RAG (Retrieval-Augmented Generation) enables language models to incorporate relevant information beyond their training data, absorb more context, and enable more comprehensive and informative outputs. With DataGemma, this was made possible by leveraging Gemini 1.5 Pro’s long context window. DataGemma retrieves relevant contextual information from Data Commons before the model initiates response generation, thereby minimizing the risk of hallucinations and enhancing the accuracy of responses.

Example query: ''Has the use of renewables increased in the world?'' applying DataGemma RAG methodology showcases greater reasoning and inclusion of footnotes.

Promising results and future directions
Our preliminary findings using RIG and RAG are early, but encouraging. We've observed notable enhancements in the accuracy of our language models when handling numerical facts. This suggests that users will experience fewer hallucinations for use cases across research, decision-making or simply satisfying curiosity. Explore these results in our research paper.

a black screen reading "What progress has Pakistan made against health goals?" and "Rag answer example"
Illustration of a RAG query and response. Supporting ground truth statistics are referenced as tables served from Data Commons. *Partial response shown for brevity.

Our research is ongoing, and we’re committed to refining these methodologies further as we scale up this work, subject it to rigorous testing, and ultimately integrate this enhanced functionality into both Gemma and Gemini models, initially through a phased, limited-access approach.

By sharing our research and making this latest Gemma model variant an “open” model once again, we aspire to facilitate the broader adoption of these Data Commons-led techniques for grounding LLMs in factual data. Making LLMs more reliable and trustworthy is key to ensuring they are indispensable tools for everyone, and building a future where AI empowers people with accurate information, fostering informed decisions, and a deeper understanding of the world around us.

Researchers and developers can also get started with DataGemma using these quickstart notebooks for both the RIG and RAG approaches. To learn more about how Data Commons and Gemma work together, read our Research post.


###
https://notebooklm.google/
Google
Introducing NotebookLM
Jul 12, 2023

3 min read

An AI-first notebook, grounded in your own documents, designed to help you gain insights faster.

R
Raiza Martin
Product Manager, Google Labs
S
Steven Johnson
Editorial Director, Google Labs
Share
At Google I/O this year we introduced a number of AI-first experiments in development, including Project Tailwind — a new kind of notebook designed to help people learn faster.

Today we’re beginning to roll out Project Tailwind with its new name: NotebookLM, an experimental offering from Google Labs. It’s our endeavor to reimagine what notetaking software might look like if you designed it from scratch knowing that you would have a powerful language model at its core: hence the LM. It will be immediately available to a small group of users in the U.S. as we continue to refine the product and make it more helpful.

It’s hard to go from information to insight
We know people are struggling with the rapid growth of information — it's everywhere and it’s overwhelming. As we've been talking with students, professors and knowledge workers, one of the biggest challenges is synthesizing facts and ideas from multiple sources. You often have the sources you want, but it's time consuming to make the connections.

We started to explore what we could build that would help people make connections faster in the midst of all this data, especially using sources they care most about.

an illustration of a screen with NotebookLM, showing boxes and bubbles of synthesized information
NotebookLM automatically generates a document guide to help you get a better understanding of the material

NotebookLM: an AI notebook for everyone
NotebookLM is an experimental product designed to use the power and promise of language models paired with your existing content to gain critical insights, faster. Think of it as a virtual research assistant that can summarize facts, explain complex ideas, and brainstorm new connections — all based on the sources you select.

A key difference between NotebookLM and traditional AI chatbots is that NotebookLM lets you “ground” the language model in your notes and sources. Source-grounding effectively creates a personalized AI that’s versed in the information relevant to you. Starting today, you can ground NotebookLM in specific Google Docs that you choose, and we’ll be adding additional formats soon.

Once you’ve selected your Google Docs, you can do three things:

Get a summary: When you first add a Google Doc into NotebookLM, it will automatically generate a summary, along with key topics and questions to ask so you get a better understanding of the material.
Ask questions: When you’re ready for a deeper dive, you can ask questions about the documents you’ve uploaded. For example:
A medical student could upload a scientific article about neuroscience and tell NotebookLM to “create a glossary of key terms related to dopamine”
An author working on a biography could upload research notes and ask a question like: “Summarize all the times Houdini and Conan Doyle interacted.”
Generate ideas: NotebookLM isn’t just for Q&A. We’ve found some of its more delightful and useful capabilities are when it’s able to help people come up with creative new ideas. For example:
A content creator could upload their ideas for new videos and ask: “Generate a script for a short video on this topic.”
Or an entrepreneur raising money could upload their pitch and ask: “What questions would potential investors ask?”
While NotebookLM’s source-grounding does seem to reduce the risk of model “hallucinations,” it’s always important to fact-check the AI’s responses against your original source material. When you're drawing on multiple sources, we make that fact-checking easy by accompanying each response with citations, showing you the most relevant original quotes from your sources.

Learning and building, together
NotebookLM is an experimental product, built by a small team in Google Labs.

Our team has two goals in mind:

Build a product with our users: We’ll be talking to people and communities often to learn about what’s working well and where the gaps are, with the intent of making NotebookLM a truly useful product.
Roll out this technology responsibly: Getting feedback directly from you is a critical part of developing AI responsibly. We will also use a strict set of safety criteria in alignment with our AI Principles and implement appropriate safeguards before expanding to more users and launching new functionality.
We’ve built NotebookLM such that the model only has access to the source material that you’ve chosen to upload, and your files and dialogue with the AI are not visible to other users. We do not use any of the data collected to train new AI models.

We hope that in these early days you give NotebookLM a shot. Sign up to the waitlist to try it out!

###
https://mistral.ai/news/september-24-release/
Mistral AI
9/17/24
Mistral drops improved Small Instruct 22B - Multilingual, 128K context, supports tool use/ function calling! 🔥
Seats comfortably between Mistral NeMo 12B and Mistral Large 123B
> 22B parameters
> Vocabulary to 32768
> Supports function calling
> 128k sequence length
> Model weights (non commercial) on the Hub
> Works out of the box with Transformers 🤗
Congrats Mistral AI for a brilliant open release! GG!
AI in abundance. Big pricing improvements across the board, and a new 12B vision model.

AI in abundance
Introducing a free API, improved pricing across the board, a new enterprise-grade Mistral Small, and free vision capabilities on le Chat.

September 17, 2024 Mistral AI team
We’re taking new steps in our mission to bring frontier AI in the hands of everyone. Today, we are releasing:

A free tier on la Plateforme
A pricing update over our entire family of models
A new, better Mistral Small
Free vision capabilities on le Chat with Pixtral 12B
Free tier on la Plateforme
La Plateforme, the serverless platform to tune and build with Mistral models as API endpoints, now offers a free tier enabling developers to get started with experimentation, evaluation, and prototyping at no cost. Users can seamlessly evolve their endpoints into a commercial tier, and benefit from full data isolation (with a free zero-retention option) and higher rate limits. Users can also choose to deploy our models to different infrastructure: whether using our cloud partners (Azure / AWS / GCP), or choosing to deploy our solutions on their own tenant.

Reduced prices across the board
We’ve worked hard on making our endpoints faster and more efficient. This enables us to reduce prices across the board, with the following prices

Model New price Old price Price drop
Mistral Nemo $0.15 / M input tokens $0.3 / M tokens 50%
$0.15 / M output tokens $0.3 / M tokens
Pixtral 12B $0.15 / M input tokens
$0.15 / M output tokens
Mistral Small $0.2 / M input tokens $1 / M input tokens 80%
$0.6 / M output tokens $3 / M output tokens
Codestral $0.2 / M input tokens $1 / M input tokens 80%
$0.6 / M output tokens $3 / M output tokens
Mistral Large $2 / M input tokens $3 / M input tokens 33%
$6 / M output tokens $9 / M output tokens
This price update makes Mistral Large 2 the most cost-efficient frontier model, make our smaller models extremely cost efficient, and allows customers to realize significantly faster returns on their AI investments. Updated pricing will also reflect on our cloud platform partner offerings (Azure AI Studio, Amazon Bedrock, Google Vertex AI).

Small gets a big update
We are proud to unveil Mistral Small v24.09, our latest enterprise-grade small model, an upgrade of Mistral Small v24.02. Available under the Mistral Research License, this model offers customers the flexibility to choose a cost-efficient, fast, yet reliable option for use cases such as translation, summarization, sentiment analysis, and other tasks that do not require full-blown general purpose models.

With 22 billion parameters, Mistral Small v24.09 offers customers a convenient mid-point between Mistral NeMo 12B and Mistral Large 2, providing a cost-effective solution that can be deployed across various platforms and environments. As shown below, the new small model delivers significant improvements in human alignment, reasoning capabilities, and code over the previous model.

Detailed benchmarks Detailed benchmarks
We’re releasing Mistral Small v24.09 under the MRL license. You may self-deploy it for non-commercial purposes, using e.g. vLLM

Eye of the Tiger - Pixtral on le Chat
Following our latest Apache model release, Pixtral 12B, a vision-capable model with image understanding capabilities, is now freely available on le Chat. Pixtral 12B is the first open source model to support images of any size without degradation in text-based performance, and you can now use it on le Chat to scan, analyze, search, caption, and better understand your personal or enterprise knowledge files.

Importantly, the model is available under the Apache 2.0 license, so you can bring visual understanding capabilities to your own environment without having to upload your files to a third-party provider. This is a critical capability for customers that operate with sensitive or proprietary information.

Do more with less
All the above announcements are now available. Head over to le Chat to try the new image understanding capabilities. To try the free tier of la Plateforme, sign in at console.mistral.ai. To learn more about Mistral Small v24.09, Pixtral 12B, and other Mistral models and pricing, click here.

###
https://github.com/QwenLM/Qwen2.5
2024.09.19
Open Source AI/ML is on fire today! 🔥 Multilingual (29) Qwen 2.5 just dropped w/ 128K context too! The 72B rivals Llama 3.1 405B and beats Mistral Large 2 (123B) ⚡
> Trained on an extensive dataset containing up to 18 trillion tokens
> It surpasses its predecessor, Qwen2, with significantly higher scores on MMLU (85+), HumanEval (85+), and MATH (80+) benchmarks
> Excels in instruction following, generating lengthy texts (over 8K tokens), and understanding structured data like tables. It also shows significant progress in generating structured outputs, particularly JSON.
> Supports over 29 languages, including major global languages, and can handle up to 128K tokens, with a text generation capacity of 8K tokens.

They release specialised models as well:

1. Qwen2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B

2. Qwen2.5-Coder: 1.5B, 7B, and 32B on the way

3. Qwen2.5-Math: 1.5B, 7B, and 72B.

Kudos to Alibaba Qwen team for shipping high quality model

Qwen2.5: A Party of Foundation Models!
September 19, 2024
· 9 min · 1739 words · Qwen Team | Translations:
简体中文
GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD

Introduction
In the past three months since Qwen2’s release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5. We are announcing what might be the largest opensource release in history! Let’s get the party started!

Our latest release features the LLMs Qwen2.5, along with specialized models for coding, Qwen2.5-Coder, and mathematics, Qwen2.5-Math. All open-weight models are dense, decoder-only language models, available in various sizes, including:

Qwen2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B
Qwen2.5-Coder: 1.5B, 7B, and 32B on the way
Qwen2.5-Math: 1.5B, 7B, and 72B.

All our open-source models, except for the 3B and 72B variants, are licensed under Apache 2.0. You can find the license files in the respective Hugging Face repositories. In addition to these models, we offer APIs for our flagship language models: Qwen2.5-Plus and Qwen2.5-Turbo through Model Studio, and we encourage you to explore them! Furthermore, we have also open-sourced the Qwen2-VL-72B, which features performance enhancements compared to last month’s release.

For more details about Qwen2.5, Qwen2.5-Coder, and Qwen2.5-Math, feel free to visit the following links:

Qwen2.5 LLM Qwen2.5-Coder Qwen2.5-Math


Get ready to unlock a world of possibilities with our extensive lineup of models! We’re excited to share these cutting-edge models with you, and we can’t wait to see the incredible things you’ll achieve with them!

Takeaways
In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens. Compared to Qwen2, Qwen2.5 has acquired significantly more knowledge (MMLU: 85+) and has greatly improved capabilities in coding (HumanEval 85+) and mathematics (MATH 80+). Additionally, the new models achieve significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. Qwen2.5 models are generally more resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. Like Qwen2, the Qwen2.5 language models support up to 128K tokens and can generate up to 8K tokens. They also maintain multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. Below, we provide basic information about the models and details of the supported languages.

The specialized expert language models, namely Qwen2.5-Coder for coding and Qwen2.5-Math for mathematics, have undergone substantial enhancements compared to their predecessors, CodeQwen1.5 and Qwen2-Math. Specifically, Qwen2.5-Coder has been trained on 5.5 trillion tokens of code-related data, enabling even smaller coding-specific models to deliver competitive performance against larger language models on coding evaluation benchmarks. Meanwhile, Qwen2.5-Math supports both Chinese and English and incorporates various reasoning methods, including Chain-of-Thought (CoT), Program-of-Thought (PoT), and Tool-Integrated Reasoning (TIR).

Qwen2.5 Specification
Performance
Qwen2.5
To showcase Qwen2.5’s capabilities, we benchmark our largest open-source model, Qwen2.5-72B - a 72B-parameter dense decoder-only language model - against leading open-source models like Llama-3.1-70B, Mistral-Large-V2, and DeepSeek-V2.5. We present comprehensive results from instruction-tuned versions across various benchmarks, evaluating both model capabilities and human preferences.

Qwen2.5-72B Instruct Performance
Besides the instruction-tuned language models, we figure out that the base language model of our flagship opensource model Qwen2.5-72B reaches top-tier performance even against larger models like Llama-3-405B.

Qwen2.5-72B Base Model Performance
Furthermore, we benchmark the latest version of our API-based model, Qwen-Plus, against leading proprietary and open-source models, including GPT4-o, Claude-3.5-Sonnet, Llama-3.1-405B, and DeepSeek-V2.5. This comparison showcases Qwen-Plus’s competitive standing in the current landscape of large language models. We show that Qwen-Plus significantly outcompetes DeepSeek-V2.5 and demonstrates competitive performance against Llama-3.1-405B, while still underperforming compared to GPT4-o and Claude-3.5-Sonnet in some aspects. This benchmarking not only highlights Qwen2.5-Plus’s strengths but also identifies areas for future improvement, reinforcing our commitment to continuous enhancement and innovation in the field of large language models.

Qwen2.5-Plus Instruct Performance
A significant update in Qwen2.5 is the reintroduction of our 14B and 32B models, Qwen2.5-14B and Qwen2.5-32B. These models outperform baseline models of comparable or larger sizes, such as Phi-3.5-MoE-Instruct and Gemma2-27B-IT, across diverse tasks. They achieve an optimal balance between model size and capability, delivering performance that matches or exceeds some larger models. Additionally, our API-based model, Qwen2.5-Turbo, offers highly competitive performance compared to the two open-source models, while providing a cost-effective and rapid service.

Qwen2.5-32B Instruct Performance
In recent times, there has been a notable shift towards small language models (SLMs). Although SLMs have historically trailed behind their larger counterparts (LLMs), the performance gap is rapidly diminishing. Remarkably, even models with just 3 billion parameters are now delivering highly competitive results. The accompanying figure illustrates a significant trend: newer models achieving scores above 65 in MMLU are increasingly smaller, underscoring the accelerated growth in knowledge density among language models. Notably, our Qwen2.5-3B stands out as a prime example, achieving impressive performance with only around 3 billion parameters, showcasing its efficiency and capability compared to its predecessors.

Qwen2.5 Small Model
In addition to the notable enhancements in benchmark evaluations, we have refined our post-training methodologies. Our four key updates include support for long text generation of up to 8K tokens, significantly improved comprehension of structured data, more reliable generation of structured outputs, particularly in JSON format, and enhanced performance across diverse system prompts, which facilitates effective role-playing. Check the LLM blog for details about how to leverage these capabilities.

Qwen2.5-Coder
Since the launch of CodeQwen1.5, we have attracted numerous users who rely on this model for various coding tasks, such as debugging, answering coding-related questions, and providing code suggestions. Our latest iteration, Qwen2.5-Coder, is specifically designed for coding applications. In this section, we present the performance results of Qwen2.5-Coder-7B-Instruct, benchmarked against leading open-source models, including those with significantly larger parameter sizes.

Qwen2.5-Coder Instruct Performance
We believe that Qwen2.5-Coder is an excellent choice as your personal coding assistant. Despite its smaller size, it outperforms many larger language models across a range of programming languages and tasks, demonstrating its exceptional coding capabilities.

Qwen2.5-Math
In terms of the math specific language models, we released the first models, Qwen2-Math, last month, and this time, compared to Qwen2-Math, Qwen2.5-Math has been pretrained larger-scale of math related data, including the synthetic data generated by Qwen2-Math. Additionally we extend the support of Chinese this time and we also strengthen its reasoning capabilities by endowing it with the abilities to perform CoT, PoT, and TIR. The general performance of Qwen2.5-Math-72B-Instruct surpasses both Qwen2-Math-72B-Instruct and GPT4-o, and even very small expert model like Qwen2.5-Math-1.5B-Instruct can achieve highly competitive performance against large language models.

Qwen2.5 Math Performance Across All Sizes
Develop with Qwen2.5
The simplest way to use is through Hugging Face Transfomer as demonstrated in the model card:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
To use Qwen2.5 with vLLM, running the following command can deploy an OpenAI API compatible service:

python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct
or use vllm serve if you use vllm>=0.5.3. Then you can communicate with Qwen2.5 via curl:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Tell me something about large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"repetition_penalty": 1.05,
"max_tokens": 512
}'
Furthermore, Qwen2.5 supports vllm’s built-in tool calling. This functionality requires vllm>=0.6. If you want to enable this functionality, please start vllm’s OpenAI-compatible service with:

vllm serve Qwen/Qwen2.5-7B-Instruct --enable-auto-tool-choice --tool-call-parser hermes
You can then use it in the same way you use GPT’s tool calling.

Qwen2.5 also supports Ollama’s tool calling. You can use it by starting Ollama’s OpenAI-compatible service and using it in the same way you use GPT’s tool calling.

Qwen2.5’s chat template also includes a tool calling template, meaning that you can use Hugging Face transformers’ tool calling support.

The vllm / Ollama / transformers tool calling support uses a tool calling template inspired by Nous’ Hermes. Historically, Qwen-Agent provided tool calling support using Qwen2’s own tool calling template (which is harder to be integrated with vllm and Ollama), and Qwen2.5 maintains compatibility with Qwen2’s template and Qwen-Agent as well.


Friends of Qwen
💗 Qwen is nothing without its friends! So many thanks to the support of these old buddies and new friends :

Hugging Face Transformers

Finetuning: Peft, ChatLearn, Llama-Factory, Axolotl, Firefly, Swift, XTuner, Unsloth, Liger Kernel

Quantization: AutoGPTQ, AutoAWQ, Neural Compressor

Deployment: vLLM, SGL, SkyPilot, TensorRT-LLM, OpenVino, TGI, Xinference

API Platforms: Together, Fireworks, OpenRouter, Sillicon Flow

Local Run: MLX, Llama.cpp, Ollama, LM Studio, Jan

Agent and RAG Frameworks: Dify, LlamaIndex, CrewAI

Evaluation: LMSys, OpenCompass, Open LLM Leaderboard

Model Training: Arcee AI, Sailor, Dolphin, Openbuddy

We would like to extend our heartfelt gratitude to the numerous teams and individuals who have contributed to Qwen, even if they haven’t been specifically mentioned. Your support is invaluable, and we warmly invite more friends to join us in this exciting journey. Together, we can enhance collaboration and drive forward the research and development of the open-source AI community, making it stronger and more innovative than ever before.

What’s Next?
While we are thrilled to launch numerous high-quality models simultaneously, we recognize that significant challenges remain. Our recent releases demonstrate our commitment to developing robust foundation models across language, vision-language, and audio-language domains. However, it is crucial to integrate these different modalities into a single model to enable seamless end-to-end processing of information across all three. Additionally, although we have made strides in enhancing reasoning capabilities through data scaling, we are inspired by the recent advancements in reinforcement learning (e.g., o1) and are dedicated to further improving our models’ reasoning abilities by scaling inference compute. We look forward to introducing you to the next generation of models soon! Stay tuned for more exciting developments!




###
https://www.microsoft.com/en-us/research/blog/eureka-evaluating-and-understanding-progress-in-ai/
Microsoft
9/18/24
Excited to announce the release of Eureka, an open-source framework for evaluating and understanding large foundation models! 🌟 Eureka offers: 🔍 In-depth analysis of 12 cutting-edge models 🧠 Multimodal & language capability testing beyond single-score reporting and rankings 📈 Insights into model strengths, weaknesses, determinism, and backward compatibility.
Join us in exploring the next AI frontier and contribute to open-source evaluations & insights!
Blog:


In a fast-paced discipline like AI, where every model release promises the next big leap in intelligence, how often have we found ourselves puzzling about questions like:
- Is a fresh model release competitive with the most capable models known to date?
- If most models rank similarly in known leaderboards, are these models comparable, if not the same, in terms of capabilities?
- If they are not the same, what are the strengths and weaknesses of each model?
- Are there capabilities that are fundamental for making AI useful in the real world but also universally challenging for most models?


We will be sharing one Eureka insight per day in the following days, but here is a spoiler.
💡 *Eureka Insight Day 0*: In contrast to recent trends in evaluation reports and leaderboards showing absolute rankings and claims for one model or another to be the best, our analysis shows that there is no such best model. Different models have different strengths, but there are models that appear more often than others as best performers for several capabilities. Despite the many observed improvements, it also becomes obvious that current models still struggle with a number of fundamental capabilities including detailed image understanding, benefiting from multimodal input when available rather than fully relying on language, factuality and grounding for information retrieval, and over refusals.
This first version of Eureka is a joint team effort from several amazing AI scientists at #MicrosoftResearch AI Frontiers: Vidhisha Balachandran Jingya Chen Neel J. Hamid Palangi Eduardo Salinas Vibhav Vineet, James Woffinden-Luey, Safoora Yousefi + endless guidance and support from Ahmed Awadallah Ece Kamar Eric Horvitz John Langford Rafah Aboul Hosn Saleema Amershi. Thanks everyone for the shared curiosity and passion for understanding AI in depth and for making this release possible! We welcome contributions, questions, and suggestions from everyone in the community.


###
https://huggingface.co/datasets/thesven/Reflective-MAGLLAMA-v0.1.1
thesven
9/13/24

<thinking>
Magpie is a great method for creating synthetic datasets; we can prompt LLMs with "empty" user inputs.
<thinking>
<reflection>
To enhance reasoning, we add self-reflection tags and scale it using Argilla
distilabel.
<reflection>
<output>
New open dataset with 10k synthetic generated <thinking> samples, available on Hugging Face
https://lnkd.in/eXQq4Zi5
</output>

Dataset Card for Reflective-MAGLLAMA-v0.1
Please Use v0.1.1
This dataset has been created with distilabel.

Overview
Reflective MAGLLAMA is a dataset created using MAGPIE-generated prompts in combination with reflection pattern responses generated by the LLaMa 3.1 70B model. This dataset is tailored specifically to encourage reflection-based analysis through reflection prompting, a technique that enhances deeper thinking and learning. It was curated using the Distilabel framework, and the full pipeline used for building the dataset is available in the associated repository.

Purpose of the Dataset
The primary goal of the Reflective MAGLLAMA dataset is to provide a robust resource for training, fine-tuning, or evaluating models in the context of reflective thinking and analytical problem-solving. By simulating human-like reflective reasoning, the dataset encourages models to generate more thoughtful and insightful outputs.

Key Aspects of Reflection Prompting
Purpose and Benefits:

Reflection prompting is a strategic technique designed to:

Enhance learning and understanding by encouraging the model (or human) to think beyond the surface level.
Promote critical thinking, fostering an environment where decision-making becomes more thorough and insightful.
Improve the decision-making process by revealing underlying thought patterns and providing space for reflection on multiple perspectives.
Reveal insights into both individual and collective thought processes, offering a clearer understanding of how information is processed and conclusions are reached.
Data Collection Process
This dataset was constructed by feeding MAGPIE-generated prompts to LLaMa 3.1 70B and collecting the reflection-based responses. These responses were subsequently labeled using the Distilabel framework to ensure consistent quality and relevance of the outputs.

Intended Use
The Reflective MAGLLAMA dataset is intended for:

Researchers aiming to explore or enhance models in reflective thinking, decision-making, and reasoning tasks.
Developers looking to fine-tune models for applications involving education, critical thinking, or complex problem-solving.
Evaluation purposes, particularly in contexts where a model’s ability to reflect on its own reasoning and improve its decision-making is of value.

###
https://huggingface.co/jinaai/reader-lm-1.5b
Jinaai
9/14/24

𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗻𝗴 𝘆𝗼𝘂𝗿 𝗛𝗧𝗠𝗟 𝘄𝗲𝗯𝗽𝗮𝗴𝗲𝘀 𝘁𝗼 𝗺𝗮𝗿𝗸𝗱𝗼𝘄𝗻 𝗶𝘀 𝗻𝗼𝘄 𝗽𝗼𝘀𝘀𝗶𝗯𝗹𝗲 𝗲𝗻𝗱-𝘁𝗼-𝗲𝗻𝗱 𝘄𝗶𝘁𝗵 𝗮 𝘀𝗶𝗺𝗽𝗹𝗲 𝗟𝗟𝗠! 👏
Jina just released Reader-LM, that handles the whole pipeline of extracting markdown from HTML webpages.
A while ago, Jina had released a completely code-based deterministic program to do this extraction, based on some heuristics : e.g., “if the text is in a <p> tag, keep it, but if it’s hidden behind another, remove it”.
🤔 But they received complaints from readers: some found it too detailed, other not enough, depending on the pages.
➡️ So they decided, 𝗺𝗮𝘆𝗯𝗲 𝗵𝗲𝘂𝗿𝗶𝘀𝘁𝗶𝗰𝘀 𝘄𝗲𝗿𝗲 𝗻𝗼𝘁 𝗲𝗻𝗼𝘂𝗴𝗵: 𝗶𝗻𝘀𝘁𝗲𝗮𝗱, 𝘁𝗵𝗲𝘆 𝘁𝗿𝗶𝗲𝗱 𝘁𝗼 𝘁𝗿𝗮𝗶𝗻 𝗮 𝗟𝗟𝗠 𝘁𝗼 𝗱𝗼 𝘁𝗵𝗲 𝗰𝗼𝗺𝗽𝗹𝗲𝘁𝗲 𝗲𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻. This LLM does not need to be very strong,but it should handle a very long context: it’s a challenging, “shallow-but-wide” architecture.
𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:
2️⃣ models: Reader-LM-0.5B and 1.5B
⚙️ Two stages of training: first, short and simple HTML to get the basics, then ramp up to longer and harder HTML up to 128k tokens
🔎 Use contrastive search for decoding: this empirically reduces “repeating output” issues
➡️ Their models beat much larger models at HTML extraction 🔥
🤗 Weights available on HF (sadly cc-by-nc license):

Trained by Jina AI.

Intro
Jina Reader-LM is a series of models that convert HTML content to Markdown content, which is useful for content conversion tasks. The model is trained on a curated collection of HTML content and its corresponding Markdown content.

Models
Name Context Length Download
reader-lm-0.5b 256K 🤗 Hugging Face
reader-lm-1.5b 256K 🤗 Hugging Face

###
https://huggingface.co/datasets/HuggingFaceFV/finevideo
HuggingFace
9/15/24

🚨 FineVideo is here: 66M words across 43K videos spanning 3.4K hours - CC-BY licensed video understanding dataset 🔥
> It enables advanced video understanding, focusing on mood analysis, storytelling, and media editing in multimodal settings
> Provides detailed annotations on scenes, characters, and audio-visual interactions
> Dataset's unique focus on emotional journey and narrative flow enable context-aware video analysis models
Dataset stats:
> 43,751 videos
> An average video length of 4.7 minutes with approximately 3,425 hours of content
> 122 categories w/ 358.61 videos per category on average
100% commercially permissive - use it as you like for whatever! 🐐

Description
This dataset opens up new frontiers in video understanding, with special focus on the tricky tasks of mood analysis, storytelling and media edition in multimodal settings.

It's packed with detailed notes on scenes, characters, plot twists, and how audio and visuals play together, making it a versatile tool for everything from beefing up pre-trained models to fine-tuning AI for specific video tasks.

What sets this dataset apart is its focus on capturing the emotional journey and narrative flow of videos - areas where current multimodal datasets fall short - giving researchers the ingredients to cook up more context-savvy video analysis models.

Dataset Explorer
You can explore the dataset directly from your browser in the FineVideo Space.

FineVideo Explorer
Dataset Distribution
This comprehensive dataset includes:

43,751 videos
An average video length of 4.7 minutes with approximately 3,425 hours of content
Content from 122 categories with 358.61 videos per category on average
Content categories
The videos were originally shared on YouTube under Creative Commons Attribution (CC-BY) licenses. FineVideo obtained these videos along with their speech-to-text transcriptions from YouTube-Commons, a project that aggregates audio transcripts of CC-BY licensed YouTube videos.

###
https://arxiv.org/abs/2409.06857
[Submitted on 10 Sep 2024 (v1), last revised 12 Sep 2024 (this version, v2)]
What is the Role of Small Models in the LLM Era: A Survey
Lihu Chen, Gaël Varoquaux
Large Language Models (LLMs) have made significant progress in advancing artificial general intelligence (AGI), leading to the development of increasingly large models such as GPT-4 and LLaMA-405B. However, scaling up model sizes results in exponentially higher computational costs and energy consumption, making these models impractical for academic researchers and businesses with limited resources. At the same time, Small Models (SMs) are frequently used in practical settings, although their significance is currently underestimated. This raises important questions about the role of small models in the era of LLMs, a topic that has received limited attention in prior research. In this work, we systematically examine the relationship between LLMs and SMs from two key perspectives: Collaboration and Competition. We hope this survey provides valuable insights for practitioners, fostering a deeper understanding of the contribution of small models and promoting more efficient use of computational resources. The code is available at this https URL https://github.com/tigerchen52/awesome_role_of_small_models

###
https://arxiv.org/abs/2409.10482
[Submitted on 16 Sep 2024 (v1), last revised 17 Sep 2024 (this version, v2)]
Schrodinger's Memory: Large Language Models
Wei Wang, Qing Li
Memory is the foundation of all human activities; without memory, it would be nearly impossible for people to perform any task in daily life. With the development of Large Language Models (LLMs), their language capabilities are becoming increasingly comparable to those of humans. But do LLMs have memory? Based on current performance, LLMs do appear to exhibit memory. So, what is the underlying mechanism of this memory? Previous research has lacked a deep exploration of LLMs' memory capabilities and the underlying theory. In this paper, we use Universal Approximation Theorem (UAT) to explain the memory mechanism in LLMs. We also conduct experiments to verify the memory capabilities of various LLMs, proposing a new method to assess their abilities based on these memory ability. We argue that LLM memory operates like Schrödinger's memory, meaning that it only becomes observable when a specific memory is queried. We can only determine if the model retains a memory based on its output in response to the query; otherwise, it remains indeterminate. Finally, we expand on this concept by comparing the memory capabilities of the human brain and LLMs, highlighting the similarities and differences in their operational mechanisms.

This paper provides a closer look at the memory capabilities of LLMs.
It uses the Universal Approximation Theorem to explain the memory mechanism of LLMs. It also proposes a new approach to evaluate LLM performance by comparing the memory capacities of different models.
The Transformer architecture "functions as a dynamic fitting UAT model, with a strong ability to adaptively fit inputs. As a result, LLMs can recall entire content based on minimal input information. Since this memory can only be confirmed when triggered by input, we refer to it as ”Schrodinger’s memory.”