OpenAI에서는 ChatGPT의 웹 검색 기능을 대폭 개선한 ‘ChatGPT search’를 발표하여 실시간으로 관련 웹 소스를 제공하고 사용자 경험을 향상시켰습니다. 또한 macOS와 Windows 데스크톱 앱에서 Advanced Voice 기능을 출시하여 음성 대화 기능을 확장하였습니다. Stability AI는 ‘Stable Diffusion 3.5’를 공개하여 고품질 이미지 생성을 지원하고, 다양한 모델 변형을 제공하여 사용자 커스터마이징을 용이하게 하였습니다. Google은 Gemini API와 Google AI Studio에 ‘Grounding with Google Search’ 기능을 도입하여 모델의 응답 정확성과 최신성을 향상시켰습니다. Meta는 모바일 환경에서 효율적인 LLM 활용을 위한 ‘MobileLLM’과 연구자들을 위한 LLM 훈련 및 추론 라이브러리 ‘Meta Lingua’를 출시하였습니다. Hugging Face는 소형이지만 강력한 LLM 세트인 ‘SmolLM 2’를 공개하여 온디바이스 최적화를 달성하였습니다. Inspirai와 칭화대학교는 이미지, 음성, 텍스트 입력을 모두 이해하는 비주얼-오디오 어시스턴트 ‘Mini-Omni 2’를 발표하였고, Microsoft는 마커 없이도 고품질 전신 퍼포먼스 캡처가 가능한 ‘SynthMoCap’ 기술을 소개하였습니다. 또한 MaskGCT에서는 새로운 최첨단 텍스트-투-스피치(TTS) 모델을 발표하였고, 연구자들은 대형 언어 모델이 개발자의 이데올로기를 반영한다는 연구 결과를 발표하였습니다.

OpenAI, ChatGPT search 소개

링크, 2024년 10월 31일

  • ChatGPT의 웹 검색 기능 대폭 개선: 이제 ChatGPT는 이전보다 훨씬 향상된 웹 검색 기능을 통해 신속하고 시기적절한 답변과 관련 웹 소스 링크를 제공합니다.
  • Plus, Team, SearchGPT 대기자 명단 사용자 우선 제공: 현재 Plus, Team, 그리고 SearchGPT 대기자 명단 사용자들에게 롤아웃 중이며, 모바일 및 데스크톱 앱과 chatgpt.com에서 이용 가능합니다.
  • Enterprise와 Edu 사용자 제공 예정: Enterprise와 Edu 사용자들은 몇 주 내에 접근 가능하며, 무료 사용자들에게는 향후 몇 달에 걸쳐 제공될 예정입니다.
  • 자동 및 수동 웹 검색 기능: ChatGPT는 사용자의 질문에 따라 자동으로 웹을 검색하거나, 사용자가 웹 검색 아이콘을 클릭하여 수동으로 검색할 수 있습니다.
  • 새로운 시각적 디자인 추가: 날씨, 주식, 스포츠, 뉴스, 지도 등의 카테고리에 대해 최신 정보와 새로운 시각적 디자인을 제공합니다.
  • 신뢰할 수 있는 뉴스 및 데이터 제공자와의 파트너십: AP News, Financial Times, Reuters 등과의 협력을 통해 최신 정보와 시각 자료를 추가하였습니다.
  • 출처 링크 제공으로 투명성 강화: 답변에 출처 링크를 포함하여 사용자가 추가 정보를 얻을 수 있도록 지원합니다.
  • 검색 모델의 기술적 개선: GPT-4o의 미세 조정 버전을 사용하였으며, 새로운 합성 데이터 생성 기술을 적용하였습니다.
  • 향후 계획: Advanced Voice 및 캔버스에 새로운 검색 경험을 도입할 예정이며, 무료 및 로그아웃 사용자들에게도 기능을 확대할 계획입니다.

OpenAI, 데스크톱 앱에 Advanced Voice 기능 추가

링크, 2024년 10월 31일

  • macOS와 Windows 데스크톱 앱에서 Advanced Voice 기능 제공: 데스크톱 환경에서 음성 대화 기능을 사용할 수 있게 되었습니다.
  • 최신 버전의 앱 필요: Advanced Voice 기능을 이용하려면 최신 버전의 데스크톱 앱을 다운로드해야 합니다.
  • 모바일 및 데스크톱 지원: ChatGPT는 이제 모바일과 데스크톱에서 모두 Advanced Voice 기능을 제공합니다.

OpenAI, Playground에서 프롬프트 생성 기능 도입

링크, 2024년 10월 31일

  • Playground의 Generate 버튼 소개: 간단한 작업 설명만으로 프롬프트, 함수, 스키마를 생성할 수 있는 기능을 제공합니다.
  • 메타 프롬프트와 스키마 사용: 최상의 프롬프트와 스키마를 생성하기 위해 메타 프롬프트와 메타 스키마를 사용합니다.
  • 생성 과정 간소화: 프롬프트와 스키마를 처음부터 작성하는 데 소요되는 시간을 절약하고 빠르게 시작할 수 있습니다.
  • 향후 발전된 기술 통합 예정: 앞으로 DSPy와 “Gradient Descent”와 같은 더 발전된 기술을 통합할 계획입니다.

OpenAI, SimpleQA 벤치마크 공개

링크, 2024년 10월 30일

  • 사실성 측정을 위한 벤치마크 ‘SimpleQA’ 오픈 소스화: 언어 모델의 사실성을 평가하기 위한 새로운 벤치마크를 공개하였습니다.
  • 단답형 사실 질문에 초점: 복잡한 사실성 측정 문제를 단순화하여 단답형 질문으로 모델의 정확성을 평가합니다.
  • 높은 정확도와 다양성: 다양한 주제와 높은 정확도를 가진 질문으로 구성되어 있습니다.
  • 최신 모델에 대한 도전성: GPT-4o 등 최신 모델들도 높은 정확도를 달성하기 어려운 도전적인 데이터셋입니다.
  • 연구자 친화적 사용자 경험: 빠르고 간단하게 실행할 수 있으며, 평가 변동성이 낮습니다.
  • 모델 간 비교 및 교정 측정: 다양한 언어 모델의 성능 비교와 모델의 교정 능력을 평가하는 데 활용됩니다.
  • 오픈 소스 데이터셋 제공: 연구자들이 자유롭게 사용하고 피드백을 제공할 수 있도록 데이터셋을 공개하였습니다.

Stability AI, Stable Diffusion 3.5 출시

링크, 2024년 10월 29일 업데이트

  • Stable Diffusion 3.5 Medium 모델 공개: 2.5억 개의 파라미터로 구성된 이 모델은 소비자 하드웨어에서 실행 가능하도록 설계되었습니다.
  • 고품질 이미지 생성: 크기에 비해 최고의 이미지 생성 품질을 제공하며, 고급 다중 해상도 기능을 갖추고 있습니다.
  • 여러 모델 변형 제공: Large, Large Turbo, Medium 등 다양한 모델 변형을 통해 사용자 요구에 맞게 선택 가능하도록 하였습니다.
  • 상업적 및 비상업적 사용 허가: Stability AI 커뮤니티 라이선스 하에 상업적 및 비상업적 용도로 무료로 사용 가능합니다.
  • 사용자 정의 및 효율성 강조: 모델 아키텍처와 훈련 방법을 개선하여 품질, 일관성, 다중 해상도 생성 능력을 향상시켰습니다.
  • 안전한 AI 개발 준수: 안전하고 책임감 있는 AI 관행을 준수하며, 잠재적인 오용을 방지하기 위한 조치를 취하였습니다.

Google, Gemini API와 Google AI Studio에 Grounding with Google Search 도입

링크, 2024년 10월 31일

  • Grounding with Google Search 기능 제공: Gemini API와 Google AI Studio에서 모델 응답의 정확성과 최신성을 높이기 위해 이 기능을 도입하였습니다.
  • 지원 링크 및 검색 제안 제공: 모델 응답에 근거 출처 링크와 관련 검색 제안을 포함하여 투명성을 높였습니다.
  • 모든 Gemini 1.5 모델에서 지원: 일반적으로 사용 가능한 모든 Gemini 1.5 모델 버전에서 이 기능을 사용할 수 있습니다.
  • 개발자 설정 가능: Google AI Studio에서 개발자가 기능을 활성화하거나 API에서 ‘google_search_retrieval’ 도구를 통해 사용할 수 있습니다.
  • 동적 검색 설정 제공: 추가 비용과 지연을 최소화하기 위해 동적 검색 구성으로 검색이 필요한 쿼리를 판단합니다.

Inspirai, 칭화대학교, Mini-Omni 2 발표

링크, 2024년 10월 25일

  • 이미지, 음성, 텍스트 입력을 이해하는 비주얼-오디오 어시스턴트 공개: GPT-4o의 기능과 유사한 멀티모달 입력 처리가 가능한 모델을 선보였습니다.
  • 실시간 음성 응답 및 대화 중단 기능 지원: 사용자와의 실시간 음성 대화 중에도 중단이 가능하여 유연한 상호작용을 제공합니다.
  • 기술적 혁신: 이미지, 오디오, 텍스트 피처를 입력으로 결합하고, 텍스트 기반의 지연된 병렬 출력을 사용하여 실시간 음성 생성을 구현하였습니다.
  • 세 가지 단계의 학습 과정: 인코더 적응, 모달 정렬, 멀티모달 파인튜닝을 통해 모델을 학습시켰습니다.
  • MIT 라이선스 하에 공개: 오픈 소스로 공개되어 연구자들과 개발자들이 자유롭게 활용할 수 있습니다.

Microsoft, SynthMoCap 소개

링크, 2024년 10월 25일

  • 마커 없이 고품질 전신 퍼포먼스 캡처 기술 발표: 복잡한 하드웨어나 수동 개입 없이 얼굴, 신체, 손의 동작을 동시에 캡처할 수 있는 기술을 선보였습니다.
  • 기술적 하이라이트:
    • 신체 형태와 자세, 얼굴 형태와 표정을 동시에 캡처,
    • 손과 혀의 움직임, 눈의 시선까지 추적 가능,
    • 단일 및 다중 뷰 시나리오에 적합,
    • 합성 데이터를 활용한 머신러닝 모델 훈련,
  • 합성 데이터셋 공개: SynthBody, SynthFace, SynthHand 데이터셋을 공개하여 연구자들이 다양한 작업에 활용할 수 있도록 하였습니다.

Meta, MobileLLM 출시

링크, 2024년 10월 30일

  • 125M, 350M, 600M, 1B 모델 체크포인트 공개: 모바일 디바이스에서 효율적으로 동작하는 LLM을 제공합니다.
  • 소형 LLM의 성능 최적화: 깊이와 폭의 조정을 통해 작은 모델에서도 높은 성능을 달성하였습니다.
  • 임베딩 공유와 그룹 쿼리 어텐션 사용: 모델의 효율성을 높이기 위해 임베딩 공유 및 최적화된 어텐션 메커니즘을 도입하였습니다.
  • 즉각적인 블록별 가중치 공유: 지연 시간을 줄이기 위해 가중치 이동을 피하면서도 최소한의 오버헤드로 구현하였습니다.
  • 성능 향상:
    • 제로샷 작업에서 이전 SOTA 125M/350M 모델보다 2.7%/4.3% 향상,
    • API 호출에서 더 큰 LLaMA-v2 7B 모델과 유사한 정확도 달성,

Meta, Meta Lingua 공개

링크, 2024년 10월 25일

  • 연구를 위한 최소한의 빠른 LLM 훈련 및 추론 라이브러리: 새로운 아키텍처, 손실 함수, 데이터 등을 실험하기 위한 PyTorch 기반의 컴포넌트를 제공합니다.
  • 엔드 투 엔드 훈련, 추론 및 평가 지원: 모델의 속도와 안정성을 이해하고 개선하기 위한 도구를 제공합니다.
  • 오픈 소스 코드베이스: 현재 개발 중이며, 다양한 앱을 통해 사용법을 시연하고 있습니다.
  • 연구자 친화적 설계: 실험과 연구를 용이하게 하기 위해 최소한의 복잡성으로 설계되었습니다.

Hugging Face, SmolLM 2 출시

링크, 2024년 11월 1일

  • 새로운 소형 LLM 세트 공개: 온디바이스에서 최적화된 작은 크기의 LLM을 출시하였습니다.
  • 세 가지 크기로 제공: 0.1B, 0.3B, 1.7B 파라미터로 구성된 모델을 제공합니다.
  • Apache 2.0 라이선스 하에 공개: 자유롭게 사용하고 수정할 수 있습니다.
  • 성능 향상:
    • Meta Llama 3.2 1B 모델을 능가하는 성능을 보여줍니다.
    • 다양한 언어 모델 평가에서 높은 점수 달성,
  • 온디바이스 실행 지원: llama.cpp 또는 Transformers.js를 통해 디바이스나 브라우저에서 실행 가능합니다.

MaskGCT, 새로운 최첨단 TTS 모델 발표

링크, 2024년 10월 24일

  • Zero-shot 음성 클로닝 및 감정 TTS 지원: 새로운 텍스트-투-스피치 모델로 음성 클로닝과 감정 표현이 가능합니다.
  • 10만 시간의 데이터로 훈련: 대규모 데이터로 훈련되어 장문 합성과 가변 속도 합성이 가능합니다.
  • 이중언어 지원: 중국어와 영어를 모두 지원합니다.
  • 완전한 비자동회귀 아키텍처:
    • 단계 1: 텍스트로부터 음성 SSL 모델에서 추출한 시맨틱 토큰 예측,
    • 단계 2: 시맨틱 토큰을 기반으로 음향 토큰 예측,
  • 허깅페이스에서 사용 가능: 모델과 코드를 공개하여 연구자들이 활용할 수 있도록 하였습니다.

연구, 대형 언어 모델이 개발자의 이데올로기를 반영한다는 결과 발표

링크, 2024년 10월 24일 제출

  • LLM의 이데올로기적 편향 연구: 대형 언어 모델이 개발자의 세계관을 반영하여 다양한 이데올로기적 스탠스를 보임을 발견하였습니다.
  • 다양한 모델과 언어에서 실험 수행: 인기 있는 여러 LLM을 대상으로 영어와 중국어에서 실험을 진행하였습니다.
  • 응답의 이념적 차이 강조: 동일한 모델이라도 사용하는 언어와 설계에 따라 응답의 이념적 차이가 발생함을 확인하였습니다.
  • 편향 제거 노력에 대한 우려 제기: LLM의 이념적 ‘편향’을 제거하려는 기술 및 규제 노력에 대한 중요한 문제를 제기하였습니다.
  • 정치적 도구화의 위험성 논의: LLM이 정치적 목적에 이용될 수 있는 위험성에 대해 강조하였습니다.
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
###
https://openai.com/index/introducing-chatgpt-search/
OpenAI

🌐 Introducing ChatGPT search 🌐
ChatGPT can now search the web in a much better way than before, so you get fast, timely answers with links to relevant web sources.
We’re rolling out now to Plus, Team, and SearchGPT waitlist users on our mobile and desktop apps and chatgpt.com.
Enterprise and Edu users will have access in the next few weeks. We’ll roll out to Free users over the coming months.

October 31, 2024

Introducing ChatGPT search
Get fast, timely answers with links to relevant web sources.



ChatGPT can now search the web in a much better way than before. You can get fast, timely answers with links to relevant web sources, which you would have previously needed to go to a search engine for. This blends the benefits of a natural language interface with the value of up-to-date sports scores, news, stock quotes, and more.

ChatGPT will choose to search the web based on what you ask, or you can manually choose to search by clicking the web search icon.

Close-up of a search bar with the text ‘Message ChatGPT’ and a blinking text entry cursor. Below, a globe icon with ‘Search’ is shown, and a cursor hovers over the search button. A paperclip icon is also visible.
Search will be available at chatgpt.com⁠(opens in a new window), as well as on our desktop and mobile apps. All ChatGPT Plus and Team users, as well as SearchGPT waitlist users, will have access today. Enterprise and Edu users will get access in the next few weeks. We’ll roll out to all Free users over the coming months.

Designed to get you to a better answer
Getting useful answers on the web can take a lot of effort. It often requires multiple searches and digging through links to find quality sources and the right information for you.

Now, chat can get you to a better answer: Ask a question in a more natural, conversational way, and ChatGPT can choose to respond with information from the web. Go deeper with follow-up questions, and ChatGPT will consider the full context of your chat to get a better answer for you.

A conversation about the weather forecast for Positano, Italy, on November 2-3, 2024, showing mild temperatures and rain. The user then asks for dinner recommendations in Positano on Friday night, with responses listing local restaurants.
We also partnered with news and data providers to add up-to-date information and new visual designs for categories like weather, stocks, sports, news, and maps.

Weather
Stocks
Sports
News
Maps
Weather
Stocks
Sports
News
Maps
AccuWeather forecast for New York, NY, currently 59°F and sunny. This week: Monday, 61°F, partly cloudy; Tuesday, 65°F, pleasant; Wednesday, 76°F, warm; Thursday, 77°F, mostly sunny; Friday, 70°F with showers. A ‘Sources’ button is below.
NVIDIA Corp (NVDA) stock chart for the past month, showing an upward trend with peaks in mid and late October. Text above notes a 0.58% increase from previous close. Additional text mentions NVIDIA’s market cap surge, recently surpassing expectations.
Screenshot showing the Golden State Warriors 2024-25 schedule with a highlighted game against the Boston Celtics on November 6 at 4:30 PM. The text notes advice for ticket purchases, mentioning a game in San Francisco on January 20, 2025.
News headlines on Cuba’s energy crisis, with articles from AP News, Financial Times, and Reuters covering grid failures and blackouts affecting residents. A ‘Sources’ button with icons of AP, FT, Reuters, and others is shown below.
Map view of Morningside Heights and Harlem, New York, showing the location of Absolute Bagels, a bagel shop open until 7 pm. Image includes a photo of an everything bagel with lox and cream cheese, plus options for directions, website, and call.
“ChatGPT search promises to better highlight and attribute information from trustworthy news sources, benefiting audiences while expanding the reach of publishers like ourselves who produce premium journalism.”
Pam Wasserstein, President, Vox Media
Go straight to the source
Chats now include links to sources, such as news articles and blog posts, giving you a way to learn more. Click the Sources button below the response to open a sidebar with the references.

Screenshot of backyard improvement suggestions, including cozy seating, outdoor lighting, and fire pits, with images of stylish backyard setups. A sidebar lists citations from sources like The Spruce, Family Handyman, and Better Homes & Gardens.
“We are convinced that AI search will be, in a near future and for the next generations, a primary way to access information, and partnering with OpenAI positions Le Monde at the forefront of this shift. It allows us to test innovations at an early stage while safeguarding journalism’s core values and integrity.”
Louis Dreyfus, CEO & Publisher of Le Monde
ChatGPT search connects people with original, high-quality content from the web and makes it part of their conversation. By integrating search with a chat interface, users can engage with information in a new way, while content owners gain new opportunities to reach a broader audience. We hope to help users discover publishers and websites, while bringing more choice to search.

“As AI reshapes the media landscape, Axel Springer’s partnership with OpenAI opens up tremendous opportunities for innovative advancements. Together, we're driving new business models that ensure journalism remains both trustworthy and profitable.”
Mathias Sanchez, SVP Global Strategic Partnerships Axel Springer SE
We collaborated extensively with the news industry and carefully listened to feedback from our global publisher partners, including Associated Press, Axel Springer, Condé Nast, Dotdash Meredith, Financial Times, GEDI, Hearst, Le Monde, News Corp, Prisa (El País), Reuters, The Atlantic, Time, and Vox Media. Any website or publisher can choose to appear⁠(opens in a new window) in ChatGPT search. If you’d like to share feedback, please email us at publishers-feedback@openai.com⁠.

How it works and what comes next
The search model is a fine-tuned version of GPT-4o, post-trained using novel synthetic data generation techniques, including distilling outputs from OpenAI o1-preview. ChatGPT search leverages third-party search providers, as well as content provided directly by our partners, to provide the information users are looking for. Learn more here⁠(opens in a new window).

Thanks to feedback from the SearchGPT prototype, we brought the best of the SearchGPT experience into ChatGPT. We plan to keep improving search, particularly in areas like shopping and travel, and leverage the reasoning capabilities of the OpenAI o1 series to do deeper research. We also plan to bring our new search experience to Advanced Voice and canvas, as well as to Free and logged out users in the future.

ChatGPT Plus and Team users can try it out today at chatgpt.com⁠(opens in a new window).

OpenAI brings a new web search tool to ChatGPT
The new tool puts OpenAI squarely in competition with the search giants, and will help fuel its next generation of AI agents

By Melissa Heikkilä & Mat Honan
October 31, 2024

OpenAI

ChatGPT can now search the web for up-to-date answers to a user’s queries, OpenAI announced today.

Until now, ChatGPT was mostly restricted to generating answers from its training data, which is current up to October 2023 for GPT-4o, and had limited web search capabilities. Searches about generalized topics will still draw on this information from the model itself, but now ChatGPT will automatically search the web in response to queries about recent information such as sports, stocks, or news of the day, and can deliver rich multi-media results. Users can also manually trigger a web search, but for the most part, the chatbot will make its own decision about when an answer would benefit from information taken from the web, says Adam Fry, OpenAI’s product lead for search.

“Our goal is to make ChatGPT the smartest assistant, and now we’re really enhancing its capabilities in terms of what it has access to from the web,” Fry tells MIT Technology Review. The feature is available today for the chatbot’s paying users.



ChatGPT triggers a web search when the user asks about local restaurants in this example
While ChatGPT search, as it is known, is initially available to paying customers, OpenAI intends to make it available for free later, even when people are logged out. The company also plans to combine search with its voice features and Canvas, its interactive platform for coding and writing, although these capabilities will not be available in today’s initial launch.

The company unveiled a standalone prototype of web search in July. Those capabilities are now built directly into the chatbot. OpenAI says it has “brought the best of the SearchGPT experience into ChatGPT.”

OpenAI is the latest tech company to debut an AI-powered search assistant, challenging similar tools from competitors such as Google, Microsoft, and startup Perplexity. Meta, too, is reportedly developing its own AI search engine. As with Perplexity’s interface, users of ChatGPT search can interact with the chatbot in natural language, and it will offer an AI-generated answer with sources and links to further reading. In contrast, Google’s AI Overviews offer a short AI-generated summary at the top of the website, as well as a traditional list of indexed links.

These new tools could eventually challenge Google’s 90% market share in online search. AI search is a very important way to draw more users, says Chirag Shah, a professor at the University of Washington, who specializes in online search. But he says it is unlikely to chip away at Google’s search dominance. Microsoft’s high-profile attempt with Bing barely made a dent in the market, Shah says.

Instead, OpenAI is trying to create a new market for more powerful and interactive AI agents, which can take complex actions in the real world, Shah says.

The new search function in ChatGPT is a step toward these agents.

It can also deliver highly contextualized responses that take advantage of chat histories, allowing users to go deeper in a search. Currently, ChatGPT search is able to recall conversation histories and continue the conversation with questions on the same topic.

ChatGPT itself can also remember things about users that it can use later —sometimes it does this automatically, or you can ask it to remember something. Those “long-term” memories affect how it responds to chats. Search doesn’t have this yet—a new web search starts from scratch— but it should get this capability in the “next couple of quarters,” says Fry. When it does, OpenAI says it will allow it to deliver far more personalized results based on what it knows.

“Those might be persistent memories, like ‘I’m a vegetarian,’ or it might be contextual, like ‘I’m going to New York in the next few days,’” says Fry. “If you say ‘I’m going to New York in four days,’ it can remember that fact and the nuance of that point,” he adds.

To help develop ChatGPT’s web search, OpenAI says it leveraged its partnerships with news organizations such as Reuters, the Atlantic, Le Monde, the Financial Times, Axel Springer, Condé Nast, and Time. However, its results include information not only from these publishers, but any other source online that does not actively block its search crawler.

It’s a positive development that ChatGPT will now be able to retrieve information from these reputable online sources and generate answers based on them, says Suzan Verberne, a professor of natural-language processing at Leiden University, who has studied information retrieval. It also allows users to ask follow-up questions.

But despite the enhanced ability to search the web and cross-check sources, the tool is not immune from the persistent tendency of AI language models to make things up or get it wrong. When MIT Technology Review tested the new search function and asked it for vacation destination ideas, ChatGPT suggested “luxury European destinations” such as Japan, Dubai, the Caribbean islands, Bali, the Seychelles, and Thailand. It offered as a source an article from the Times, a British newspaper, which listed these locations as well as those in Europe as luxury holiday options.

“Especially when you ask about untrue facts or events that never happened, the engine might still try to formulate a plausible response that is not necessarily correct,” says Verberne. There is also a risk that misinformation might seep into ChatGPT’s answers from the internet if the company has not filtered its sources well enough, she adds.

Another risk is that the current push to access the web through AI search will disrupt the internet’s digital economy, argues Benjamin Brooks, a fellow at Harvard University’s Berkman Klein Center, who previously led public policy for Stability AI, in an op-ed published by MIT Technology Review today.

“By shielding the web behind an all-knowing chatbot, AI search could deprive creators of the visits and ‘eyeballs’ they need to survive,” Brooks writes.


###
https://openai.com/chatgpt/download/
OpenAI
10/31/24
Big day for desktops.
Advanced Voice is now available in the macOS and Windows desktop apps.
To access Advanced Voice on desktop, remember to download the latest version of the app.

Download ChatGPT
Get ChatGPT on mobile or desktop.

A screenshot of the ChatGPT app on a mobile device with a pink gradient background, showing chat suggestions and a 9:41 time display.
For Mobile
Chat on the go, have voice conversations, and ask about photos.

Download on App Store(opens in a new window)
Download on Google Play(opens in a new window)
A screenshot of a desktop showing two Word documents, "Meeting Notes" and "Product Ideas," with the prompt "Summarize key themes" below. The app dock with icons is visible at the bottom.
For Desktop
Chat about email, screenshots, files, and anything on your screen.

Download for macOS*(opens in a new window)
Learn more about the macOS app
Test an early version of the Windows app**(opens in a new window)

###
https://platform.openai.com/docs/guides/prompt-generation
OpenAI
10/31/24

Prompt generation
Generate prompts and schemas in Playground.
The Generate button in the Playground lets you generate prompts, functions, and schemas from just a description of your task. This guide will walk through exactly how it works.

Overview
Creating prompts and schemas from scratch can be time-consuming, so generating them can help you get started quickly. The Generate button uses two main approaches:

Prompts: We use meta-prompts that incorporate best practices to generate or improve prompts.
Schemas: We use meta-schemas that produce valid JSON and function syntax.
While we currently use meta prompts and schemas, we may integrate more advanced techniques in the future like DSPy and "Gradient Descent".

###
https://openai.com/index/introducing-simpleqa/
OpenAI
October 30, 2024

Introducing SimpleQA
A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.

Read paper(opens in a new window)
An open problem in artificial intelligence is how to train models that produce responses that are factually correct. Current language models sometimes produce false outputs or answers unsubstantiated by evidence, a problem known as “hallucinations”. Language models that generate more accurate responses with fewer hallucinations are more trustworthy and can be used in a broader range of applications. To measure the factuality of language models, we are open-sourcing⁠(opens in a new window) a new benchmark called SimpleQA.

About the SimpleQA benchmark
Factuality is a complicated topic because it is hard to measure—evaluating the factuality of any given arbitrary claim is challenging, and language models can generate long completions that contain dozens of factual claims. In SimpleQA, we will focus on short, fact-seeking queries, which reduces the scope of the benchmark but makes measuring factuality much more tractable.

With SimpleQA, our goal was to create a dataset with the following properties:

High correctness. Reference answers to questions are supported by sources from two independent AI trainers, and questions were written in such a way that the predicted answers are easy to grade.

Diversity. SimpleQA covers a wide range of topics, from science and technology to TV shows and video games.

Challenging for frontier models. Compared to older benchmarks such as TriviaQA⁠(opens in a new window) (2017) or NQ⁠(opens in a new window) (2019), which have become saturated, SimpleQA was created to be a greater challenge for frontier models (e.g., GPT-4o scores less than 40%).

Good researcher UX. SimpleQA is intended to be fast and simple to run due to its concise questions and answers. Grading is also efficient whether through the OpenAI API or another frontier model API. Additionally, with 4,326 questions, SimpleQA should have relatively low variance as an evaluation benchmark.

We hired AI trainers to browse the web and create short, fact-seeking questions and corresponding answers. To be included in the dataset, each question had to meet a strict set of criteria: it must have a single, indisputable answer for easy grading; the answer to the question should not change over time; and most questions had to induce hallucinations from either GPT-4o or GPT-3.5. To further improve the quality of the dataset, a second, independent AI trainer answered each question without seeing the original response. Only questions where both AI trainers’ answers agreed were included.

As a final verification of quality, we had a third AI trainer answer a random sample of 1,000 questions from the dataset. We found that the third AI trainer’s answer matched the original agreed answers 94.4% of the time, with a 5.6% disagreement rate. We then manually inspected these examples, and found that 2.8% of the 5.6% of disagreements were due to grader false negatives or human errors from the third trainer (e.g., incomplete answers or misinterpreting sources), and the remaining 2.8% were due to real issues with the question (e.g., ambiguous questions, or different websites giving conflicting answers). Hence, we estimate the inherent error rate of this dataset to be approximately 3%.

Question diversity in SimpleQA
The pie chart below shows the diversity of topics in the SimpleQA benchmark, with examples of each question shown if you hover over the pie on the pie chart.

Distribution of Tasks per Category (Number of Tasks)
Music: 341
Sports: 368
Geography: 424
Other: 475
Art: 550
Politics: 709
Science andtechnology: 858
Video games: 135
History: 173
TV Shows: 293
Using SimpleQA to compare language models
To grade questions, we use a prompted ChatGPT classifier that sees both the predicted answer from the model and the ground-truth answer, and then grades the predicted answer as either “correct”, “incorrect”, or “not attempted”.

A definition and corresponding examples for each grade are shown in the table below.

Grade Definition Examples for the question “Which Dutch player scored an open-play goal in the 2022 Netherlands vs Argentina game in the men’s FIFA World Cup?” (Answer: Wout Weghorst)
“Correct” The predicted answer fully contains the ground-truth answer without contradicting the reference answer.
“Wout Weghorst”
“Wout Weghorst scored at 83’ and 90+11’ in that game”
“Incorrect” The predicted answer contradicts the ground-truth answer in any way, even if the contradiction is hedged.
“Virgil van Dijk”
“Virgil van Dijk and Wout Weghorst”
“Wout Weghorst and I think van Dijk scored, but I am not totally sure”
“Not attempted” The ground truth target is not fully given in the answer, and there are no contradictions with the reference answer.
“I don’t know the answer to that question”
“To find which Dutch player scored in that game, please browse the internet yourself”
A model will ideally answer as many questions as possible (highest number of correct), while minimizing the number of incorrect answers.

Using this classification, we can then measure the performance of several OpenAI models without browsing, including gpt-4o-mini, o1-mini, gpt-4o, and o1-preview. As expected, gpt-4o-mini and o1-mini answer fewer questions correctly compared to gpt-4o and o1-preview, likely because smaller models typically have less world knowledge. We also see that o1-mini and o1-preview, which are designed to spend more time thinking, choose to "not attempt" questions more often than gpt-4o-mini and gpt-4o. This may be because they can use their reasoning capacity to recognize when they don’t know the answer to a question, instead of hallucinating.

correct
not attempted
incorrect
0.0%
25.0%
50.0%
75.0%
100.0%
GPT-4o mini
o1-mini
GPT-4o
o1-preview
Using SimpleQA to measure the calibration of large language models
A factuality benchmark like SimpleQA also allows us to measure the scientific phenomenon known as calibration, or whether language models “know what they know.” One way to measure calibration is to directly ask the language model to state its confidence in its answer using the prompt: “Please give your best guess, along with your confidence as a percentage that that is the correct answer.” Then we can plot the correlation between the stated confidence of the model, and how accurate the model actually was. A perfectly calibrated model would have the same actual accuracy as stated confidence. For instance, on all prompts where the model stated a confidence of 75%, the accuracy would be 75% for a perfectly calibrated model.

This result is shown in the figure below. The positive correlation between stated confidence and accuracy is a reassuring sign that models have some notion of confidence. We see that o1-preview is more calibrated than o1-mini, and gpt4o is more calibrated than gpt4o-mini, which is consistent with prior work⁠(opens in a new window) showing that larger models are more calibrated. However, the fact that performance is well below the line y=x means that models consistently overstate their confidence. Hence, there is a lot of room to improve the calibration of large language models in terms of stated confidence.

Calibration (Uniform)
GPT-4o
GPT-4o-mini
o1-preview
o1-mini
Perfect Calibration
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy
0
0.2
0.4
0.6
0.8
1
Average Stated Confidence
Another way to measure calibration is to ask the language model the same question 100 times. Since language models may produce different answers upon repeated attempts, we can assess whether the frequency of a particular answer corresponds to its correctness. Higher frequency typically indicates that the model is more confident in its answers, as the model is giving the same answer repeatedly. A well-calibrated model would have the same actual accuracy as frequency.

In the plot below, we show the calibration of language models as measured by the frequency of their responses. Here, we simply use string match to group together different answers from the language model. We see across all models that accuracy increases with frequency, and that o1-preview has the highest level of calibration, where the frequency of the response is roughly equivalent to the accuracy of the response. Similar to calibration via stated confidence plot above, we again see o1-preview is more calibrated than o1-mini, and gpt4o is more calibrated than o1-mini.

Accuracy vs Consistency - String Match (Quantile, n=30)
GPT-4o
GPT-4o-mini
o1-preview
o1-mini
Perfect Calibration
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy
0.2
0.4
0.6
0.8
1
Frequency of answer
Conclusions
SimpleQA is a simple but challenging benchmark for evaluating the factuality of frontier models. A main limitation in SimpleQA is its scope—while SimpleQA is accurate it only measures factuality under the constrained setting of short, fact-seeking queries with a single, verifiable answer. Whether the ability to provide factual short answers correlates with the ability to write lengthy responses filled with numerous facts remains an open research question. We hope that open-sourcing SimpleQA drives the research on more trustworthy and reliable AI forward, and we invite researchers to evaluate the factuality of language models with it, and to provide us feedback.



###
https://stability.ai/news/introducing-stable-diffusion-3-5
Stability AI
Introducing Stable Diffusion 3.5
Updated October 29th with release of Stable Diffusion 3.5 Medium

Stable Diffusion 3.5 Medium is here! Whether a startup or creator, access to this technology shouldn’t be restricted by hardware limitations. With 2.5 billion parameters, this model is designed to run “out of the box” on consumer hardware.
This model delivers best-in-class image generation for its size, with advanced multi-resolution capabilities. It surpasses other medium-sized models with its prompt adherence and image quality, making it a top choice for efficient, high-quality performance.
This model is available for both commercial and non-commercial use under the Stability AI Community License.

Key Takeaways:

Today we are introducing Stable Diffusion 3.5. This open release includes multiple model variants, including Stable Diffusion 3.5 Large and Stable Diffusion 3.5 Large Turbo, and as of October 29th, Stable Diffusion 3.5 Medium.

These models are highly customizable for their size, run on consumer hardware, and are free for both commercial and non-commercial use under the permissive Stability AI Community License.

You can download all Stable Diffusion 3.5 models from Hugging Face and the inference code on GitHub now.

View fullsize

Today we are releasing Stable Diffusion 3.5, our most powerful models yet. This open release includes multiple variants that are customizable, run on consumer hardware, and are available for use under the permissive Stability AI Community License. You can download Stable Diffusion 3.5 Large and Stable Diffusion 3.5 Large Turbo models from Hugging Face and the inference code on GitHub now.

In June, we released Stable Diffusion 3 Medium, the first open release from the Stable Diffusion 3 series. This release didn't fully meet our standards or our communities’ expectations. After listening to the valuable community feedback, instead of a quick fix, we took the time to further develop a version that advances our mission to transform visual media.

Stable Diffusion 3.5 reflects our commitment to empower builders and creators with tools that are widely accessible, cutting-edge, and free for most use cases. We encourage the distribution and monetization of work across the entire pipeline - whether it's fine-tuning, LoRA, optimizations, applications, or artwork.

What’s being released

Stable Diffusion 3.5 offers a variety of models developed to meet the needs of scientific researchers, hobbyists, startups, and enterprises alike:

Stable Diffusion 3.5 Large: At 8.1 billion parameters, with superior quality and prompt adherence, this base model is the most powerful in the Stable Diffusion family. This model is ideal for professional use cases at 1 megapixel resolution.

Stable Diffusion 3.5 Large Turbo: A distilled version of Stable Diffusion 3.5 Large generates high-quality images with exceptional prompt adherence in just 4 steps, making it considerably faster than Stable Diffusion 3.5 Large.

Stable Diffusion 3.5 Medium: At 2.5 billion parameters, with improved MMDiT-X architecture and training methods, this model is designed to run “out of the box” on consumer hardware, striking a balance between quality and ease of customization. It is capable of generating images ranging between 0.25 and 2 megapixel resolution.

Developing the models

In developing the models, we prioritized customizability to offer a flexible base to build upon. To achieve this, we integrated Query-Key Normalization into the transformer blocks, stabilizing the model training process and simplifying further fine-tuning and development.

To support this level of downstream flexibility, we had to make some trade-offs. Greater variation in outputs from the same prompt with different seeds may occur, which is intentional as it helps preserve a broader knowledge-base and diverse styles in the base models. However, as a result, prompts lacking specificity might lead to increased uncertainty in the output, and the aesthetic level may vary.

For the Medium model specifically, we made several adjustments to the architecture and training protocols to enhance quality, coherence, and multi-resolution generation abilities.

Where the models excel

The Stable Diffusion 3.5 version excels in the following areas, making it one of the most customizable and accessible image models on the market, while maintaining top-tier performance in prompt adherence and image quality:

Customizability: Easily fine-tune the model to meet your specific creative needs, or build applications based on customized workflows.

Efficient Performance: Optimized to run on standard consumer hardware without heavy demands, especially the Stable Diffusion 3.5 Medium and Stable Diffusion 3.5 Large Turbo models.

We took a look at the hardware compatibility for running Stable Diffusion 3.5 Medium alongside other open-image base models. This model only requires 9.9 GB of VRAM (excluding text encoders) to unlock its full performance, making it highly accessible and compatible with most consumer GPUs.

View fullsize

Diverse Outputs: Creates images representative of the world, not just one type of person, with different skin tones and features, without the need for extensive prompting.

View fullsize

Versatile Styles: Capable of generating a wide range of styles and aesthetics like 3D, photography, painting, line art, and virtually any visual style imaginable.

View fullsize

Additionally, our analysis shows that Stable Diffusion 3.5 Large leads the market in prompt adherence and rivals much larger models in image quality.

Stable Diffusion 3.5 Large Turbo offers some of the fastest inference times for its size, while remaining highly competitive in both image quality and prompt adherence, even when compared to non-distilled models of similar size

Stable Diffusion 3.5 Medium outperforms other medium-sized models, offering a balance of prompt adherence and image quality, making it a top choice for efficient, high-quality performance.

View fullsize

View fullsize

The Stability AI Community license at a glance

We are pleased to release this model under our permissive community license. Here are the key components of the license:

Free for non-commercial use: Individuals and organizations can use the model free of charge for non-commercial use, including scientific research.

Free for commercial use (up to $1M in annual revenue): Startups, small to medium-sized businesses, and creators can use the model for commercial purposes at no cost, as long as their total annual revenue is less than $1M.

Ownership of outputs: Retain ownership of the media generated without restrictive licensing implications.

For organizations with annual revenue more than $1M, please contact us here to inquire about an Enterprise License.

More ways to access the models

While the model weights are available on Hugging Face now for self-hosting, you can also access the model through the following platforms:

Stability AI API

Replicate

DeepInfra

ComfyUI

Our commitment to safety

We believe in safe, responsible AI practices and take deliberate measures to ensure Integrity starts at the early stages of development. This means we have taken and continue to take reasonable steps to prevent the misuse of Stable Diffusion 3.5 by bad actors. For more information about our approach to Safety please visit our Stable Safety page.

###
https://developers.googleblog.com/en/gemini-api-and-ai-studio-now-offer-grounding-with-google-search/
Google
Gemini API and Google AI Studio now offer Grounding with Google Search
OCT 31, 2024
Shrestha Basu Mallick
Group Product Manager
Gemini API
Logan Kilpatrick
Senior Product Manager
Gemini API and Google AI Studio

Share
GeminiGrounding_Banner
Today, we are rolling out Grounding with Google Search in Google AI Studio and the Gemini API, enabling developers to get more accurate and fresh responses from the Gemini models aided by Google Search. In addition to more accurate responses, the model returns grounding sources (in-line supporting links) and Search Suggestions that point users to the search results corresponding to the grounded response.

Model response with Grounding sources and Search Suggestions when Grounding with Google Search is turned on
Grounding with Google Search is supported with all generally available versions of Gemini 1.5 models. Developers can turn it on in Google AI Studio under the “Tools” section or in the API by enabling the 'google_search_retrieval' tool. Grounding is available to test for free in Google AI Studio. In the API, developers can access the tool with the paid tier for $35 per 1,000 grounded queries.


When should developers use grounding?
Developers should enable Grounding with Google Search for queries and applications which could benefit from any of the following:

Reduced hallucinations: Grounding helps ensure that AI applications provide users with more factual information.
More up-to-date information: With grounding, models can access real-time information, making AI applications relevant and applicable to a wider range of scenarios.
Enhanced trustworthiness and traffic to publishers: By providing supporting links, grounding brings transparency to AI applications, making them more trustworthy and encouraging users to click on the underlying sources to find out more.
Richer information: By drawing information from Google Search to enhance the model response, grounding is able to provide richer color on many queries.

Grounding with Google Search in action
We show a couple of examples below, using AI Studio’s new Compare Mode, where the model response benefits from Grounding with Google Search. In the first example, the model provides an out of date answer based on its knowledge cut-off (on the left) but answers more accurately based on the latest available sources (on the right) when grounding is turned on.

Model response in Google AI Studio compare mode
Model response in Google AI Studio compare mode, without grounding (left) and with grounding (right)
In this example, without grounding enabled (on the left), the model intentionally presents a minimal response by default. With grounding (on the right), the model comes back with a richer response including supporting links.

Richer response by the latest Gemini 1.5 Flash model
Richer response by the latest Gemini 1.5 Flash model using Grounding with Google Search (right)
How does Grounding with Google Search work?
When a user makes a query with grounding turned on, the service uses Google’s search engine to find up-to-date and comprehensive information that is relevant to the query, and sends it to the model. The model then responds with higher accuracy and freshness, providing in-line grounding sources (supporting links) and Search Suggestions.

import google.generativeai as genai
import os

genai.configure(api_key=os.environ["API_KEY"])
model = genai.GenerativeModel('models/gemini-1.5-flash-002')


response = model.generate_content(contents="Who won Wimbledon this year?",
tools='google_search_retrieval')

print(response)
# Response contains `groundingMetadata` with grounding sources, confidence scores, and search suggestions
Refer to the documentation for complete code.


Even when Grounding with Google Search is on, not every query in a session necessarily requires grounding, which results in additional cost and latency. This is where developers have a second layer of control with dynamic retrieval.

When developers request a grounded answer, the dynamic retrieval configuration assigns the prompt a prediction score, which is a floating point value between 0 and 1. The value is higher when a prompt is more likely to benefit from grounding. In their requests, developers can set a threshold for what scores should result in grounding (the default threshold value is 0.3). Developers should test various options for the threshold value to see what best works for their applications.

Dynamic retrieval for Grounding Search in Google AI Studio
Dynamic retrieval for Grounding with Google Search in Google AI Studio
By using Google’s search results to ground Gemini-based applications, developers can provide their users with more accurate, relevant, and trustworthy information. Refer to our documentation for detailed code examples and step-by-step instructions.


We look forward to your feedback and are excited to see what you build with this new capability!

###
https://huggingface.co/gpt-omni/mini-omni2
10/25/24
Inspirai, Tsinghua University

Mini-Omni 2 understands image, audio and text inputs all via end-to-end voice conversations with users 🔥
> Understands and processes images, speech, and text
> Generates real-time speech responses
> Supports interruptions during speech
Technical Overview:
> Concatenates image, audio, and text features for input.
> Uses text-guided delayed parallel output for real-time speech
> Involves encoder adaptation, modal alignment, and multimodal fine-tuning
Best part: MIT licensed ⚡

Zhifei Xie, Changqiao Wu
GPT-4o, an all-encompassing model, represents a milestone in the development of large multi-modal language models. It can understand visual, auditory, and textual modalities, directly output audio, and support flexible duplex interaction. Models from the open-source community often achieve some functionalities of GPT-4o, such as visual understanding and voice chat. Nevertheless, training a unified model that incorporates all modalities is challenging due to the complexities of multi-modal data, intricate model architectures, and training processes. In this paper, we introduce Mini-Omni2, a visual-audio assistant capable of providing real-time, end-to-end voice responses to visoin and audio queries. By integrating pretrained visual and auditory encoders, Mini-Omni2 maintains performance in individual modalities. We propose a three-stage training process to align modalities, allowing the language model to handle multi-modal inputs and outputs after training on a limited dataset. For interaction, we introduce a command-based interruption mechanism, enabling more flexible interaction with users. To the best of our knowledge, Mini-Omni2 is one of the closest reproductions of GPT-4o, which have similar form of functionality, and we hope it can offer valuable insights for subsequent research.
Mini-Omni2 is an omni-interactive model. It can understand image, audio and text inputs and has end-to-end voice conversations with users. Featuring real-time voice output, omni-capable multimodal understanding and flexible interaction ability with interruption mechanism while speaking.



Updates
2024.10: Release the model, technical report, inference and chat demo code.
Features
✅ Multimodal interaction: with the ability to understand images, speech and text, just like GPT-4o.

✅ Real-time speech-to-speech conversational capabilities. No extra ASR or TTS models required, just like Mini-Omni.

###
https://microsoft.github.io/SynthMoCap/
Microsoft
10/25/24

Look Ma, no markers
Holistic performance capture without the hassle

ACM Transactions on Graphics

SIGGRAPH Asia 2024

Charlie Hewitt Fatemeh Saleh Sadegh Aliakbarian Lohit Petikam Shideh Rezaeifar Louis Florentin Zafiirah Hosenie Thomas J Cashman Julien Valentin Darren Cosker Tadas Baltrušaitis

Microsoft
unveils the first technique for marker-free, HQ reconstruction of COMPLETE human body, including eyes & tongue, without requiring any calibration, manual intervention or custom hardware. Impressive results! Repo for training & Dataset💙

𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬:

✅Novel SOTA holistic 3D human reconstruction

✅Body shape/pose + face shape/expression

✅Hand & tongue articulation + eye gaze

✅Suitable for mono & multi-view scenario

✅Data pipeline for generating synthetic data

✅Body, face and hands dataset released!



Abstract
We tackle the problem of highly-accurate, holistic performance capture for the face, body and hands simultaneously. Motion-capture technologies used in film and game production typically focus only on face, body or hand capture independently, involve complex and expensive hardware and a high degree of manual intervention from skilled operators. While machine-learning-based approaches exist to overcome these problems, they usually only support a single camera, often operate on a single part of the body, do not produce precise world-space results, and rarely generalize outside specific contexts. In this work, we introduce the first technique for marker-free, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware. Our approach produces stable world-space results from arbitrary camera rigs as well as supporting varied capture environments and clothing. We achieve this through a hybrid approach that leverages machine learning models trained exclusively on synthetic data and powerful parametric models of human shape and motion. We evaluate our method on a number of body, face and hand reconstruction benchmarks and demonstrate state-of-the-art results that generalize on diverse datasets.
Holistic Performance Capture
Our approach combines machine-learning models for dense-landmark and parameter prediction with model-fitting to provide a robust, accurate and adaptable system. Our method supports registration of the face, body and hands; in isolation, and together in a single take.

Our parametric model captures body and hand pose, body and face shape, and facial expression.

We can also track tongue articulation and eye gaze.

Our method achieves state-of-the-art results on a number of 3D reconstruction benchmarks.

No Hassle
Motion capture shoots typically require specialist hardware, skilled experts and a lot of time to get right. This can make them expensive and challenging to manage in a tight production schedule. Our method aims to eliminate this inconvenience by providing a marker-less, calibration-free solution that can be used with off-the-shelf hardware. This allows for quick and easy capture of high-quality motion data in a variety of environments.

Using just two uncalibrated mobile-phone cameras we can achieve high quality results in world-space.

Our method even works with a single, moving camera in an unconstrained environment with arbitrary clothing.

Synthetic Datasets
Our method is trained exclusively on synthetic data, generated using a conventional computer graphics pipeline. The three datasets used in the paper are available to download here.


SynthBody can be used for tasks such as skeletal tracking and body pose prediction.


SynthFace can be used for tasks such as facial landmark and head pose prediction or face parsing.


SynthHand can be used for tasks such as hand pose prediction or landmark regression.

###
https://github.com/facebookresearch/MobileLLM
META
10/30/24

🚨 Meta released MobileLLM - 125M, 350M, 600M, and 1B model checkpoints! 🔥
Notes on the release:
Depth vs. Width: Contrary to the scaling law (Kaplan et al., 2020), depth is more critical than width for small LLMs, enhancing abstract concept capture and final performance
Embedding Sharing: Revisited and implemented embedding sharing methods ( to maximize weight utilization
Grouped Query Attention: Adopted from Ainslie et al. (2023) to optimize attention mechanisms
Immediate Block-wise Weight Sharing: Reduces latency by avoiding weight movement with minimal overhead
Performance:
> Zero-Shot Tasks: MobileLLM outperforms previous SOTA 125M/350M models by 2.7%/4.3%.
> API Calling: Comparable exact-match score to the larger LLaMA-v2 7B model 7B
Models are available on the Hub & integrated with Transformers! 🔥

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra
This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight-sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.

###
https://github.com/facebookresearch/lingua
META
10/25/24
Meta Lingua
Mathurin Videau*, Badr Youbi Idrissi*, Daniel Haziza, Luca Wehrstedt, Jade Copet, Olivier Teytaud, David Lopez-Paz. *Equal and main contribution

Meta Lingua is a minimal and fast LLM training and inference library designed for research. Meta Lingua uses easy-to-modify PyTorch components in order to try new architectures, losses, data, etc. We aim for this code to enable end to end training, inference and evaluation as well as provide tools to better understand speed and stability. While Meta Lingua is currently under development, we provide you with multiple apps to showcase how to use this codebase.

###
https://huggingface.co/collections/HuggingFaceTB/smollm2-6723884218bcda64b34d7db9
Huggingface
11/1/24
Pushing the boundaries of Small LLMs! We just released SmolLM 2, a new set of small, powerful LLMs optimized for on-device, outperforming Meta Llama 3.2 1B! SmolLM comes in three sizes, 0.1B, 0.3B, and 1.7B and under Apache 2.0. 🚀

TL;DR;
🔢 Comes in 3 sizes with 135M, 360M, and 1.7B parameters.
📚 Trained on 11 trillion mostly English tokens from FineWeb-Edu, DCLM, the Stack…
🚀 Trained for text rewriting, summarization, and function calling (27% on BFCL Leaderboard)
🥇 Matches or outperforms Llama 3.2 1B and Qwen2.5 1B
🏆 56.7 IFEval; 6.13 MT Bench; 19.3 MMLU-Pro; 48.2 GMS8k
🔧 Post-Training with SFT → DPO
📱 Runs on-device with llama.cpp or in browser with Transformers.js
🤗 Available on Hugging Face and under Apache 2.0

###
https://maskgct.github.io/
10/24/24
MaskGCT - New open SoTA Text to Speech model! 🔥
> Zero-shot voice cloning
> Emotional TTS
> Trained on 100K hours of data
> Long form synthesis
> Variable speed synthesis
> Bilingual - Chinese & English
> Available on Hugging Face

Fully non-autoregressive architecture:
> Stage 1: Predicts semantic tokens from text, using tokens extracted from a speech self-supervised learning (SSL) model
> Stage 2: Predicts acoustic tokens conditioned on the semantic tokens.

Synthesised: "Would you guys personally like to have a fake fireplace, an electric one, in your house? Or would you rather have a real fireplace? Let me know down below. Okay everybody, that's all for today's video and I hope you guys learned a bunch of furniture vocabulary!"

TTS scene keeps getting lit! 🐐https://maskgct.github.io/
Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
Abstract The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive and non-autoregressive systems. The autoregressive systems implicitly model duration but exhibit certain deficiencies in robustness and lack of duration controllability. Non-autoregressive systems require explicit alignment information between text and speech during training and predict durations for linguistic units (e.g. phone), which may compromise their naturalness. In this paper, we introduce Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the mask-and-predict learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. Experiments with 100K hours of in-the-wild speech demonstrate that MaskGCT outperforms the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility.

###
https://arxiv.org/abs/2410.18417


[Submitted on 24 Oct 2024]
Large Language Models Reflect the Ideology of their Creators
Maarten Buyl, Alexander Rogiers, Sander Noels, Iris Dominguez-Catena, Edith Heiter, Raphael Romero, Iman Johary, Alexandru-Cristian Mara, Jefrey Lijffijt, Tijl De Bie
Large language models (LLMs) are trained on vast amounts of data to generate natural language, enabling them to perform tasks like text summarization and question answering. These models have become popular in artificial intelligence (AI) assistants like ChatGPT and already play an influential role in how humans access information. However, the behavior of LLMs varies depending on their design, training, and use.
In this paper, we uncover notable diversity in the ideological stance exhibited across different LLMs and languages in which they are accessed. We do this by prompting a diverse panel of popular LLMs to describe a large number of prominent and controversial personalities from recent world history, both in English and in Chinese. By identifying and analyzing moral assessments reflected in the generated descriptions, we find consistent normative differences between how the same LLM responds in Chinese compared to English. Similarly, we identify normative disagreements between Western and non-Western LLMs about prominent actors in geopolitical conflicts. Furthermore, popularly hypothesized disparities in political goals among Western models are reflected in significant normative differences related to inclusion, social inequality, and political scandals.
Our results show that the ideological stance of an LLM often reflects the worldview of its creators. This raises important concerns around technological and regulatory efforts with the stated aim of making LLMs ideologically `unbiased', and it poses risks for political instrumentalization.

기술적으로 최대한 자세하게 적어. 13개의 기사가 있고 하나도 빼먹지 말고 적어.