오늘 AI 소식에서는 Google, DeepMind, Meta, Microsoft, Apple, NousResearch, Hugging Face와 같은 주요 기업들이 발표한 최신 AI 기술과 제품들을 다루었습니다. Google은 Gemini 2.0 및 Veo 2, Imagen 3과 같은 첨단 AI 모델을 공개하였으며, DeepMind는 Project Mariner를 통해 브라우저 제어 혁신을 선보였습니다. Meta는 Apollo Multimodal Models와 Meta Motivo를 발표하여 멀티모달 및 휴머노이드 제어 분야에서의 선도적인 위치를 확립하였습니다. Microsoft는 Phi-4를 출시하여 소형 언어 모델의 복잡한 추론 능력을 강화하였으며, Apple은 Apple Intelligence의 새로운 기능들을 공개하여 사용자 경험을 크게 향상시켰습니다. 또한, NousResearch와 Hugging Face는 Hermes 3 및 ProcessBench를 발표하여 AI의 효율성과 정확성을 높였고, Microsoft는 MarkItDown 라이브러리를 통해 다양한 파일 형식을 Markdown으로 변환하는 도구를 공개하였습니다.

Google

Gemini 2.0 발표

링크, 2024년 12월 11일

  • Gemini 2.0 Flash Experimental 출시: Google Labs에서 Gemini 2.0 Flash Experimental을 발표하여, 이전 버전인 Gemini 1.5 Flash 대비 두 배 빠른 속도와 강화된 멀티모달 성능을 제공.
  • 멀티모달 라이브 API 도입: 실시간 오디오 및 비디오 스트리밍 입력을 지원하는 멀티모달 라이브 API를 도입하여, 개발자들이 더욱 동적이고 인터랙티브한 애플리케이션을 구축할 수 있도록 지원.
  • Jules AI 코딩 에이전트: 코드 실행 도구를 갖춘 Jules라는 AI 코딩 에이전트를 도입하여, 개발자의 워크플로우를 향상시키고, 버그 수정 및 코드 리뷰를 자동화.
  • Colab 데이터 과학 에이전트 통합: Colab의 데이터 과학 에이전트가 Gemini 2.0을 활용하여 자연어 명령으로 노트북을 생성하고, 데이터 분석을 자동화함으로써 연구 및 개발 시간을 대폭 단축.
  • 멀티모달 출력 지원: Gemini 2.0 Flash는 텍스트, 오디오, 이미지 등 다양한 출력 모달리티를 단일 API 호출로 통합 생성 가능, SynthID 물리적 워터마크를 통해 AI 생성 콘텐츠의 출처 식별 용이.

Veo 2 및 Imagen 3 업데이트

링크, 2024년 12월 16일

  • Veo 2 발표: 최신 비디오 생성 모델 Veo 2를 발표, 4K 해상도 지원 및 실시간 물리 및 인간 움직임 이해 능력 향상. 영화 촬영 언어를 이해하여 다양한 장르와 촬영 효과를 반영한 고품질 비디오 생성 가능.
  • Imagen 3 개선: 이미지 생성 모델 Imagen 3을 업데이트하여 더 밝고 정교한 이미지 생성, 다양한 예술 스타일의 정확한 렌더링 및 풍부한 디테일과 텍스처 제공.
  • Whisk 실험 도구 도입: 새로운 실험 도구 Whisk를 통해 이미지 기반 아이디어를 시각화하고 리믹스할 수 있는 기능을 추가, 사용자 맞춤형 디지털 플러시, 에나멜 핀, 스티커 등을 생성 가능.
  • Veo 2 및 Imagen 3 통합: VideoFX, ImageFX 및 Whisk에서 Veo 2와 Imagen 3의 최신 기능을 활용할 수 있도록 지원, Google Labs에서 초기 사용자에게 공개 및 추후 제품군에 통합 예정.

DeepMind

Project Mariner 출시

링크, 2024년 12월 13일

  • Project Mariner 발표: DeepMind에서 Project Mariner를 발표, Gemini 2.0을 기반으로 하는 브라우저 제어 크롬 확장 프로그램 출시.
  • 브라우저 상호작용 자동화: URL 입력, 페이지 스크롤, 버튼 클릭 등 다양한 웹사이트 상호작용을 자동화하여 사용자 지시를 정확하게 수행.
  • 멀티모달 이해 및 추론: 화면의 픽셀, 텍스트, 코드, 이미지, 폼 등 다양한 웹 요소를 이해하고 추론하여 복잡한 웹사이트에서도 안정적인 성능 발휘.
  • WebVoyager 벤치마크 성과: Project Mariner는 WebVoyager 벤치마크에서 90.5%의 높은 성과를 기록, 실제 웹사이트에서의 높은 신뢰성과 효율성 입증.
  • 연구 프로토타입 단계 유지: 현재 소규모 신뢰된 테스터 그룹에게만 공개되어 연구 프로토타입 단계로, API나 프로그램적 사용에 대한 정보는 제공되지 않음.

Meta

Apollo Multimodal Models 발표

링크, 2024년 12월 17일

  • Apollo Multimodal Models 공개: Meta에서 Apollo Multimodal Models를 발표, Apache 2.0 라이선스로 공개하여 오픈 소스 커뮤니티와의 협력 강화.
  • 모델 성능: Apollo-7B 모델은 Video-MME에서 61.2점, MLVU에서 70.9점, ApolloBench에서 66.3점을 기록하며, 30B+ 파라미터 모델인 Oryx-34B와 VILA1.5-40B를 능가하는 성능을 보임.
  • 모델 체크포인트 제공: 1.5B, 3B, 7B 모델 체크포인트를 제공하며, transformers 라이브러리와 호환 가능하여 다양한 개발 환경에서 손쉽게 활용 가능.
  • Stanford University와 협력 연구: 비디오 이해 메커니즘을 체계적으로 탐구하는 연구를 진행, 모델의 효율성과 고성능을 위한 설계 요소 분석.
  • ApolloBench 벤치마크 도입: 효율적인 평가를 위한 새로운 벤치마크 ApolloBench를 도입, 비디오-언어 모델링의 성능을 체계적으로 평가 가능.

Meta Motivo 발표

링크, 2024년 12월 12일

  • Meta Motivo 발표: Meta에서 Meta Motivo를 발표, 제로샷 휴머노이드 제어를 위한 행동 기초 모델로 소개.
  • 알고리즘 혁신: Forward-Backward Representations with Conditional-Policy Regularization (FB-CPR) 알고리즘 도입, 비지도 강화 학습을 통한 유연한 정책 학습.
  • 모델 훈련: AMASS 모션 캡처 데이터셋과 3천만 개의 온라인 상호작용 샘플을 사용하여 고차원 가상 휴머노이드 에이전트 제어 능력 강화.
  • 다양한 작업 수행: 모션 트래킹, 목표 도달, 보상 최적화 등 다양한 전신 작업을 제로샷으로 수행 가능, 인간과 유사한 행동 표현 및 우수한 성과 기록.
  • 새로운 휴머노이드 벤치마크: Meta Motivo는 새로운 휴머노이드 벤치마크에서 기존 비지도 RL 및 모델 기반 베이스라인을 능가하는 성과를 보임.
  • 오픈 소스 발표: 사전 훈련된 모델, 휴머노이드 벤치마크, 훈련 코드를 공개하여 커뮤니티의 연구 발전 촉진.

Microsoft

Phi-4 출시

링크, 2024년 12월 13일

  • Phi-4 발표: Microsoft에서 Phi-4, 140억 파라미터의 소형 언어 모델 발표, 복잡한 수학 문제 해결에 특화된 성능 제공.
  • 벤치마크 성과: Phi-4는 수학 경쟁 문제에서 Gemini Pro 1.5를 능가하며, 다양한 수학 관련 추론 벤치마크에서 우수한 성과를 기록.
  • Azure AI Foundry 출시: Phi-4는 현재 Azure AI Foundry에서 Microsoft Research License Agreement (MSRLA)를 통해 사용 가능, 다음 주에는 Hugging Face에서도 공개 예정.
  • 책임 있는 AI 개발: Azure AI Content Safety 기능 제공, 프롬프트 보호, 보호된 자료 감지, 근거성 감지 등을 통해 AI 위험 관리 및 콘텐츠 필터링 지원.
  • 데이터 품질 향상: Phi-4는 고품질 합성 데이터와 유기적 데이터의 조합, 후처리 혁신을 통해 크기 대비 뛰어난 품질과 복잡한 추론 능력 달성.
  • 기술 혁신: Phi-4는 기존 Phi-3 아키텍처의 미세 조정을 통해 STEM 중심의 QA 능력 대폭 향상, 데이터 생성 및 후처리 기법을 통해 GPT-4를 능가하는 성과 제공.

MarkItDown 라이브러리 공개

링크, 2024년 12월 17일

  • MarkItDown 발표: Microsoft에서 MarkItDown 라이브러리를 공개, 다양한 파일 형식을 Markdown으로 변환하는 유틸리티 도구 제공.
  • 지원 파일 형식: PDF, PowerPoint (.pptx), Word (.docx), Excel (.xlsx), 이미지 (EXIF 메타데이터 및 OCR), 오디오 (EXIF 메타데이터 및 음성 전사), HTML, CSV, JSON, XML 등 다양한 형식 지원.
  • 설치 및 사용: pip를 통해 간편하게 설치 가능하며, 명령줄 유틸리티 및 Docker 이미지로도 사용 가능. 예를 들어, markitdown path-to-file.pdf > document.md 명령어로 PDF 파일을 Markdown으로 변환 가능.
  • LLM 통합: Large Language Models를 사용하여 이미지 설명 기능 추가 가능, OpenAI의 GPT-4o와 연동하여 이미지 설명 자동화 지원.
  • 유연한 사용: API를 통해 다양한 설정 가능, Docker 이미지를 통해 컨테이너 환경에서도 손쉽게 활용 가능.

NousResearch

Hermes 3 발표

링크, 2024년 12월 15일

  • Hermes 3 발표: NousResearch에서 Hermes 3 모델 발표, Llama-3.2 3B 파운데이션 모델을 기반으로 한 최신 인스트럭트 튜닝 모델.
  • 모델 특징: Hermes 3 3B는 고급 에이전틱 기능, 향상된 역할 수행, 멀티턴 대화, 긴 컨텍스트 유지, 코드 생성 능력 강화.
  • 훈련 세부 사항: PRM800K 데이터셋을 활용하여 세부 조정, LambdaLabs GPU 클라우드를 사용하여 H100s에서 훈련 완료.
  • 벤치마크 성과: GPT-4o와 경쟁력 있는 성과를 기록하며, 기존 PRMs 대비 복잡한 문제에서 우수한 일반화 능력 보유.
  • 기능 확장: Hermes 3 시리즈는 Hermes 2의 기능을 확장, 신뢰할 수 있는 함수 호출, 구조화된 출력 능력, 일반적인 어시스턴트 기능, 향상된 코드 생성 기술 포함.
  • 오픈 소스 지원: Hermes 3는 Hugging Face 플랫폼에서 공개되어 개발자들이 쉽게 접근하고 활용할 수 있도록 지원.

Hugging Face

ProcessBench 공개

링크, 2024년 12월 10일

  • ProcessBench 발표: Hugging Face에서 ProcessBench를 공개, 수학적 추론 과정에서 오류를 식별하기 위한 새로운 벤치마크.
  • 벤치마크 구성: 3,400개의 경쟁 및 올림피아드 수준의 수학 문제 포함, 단계별 해결 과정에 오류 위치가 전문가에 의해 주석 처리됨.
  • 모델 평가: Process Reward Models(PRMs)와 비평가 모델(critic models)을 통해 광범위한 평가 수행. PRMs는 복잡한 문제에서 일반화에 어려움을 겪는 반면, QwQ-32B-Preview 모델은 GPT-4o와 경쟁력 있는 성과를 보임.
  • 주요 발견:
    • 기존 PRMs는 GSM8K 및 MATH 외의 복잡한 수학 문제에서 일반화 능력이 떨어짐.
    • 비평가 모델(일반 언어 모델)이 오류 감지에서 PRMs보다 우수한 성과를 보임.
    • PRMs를 PRM800K 데이터셋으로 세부 조정할 경우 성능 향상.
  • 오픈 소스 기여: ProcessBench는 연구자들이 언어 모델의 추론 과정 평가를 개선하고, 향후 AI 모델의 오류 식별 능력을 높이기 위한 연구를 촉진할 것으로 기대됨.
  • 향후 계획: PRM 모델은 곧 Hugging Face에서 공개될 예정, 연구자들과 개발자들이 쉽게 접근하여 활용 가능.

Apple

Apple Intelligence 새로운 기능 발표

링크, 2024년 12월 11일

  • Apple Intelligence 업데이트: iOS 18.2, iPadOS 18.2, macOS Sequoia 15.2 업데이트와 함께 Apple Intelligence의 새로운 기능 공개.
  • Image Playground: 테마, 의상, 액세서리, 장소 등을 활용하여 창의적인 이미지 생성 가능. 사용자 사진을 기반으로 가족이나 친구의 모습과 유사한 이미지를 생성할 수 있으며, Animation 및 Illustration 스타일 지원.
  • Genmoji 도입: 사용자가 텍스트 설명을 입력하면 다양한 옵션의 Genmoji를 생성하여 대화에서 더 재미있고 창의적인 이모지 사용 가능. 사용자 사진을 활용한 맞춤형 Genmoji도 지원.
  • Writing Tools 향상: Rewrite, Proofread, Summarize 기능에 Describe Your Change 옵션 추가, 사용자가 원하는 변경 사항을 구체적으로 지정하여 텍스트 수정 가능.
  • ChatGPT 통합: Siri 및 Writing Tools에 ChatGPT 통합, 사용자가 앱 간 전환 없이도 AI의 도움을 받을 수 있도록 개선. Compose 기능을 통해 글 작성 시 ChatGPT의 콘텐츠 생성 및 이미지 생성 기능 활용 가능.
  • Visual Intelligence: iPhone 16 시리즈의 Camera Control을 통해 주변 환경을 실시간으로 분석, 텍스트 요약, 번역, 전화번호 및 이메일 감지, Google 검색 연동 등 다양한 기능 제공.
  • 언어 확장: 호주, 캐나다, 아일랜드, 뉴질랜드, 남아프리카, 영국 등 영어 현지화 지원 확대, 추가적으로 중국어, 인도 영어, 싱가포르 영어, 프랑스어, 독일어, 이탈리아어, 일본어, 한국어, 포르투갈어, 스페인어, 베트남어 등 다양한 언어 지원 예정.
  • 프라이버시 보호: Apple Intelligence는 온디바이스 처리 방식을 채택하여 사용자 데이터 보호 강화, Private Cloud Compute를 통해 클라우드에서도 데이터 저장 및 공유 없이 AI 기능 제공.
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
###
https://developers.googleblog.com/en/the-next-chapter-of-the-gemini-era-for-developers/
Gemini
The next chapter of the Gemini era for developers
DEC 11, 2024
Shrestha Basu Mallick
Group Product Manager
Gemini API
Kathy Korevec
Director of Product
Google Labs

Share
Gemini 2.0
We're giving developers the power to build the future of AI with cutting-edge models, intelligent tools to write code faster, and seamless integration across platforms and devices. Since last December when we launched Gemini 1.0, millions of developers have used Google AI Studio and Vertex AI to build with Gemini across 109 languages.

Today, we are announcing Gemini 2.0 Flash Experimental to enable even more immersive and interactive applications, as well as new coding agents that will enhance workflows by taking action on behalf of the developer.


Build with Gemini 2.0 Flash
Building on the success of Gemini 1.5 Flash, Flash 2.0 is twice as fast as 1.5 Pro while achieving stronger performance, includes new multimodal outputs, and comes with native tool use. We’re also introducing a Multimodal Live API for building dynamic applications with real-time audio and video streaming.

Starting today, developers can test and explore Gemini 2.0 Flash via the Gemini API in Google AI Studio and Vertex AI during its experimental phase, with general availability coming early next year.

With Gemini 2.0 Flash, developers have access to:

1. Better performance
Gemini 2.0 Flash is more powerful than 1.5 Pro while still delivering on the speed and efficiency that developers expect from Flash. It also features improved multimodal, text, code, video, spatial understanding and reasoning performance on key benchmarks. Improved spatial understanding enables more accurate bounding boxes generation on small objects in cluttered images, and better object identification and captioning. Learn more in the spatial understanding video or read the Gemini API docs.


2. New output modalities
Developers will be able to use Gemini 2.0 Flash to generate integrated responses that can include text, audio, and images — all through a single API call. These new output modalities are available to early testers, with wider rollout expected next year. SynthID invisible watermarks will be enabled in all image and audio outputs, helping decrease misinformation and misattribution concerns.

Multilingual native audio output: Gemini 2.0 Flash features native text-to-speech audio output that provides developers fine-grained control over not just what the model says, but how it says it, with a choice of 8 high-quality voices and a range of languages and accents. Hear native audio output in action or read more in the developer docs.
Native image output: Gemini 2.0 Flash now natively generates images and supports conversational, multi-turn editing, so you can build on previous outputs and refine them. It can output interleaved text and images, making it useful in multimodal content such as recipes. See more in the native image output video.

3. Native tool use
Gemini 2.0 has been trained to use tools–a foundational capability for building agentic experiences. It can natively call tools like Google Search and code execution in addition to custom third-party functions via function calling. Using Google Search natively as a tool leads to more factual and comprehensive answers and increases traffic to publishers. Multiple searches can be run in parallel leading to improved information retrieval by finding more relevant facts from multiple sources simultaneously and combining them for accuracy. Learn more in the native tool use video or start building from a notebook.


4. Multimodal Live API
Developers can now build real-time, multimodal applications with audio and video-streaming inputs from cameras or screens. Natural conversational patterns like interruptions and voice activity detection are supported. The API supports the integration of multiple tools together to accomplish complex use cases with a single API call. See more in the multimodal live streaming video, try the web console, or starter code (Python).


We’re thrilled to see startups making impressive progress with Gemini 2.0 Flash, prototyping new experiences like tldraw's visual playground, Viggle's virtual character creation and audio narration, Toonsutra's contextual multilingual translation, and Rooms' adding real-time audio.

To jumpstart building, we’ve released three starter app experiences in Google AI Studio along with open source code for spatial understanding, video analysis and Google Maps exploration so you can begin building with Gemini 2.0 Flash.


Enabling the evolution of AI code assistance
As AI code assistance rapidly evolves from simple code searches to AI-powered assistants embedded in developer workflows, we want to share the latest advancement that will use Gemini 2.0: coding agents that can execute tasks on your behalf.

In our latest research, we've been able to use 2.0 Flash equipped with code execution tools to achieve 51.8% on SWE-bench Verified, which tests agent performance on real-world software engineering tasks. The cutting edge inference speed of 2.0 Flash allowed the agent to sample hundreds of potential solutions, selecting the best based on existing unit tests and Gemini's own judgment. We're in the process of turning this research into new developer products.


Meet Jules, your AI-powered code agent
Imagine your team has just finished a bug bash, and now you’re staring down a long list of bugs. Starting today, you can offload Python and Javascript coding tasks to Jules, an experimental AI-powered code agent that will use Gemini 2.0. Working asynchronously and integrated with your GitHub workflow, Jules handles bug fixes and other time-consuming tasks while you focus on what you actually want to build. Jules creates comprehensive, multi-step plans to address issues, efficiently modifies multiple files, and even prepares pull requests to land fixes directly back into GitHub.

Jules tackling an issue, developing a plan, and executing it (Sequences shortened. Results for illustrative purposes. Jules may make mistakes.)
It’s early, but from our internal experience using Jules, it’s giving developers:

More productivity. Assign issues and coding tasks to Jules for asynchronous coding efficiency.
Progress tracking. Stay informed and prioritize tasks that require your attention with real-time updates.
Full developer control. Review the plans Jules creates along the way, and provide feedback or request adjustments as you see fit. Easily review and, if appropriate, merge the code Jules writes into your project.
We’re making Jules available for a select group of trusted testers today, and we’ll make it available for other interested developers in early 2025. Sign up to get updates about Jules on labs.google.com/jules.


Colab's data science agent will create notebooks for you
At I/O this year, we launched an experimental Data Science Agent on labs.google/code that allows anyone to upload a dataset and get insights within minutes, all grounded in a working Colab notebook. We were thrilled to receive such positive feedback from the developer community and see the impact. For example, with the help of Data Science Agent, a scientist at Lawrence Berkeley National Laboratory working on a global tropical wetland methane emissions project has estimated their analysis and processing time was reduced from one week to five minutes.

Colab has started to integrate these same agentic capabilities, using Gemini 2.0. Simply describe your analysis goals in plain language, and watch your notebook take shape automatically, helping accelerate your ability to conduct research and data analysis. Developers can get early access to this new feature by joining the trusted tester program before it rolls out more widely to Colab users in the first half of 2025.

Colab’s data science agent uses Gemini 2.0 to create a notebook from natural language instructions
Developers are building the future
Our Gemini 2.0 models can empower you to build more capable AI apps faster and easier, so you can focus on great experiences for your users. We'll be bringing Gemini 2.0 to our platforms like Android Studio, Chrome DevTools and Firebase in the coming months. Developers can sign up to use Gemini 2.0 Flash in Gemini Code Assist, for enhanced coding assistance capabilities in popular IDEs such as Visual Studio Code, IntelliJ, PyCharm and more. Visit ai.google.dev to get started and follow Google AI for Developers for future updates.

###
https://deepmind.google/technologies/project-mariner/
Project Mariner
A research prototype exploring the future of human-agent interaction, starting with your browser
12/13/24

Yesterday Google DeepMind released its version of “computer use” with Project Mariner. Project Mariner is a Chrome Extension that uses Gemini 2.0 to control your browser based on human instructions. It scored 90.5% with a tree search on WebVoyager.
It can interact with websites by typing URLs into the address bar, scrolling through web pages to find relevant information, and clicking buttons, links, and other interactive elements.
No information about an API or programmatic use.

Overview
Native multimodality
Browser interaction
Reasoning
A new way to use your browser
Built with Gemini 2.0, Project Mariner combines strong multimodal understanding and reasoning capabilities to automate tasks using your browser.

Native multimodality
Project Mariner can understand and reason across everything on your browser screen, including pixels and web elements like text, code, images and forms.


Understands and seamlessly reasons across websites.


Understands and responds to voice instructions.


Keeps you informed on progress with visual feedback and updates.

Browser interaction
Project Mariner understands and navigates complex websites in real time—automating tasks in your browser while keeping you in control.



Navigates and interacts with websites on your behalf.


Automates repetitive tasks to help save you time.


Asks for clarification if it doesn't understand an instruction.

Reasoning
Project Mariner can follow complex instructions and reason across websites — and shows its work.


Interprets complex instructions, breaking them down into actionable steps.


Understands the relationships between different web elements and their functions.


Provides a clear view of its plan and actions, enabling you to understand its decision-making process.

Benchmarks
Benchmark Description
Project Mariner
Single-agent
Project Mariner
Tree-search
ScreenSpot
Multimodal screen understanding and grounding benchmark over graphical user interfaces (GUIs) across different platforms 84.0% -
WebVoyager
A benchmark to evaluate autonomous browser agents interacting with real-world websites.* 83.5% 90.5%
*We updated the dates for outdated tasks and removed obsolete ones. For evaluation, we submitted the outputs to human reviewers and used majority voting among three evaluators.

Building responsibly in the agentic era
As we develop these new technologies, we recognize the responsibility it entails, and aim to prioritize safety and security in all our efforts.

Learn more

Experience Project Mariner
Project Mariner is a research prototype, being used only by a small group of trusted testers. If you're interested in becoming a tester, please share a few details to join the waitlist.

###
https://blog.google/technology/google-labs/video-image-generation-update-december-2024/
State-of-the-art video and image generation with Veo 2 and Imagen 3
Dec 16, 2024

7 min read

We’re announcing new versions of Veo and Imagen, and introducing our latest experiment in image generation: Whisk.

A
Aäron van den Oord
Research Scientist, Google DeepMind
E
Elias Roman
Senior Director, Product Management, Google Labs
Read AI-generated summary
Share
Three different AI generated images in front of an abstract background
Earlier this year, we introduced our video generation model, Veo, and our latest image generation model, Imagen 3. Since then, it’s been exciting to watch people bring their ideas to life with help from these models: YouTube creators are exploring the creative possibilities of video backgrounds for their YouTube Shorts, enterprise customers are enhancing creative workflows on Vertex AI and creatives are using VideoFX and ImageFX to tell their stories. Together with collaborators ranging from filmmakers to businesses, we’re continuing to develop and evolve these technologies.

Today we're introducing a new video model, Veo 2, and the latest version of Imagen 3, both of which achieve state-of-the-art results. These models are now available in VideoFX, ImageFX and our newest Labs experiment, Whisk.

Veo 2: state-of-the-art video generation
Veo 2 creates incredibly high-quality videos in a wide range of subjects and styles. In head-to-head comparisons judged by human raters, Veo 2 achieved state-of-the-art results against leading models.

It brings an improved understanding of real-world physics and the nuances of human movement and expression, which helps improve its detail and realism overall. Veo 2 understands the unique language of cinematography: ask it for a genre, specify a lens, suggest cinematic effects and Veo 2 will deliver — at resolutions up to 4K, and extended to minutes in length. Ask for a low-angle tracking shot that glides through the middle of a scene, or a close-up shot on the face of a scientist looking through her microscope, and Veo 2 creates it. Suggest “18mm lens” in your prompt and Veo 2 knows to craft the wide angle shot that this lens is known for, or blur out the background and focus on your subject by putting "shallow depth of field" in your prompt.

Item 1 of 7
Cinematic shot of a female doctor in a dark yellow hazmat suit, illuminated by the harsh fluorescent light of a laboratory. The camera slowly zooms in on her face, panning gently to emphasize the worry and anxiety etched across her brow. She is hunched over a lab table, peering intently into a microscope, her gloved hands carefully adjusting the focus. The muted color palette of the scene, dominated by the sickly yellow of the suit and the sterile steel of the lab, underscores the gravity of the situation and the weight of the unknown she is facing. The shallow depth of field focuses on the fear in her eyes, reflecting the immense pressure and responsibility she bears.
Examples of Veo 2's high-quality video generation capabilities. All videos were generated by Veo 2 and have not been modified.

This medium shot, with a shallow depth of field, portrays an adorable cartoon girl with wavy brown hair and lots of character, sitting upright in a 1980s kitchen. Her hair is medium length and wavy. She has a small, slightly upturned nose, and small, rounded ears. She is very animated and excited as she talks to the camera and lighting and giggling with a huge grin.
Examples of Veo 2's high-quality video generation capabilities. All videos were generated by Veo 2 and have not been modified.

The camera floats gently through rows of pastel-painted wooden beehives, buzzing honeybees gliding in and out of frame. The motion settles on the refined farmer standing at the center, his pristine white beekeeping suit gleaming in the golden afternoon light. He lifts a jar of honey, tilting it slightly to catch the light. Behind him, tall sunflowers sway rhythmically in the breeze, their petals glowing in the warm sunlight. The camera tilts upward to reveal a retro farmhouse with mint-green shutters, its walls dappled with shadows from swaying trees. Shot with a 35mm lens on Kodak Portra 400 film, the golden light creates rich textures on the farmer’s gloves, marmalade jar, and weathered wood of the beehives.
Examples of Veo 2's high-quality video generation capabilities. All videos were generated by Veo 2 and have not been modified.

A low-angle shot captures a flock of pink flamingos gracefully wading in a lush, tranquil lagoon. The vibrant pink of their plumage contrasts beautifully with the verdant green of the surrounding vegetation and the crystal-clear turquoise water. Sunlight glints off the water's surface, creating shimmering reflections that dance on the flamingos' feathers. The birds' elegant, curved necks are submerged as they walk through the shallow water, their movements creating gentle ripples that spread across the lagoon. The composition emphasizes the serenity and natural beauty of the scene, highlighting the delicate balance of the ecosystem and the inherent grace of these magnificent birds. The soft, diffused light of early morning bathes the entire scene in a warm, ethereal glow.
Examples of Veo 2's high-quality video generation capabilities. All videos were generated by Veo 2 and have not been modified.

A perfect cube rotates in the center of a soft, foggy void. The surface shifts between different hyper-real textures—smooth marble, velvety suede, hammered brass, and raw concrete. Each material reveals subtle details: marble veins slowly spreading, suede fibers brushing with wind, brass tarnishing in slow motion, and concrete crumbling to reveal polished stone inside. Ends with a soft glow surrounding the cube as it transitions to a smooth mirrored surface, reflecting infinity.
Examples of Veo 2's high-quality video generation capabilities. All videos were generated by Veo 2 and have not been modified.

A cinematic shot captures a fluffy Cockapoo, perched atop a vibrant pink flamingo float, in a sun-drenched Los Angeles swimming pool. The crystal-clear water sparkles under the bright California sun, reflecting the playful scene. The Cockapoo's fur, a soft blend of white and apricot, is highlighted by the golden sunlight, its floppy ears gently swaying in the breeze. Its happy expression and wagging tail convey pure joy and summer bliss. The vibrant pink flamingo adds a whimsical touch, creating a picture-perfect image of carefree fun in the LA sunshine.
Examples of Veo 2's high-quality video generation capabilities. All videos were generated by Veo 2 and have not been modified.

The sun rises slowly behind a perfectly plated breakfast scene. Thick, golden maple syrup pours in slow motion over a stack of fluffy pancakes, each one releasing a soft, warm steam cloud. A close-up of crispy bacon sizzles, sending tiny embers of golden grease into the air. Coffee pours in smooth, swirling motion into a crystal-clear cup, filling it with deep brown layers of crema. Scene ends with a camera swoop into a fresh-cut orange, revealing its bright, juicy segments in stunning macro detail.
Examples of Veo 2's high-quality video generation capabilities. All videos were generated by Veo 2 and have not been modified.


1
2
3
4
5
6
7
While video models often “hallucinate” unwanted details — extra fingers or unexpected objects, for example — Veo 2 produces these less frequently, making outputs more realistic.

Our commitment to safety and responsible development has guided Veo 2. We have been intentionally measured in growing Veo’s availability, so we can help identify, understand and improve the model’s quality and safety while slowly rolling it out via VideoFX, YouTube and Vertex AI.

Just like the rest of our image and video generation models, Veo 2 outputs include an invisible SynthID watermark that helps identify them as AI-generated, helping reduce the chances of misinformation and misattribution.

Today, we're bringing our new Veo 2 capabilities to our Google Labs video generation tool, VideoFX, and expanding the number of users who can access it. Visit Google Labs to sign up for the waitlist. We also plan to expand Veo 2 to YouTube Shorts and other products next year.

Note: Find prompts for all videos at the bottom of this post: Scientist1, Cartoon character2, Bees3, Flamingos4, Cube5, Dog6, Pancakes7

Imagen 3: state-of-the-art image generation
We've also improved our Imagen 3 image-generation model, which now generates brighter, better composed images. It can now render more diverse art styles with greater accuracy — from photorealism to impressionism, from abstract to anime. This upgrade also follows prompts more faithfully, and renders richer details and textures. In side-by-side comparisons of outputs by human raters against leading image generation models, Imagen 3 achieved state-of-the-art results.

Starting today, the latest Imagen 3 model will globally roll out in ImageFX, our image generation tool from Google Labs, to more than 100 countries. Visit ImageFX to get started.

Item 1 of 5
A close-up shot captures a winter wonderland scene – soft snowflakes fall on a snow-covered forest floor. Behind a frosted pine branch, a red squirrel sits, its bright orange fur a splash of color against the white. It holds a small hazelnut. As it enjoys its meal, it seems oblivious to the falling snow.
Examples of Imagen 3's rich detail and image quality composition

An extreme close-up of a craftsperson's hands shaping a glowing piece of pottery on a wheel. Threads of golden, luminous energy connect the potter’s hands to the clay, swirling dynamically with their movements.
Examples of Imagen 3's rich detail and image quality composition

A foggy 1940s European train station at dawn, framed by intricate wrought-iron arches and misted glass windows. Steam rises from the tracks, blending with dense fog. Two lovers stand in an emotional embrace near the train, backlit by the warm, amber glow of dim lanterns. The departing train is partially visible, its red tail lights fading into the mist. The woman wears a faded red coat and clutches a small leather diary, while the man is dressed in a weathered soldier’s uniform. Dust motes float in the air, illuminated by the soft golden backlight. The atmosphere is melancholic and timeless, evoking the bittersweet farewell of wartime cinema.
Examples of Imagen 3's rich detail and image quality composition

A portrait of an Asian woman with neon green lights in the background, shallow depth of field.
Examples of Imagen 3's rich detail and image quality composition

A close-up, macro photography stock photo of a strawberry intricately sculpted into the shape of a hummingbird in mid-flight, its wings a blur as it sips nectar from a vibrant, tubular flower. The backdrop features a lush, colorful garden with a soft, bokeh effect, creating a dreamlike atmosphere. The image is exceptionally detailed and captured with a shallow depth of field, ensuring a razor-sharp focus on the strawberry-hummingbird and gentle fading of the background. The high resolution, professional photographers style, and soft lighting illuminate the scene in a very detailed manner, professional color grading amplifies the vibrant colors and creates an image with exceptional clarity. The depth of field makes the hummingbird and flower stand out starkly against the bokeh background.
Examples of Imagen 3's rich detail and image quality composition


1
2
3
4
5
Note: Find prompts for all images at the bottom of this post: Potter8, Squirrel9, Train station10, Woman11, Strawberry bird12

Whisk: a fun new tool that lets you prompt with images to visualize your ideas
Whisk, our newest experiment from Google Labs, lets you input or create images that convey the subject, scene and style you have in mind. Then, you can bring them together and remix them to create something uniquely your own, from a digital plushie to an enamel pin or sticker.

Under the hood, Whisk combines our latest Imagen 3 model with Gemini’s visual understanding and description capabilities. The Gemini model automatically writes a detailed caption of your images, and it then feeds those descriptions into Imagen 3. This process allows you to easily remix your subjects, scenes and styles in fun, new ways.

###
https://apollo-lmms.github.io/

Let's gooo! AI at Meta released Apollo Multimodal Models Apache 2.0 licensed - 7B SoTA & beats 30B+ checkpoints🔥
Key insights:
> 1.5B, 3B and 7B model checkpoints
> Can comprehend up-to 1 hour of video 🤯
> Temporal reasoning & complex video question-answering
> Multi-turn conversations grounded in video content
> Apollo-3B outperforms most existing 7B models, achieving scores of 58.4, 68.7, and 62.7 on Video-MME, MLVU, and ApolloBench, respectively
> Apollo-7B rivals and surpasses models with over 30B parameters, such as Oryx-34B and VILA1.5-40B, on benchmarks like MLVU
> Apollo-1.5B: Outperforms models larger than itself, including Phi-3.5-Vision and some 7B models like LongVA-7B
> Apollo-3B: Achieves scores of 55.1 on LongVideoBench, 68.7 on MLVU, and 62.7 on ApolloBench
> Apollo-7B: Attains scores of 61.2 on Video-MME, 70.9 on MLVU, and 66.3 on ApolloBench
> Model checkpoints on the Hub & works w/ transformers (custom code)
Congrats Meta for such a brilliant release and thanks again for ensuring their commitment to Open Source! 🤗

Apollo: An Exploration of Video Understanding in Large Multimodal Models
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia

1Meta GenAI 2Stanford University

We investigate the mechanisms that drive video understanding in large multimodal models and provide actionable insights for the community. Our work includes:

Systematic exploration of the design space of video-LMMs, uncovering critical factors that drive performance.
Investigation of training schedules and data mixtures, providing practical insights for optimizing model performance.
Discovery of "Scaling Consistency," enabling efficient design decisions on smaller LMMs that generalize to larger scales.
A novel benchmark, ApolloBench, for efficient evaluation.
Introducing Apollo, a family of state-of-the-art video-LMMs.

We introduce Apollo, a new family of state-of-the-art video-LMMs. In developing Apollo, we uncover Scaling Consistency, enabling us to reliably make design decisions on smaller models and datasets, dramatically cutting computational costs. Guided by these principles, we train hundreds of model variants—systematically exploring video sampling strategies, token integration, training schedules, and data mixtures. Leveraging these insights, Apollo sets a new benchmark in efficient, high-performance video-language modeling.

###
https://huggingface.co/NousResearch/Hermes-3-Llama-3.2-3B
Hermes 3 - Llama-3.2 3B
nousresearch

[Submitted on 15 Aug 2024]

Ryan Teknium, Jeffrey Quesnelle, Chen Guang
Instruct (or "chat") tuned models have become the primary way in which most people interact with large language models. As opposed to "base" or "foundation" models, instruct-tuned models are optimized to respond to imperative statements. We present Hermes 3, a neutrally-aligned generalist instruct and tool use model with strong reasoning and creative abilities. Its largest version, Hermes 3 405B, achieves state of the art performance among open weight models on several public benchmarks.

Model Description
Hermes 3 3B is a small but mighty new addition to the Hermes series of LLMs by Nous Research, and is Nous's first fine-tune in this parameter class.

For details on Hermes 3, please see the Hermes 3 Technical Report.

Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board.

Hermes 3 3B is a full parameter fine-tune of the Llama-3.2 3B foundation model, focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user.

The Hermes 3 series builds and expands on the Hermes 2 set of capabilities, including more powerful and reliable function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.

Hermes 3 3B was trained on H100s on LambdaLabs GPU Cloud. Check out LambdaLabs' cloud offerings here.

Benchmarks
Hermes 3 is competitive, if not superior, to Llama-3.1 Instruct models at general capabilities, with varying strengths and weaknesses attributable between the two.

GPT4All:
| Task |Version| Metric |Value | |Stderr|
|-------------|------:|--------|-----:|---|-----:|
|arc_challenge| 0|acc |0.5529|± |0.0145|
| | |acc_norm|0.5870|± |0.0144|
|arc_easy | 0|acc |0.8371|± |0.0076|
| | |acc_norm|0.8144|± |0.0080|
|boolq | 1|acc |0.8599|± |0.0061|
|hellaswag | 0|acc |0.6133|± |0.0049|
| | |acc_norm|0.7989|± |0.0040|
|openbookqa | 0|acc |0.3940|± |0.0219|
| | |acc_norm|0.4680|± |0.0223|
|piqa | 0|acc |0.8063|± |0.0092|
| | |acc_norm|0.8156|± |0.0090|
|winogrande | 0|acc |0.7372|± |0.0124|

Average: 72.59

AGIEval:
| Task |Version| Metric |Value | |Stderr|
|------------------------------|------:|--------|-----:|---|-----:|
|agieval_aqua_rat | 0|acc |0.2441|± |0.0270|
| | |acc_norm|0.2441|± |0.0270|
|agieval_logiqa_en | 0|acc |0.3687|± |0.0189|
| | |acc_norm|0.3840|± |0.0191|
|agieval_lsat_ar | 0|acc |0.2304|± |0.0278|
| | |acc_norm|0.2174|± |0.0273|
|agieval_lsat_lr | 0|acc |0.5471|± |0.0221|
| | |acc_norm|0.5373|± |0.0221|
|agieval_lsat_rc | 0|acc |0.6617|± |0.0289|
| | |acc_norm|0.6357|± |0.0294|
|agieval_sat_en | 0|acc |0.7670|± |0.0295|
| | |acc_norm|0.7379|± |0.0307|
|agieval_sat_en_without_passage| 0|acc |0.4417|± |0.0347|
| | |acc_norm|0.4223|± |0.0345|
|agieval_sat_math | 0|acc |0.4000|± |0.0331|
| | |acc_norm|0.3455|± |0.0321|

Average: 44.05

BigBench:

| Task |Version| Metric |Value | |Stderr|
|------------------------------------------------|------:|---------------------|-----:|---|-----:|
|bigbench_causal_judgement | 0|multiple_choice_grade|0.6000|± |0.0356|
|bigbench_date_understanding | 0|multiple_choice_grade|0.6585|± |0.0247|
|bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3178|± |0.0290|
|bigbench_geometric_shapes | 0|multiple_choice_grade|0.2340|± |0.0224|
| | |exact_str_match |0.0000|± |0.0000|
|bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2980|± |0.0205|
|bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2057|± |0.0153|
|bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.5367|± |0.0288|
|bigbench_movie_recommendation | 0|multiple_choice_grade|0.4040|± |0.0220|
|bigbench_navigate | 0|multiple_choice_grade|0.4970|± |0.0158|
|bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.7075|± |0.0102|
|bigbench_ruin_names | 0|multiple_choice_grade|0.4821|± |0.0236|
|bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2295|± |0.0133|
|bigbench_snarks | 0|multiple_choice_grade|0.6906|± |0.0345|
|bigbench_sports_understanding | 0|multiple_choice_grade|0.5375|± |0.0159|
|bigbench_temporal_sequences | 0|multiple_choice_grade|0.6270|± |0.0153|
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2216|± |0.0118|
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1594|± |0.0088|
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.5367|± |0.0288|

Average: 44.13

###
https://techcommunity.microsoft.com/blog/aiplatformblog/introducing-phi-4-microsoft%E2%80%99s-newest-small-language-model-specializing-in-comple/4357090
Introducing Phi-4: Microsoft’s Newest Small Language Model Specializing in Complex Reasoning
ecekamar's avatar
ecekamar
Icon for Microsoft rank
Microsoft
Dec 13, 2024
Learn about Phi-4, the latest small language model in Phi family, that offers high quality results at a small size (14B parameters).
Today we are introducing Phi-4, our 14B parameter state-of-the-art small language model (SLM) that excels at complex reasoning in areas such as math, in addition to conventional language processing. Phi-4 is the latest member of our Phi family of small language models and demonstrates what’s possible as we continue to probe the boundaries of SLMs. Phi-4 is currently available on Azure AI Foundry under a Microsoft Research License Agreement (MSRLA) and will be available on Hugging Face next week.

Phi-4 Benchmarks

Phi-4 outperforms comparable and larger models on math related reasoning due to advancements throughout the processes, including the use of high-quality synthetic datasets, curation of high-quality organic data, and post-training innovations. Phi-4 continues to push the frontier of size vs quality.


Phi-4 is particularly good at math problems, for example here are the benchmarks for Phi-4 on math competition problems:

Phi-4 performance on math competition problems


Phi-4 outperforms much larger models, including Gemini Pro 1.5, on math competition problems (https://maa.org/student-programs/amc/)
To see more benchmarks read the newest technical paper released on arxiv.

Enabling AI innovation safely and responsibly

Building AI solutions responsibly is at the core of AI development at Microsoft. We have made our robust responsible AI capabilities available to customers building with Phi models, including Phi-3.5-mini optimized for Windows Copilot+ PCs.

Azure AI Foundry provides users with a robust set of capabilities to help organizations measure, mitigate, and manage AI risks across the AI development lifecycle for traditional machine learning and generative AI applications. Azure AI evaluations in AI Foundry enable developers to iteratively assess the quality and safety of models and applications using built-in and custom metrics to inform mitigations.

Additionally, Phi users can use Azure AI Content Safety features such as prompt shields, protected material detection, and groundedness detection. These capabilities can be leveraged as content filters with any language model included in our model catalog and developers can integrate these capabilities into their application easily through a single API. Once in production, developers can monitor their application for quality and safety, adversarial prompt attacks, and data integrity, making timely interventions with the help of real-time alerts.

Phi-4 in action

One example of the mathematical reasoning Phi-4 is capable of is demonstrated in this problem.

We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.

###
https://www.apple.com/newsroom/2024/12/apple-intelligence-now-features-image-playground-genmoji-and-more/
PRESS RELEASE
December 11, 2024
Apple Intelligence now features Image Playground, Genmoji, Writing Tools enhancements, seamless support for ChatGPT, and visual intelligence
Today also marks the beginning of Apple Intelligence language expansion with localized English support for Australia, Canada, Ireland, New Zealand, South Africa, and the U.K.




https://nr.apple.com/dA2t5F6hg2

Rewrite, Genmoji creation, and Image Playground displayed on MacBook Pro, iPhone 16, and iPad Pro.
Today’s release of iOS 18.2, iPadOS 18.2, and macOS Sequoia 15.2 brings new features to Apple Intelligence on iPhone, iPad, and Mac.
CUPERTINO, CALIFORNIA Apple today announced the release of iOS 18.2, iPadOS 18.2, and macOS Sequoia 15.2, introducing a brand-new set of Apple Intelligence features that will elevate users’ experience with iPhone, iPad, and Mac, and builds on the first set of capabilities already introduced. Apple Intelligence is the easy-to-use personal intelligence system that delivers helpful and relevant intelligence while taking an extraordinary step forward for privacy in AI. Now users can explore creative new ways to express themselves visually with Image Playground, create the perfect emoji for any situation with Genmoji, and make their writing even more dynamic with new enhancements to Writing Tools. Building on Apple Intelligence, users with an iPhone 16 or iPhone 16 Pro can instantly learn more about their surroundings with visual intelligence with Camera Control. And now with ChatGPT integrated into Writing Tools and Siri, users can tap into ChatGPT’s expertise without having to switch between apps, helping them get things done faster and easier than ever before.
Today, Apple Intelligence also begins language expansion with localized English support for Australia, Canada, Ireland, New Zealand, South Africa, and the U.K., giving even more users around the world powerful new ways to use their iPhone, iPad, and Mac. Additional languages, including Chinese, English (India), English (Singapore), French, German, Italian, Japanese, Korean, Portuguese, Spanish, and Vietnamese will be coming throughout the year, with an initial set arriving in a software update in April.
Design Fun, Original Images with Image Playground
The Image Playground experience allows users to easily create fun and unique images, with concepts like themes, costumes, accessories, and places. Users can add their own text descriptions, and can even create images in the likeness of a family member or friend using photos from their photo library. Image Playground generates images in distinct styles, including Animation — a modern, 3D-animated look — and Illustration, which offers images with simple shapes, clear lines, and colorblocking.
The experience is integrated right into Messages, making it easier than ever to create images for conversations, as well as into apps like Freeform, Keynote, and many others. Image Playground is also available as a brand-new dedicated app.
The Image Playground app displayed in macOS Sequoia.
An Image Playground Animation style creation in Keynote displayed in iPadOS 18.
The Image Playground app displayed in macOS Sequoia.
Image Playground allows users to easily create fun and unique images, with concepts like themes, costumes, accessories, and places.
previous
next
Create Genmoji to Fit Any Moment
With the power of Apple Intelligence, emoji is taken to the next level with Genmoji, making conversations with family and friends more fun and playful, and opening up entirely new ways to communicate.
By simply typing a description into the emoji keyboard, a Genmoji will appear, including multiple options to choose from. With images from their photo library, users can take Genmoji even further by creating one that is inspired by a friend or family member. Personalized Genmoji can be customized with accessories, like a hat or sunglasses, and can reflect themes or activities to make them even more personal and unique. Just like emoji, Genmoji can be added inline to messages, or shared as a sticker or reaction in a Tapback.1
Genmoji creation displayed on iPhone 16 Pro.
A Genmoji octopus acting as a DJ, displayed on iPhone 16 Pro.
Genmoji creation displayed on iPhone 16 Pro.
Users can create their own unique emoji with Genmoji, making conversations with family and friends more fun and playful.
previous
next
Take Notes to the Next Level with Image Wand
The Notes app gets new tools to make note-taking more visual and dynamic. With Image Wand in the tool palette, users can quickly create images in their note using the written or visual context already captured within the note.
Image Wand transforms a rough sketch into a polished image by simply circling it. Users can even circle empty space within a note, and Image Wand will gather context from the surrounding area — using on-device generative models to analyze the handwritten or typed text — to create a relevant image that complements the note and makes it more visual. Users can create images with the Animation, Illustration, and an additional Sketch style in Image Wand.
Image Wand displayed on iPad Pro.
With Image Wand, users can transform a rough sketch into a polished image by simply circling it.
Describe Changes in Writing Tools
Writing Tools build on the existing options of Rewrite, Proofread, and Summarize with the new ability for users to specify the change they’d like to make, using the new Describe Your Change option. Describe Your Change gives users even more flexibility and control when they’d like to make their writing sound more expressive, such as to add more dynamic action words to their resume or even rewrite a dinner party invitation in the form of a poem, and more. Just like all of the features with Writing Tools, this new Describe Your Change option is available systemwide across Apple and many third-party apps.
The Describe Your Change experience with the Writing Tools feature in Mail, displayed in macOS Sequoia.
Describe Your Change in Writing Tools gives users even more flexibility and control by allowing them to specify the change they’d like to make to their text.
Learn More About Surroundings in One Click with Visual Intelligence
A new visual intelligence experience builds on Apple Intelligence and helps users learn about objects and places instantly, thanks to the new Camera Control on the iPhone 16 lineup. Visual intelligence can summarize and copy text, translate text between languages, detect phone numbers or email addresses with the option to add to contacts, and more. Camera Control also allows users to search Google so they can see where they can buy an item, or benefit from ChatGPT’s problem-solving skills to ask for an explanation about a complex diagram, such as from class notes. Users are in control of when third-party tools are used and what information is shared.
Tap into ChatGPT with Siri and Writing Tools
Apple is enabling ChatGPT access in Siri and Writing Tools experiences within iOS, iPadOS, and macOS, allowing users to access its expertise — as well as its image- and document-understanding capabilities — without needing to jump between applications. With the ChatGPT integration, Siri can suggest a user access ChatGPT for certain requests, and Siri can provide the response directly.
With Compose, users can ask ChatGPT to generate content for anything they are writing about from the systemwide Writing Tools. They can also use ChatGPT’s image-generation capabilities to add images alongside their written content.
Users can choose whether to enable ChatGPT integration, and are in full control of when to use it and what information is shared with ChatGPT. By default, a ChatGPT account is not required to use this integration. When using ChatGPT without an account, OpenAI will not store requests, and will not use the data for model training. Additionally, users’ IP addresses are obscured to prevent their sessions from being linked together. For those who choose to connect their account, OpenAI’s data-use policies apply.
Writing Tools with ChatGPT integration, displayed in Pages in macOS Sequoia.
Compose with ChatGPT helps users generate content for anything they are writing about.
Even More Capabilities Coming Soon
Additional Apple Intelligence capabilities will be available in the months to come. Siri will be even more capable, with the ability to draw on a user’s personal context to deliver intelligence that’s tailored to them. Siri will also gain onscreen awareness, and will be able to take hundreds of new actions in and across Apple and third-party apps. Priority Notifications will also surface what’s most important. In addition, users will be able to create images in Image Playground in a Sketch style, an academic and highly detailed style that uses a vibrant color palette combined with technical lines to produce realistic drawings.
A Breakthrough for Privacy in AI
Designed to protect users’ privacy at every step, Apple Intelligence uses on-device processing, meaning that many of the models that power it run entirely on device. For requests that require access to larger models, Private Cloud Compute extends the privacy and security of iPhone into the cloud to unlock even more intelligence. When using Private Cloud Compute, users’ data is never stored or shared with Apple; it is used only to fulfill their request. Independent experts can inspect the code that runs on Apple silicon servers to continuously verify this privacy promise, and are already doing so. This is an extraordinary step forward for privacy in AI.
Availability
Apple Intelligence is available now as a free software update with iOS 18.2, iPadOS 18.2, and macOS Sequoia 15.2, and can be accessed in most regions around the world when the device and Siri language are set to localized English for Australia, Canada, Ireland, New Zealand, South Africa, the U.K., or the U.S.
Mac users in the EU can access Apple Intelligence when using a compatible device with supported settings and languages. This April, Apple Intelligence features will start to roll out to iPhone and iPad users in the EU. This will include many of the core features of Apple Intelligence, including Writing Tools, Genmoji, a redesigned Siri with richer language understanding, ChatGPT integration, and more.
Apple Intelligence is available on iPhone 16, iPhone 16 Plus, iPhone 16 Pro, iPhone 16 Pro Max, iPhone 15 Pro, iPhone 15 Pro Max, iPad with A17 Pro or M1 and later, and Mac with M1 and later.

###
https://ai.meta.com/research/publications/zero-shot-whole-body-humanoid-control-via-behavioral-foundation-models/

Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models
December 12, 2024

Abstract
Unsupervised reinforcement learning (RL) aims at pre-training agents that can solve a wide range of downstream tasks in complex environments. Despite recent advancements, existing approaches suffer from several limitations: they may still require running an RL process on each downstream task to achieve a satisfactory performance, they may need access to datasets with good coverage or well-curated task-specific samples, or they may pre-train policies with unsupervised losses that are poorly correlated with the downstream tasks of interest. In this paper, we introduce a novel algorithm regularizing unsupervised RL towards imitating trajectories from unlabeled behavior dataset. The key technical novelty of our method, called Forward-Backward Representations with Conditional-Policy Regularization, is to train forward-backward representations to embed the unlabeled trajectories to the same latent space used to represent states, rewards, and policies, and use a latent-conditional discriminator to encourage policies to ``cover'' the states in the unlabeled behavior dataset. As a result, we can learn policies that are well aligned with the behaviors in the dataset, while retaining zero-shot generalization capabilities for reward-based and imitation tasks. We demonstrate the effectiveness of this new approach in a challenging humanoid control problem: leveraging observation-only motion capture datasets, we train Meta Motivo, the first humanoid behavioral foundation model that can be prompted to solve a variety of whole-body tasks, including motion tracking, goal reaching, and reward optimization. The resulting model is capable of expressing human-like behaviors and it achieves competitive performance with task-specific methods while outperforming state-of-the-art unsupervised RL and model-based baselines.

A Meta FAIR release
Introducing Meta Motivo
A first-of-its-kind behavioral foundation model to control a virtual physics-based humanoid agent for a wide range of whole-body tasks.
Try the demo
Download the model
Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models
Meta Motivo is a behavioral foundation model pre-trained with a novel unsupervised reinforcement learning algorithm to control the movements of a complex virtual humanoid agent. At test time, our model can be prompted to solve unseen tasks such as motion tracking, pose reaching, and reward optimization without any additional learning or fine-tuning.
Read the research paper
Physics-based environment
The model has learned to control the agent, subject to the physics of its body and environment. Its behaviors are robust to variations and perturbations.
Different prompts for behaviors
The model can be prompted with motions to track, poses to reach, and rewards to optimize.
Zero-shot capability
The model computes the best behavior for each prompt without any additional learning or fine-tuning.
Explore the Research
We are releasing the pre-trained model together with the new humanoid benchmark and the training code. We hope this will encourage the community to further develop research towards building behavioral foundation models that can generalize to more complex tasks, and potentially different types of agents.
Key takeaways
We introduce a new algorithm grounding the forward-backward unsupervised reinforcement learning method with an imitation objective leveraging a dataset of unsupervised trajectories.
With this new approach, we train Meta Motivo, a behavioral foundation model that controls a high-dimensional virtual humanoid agent to solve a wide range of tasks.
We evaluated our model using a new humanoid benchmark across motion tracking, pose reaching, and motion tracking tasks. Meta Motivo achieved competitive performance with task-specific methods, while outperforming state-of-the-art unsupervised RL and model-based baselines.
The Algorithm
Forward-Backward representations with Conditional Policy Regularization (FB-CPR) is a novel algorithm combining unsupervised forward-backward representations [1, 2, 3] with an imitation learning loss regularizing policies to cover states observed in a dataset of unlabeled trajectories. Our algorithm is trained online through direct access to the environment and it crucially learns a representation that aligns the embedding of states, motions, and rewards into the same latent space. As a result, we can train models whose policies are grounded towards useful behaviors, while being capable of zero-shot inference across a wide range of tasks, such as goal-based RL, imitation learning, reward optimization, and tracking.
A diagram that describes the pretraining approach
The final model includes two components: 1) an embedding network that receives as input the state of the agent and it returns its embedding; 2) a policy network parameterized with the same embedding that receives an input the state and returns the action to take.
A diagram that describes what the model had learned
Inference from various types of prompts
Our algorithm learns a representation that aligns states, rewards, and policies into the same latent space. We can then leverage this representation to perform zero-shot inference for different tasks
Performance improvement during pre-training
Meta Motivo is a behavioral foundation model trained on a SMPL-based humanoid [4] simulated with the Mujoco simulator [5] using a subset of the AMASS motion capture dataset [6] and 30 million online interaction samples.
The videos below illustrate the behaviors corresponding to one motion tracking task (a cartwheel motion), one pose reaching task (an arabesque pose), and one reward optimization task (running) at different stages of the pre-training process. Despite the model not being explicitly trained to optimize any of these tasks, we see the performance improving during training and more human-like behaviors emerge.
Evaluation Results
For evaluation, we have developed a new humanoid benchmark including motions to track, stable poses to reach, and reward functions to optimize. We consider several different baselines including 1) methods that are retrained for each task separately; 2) behavioral foundation models and model-based algorithms. We are releasing the code with the specification files needed to use the simulator and evaluate the model performance on the tasks that are used in the paper [7].
Quantitative
Our model achieves between 61% to 88% of the performance of top-line methods retrained for each task, while outperforming all other algorithms except for the tracking: in this case it is second best behind Goal-TD3, which cannot be used for reward-based tasks.
Quantitative data
Qualitative
To further analyze the performance gap in reward-based and goal-based tasks between Meta Motivo and single-task TD3, we ran a human evaluation with the objective of having a qualitative assessment of the learned behaviors in terms of human-likeness. This evaluation reveals that policies purely optimized for performance (TD3) produce much less natural behaviors than Meta Motivo, which better trades off performance and qualitative behaviors.
Qualitative data
Understanding the behavioral latent space
One of the crucial aspects of our new algorithm is that it uses the same representation to embed states, rewards, and motions in the same space. We have then investigated the structure of the learned behavioral latent space.
Behavioral Latent Space
In the image above, we visualize the embedding of motions classified by their activity (e.g., jumping, running, crawling) and reward-based tasks. Not only does the representation capture semantically similar motions in similar clusters, but it creates a latent space where rewards and motions are well aligned.
Limitations
Meta Motivo is our first attempt to train behavioral foundation models with zero-shot capabilities across several different prompt types. While the model achieved strong quantitative and qualitative results, it still suffers from several limitations.
Fast movements and motions on the ground are poorly tracked. The model also exhibits unnatural jittering.

###
https://huggingface.co/papers/2412.06559
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Published on Dec 10
·
Submitted by
chujiezheng
on Dec 10
#2 Paper of the day
Authors:

Chujie Zheng
,

Zhenru Zhang
,
Beichen Zhang
,

Runji Lin
,
Keming Lu
,

Bowen Yu
,

Dayiheng Liu
,

Jingren Zhou
,

Junyang Lin
Abstract
As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.
Chain of Thought (CoT) is the secret behind OpenAI o1. In CoT the LLM breaks down complex problems into a series of intermediate reasoning steps, mimicking human thought processes, leading to more accurate and reliable solutions. But how can we make sure that those intermediate steps are correct? 👀
ProcessBench from Qwen is a new benchmark to help identify erroneous steps in mathematical reasoning. It consists of 3,400 step-by-step competitive- and Olympiad-level math problems with error location annotated by human experts. ProcessBench can be used to evaluate Process Reward Models (PRM) or LLM as a Judge (critic models).
Insights:
📈 We need more complex and challenging data to improve PRMs
👀 Surprising LLM (when prompted) outperform specialized PRMs at error detection
📉 Existing Process Reward Models (PRMs) struggle with complex problems
💪🏻 fine-tuning PRMs on a large and more complex data boosts their performance.
🏆 QwQ-32B-Preview matches GPT-4o as critic model
🤗 PRM models will be released soon on Hugging Face
Excerpt:
Process Reward Models (PRMs) can be used in RLHF to score the intermediate steps in CoT the quality of each step, ensuring that the reasoning process is accurate. However, creating PRMs is complex as it requires high-quality data and is challenging to evaluate effectively.

###
https://github.com/microsoft/markitdown
Microsoft
12/17/24

MarkItDown
PyPI

The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)

It presently supports:

PDF (.pdf)
PowerPoint (.pptx)
Word (.docx)
Excel (.xlsx)
Images (EXIF metadata, and OCR)
Audio (EXIF metadata, and speech transcription)
HTML (special handling of Wikipedia, etc.)
Various other text-based formats (csv, json, xml, etc.)
ZIP (Iterates over contents and converts each file)

Installation
You can install markitdown using pip:

pip install markitdown
or from the source

pip install -e .
Usage
The API is simple:

from markitdown import MarkItDown

markitdown = MarkItDown()
result = markitdown.convert("test.xlsx")
print(result.text_content)
To use this as a command-line utility, install it and then run it like this:

markitdown path-to-file.pdf
This will output Markdown to standard output. You can save it like this:

markitdown path-to-file.pdf > document.md
You can pipe content to standard input by omitting the argument:

cat path-to-file.pdf | markitdown
You can also configure markitdown to use Large Language Models to describe images. To do so you must provide llm_client and llm_model parameters to MarkItDown object, according to your specific client.

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)
You can also use the project as Docker Image:

docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md

기술적으로 최대한 자세하게 적어. 10개의 기사가 있고 하나도 빼먹지 말고 적어.