OpenAI는 2024년 10월 1일, 새로운 Realtime API를 공개하며 실시간 음성 대 음성 상호작용을 보다 자연스럽고 빠르게 구현할 수 있는 API를 제공한다고 발표했습니다. 이와 더불어 Vision Fine-Tuning API, Prompt Caching, Model Distillation 등의 혁신적인 기능들을 발표하며 GPT-4o 모델의 성능 향상을 지원합니다. 특히, 최근 10억 달러 규모의 새로운 신용 한도를 확보함으로써 연구 개발과 인프라 확장에 필요한 재정적 유연성을 크게 확보했습니다. Black Forest Labs는 FLUX1.1 [pro]라는 텍스트-이미지 생성 모델을 발표하며, 6배 향상된 속도와 더 높은 이미지 품질을 제공합니다. Google Cloud는 Vertex AI Prompt Optimizer의 퍼블릭 프리뷰를 통해 프롬프트 최적화를 자동으로 수행하는 기술을 소개했습니다. Beijing Academy of Artificial Intelligence(BAAI)는 Emu3라는 새로운 멀티모달 모델을 발표했고, Salesforce는 RAG(검색 강화 생성)를 위한 SFR-RAG 모델을 공개했습니다.

OpenAI, Realtime API 발표

링크, 2024년 10월 1일

  • Realtime API 소개: OpenAI는 실시간 음성 대 음성 상호작용을 가능하게 하는 Realtime API를 발표했습니다. 이 API는 ChatGPT의 고급 음성 모드와 유사한 방식으로 작동하며, 저지연 및 멀티모달 음성 경험을 지원합니다.
  • 기술적 구현: 이 API는 WebSocket 연결을 통해 지속적인 메시지 교환이 가능하도록 설계되었습니다. 특히 API의 핵심은 자연스러운 대화를 위해 스트리밍 오디오 입력과 출력을 지원하며, 사용자가 대화 중간에 끼어드는 것을 감지하고 이에 자동으로 대응합니다.
  • 적용 사례: Healthify와 Speak 같은 초기 파트너는 Realtime API를 활용해 자연스러운 AI 기반 음성 상호작용을 구현했습니다. 예를 들어 Healthify는 영양 및 피트니스 앱에서 AI 코치와의 대화를 지원하며, Speak는 언어 학습 애플리케이션에서 음성 역할극 기능을 제공합니다.

OpenAI, Vision Fine-Tuning API 도입

링크, 2024년 10월 1일

  • Vision Fine-Tuning API: 이 API는 GPT-4o 모델에서 이미지와 텍스트를 동시에 처리할 수 있도록 Fine-Tuning 기능을 확장한 것입니다. 개발자는 이미지 및 텍스트 데이터를 결합해 모델을 미세 조정할 수 있으며, 이를 통해 시각 인식, 객체 탐지, 시각 검색 등에서 성능을 크게 향상시킬 수 있습니다.
  • 기술적 구현: Fine-Tuning 과정은 텍스트와 유사한 방식으로 진행됩니다. 이미지 데이터를 일정한 포맷으로 준비해 업로드하면, 100장 이상의 이미지로도 성능을 향상시킬 수 있으며, 더 많은 데이터를 사용하면 더 큰 성능 향상을 기대할 수 있습니다.
  • 적용 사례: Grab은 도로 이미지를 사용해 지도 데이터를 자동화하는데 Vision Fine-Tuning API를 사용해 차선 수 및 제한 속도 표지판의 정확도를 각각 20%, 13% 향상시켰습니다. Automat은 200장의 스크린샷을 사용해 UI 요소 식별 정확도를 272% 향상시켰습니다.
  • 가격: Vision Fine-Tuning은 GPT-4o 모델에서 제공되며, 2024년 10월 31일까지 무료로 1M 토큰을 제공합니다.

OpenAI, Prompt Caching 기능 도입

링크, 2024년 10월 1일

  • Prompt Caching 소개: OpenAI는 API 사용 시 자주 사용되는 프롬프트를 캐시함으로써 처리 비용과 지연 시간을 줄이는 Prompt Caching 기능을 도입했습니다. 캐시된 프롬프트는 기본 프롬프트보다 50% 할인된 비용으로 처리됩니다.
  • 기술적 구현: Prompt Caching은 1,024 토큰 이상의 프롬프트에 대해 자동으로 작동하며, 최대 128개의 토큰을 추가 캐싱할 수 있습니다. 이 기능은 GPT-4o, GPT-4o mini, o1-preview 등의 모델에서 지원됩니다.
  • 활용 사례: 긴 대화나 동일한 프롬프트를 반복적으로 사용하는 애플리케이션에서 성능을 크게 향상시킬 수 있으며, 프롬프트 캐싱을 통해 비용을 절감할 수 있습니다.
  • 가격: 캐시된 입력 토큰은 기본 입력 토큰의 50% 가격으로 제공되며, 출력 토큰은 기존과 동일한 가격으로 처리됩니다.

OpenAI, Model Distillation 기능 발표

링크, 2024년 10월 1일

  • Model Distillation 소개: 대형 모델의 출력을 사용해 더 작은 모델을 미세 조정함으로써 비용 효율적인 모델을 만드는 Model Distillation 기능이 발표되었습니다. 이를 통해 더 작은 모델에서도 대형 모델과 유사한 성능을 얻을 수 있습니다.
  • 기술적 구현: Stored Completions 기능을 통해 대형 모델의 입력과 출력을 저장하고, 이를 학습 데이터로 활용해 더 작은 모델을 Fine-Tuning할 수 있습니다. 또한, Evals 기능을 사용해 모델 성능을 자동으로 평가하고, Fine-Tuning 과정의 성과를 지속적으로 모니터링할 수 있습니다.
  • 적용 사례: 대규모 모델의 성능을 소형 모델로 전이시켜 비용 효율적인 운영이 가능하며, 특히 GPT-4o mini와 같은 소형 모델에 적용하면 성능 대비 비용을 최적화할 수 있습니다.

OpenAI, 10억 달러 규모의 새로운 신용 한도 확보

링크, 2024년 10월 3일

  • 신용 한도: OpenAI는 66억 달러의 새로운 투자와 함께 10억 달러 규모의 신용 한도를 확보했습니다. 이를 통해 AI 연구 및 인프라 확장에 필요한 재정적 유연성을 확보할 수 있습니다.
  • 기술적 의미: OpenAI는 이번 신용 한도를 통해 AI 연구 개발에 필요한 자금을 더욱 유연하게 조달할 수 있으며, 특히 인프라 확장과 새로운 프로젝트 투자에 더욱 적극적으로 나설 수 있게 되었습니다.

Black Forest Labs, FLUX1.1 [pro] 출시

링크, 2024년 10월 2일

  • FLUX1.1 [pro] 소개: Black Forest Labs는 이전 모델보다 6배 빠르고 향상된 이미지 품질을 제공하는 FLUX1.1 [pro]를 발표했습니다.
  • 기술적 구현: FLUX1.1 [pro]는 텍스트-이미지 모델로, 이미지 생성 속도를 크게 향상시킴과 동시에 더 나은 품질의 이미지를 생성할 수 있도록 개선되었습니다. 또한, Artificial Analysis 리더보드에서 최고 점수를 기록하며, FLUX 모델의 성능이 입증되었습니다.
  • API 지원: FLUX1.1 [pro]는 Black Forest Labs의 API를 통해 제공되며, 이미지를 생성하는 데 $0.04의 비용이 청구됩니다.

Google Cloud, Vertex AI Prompt Optimizer 공개

링크, 2024년 9월 27일

  • Vertex AI Prompt Optimizer 소개: Google Cloud는 Vertex AI Prompt Optimizer의 퍼블릭 프리뷰를 발표하며, 이를 통해 프롬프트를 자동으로 최적화하는 기능을 제공합니다.
  • 기술적 구현: Prompt Optimizer는 LLM 간의 프롬프트를 이전하거나 최적화하는 작업을 자동화합니다. 이 기능은 Google Research의 NeurIPS 2024 논문에서 발표된 자동 프롬프트 최적화(APO) 방법을 기반으로 하며, 최적의 프롬프트를 생성하기 위해 다양한 평가 지표를 활용합니다.
  • 적용 사례: AdVon Commerce는 이 기능을 사용해 상품 페이지 생성 속도를 100배 향상시켰으며, Augmedix는 의료 문서 생성에서

66%에서 86%로 성능을 크게 향상시켰습니다.

Beijing Academy of Artificial Intelligence, Emu3 발표

링크, 2024년 9월 30일

  • Emu3 소개: Beijing Academy of Artificial Intelligence는 텍스트, 이미지, 비디오를 동시에 처리하는 멀티모달 모델 Emu3를 발표했습니다. 이 모델은 간단한 ‘다음 토큰 예측’ 기법을 사용해 모든 데이터 유형을 하나의 시퀀스로 처리합니다.
  • 기술적 구현: Emu3는 SBER-MoVQGAN이라는 특수 토크나이저를 사용해 이미지를 4,096개의 토큰으로 변환한 후, 이를 시퀀스로 처리합니다. 이 과정에서 이미지나 텍스트를 별도로 처리할 필요 없이 하나의 시퀀스로 통합하여 처리합니다.
  • 적용 사례: Emu3는 SDXL과 같은 이미지 생성 모델에 필적하는 성능을 보여주며, 비디오 생성에서도 새로운 혁신을 달성했습니다.

Salesforce, SFR-RAG 모델 발표

링크, 2024년 9월 30일

  • SFR-RAG 소개: Salesforce는 검색 강화 생성(RAG)을 위한 SFR-RAG 모델을 발표했습니다. 이 모델은 9B 파라미터를 가진 모델로, 문맥 이해와 다단계 질문 응답에 특화되었습니다.
  • 기술적 구현: SFR-RAG는 새로운 확장형 대화 템플릿을 사용해 문맥을 제어하며, Multi-Hop Questions와 Reliable Citations 등 다양한 데이터로 학습되었습니다.
  • 적용 사례: 여러 RAG 벤치마크에서 OpenAI GPT-4o 및 Cohere Command-R+와 유사한 성능을 보여주며, 소규모 모델이 더 큰 모델을 대체할 수 있음을 입증했습니다.
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
###
https://openai.com/devday/
OpenAI DevDay
Oct 1, 2024
OpenAI

San Francisco, London, and Singapore
We're bringing developers together to explore new tools and exchange ideas.
Product announcements


Product



Introducing the Realtime API

Introducing the Realtime API > Card image
Product
Oct 1, 2024
Introducing vision to the fine-tuning API

Introducing vision to the fine-tuning API > Cover image
Product
Oct 1, 2024
Prompt Caching in the API

Prompt Caching > Media
Product
Oct 1, 2024
Model Distillation in the API

Model Distillation in the API > Media
Documentation
Explore guides to start building

Realtime API
Developers can now build fast speech-to-speech experiences into their applications.
Learn more
Vision Fine-Tuning
Developers can now fine-tune GPT-4o with images and text to improve vision capabilities.
Learn more
Prompt Caching
Get automatic discounts on inputs that the model has recently seen.
Learn more
Distillation
Fine-tune a cost-efficient model with the outputs of a large frontier model.
Learn more

###
https://openai.com/index/introducing-the-realtime-api/
OpenAI
October 1, 2024

Introducing the Realtime API
Developers can now build fast speech-to-speech experiences into their applications

DALL·E generated impressionist oil painting of undulating orange audio waves creating a harmonious, layered composition
Today, we're introducing a public beta of the Realtime API, enabling all paid developers to build low-latency, multimodal experiences in their apps. Similar to ChatGPT’s Advanced Voice Mode, the Realtime API supports natural speech-to-speech conversations using the six preset voices(opens in a new window) already supported in the API.

We’re also introducing audio input and output in the Chat Completions API(opens in a new window) to support use cases that don’t require the low-latency benefits of the Realtime API. With this update, developers can pass any text or audio inputs into GPT-4o and have the model respond with their choice of text, audio, or both.

From language apps and educational software to customer support experiences, developers have already been leveraging voice experiences to connect with their users. Now with the Realtime API and soon with audio in the Chat Completions API, developers no longer have to stitch together multiple models to power these experiences. Instead, you can build natural conversational experiences with a single API call.

How it works
Previously, to create a similar voice assistant experience, developers had to transcribe audio with an automatic speech recognition model like Whisper, pass the text to a text model for inference or reasoning, and then play the model’s output using a text-to-speech(opens in a new window) model. This approach often resulted in loss of emotion, emphasis and accents, plus noticeable latency. With the Chat Completions API, developers can handle the entire process with a single API call, though it remains slower than human conversation. The Realtime API improves this by streaming audio inputs and outputs directly, enabling more natural conversational experiences. It can also handle interruptions automatically, much like Advanced Voice Mode in ChatGPT.

Under the hood, the Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT-4o. The API supports function calling(opens in a new window), which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context. For example, a voice assistant could place an order on behalf of the user or retrieve relevant customer information to personalize its responses.

Powering customer support agents, language learning assistants, and more
As part of our iterative deployment strategy, we’ve been testing the Realtime API with a handful of partners to gather feedback while we build. A couple of promising early use cases include:


Healthify, a nutrition and fitness coaching app, uses the Realtime API to enable natural conversations with its AI coach Ria, while involving human dietitians when needed for personalized support.


Speak, a language learning app, uses Realtime API to power its role-play feature, encouraging users to practice conversations in a new language.

Availability & pricing
The Realtime API will begin rolling out today in public beta to all paid developers. Audio capabilities in the Realtime API are powered by the new GPT-4o model gpt-4o-realtime-preview.

Audio in the Chat Completions API will be released in the coming weeks, as a new model gpt-4o-audio-preview. With gpt-4o-audio-preview, developers can input text or audio into GPT-4o and receive responses in text, audio, or both.

The Realtime API uses both text tokens and audio tokens. Text input tokens are priced at $5 per 1M and $20 per 1M output tokens. Audio input is priced at $100 per 1M tokens and output is $200 per 1M tokens. This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be the same price.

Safety & privacy
The Realtime API uses multiple layers of safety protections to mitigate the risk of API abuse, including automated monitoring and human review of flagged model inputs and outputs. The Realtime API is built on the same version of GPT-4o that powers Advanced Voice Mode in ChatGPT, which we carefully assessed using both automated and human evaluations, including evaluations according to our Preparedness Framework, detailed in the GPT-4o System Card. The Realtime API also leverages the same audio safety infrastructure we built for Advanced Voice Mode, which our testing shows has helped to reduce the potential for harm.

It is against our usage policies to repurpose or distribute output from our services to spam, mislead, or otherwise harm others – and we actively monitor for potential abuse. Our policies also require developers to make it clear to their users that they are interacting with AI, unless it's obvious from the context.

Prior to launch, we tested the Realtime API with our external red teaming network and found that the Realtime API didn’t introduce any high-risk gaps not covered by our existing mitigations. As with all API services, the Realtime API is subject to our Enterprise privacy commitments. We do not train our models on the inputs or outputs used in this service without your explicit permission.

Getting started
Developers can start building with the Realtime API over the coming days in the Playground(opens in a new window), or by using our docs(opens in a new window) and the reference client(opens in a new window).

We’ve also worked with LiveKit(opens in a new window) and Agora(opens in a new window) to create client libraries of audio components like echo cancellation, reconnection, and sound isolation, and Twilio(opens in a new window) to integrate the Realtime API with Twilio’s Voice APIs(opens in a new window) which enable developers to seamlessly build, deploy and connect AI virtual agents to customers via voice calls.

What’s next
As we work towards general availability, we’re actively collecting feedback to improve the Realtime API. Some of the capabilities we plan to introduce include:

More modalities: To start, the Realtime API will support voice, and we plan to add additional modalities like vision and video over time.

Increased rate limits: Today the API is rate limited to approximately 100 simultaneous sessions for Tier 5 developers, with lower limits for Tiers 1-4. We will increase these limits over time to support larger deployments.

Official SDK support: We will integrate support for Realtime API into the OpenAI Python and Node.js SDKs.

Prompt Caching: We will add support for Prompt Caching(opens in a new window) so previous conversation turns can be reprocessed at a discount.

Expanded model support: The Realtime API will also support GPT-4o mini in upcoming versions of that model.

We're looking forward to seeing how developers leverage these new capabilities to create compelling new audio experiences for their users across a variety of use cases from education to translation, customer service, accessibility and beyond.

###
https://openai.com/index/introducing-vision-to-the-fine-tuning-api/
OpenAI
October 1, 2024

Introducing vision to the fine-tuning API
Developers can now fine-tune GPT-4o with images and text to improve vision capabilities

DALL·E generated impressionist oil painting of overlapping translucent rectangles blending in sky blue hues
Today, we’re introducing vision fine-tuning(opens in a new window) on GPT-4o1, making it possible to fine-tune with images, in addition to text. Developers can customize the model to have stronger image understanding capabilities which enables applications like enhanced visual search functionality, improved object detection for autonomous vehicles or smart cities, and more accurate medical image analysis.

Since we first introduced fine-tuning on GPT-4o, hundreds of thousands of developers have customized our models using text-only datasets to improve performance on specific tasks. However, for many cases, fine-tuning models on text alone doesn’t provide the performance boost expected.

How it works
Vision fine-tuning follows a similar process to fine-tuning with text—developers can prepare their image datasets to follow the proper format(opens in a new window) and then upload that dataset to our platform. They can improve the performance of GPT-4o for vision tasks with as few as 100 images, and drive even higher performance with larger volumes of text and image data.

json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1
{
2
"messages": [
3
{ "role": "system", "content": "You are an assistant that identifies uncommon cheeses." },
4
{ "role": "user", "content": "What is this cheese?" },
5
{ "role": "user", "content": [
6
{
7
"type": "image_url",
8
"image_url": {
9
"url": "https://upload.wikimedia.org/wikipedia/commons/3/36/Danbo_Cheese.jpg"
10
}
11
}
12
]
13
},
14
{ "role": "assistant", "content": "Danbo" }
15
]
16
}
17

We collaborated with a small group of trusted partners to better understand the real-world applications of vision fine-tuning. We’re excited to showcase a few examples of what they built.

Grab improves image detection and understanding on the road
Grab(opens in a new window), a leading food delivery and rideshare company, turns street-level imagery collected from their drivers into mapping data used to power GrabMaps(opens in a new window), a mapping service enabling all of their Southeast Asia operations. Using vision fine-tuning with only 100 examples, Grab taught GPT-4o to correctly localize traffic signs and count lane dividers to refine their mapping data. As a result, Grab was able to improve lane count accuracy by 20% and speed limit sign localization by 13% over a base GPT-4o model, enabling them to better automate their mapping operations from a previously manual process.

Example of a speed limit sign tagged successfully by a vision fine-tuned GPT-4o model that was incorrectly tagged by the GPT-4o base model.
Example of a speed limit sign tagged successfully by a vision fine-tuned GPT-4o model that was incorrectly tagged by the GPT-4o base model.

Automat improves success rate of desktop bots automating business processes
Automat(opens in a new window), an enterprise automation company, builds desktop and web agents that process documents and take UI-based actions to automate business processes. With vision fine-tuning and a dataset of screenshots, Automat trained GPT-4o to locate UI elements on a screen given a natural language description, improving the success rate of their RPA agent from 16.60% to 61.67%—a 272% uplift in performance compared to base GPT-4o. Additionally, Automat trained GPT-4o on just 200 images of unstructured insurance documents to achieve a 7% lift in F1 score on information extraction tasks.

Example of a desktop bot successfully identifying the center of UI elements via vision fine-tuning with website screenshots.

Coframe enhances quality of digital content creation
Coframe(opens in a new window) is building an AI growth engineering assistant that helps businesses continuously create and test variations of their websites and UIs to optimize business metrics. A key part of this task is autonomously generating new branded sections of a website, based on the rest of the website. Coframe tasked GPT-4o with generating code for the next section of a website based on images and existing code. By fine-tuning GPT-4o with images and code, they improved the model’s ability to generate websites with consistent visual style and correct layout by 26% compared to base GPT-4o.

Existing website that the model is meant to match.
Existing website that the model is meant to match.

Output with the GPT-4o base model
Output with the GPT-4o base model.

Output with GPT-4o fine-tuned with vision and text, more closely matching the style of the page.
Output with GPT-4o fine-tuned with vision and text, more closely matching the style of the page.

Safety & privacy
We continuously run automated safety evals on fine-tuned models and monitor usage to ensure applications adhere to our usage policies. As with all API services, vision fine-tuning is subject to our Enterprise privacy commitments. Fine-tuned models remain entirely under your control, with full ownership of your business data. We do not train our models on the inputs or outputs used in this service without your explicit permission.

Availability & pricing
Vision fine-tuning capabilities are available today for all developers on paid usage tiers(opens in a new window). These capabilities are supported on the latest GPT-4o model snapshot, gpt-4o-2024-08-06. Developers can extend existing fine-tuning training data for images using the same format as our Chat endpoints(opens in a new window).

We’re offering 1M training tokens per day for free through October 31, 2024 to fine-tune GPT-4o with images. After October 31, 2024, GPT-4o fine-tuning training will cost $25 per 1M tokens and inference will cost $3.75 per 1M input tokens and $15 per 1M output tokens. Image inputs are first tokenized based on image size, and then priced at the same per-token rate as text inputs. More details can be found on the API Pricing page.

To get started, visit the fine-tuning dashboard(opens in a new window), click ‘create’ and select gpt-4o-2024-08-06 from the base model drop-down. To learn how to fine-tune GPT-4o with images, visit our docs(opens in a new window).

###
https://openai.com/index/api-prompt-caching/
OpenAI

October 1, 2024

Prompt Caching in the API
Offering automatic discounts on inputs that the model has recently seen

DALL·E generated impressionist oil painting of layered light green columns interwoven with parallel emerald streams, forming a harmonious and repetitive tapestry.
Many developers use the same context repeatedly across multiple API calls when building AI applications, like when making edits to a codebase or having long, multi-turn conversations with a chatbot. Today, we’re introducing Prompt Caching, allowing developers to reduce costs and latency. By reusing recently seen input tokens, developers can get a 50% discount and faster prompt processing times.

Prompt Caching Availability & Pricing
Starting today, Prompt Caching is automatically applied on the latest versions of GPT-4o, GPT-4o mini, o1-preview and o1-mini, as well as fine-tuned versions of those models. Cached prompts are offered at a discount compared to uncached prompts.

Here's an overview of pricing:

Uncached Input Tokens


Cached Input Tokens


Output Tokens


GPT-4o

gpt-4o-2024-08-06

$2.50

$1.25

$10.00

GPT-4o fine-tuning

$3.75

$1.875

$15.00

GPT-4o mini

gpt-4o-mini-2024-07-18

$0.15

$0.075

$0.60

GPT-4o mini fine-tuning

$0.30

$0.15

$1.20

o1

o1-preview

$15.00

$7.50

$60.00

o1 mini

$3.00

$1.50

$12.00

Monitoring Cache Usage
API calls to supported models will automatically benefit from Prompt Caching on prompts longer than 1,024 tokens. The API caches the longest prefix of a prompt that has been previously computed, starting at 1,024 tokens and increasing in 128-token increments. If you reuse prompts with common prefixes, we will automatically apply the Prompt Caching discount without requiring you to make any changes to your API integration.

Requests using Prompt Caching have a cached_tokens value within the usage field in the API response:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
1
usage: {
2
total_tokens: 2306,
3
prompt_tokens: 2006,
4
completion_tokens: 300,
5

6
prompt_tokens_details: {
7
cached_tokens: 1920,
8
audio_tokens: 0,
9
},
10
completion_tokens_details: {
11
reasoning_tokens: 0,
12
audio_tokens: 0,
13
}
14
}
Caches are typically cleared after 5-10 minutes of inactivity and are always removed within one hour of the cache's last use. As with all API services, Prompt Caching is subject to our Enterprise privacy commitments. Prompt caches are not shared between organizations.

Prompt Caching is one of a variety of tools for developers to scale their applications in production while balancing performance, cost and latency. For more information, check out the Prompt Caching docs(opens in a new window).

###
https://openai.com/index/api-model-distillation/
OpenAI
October 1, 2024

Model Distillation in the API
Fine-tune a cost-efficient model with the outputs of a large frontier model–all on the OpenAI platform

DALL·E generated impressionist oil painting of stacked light green rectangles serving as columns, with emerald streams weaving repeatedly through each tier
We’re introducing a new Model Distillation offering to provide developers with an integrated workflow to manage the entire distillation pipeline directly within the OpenAI platform. This lets developers easily use the outputs of frontier models like o1-preview and GPT-4o to fine-tune and improve the performance of more cost-efficient models like GPT-4o mini.

Model distillation involves fine-tuning smaller, cost-efficient models using outputs from more capable models, allowing them to match the performance of advanced models on specific tasks at a much lower cost. Until now, distillation has been a multi-step, error-prone process, which required developers to manually orchestrate multiple operations across disconnected tools, from generating datasets to fine-tuning models and measuring performance improvements. Since distillation is inherently iterative, developers needed to repeatedly run each step, adding significant effort and complexity.

Our new Model Distillation suite includes:

Stored Completions(opens in a new window): Developers can now easily generate datasets for distillation by automatically capturing and storing the input-output pairs generated by one of our models, like GPT-4o or o1-preview through our API. With Stored Completions, you can easily build datasets with your production data to evaluate and fine-tune models. Developers can review this integration guide(opens in a new window) to learn how to opt-in to storing completions.

Evals(opens in a new window) (beta): Developers can now create and run custom evaluations on our platform to measure model performance on specific tasks. Instead of manually creating evaluation scripts and integrating disparate logging tools, Evals provides an integrated way to measure model performance. You can either use data from Stored Completions or upload existing datasets to set up your evaluations. Evals can also be used independently of fine-tuning to quantitatively evaluate model performance for your use cases.

Fine-tuning(opens in a new window): Stored Completions and Evals are fully integrated with our existing fine-tuning offering. This means that developers can use datasets created with Stored Completions in their fine-tuning jobs and run evaluations on fine-tuned models using Evals, all within our platform.

How to use Model Distillation
First, create an evaluation(opens in a new window) to measure the performance of the model you want to distill into, which in this example will be GPT-4o mini. This evaluation will be used to continuously test the distilled model’s performance, to help you decide whether to deploy it.

Example of evaluation used for model distillation
Next, use Stored Completions to create a distillation dataset of real-world examples using GPT-4o’s outputs for the tasks on which you want to fine-tune GPT-4o mini. You can do this by setting the ‘store:true’ flag in the Chat Completions API to automatically store these input-output pairs without any latency impact. These stored completions can be reviewed, filtered, and tagged to create high-quality datasets for fine-tuning or evaluation.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
response = client.chat.completions.create(
2
model="gpt-4o",
3
messages=[
4
{
5
"role": "user",
6
"content": [
7
{
8
"type": "text",
9
"text": "what's the capital of the USA?"
10
}
11
]
12
}
13
],
14
store=True,
15
metadata={"username": "user123", "user_id": "123", "session_id": "123"}
Finally, use this dataset to fine-tune GPT-4o mini. Stored Completions can be used as a training file when creating a fine-tuned model. Once the model is fine-tuned, you can go back to Evals to test whether the fine-tuned GPT-4o mini model meets your performance criteria when compared to GPT-4o.

Fine-tuning is an iterative process. If the initial results aren’t satisfactory, you may need to refine the dataset, adjust the training parameters, or capture more specific examples where the model is underperforming. The goal is to incrementally improve the distilled model until it performs well enough for production use.

Availability & Pricing
Model Distillation is available today to all developers and can be used to distill any of our models, including GPT-4o and o1-preview. As a reminder, we’re also offering 2M free training tokens per day on GPT-4o mini and 1M free training tokens per day on GPT-4o until October 31 to help developers get started with distillation. Beyond that limit, the cost of training and running a distilled model is the same as our standard fine-tuning prices, which you can find on our API pricing page.

Stored Completions is available for free. Evals, which are available in beta, are charged at standard model prices based on the tokens used. Through the end of the year, developers can run evaluations for free (up to 7 per week) when they opt in(opens in a new window) to share their Evals with OpenAI. Evals shared with us will be used to help us improve and evaluate our future models.

For more information, check out our Model Distillation docs(opens in a new window).

###
https://openai.com/index/introducing-canvas/
OpenAI
October 3, 2024

Introducing canvas
A new way of working with ChatGPT to write and code

The image shows a vertical toolbar featuring five icons arranged in a column on a soft pastel background. The third icon from the top, depicting an open book, is highlighted with a label next to it reading "Reading Level."
We’re introducing canvas, a new interface for working with ChatGPT on writing and coding projects that go beyond simple chat. Canvas opens in a separate window, allowing you and ChatGPT to collaborate on a project. This early beta introduces a new way of working together—not just through conversation, but by creating and refining ideas side by side.

Canvas was built with GPT-4o and can be manually selected in the model picker while in beta. Starting today we’re rolling out canvas to ChatGPT Plus and Team users globally. Enterprise and Edu users will get access next week. We also plan to make canvas available to all ChatGPT Free users when it’s out of beta.

Better collaboration with ChatGPT
People use ChatGPT every day for help with writing and code. Although the chat interface is easy to use and works well for many tasks, it’s limited when you want to work on projects that require editing and revisions. Canvas offers a new interface for this kind of work.

With canvas, ChatGPT can better understand the context of what you’re trying to accomplish. You can highlight specific sections to indicate exactly what you want ChatGPT to focus on. Like a copy editor or code reviewer, it can give inline feedback and suggestions with the entire project in mind.

You control the project in canvas. You can directly edit text or code. There’s a menu of shortcuts for you to ask ChatGPT to adjust writing length, debug your code, and quickly perform other useful actions. You can also restore previous versions of your work by using the back button in canvas.

Canvas opens automatically when ChatGPT detects a scenario in which it could be helpful. You can also include “use canvas” in your prompt to open canvas and use it to work on an existing project.

Writing shortcuts include:

Suggest edits: ChatGPT offers inline suggestions and feedback.

Adjust the length: Edits the document length to be shorter or longer.

Change reading level: Adjusts the reading level, from Kindergarten to Graduate School.

Add final polish: Checks for grammar, clarity, and consistency.

Add emojis: Adds relevant emojis for emphasis and color.



Coding in canvas
Coding is an iterative process, and it can be hard to follow all the revisions to your code in chat. Canvas makes it easier to track and understand ChatGPT’s changes, and we plan to continue improving transparency into these kinds of edits.

Coding shortcuts include:

Review code: ChatGPT provides inline suggestions to improve your code.

Add logs: Inserts print statements to help you debug and understand your code.

Add comments: Adds comments to the code to make it easier to understand.

Fix bugs: Detects and rewrites problematic code to resolve errors.

Port to a language: Translates your code into JavaScript, TypeScript, Python, Java, C++, or PHP.



Training the model to become a collaborator
We trained GPT-4o to collaborate as a creative partner. The model knows when to open a canvas, make targeted edits, and fully rewrite. It also understands broader context to provide precise feedback and suggestions.

To support this, our research team developed the following core behaviors:

Triggering the canvas for writing and coding

Generating diverse content types

Making targeted edits

Rewriting documents

Providing inline critique

We measured progress with over 20 automated internal evaluations. We used novel synthetic data generation techniques, such as distilling outputs from OpenAI o1-preview, to post-train the model for its core behaviors. This approach allowed us to rapidly address writing quality and new user interactions, all without relying on human-generated data.

A key challenge was defining when to trigger a canvas. We taught the model to open a canvas for prompts like “Write a blog post about the history of coffee beans” while avoiding over-triggering for general Q&A tasks like “Help me cook a new recipe for dinner.” For writing tasks, we prioritized improving “correct triggers” (at the expense of “correct non-triggers”), reaching 83% compared to a baseline zero-shot GPT-4o with prompted instructions.

It’s worth noting that the quality of such baselines is highly sensitive to the specific prompt used. With different prompts, the baseline may still perform poorly but in a different manner—for instance, by being evenly inaccurate across coding and writing tasks, resulting in a different distribution of errors and alternative forms of suboptimal performance. For coding, we intentionally biased the model against triggering to avoid disrupting our power users. We'll continue refining this based on user feedback.

Canvas Decision Boundary Trigger - Writing & Coding
Prompted GPT-4o
GPT-4o with canvas
Writing (correct trigger canvas)
Writing (correct don’t trigger canvas)
Coding (correct don’t trigger canvas)
Coding (correct trigger canvas)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
For writing and coding tasks, we improved correctly triggering the canvas decision boundary, reaching 83% and 94% respectively compared to a baseline zero-shot GPT-4o with prompted instructions.

A second challenge involved tuning the model's editing behavior once the canvas was triggered—specifically deciding when to make a targeted edit versus rewriting the entire content. We trained the model to perform targeted edits when users explicitly select text through the interface, otherwise favoring rewrites. This behavior continues to evolve as we refine the model.

Canvas Edits Boundary - Writing & Coding
Prompted GPT-4o
GPT-4o with canvas
Canvas full rewrite
Canvas targeted edit
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
For writing and coding tasks, we prioritized improving canvas targeted edits. GPT-4o with canvas performs better than a baseline prompted GPT-4o by 18%.

Finally, training the model to generate high-quality comments required careful iteration. Unlike the first two cases, which are easily adaptable to automated evaluation with thorough manual reviews, measuring quality in an automated way is particularly challenging. Therefore, we used human evaluations to assess comment quality and accuracy. Our integrated canvas model outperforms the zero-shot GPT-4o with prompted instructions by 30% in accuracy and 16% in quality, showing that synthetic training significantly enhances response quality and behavior compared to zero-shot prompting with detailed instructions.

Canvas Suggested Comments
Prompted GPT-4o
GPT-4o with canvas
Comment’s Triggering Correctness
Quality of Suggested Comments
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Human evaluations assessed canvas comment quality and accuracy functionality. Our canvas model outperforms the zero-shot GPT-4o with prompted instructions by 30% in accuracy and 16% in quality.

What’s next
Making AI more useful and accessible requires rethinking how we interact with it. Canvas is a new approach and the first major update to ChatGPT’s visual interface since we launched two years ago.

Canvas is in early beta, and we plan to rapidly improve its capabilities.

###
https://huggingface.co/datasets/openai/MMMLU
Open Dataset release by OpenAI! 👀 OpenAI just released a Multilingual Massive Multitask Language Understanding (MMMLU) dataset on Hugging Face to more easily evaluate multilingual LLMs!
🌍 MMLU test set available in 14 languages, including Arabic, German, Spanish, French,….
🧠 Covers 57 categories from elementary to advanced professional subjects
🎓 translated by professional human translators
🔬 Evaluates AI models' general knowledge across diverse cultures, used in openai/simple-evals
🤔 License unclear
Multilingual Massive Multitask Language Understanding (MMMLU)
The MMLU is a widely recognized benchmark of general knowledge attained by AI models. It covers a broad range of topics from 57 different categories, covering elementary-level knowledge up to advanced professional subjects like law, physics, history, and computer science.

We translated the MMLU’s test set into 14 languages using professional human translators. Relying on human translators for this evaluation increases confidence in the accuracy of the translations, especially for low-resource languages like Yoruba. We are publishing the professional human translations and the code we use to run the evaluations.

This effort reflects our commitment to improving the multilingual capabilities of AI models, ensuring they perform accurately across languages, particularly for underrepresented communities. By prioritizing high-quality translations, we aim to make AI technology more inclusive and effective for users worldwide.

Locales
MMMLU contains the MMLU test set translated into the following locales:

AR_XY (Arabic)
BN_BD (Bengali)
DE_DE (German)
ES_LA (Spanish)
FR_FR (French)
HI_IN (Hindi)
ID_ID (Indonesian)
IT_IT (Italian)
JA_JP (Japanese)
KO_KR (Korean)
PT_BR (Brazilian Portuguese)
SW_KE (Swahili)
YO_NG (Yoruba)
ZH_CH (Simplified Chinese)

###
https://openai.com/index/new-credit-facility-enhances-financial-flexibility/
October 3, 2024

New Credit Facility Enhances Financial Flexibility
In addition to securing $6.6 billion in new funding from leading investors, we have established a new $4 billion credit facility with JPMorgan Chase, Citi, Goldman Sachs, Morgan Stanley, Santander, Wells Fargo, SMBC, UBS, and HSBC. This is a revolving credit facility that is undrawn at closing.

This means we now have access to over $10 billion in liquidity, which gives us the flexibility to invest in new initiatives and operate with full agility as we scale. It also reaffirms our partnership with an exceptional group of financial institutions, many of whom are also OpenAI customers.

“This credit facility further strengthens our balance sheet and provides flexibility to seize future growth opportunities,” said Sarah Friar, CFO of OpenAI. “We are proud to have the strongest banks and investors in the world supporting us.”

The support of our investors and financial partners enables us to continue investing in groundbreaking research and products that bring AI to the world, expand our infrastructure to meet growing demand, and attract top talent from around the world. As we embark on this next phase, we remain focused on delivering helpful tools that contribute to people’s lives.

October 2, 2024

New funding to scale the benefits of AI
DALL·E generated impressionist oil painting of gentle waves reflecting a cascade of glowing words, gentle shimmering ethereal waves
We are making progress on our mission to ensure that artificial general intelligence benefits all of humanity. Every week, over 250 million people around the world use ChatGPT to enhance their work, creativity, and learning. Across industries, businesses are improving productivity and operations, and developers are leveraging our platform to create a new generation of applications. And we’re only getting started.

We’ve raised $6.6B in new funding at a $157B post-money valuation to accelerate progress on our mission. The new funding will allow us to double down on our leadership in frontier AI research, increase compute capacity, and continue building tools that help people solve hard problems.

We aim to make advanced intelligence a widely accessible resource. We’re grateful to our investors for their trust in us, and we look forward to working with our partners, developers, and the broader community to shape an AI-powered ecosystem and future that benefits everyone. By collaborating with key partners, including the U.S. and allied governments, we can unlock this technology's full potential.

###
https://blackforestlabs.ai/announcing-flux-1-1-pro-and-the-bfl-api/
FLUX just got a big update! Black Forest Labs released FLUX1.1 [pro] an improved text-to-image model! 🤯
FLUX1.1 [pro]:
💡 6x faster than its predecessor and boasts enhanced image quality.
🏆 Achieves highest Elo score on artificialanalysis leaderboard
🛠️ Supported via BFL API with $0.04/img.
🤝 Available also through together.ai, Replicate, fal.ai, and Freepik
🤔 No mentions if it comes to Grok and X (prev. Twitter)
😕 Only [pro] updates, no weights for [dev] or [schnell]

Announcing FLUX1.1 [pro] and the BFL API
Oct 2, 2024


by

BlackForestLabs
in Uncategorized
Today, we release FLUX1.1 [pro], our most advanced and efficient model yet, alongside the general availability of the beta BFL API. This release marks a significant step forward in our mission to empower creators, developers, and enterprises with scalable, state-of-the-art generative technology.

FLUX1.1 [pro]: Faster & Better

FLUX1.1 [pro] provides six times faster generation than its predecessor FLUX.1 [pro] while also improving image quality, prompt adherence, and diversity. At the same time, we updated FLUX.1 [pro] to generate the same output as before, but two times faster.

Superior Speed and Efficiency: Faster generation times and reduced latency, enabling more efficient workflows. FLUX1.1 [pro] provides an ideal tradeoff between image quality and inference speed. FLUX1.1 [pro] is three times faster than the currently available FLUX.1 [pro].
Improved Performance: FLUX1.1 [pro] has been introduced and tested under the codename “blueberry” into the Artificial Analysis image arena (https://artificialanalysis.ai/text-to-image), a popular benchmark for text-to-image models. It surpasses all other models on the leaderboard, achieving the highest overall Elo score.

All metrics from artificialanalysis.ai as of Oct 1, 2024.



All metrics from artificialanalysis.ai as of Oct 1, 2024, except FLUX.1 inference speeds (benchmarked internally).

Fast High-res coming soon: FLUX1.1 [pro], natively set up for fast ultra high-resolution generation coming soon to the API. Generate up to 2k images without sacrificing any of the prompt following.
We are excited to announce that FLUX1.1 [pro] will also be available through Together.ai, Replicate, fal.ai, and Freepik.

Building with the BFL API
Our new beta BFL API brings FLUX’s capabilities directly to developers and businesses looking to integrate state-of-the-art image generation into their own applications. Our API stands out with key advantages over competitors:

Advanced Customization: Tailor the API outputs to your specific needs with customization options on model choice, image resolution, and content moderation.
Scalability: Seamlessly scale your applications, whether you are building small projects or enterprise-level applications.
Competitive pricing: The API offers superior image quality at a lower cost. The pricing for our FLUX.1 model suite is as follows:
FLUX.1 [dev]: 2.5 cts/img
FLUX.1 [pro]: 5 cts/img
FLUX1.1 [pro]: 4 cts/img
Get started with the BFL API today at: docs.bfl.ml.

We are eager to see the creative applications that will emerge from users of the BFL API.

###
https://cloud.google.com/blog/products/ai-machine-learning/announcing-vertex-ai-prompt-optimizer?hl=en
AI & Machine Learning
Announcing Public Preview of Vertex AI Prompt Optimizer
September 27, 2024
George Lee
Product Manager, Cloud AI Research

Ivan Nardini
Developer Relations Engineer

Google Cloud Summit Series
Discover the latest in AI, Security, Workspace, App Dev, & more.

Register
Prompt design and engineering stands out as one of the most approachable methods to drive meaningful output from a Large Language Model (LLM). ​​However, prompting large language models can feel like navigating a complex maze. You must experiment with various combinations of instructions and examples to achieve the desired output. Moreover, even if you find the ideal prompt template, there is no guarantee that it will continue to deliver optimal results for a different LLM.

Migrating or translating prompts from one LLM to another is challenging because different language models behave differently. Simply reusing prompts is ineffective, so users need an intelligent prompt optimizer to generate useful outputs.

To help mitigate the "prompt fatigue" experienced by users while they build LLM-based applications, we are announcing Vertex AI Prompt Optimizer in Public Preview.

What is Vertex AI Prompt Optimizer?
Vertex AI Prompt Optimizer helps you find the best prompt (instruction and demonstrations) for any preferred model on Vertex AI. It is based on Google Research’s publication (accepted by NeurIPS 2024) on automatic prompt optimization (APO) methods, and employs an iterative LLM-based optimization algorithm where the optimizer model [responsible for generating paraphrased instructions] and evaluator model [responsible for evaluating the selected instruction and demonstration] work together to generate and evaluate candidate prompts. Prompt Optimizer subsequently selects the best instructions and demonstrations based on the evaluation metrics the user wants to optimize against. Instructions include the system instruction, context, and task of your prompt template. Demonstrations are the few-shot examples you provide in your prompt to elicit a specific style or tone from the model response.

With just a few labeled examples and configured optimization settings, Vertex AI Prompt Optimizer finds the best prompt (instruction and demonstrations) for the target model and removes the need for manually optimizing existing prompts every time for a new LLM. You can now easily craft a new prompt for a particular task or translate a prompt from one model to another model on Vertex AI. Here are the key characteristics:

Easy optimization: Quickly optimize prompts for any target Google model, including migration and translation of prompts from any source model.

Versatile task handling: Accommodates any text-based task (such as question and answering, summarization, classification, and entity extraction) and expand support for multimodal tasks is coming soon.

Comprehensive evaluation: Supports a wide array of evaluation metrics, including model-based, computation-based, and custom metrics, to ensure optimal prompt performance against the metrics you care about.

Flexible and customizable: Tailor the optimization process and latency with advanced settings and utilize various notebook versions to suit your expertise level and needs.

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/1_dBmMX6I.gif
Why use Vertex AI Prompt Optimizer?
Data-driven optimization: Many existing prompt optimization tools focus on tailoring your prompts to your preferred style and tone and oftentimes still require human verification. However, Vertex AI Prompt Optimizer goes beyond this by optimizing your prompts based on specific evaluation metrics, ensuring the best possible performance for your target model.

Built for Gemini: If you’re using Gemini, Vertex AI Prompt Optimizer is designed to keep Gemini’s underlying characteristics in mind. It is specifically designed to adapt to the unique attributes of the Gemini and other Google models. This tailored approach allows you to unlock the full potential of Gemini and achieve superior results.

Getting started with Vertex AI Prompt Optimizer
To start using Vertex AI Prompt Optimizer, you can use the Colab notebook available in the Google Cloud Generative AI repository on Github which contains sample code and notebooks for Generative AI on Google Cloud. Refer to the UI version for basic settings and the SDK version for more advanced settings. More versions of the notebook to support custom metrics and multimodal input will be added in the coming weeks. You can also access it via the Vertex AI Studio console. Look for entry points in the console that indicate “prompt optimizer” or “optimizer your prompt further” (refer to screencasts below).

To either optimize or translate prompts using Vertex AI Prompt Optimizer, follow these steps:

Configure your prompt template

Input your data (labeled examples)

Configure your optimization settings (target model, evaluation metrics, etc.)

Run the optimization job

Inspect the results

Vertex AI Prompt Optimizer supports any Google models and evaluation metrics supported by the Generative AI Evaluation Service.

Entry points from Vertex AI Studio to Vertex AI Prompt Optimizer Colab Enterprise Notebook

A. The Saved prompts page will include a new Prompt optimizer button.

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/2_oH9kBwq.gif
B. The Prompt assist dialog pop-up will include a new Optimize your prompt further button.

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/3_PfCSq4n.gif
How AdVon Commerce and Augmedix enhanced Gemini prompts using Vertex AI Prompt Optimizer
AdVon Commerce - a digital commerce platform - partnered with Google Cloud to create quality content at scale for retailers using tailored AI solutions. AdVon Commerce utilizes LLMs to generate accurate and engaging product page content at scale, which incorporates the right keywords and represents the products accurately. When optimizing retail pages, there’s a lot of missing or incorrect data to work through. Creating shopper-first content means accurately completing missing product attributes that are essential for product searchability and customer journey.

Vertex AI Prompt Optimizer streamlined the creation and refinement of AI prompts, resulting in higher accuracy and relevance. AdVon Commerce observed a 10% increase in attribute accuracy and are able to maintain their commitment to high-quality content while significantly reducing the time they spend on human verification, saving substantial costs. Coupled with Gemini Flash, they have received impressive results with a reduction in incorrect specs and better quality in product page content. For example, AdVon Commerce has recently helped one of the largest global retailers by using Vertex AI Prompt Optimizer and Gemini 1.5 Flash to automate the process of creating populating product attributes for hundreds of millions of items. This resulted in a 100x increase in productivity for the retailer, as it would have taken 100 times longer if they had tried to do that manually.

Vlad Barshai, Chief Technology Officer at AdVon Commerce, stated “Vertex AI Prompt Optimizer allows us to optimize our prompts for Gemini Flash with 10% incremental improvements for problematic Product Attributes and PDP (Product Detail Page) for retailer listings, significantly surpassing results from all other leading AI models on the market. With Vertex AI Prompt Optimizer, we save time on human verification, allowing us to enrich millions of products in a loop where we optimize prompts and generate AI attributes and PDP content at scale. Coupled with a solid human-in-the-loop process, Vertex AI Prompt Optimizer will help us produce high quality enrichment every time.”

Augmedix – a leader in ambient AI medical documentation and data solutions that has generated over 10 million medical notes to date – partnered with Google Cloud to enhance medical documentation for healthcare providers. Augmedix utilizes LLMs to improve efficiency and accuracy in capturing patient interactions, reduce clinician administrative burdens, and ultimately, improve patient care. Augmedix adopted a hybrid approach – models are fine-tuned and inputs are prompt-tuned. For many parts of note generation, fine-tuning is best, and basic prompts work well. In other parts of their system, where there may be hundreds of rules that instruct the LLM, prompt-tuning these rules is optimal.

Augmedix employed Vertex AI Prompt Optimizer to enhance medical note generation from doctor-patient conversations. The feature improved LLM output quality scores from 66% to 86%. In addition, with Vertex AI Prompt Optimizer, Augmedix can test prompt variations quickly, allowing for faster iteration and optimization. The optimized prompts run in 6 seconds, compared to 20 seconds for prompts without Vertex AI Prompt Optimizer.

Ian Shakil, Founder, Director, and Chief Strategy Officer at Augmedix, stated, “Our partnership with Google Cloud AI enabled us to pioneer the frontier of the LLM wave. With MedLM and Gemini, we have achieved revolutionary advancements, driving cutting-edge innovation in the digital health space. This collaboration empowers us to deliver higher quality outputs, reduced turnaround times, and a richer feature set.”


###
https://github.com/baaivision/Emu3
9/30/24
Emu3: Next-token prediction conquers multimodal tasks 🔥
This is the most important research in months: we’re now very close to having a single architecture to handle all modalities. The folks at Beijing Academy of Artificial Intelligence(BAAI) just released Emu3, a single model that handles text, images, and videos all at once.
𝗪𝗵𝗮𝘁'𝘀 𝘁𝗵𝗲 𝗯𝗶𝗴 𝗱𝗲𝗮𝗹?
🌟 Emu3 is the first model to truly unify all these different types of data (text, images, video) using just one simple trick: predicting the next token.
And it’s only 8B, but really strong:
🖼️ For image generation, it's matching the best specialized models out there, like SDXL.
👁️ In vision tasks, it's outperforming top models like LLaVA-1.6-7B, which is a big deal for a model that wasn't specifically designed for this.
🎬 It's the first to nail video generation without using complicated diffusion techniques.
𝗛𝗼𝘄 𝗱𝗼𝗲𝘀 𝗶𝘁 𝘄𝗼𝗿𝗸?
🧩 Emu3 uses a special tokenizer (SBER-MoVQGAN) to turn images and video clips into sequences of 4,096 tokens.
🔗 Then, it treats everything - text, images, and videos - as one long series of tokens to predict.
🔮 During training, it just tries to guess the next token, whether that's a word, part of an image, or a video frame.
To build their multimodal dataset, the team:
🎨 Tossed out low-res and ugly images using an aesthetic model (LAION-AI aesthetic filter) to score pictures and videos
✍️ Got GPT-4V to write captions for 1 million images, same for frames of video, to finally get a truly multimodal dataset.
𝗖𝗮𝘃𝗲𝗮𝘁𝘀 𝗼𝗻 𝘁𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁𝘀:
👉 In image generation, Emu3 beats SDXL, but it’s also much bigger (8B vs 3.5B). It would be more difficult to beat the real diffusion GOAT FLUX-dev.
👉 In vision, authors also don’t show a comparison against all the current SOTA models like Qwen-VL or Pixtral.
On the positive side, this approach is exciting because it's simple (next token prediction) and scalable(handles all sorts of data).
On the other hand, it's once again the Bitter lesson: no matter your architecture, just throw good data and compute, and you’ll get the best model out there. 😬

###
https://github.com/SalesforceAIResearch/SFR-RAG
Salesforce
9/30/24
SFR-RAG
We introduce SFR-RAG, a 9B LLM trained with an emphasis in contextual comprehension and retrieval augmented generation (RAG) use case.

ContextualBench
We also introduce ContextualBench - a compilation of 7 popular contextual question answering benchmarks to evaluate LLMs in RAG application.


Reasoning will be the future of AI, but RAG is the present! How can we improve the faithfulness of LLMs? Salesforce released ContextualBench, a leaderboard and evaluation framework combining multiple academic RAG benchmarks such as HotpotQA, and announced SFR-RAG 9B, a fine-tuned LLM for RAG that matches Cohere Command-R+ (104B) and OpenAI GPT-4o. 👀
SFR-RAG is an unreleased 9B LLM fine-tuned and DPOed using a new extended chat template with Thought and Observation roles to control retrieval. It is trained on Multi-Hop Questions, Reliable Citations, and Hallucination Minimization by identifying unanswerable questions. There are no details on whether the model or dataset will be released. ❌
This research emphasizes how important task- and domain-specific models can be for Companies, especially for search or RAG. It should not be the first step, but gathering data, cleaning it, and fine-tuning can achieve superior results with smaller models, which might be easier and cheaper to run. ✅
P.S. Academic benchmarks are not representative of real-world use cases, but they give us a first understanding.

기술적으로 최대한 자세하게 적어. 12개의 기사가 있고 하나도 빼먹지 말고 적어.