Summary

오늘의 AI 뉴스에서는 여러 회사들의 최신 AI 발표와 연구 결과가 소개되었습니다. Anthropic에서는 Claude 3.5 Sonnet 모델을 출시하여 업계 기준을 높였으며, OpenAI의 공동 창립자였던 Ilya Sutskever는 새로운 안전 중심의 AI 연구소인 Safe Superintelligence Inc.를 창업하였습니다. BigCodeBench라는 새로운 레더보드가 발표되었으며 실제 프로그래밍 작업에서 대규모 언어 모델의 성능을 평가하는 방법을 제시했습니다. Open-Sora는 오픈소스 영상 생성AI 모델의 2.1버전을 발표했으며, Character.AI는 효율적인 AI 추론을 최적화하는 방법을 공유했습니다. 마지막으로, 금융 업계의 AI 자동화에 대한 전망도 논의되었습니다.

Claude 3.5 Sonnet 출시

Claude 3.5 Sonnet 출시

링크, 2024년 6월 21일, Anthropic

  • Claude 3.5 Sonnet 모델 출시
  • 기존 모델보다 지능과 성능이 향상됨
  • Claude.ai와 Claude iOS 앱에서 무료로 사용 가능
  • Claude Pro 및 Team 플랜 구독자는 더 높은 사용 한도 제공
  • Amazon Bedrock과 Google Cloud의 Vertex AI를 통해서도 제공
  • 코드 생성 및 번역, 고급 콘텐츠 작성에서 탁월한 성능 발휘

Safe Superintelligence Inc. 발표

새로운 AI 연구소 Safe Superintelligence Inc. 발표

링크, 2024년 6월 19일, TIME

  • OpenAI의 공동 창립자 Ilya Sutskever가 새로운 AI 연구소 발표
  • 안전한 “슈퍼인텔리전스” 개발 목표
  • Palo Alto와 텔아비브에 사무실 설립 예정
  • 회사의 유일한 목표는 안전한 슈퍼인텔리전스 시스템 개발
  • 현재 자금 조달 방식과 비즈니스 모델은 불확실

BigCodeBench 발표

BigCodeBench: 실질적이고 도전적인 프로그래밍 과제 평가

링크, 2024년 6월 18일, Hugging Face

  • HumanEval의 한계를 극복하는 새로운 벤치마크 BigCodeBench 발표
  • 1,140개의 기능 수준 과제로 구성
  • 다양한 라이브러리와 함수 호출을 포함하여 현실적인 프로그래밍 과제 평가
  • LLM의 실제 프로그래밍 능력을 정확하게 평가
  • 공개 및 폐쇄 LLM 간의 성능 격차 확인

Open-Sora 1.2 보고서 발표

Open-Sora: 오픈소스 영상 생성 AI

링크, Open-Sora

  • 영상 생성AI 인 Open-Sora 1.2 버전 발표
  • 1.1B 모델을 30M 이상의 데이터로 훈련
  • 비디오 압축 네트워크와 다단계 훈련 도입
  • 이미지에서 비디오 생성 및 비디오 확장 기능 제공
  • 다양한 해상도와 비디오 길이 지원

Character.AI 추론 최적화

Character.AI에서 AI 추론 최적화

링크, 2024년 6월 20일, Character.AI

  • 효율적인 AI 추론을 위한 최적화 방법 공개
  • 캐시 크기 줄이기 위한 Multi-Query Attention 도입
  • 하이브리드 어텐션 호라이즌 사용
  • 레이어 간 KV 공유로 메모리 효율성 증가
  • 대화 기록을 효율적으로 캐싱하는 시스템 개발

금융 업계의 AI 자동화 전망

금융업, AI 자동화로 일자리 뺏길라…”근무일 3.5일 단축 가능성↑”

링크, 2024년 6월 20일, 블룸버그통신

  • 금융 부문의 54%가 AI로 자동화 가능
  • 은행, 보험, 에너지 등 다양한 업종에서 자동화 예측
  • 글로벌 주요 은행들이 AI 도입 실험 중
  • JP모건체이스 CEO는 AI 기술로 주당 근무일을 3.5일로 단축할 수 있다고 언급
  • 생성형 AI로 은행 규정을 빠르게 검토하고 생산성 향상
Sources
This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

Title,

한글제목

링크, date,
company name

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

Title,

한글제목

링크, date,
company name

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
###
https://www.anthropic.com/news/claude-3-5-sonnet
Claude
Research
Company
Careers
News
Announcements
Claude 3.5 Sonnet
2024년 6월 21일

4 min read
Try on Claude.ai
Claude head illustration
Today, we’re launching Claude 3.5 Sonnet—our first release in the forthcoming Claude 3.5 model family. Claude 3.5 Sonnet raises the industry bar for intelligence, outperforming competitor models and Claude 3 Opus on a wide range of evaluations, with the speed and cost of our mid-tier model, Claude 3 Sonnet.

Claude 3.5 Sonnet is now available for free on Claude.ai and the Claude iOS app, while Claude Pro and Team plan subscribers can access it with significantly higher rate limits. It is also available via the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI. The model costs $3 per million input tokens and $15 per million output tokens, with a 200K token context window.

Claude model family
Frontier intelligence at 2x the speed
Claude 3.5 Sonnet sets new industry benchmarks for graduate-level reasoning (GPQA), undergraduate-level knowledge (MMLU), and coding proficiency (HumanEval). It shows marked improvement in grasping nuance, humor, and complex instructions, and is exceptional at writing high-quality content with a natural, relatable tone.

Claude 3.5 Sonnet operates at twice the speed of Claude 3 Opus. This performance boost, combined with cost-effective pricing, makes Claude 3.5 Sonnet ideal for complex tasks such as context-sensitive customer support and orchestrating multi-step workflows.

In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%. Our evaluation tests the model’s ability to fix a bug or add functionality to an open source codebase, given a natural language description of the desired improvement. When instructed and provided with the relevant tools, Claude 3.5 Sonnet can independently write, edit, and execute code with sophisticated reasoning and troubleshooting capabilities. It handles code translations with ease, making it particularly effective for updating legacy applications and migrating codebases.

Claude 3.5 Sonnet benchmarks
State-of-the-art vision
Claude 3.5 Sonnet is our strongest vision model yet, surpassing Claude 3 Opus on standard vision benchmarks. These step-change improvements are most noticeable for tasks that require visual reasoning, like interpreting charts and graphs. Claude 3.5 Sonnet can also accurately transcribe text from imperfect images—a core capability for retail, logistics, and financial services, where AI may glean more insights from an image, graphic or illustration than from text alone.


Claude 3.5 Sonnet vision evals
Artifacts—a new way to use Claude
Today, we’re also introducing Artifacts on Claude.ai, a new feature that expands how users can interact with Claude. When a user asks Claude to generate content like code snippets, text documents, or website designs, these Artifacts appear in a dedicated window alongside their conversation. This creates a dynamic workspace where they can see, edit, and build upon Claude’s creations in real-time, seamlessly integrating AI-generated content into their projects and workflows.

This preview feature marks Claude’s evolution from a conversational AI to a collaborative work environment. It’s just the beginning of a broader vision for Claude.ai, which will soon expand to support team collaboration. In the near future, teams—and eventually entire organizations—will be able to securely centralize their knowledge, documents, and ongoing work in one shared space, with Claude serving as an on-demand teammate.


Commitment to safety and privacy
Our models are subjected to rigorous testing and have been trained to reduce misuse. Despite Claude 3.5 Sonnet’s leap in intelligence, our red teaming assessments have concluded that Claude 3.5 Sonnet remains at ASL-2. More details can be found in the model card addendum.

As part of our commitment to safety and transparency, we’ve engaged with external experts to test and refine the safety mechanisms within this latest model. We recently provided Claude 3.5 Sonnet to the UK’s Artificial Intelligence Safety Institute (UK AISI) for pre-deployment safety evaluation. The UK AISI completed tests of 3.5 Sonnet and shared their results with the US AI Safety Institute (US AISI) as part of a Memorandum of Understanding, made possible by the partnership between the US and UK AISIs announced earlier this year.

We have integrated policy feedback from outside subject matter experts to ensure that our evaluations are robust and take into account new trends in abuse. This engagement has helped our teams scale up our ability to evaluate 3.5 Sonnet against various types of misuse. For example, we used feedback from child safety experts at Thorn to update our classifiers and fine-tune our models.

One of the core constitutional principles that guides our AI model development is privacy. We do not train our generative models on user-submitted data unless a user gives us explicit permission to do so. To date we have not used any customer or user-submitted data to train our generative models.

Coming soon
Our aim is to substantially improve the tradeoff curve between intelligence, speed, and cost every few months. To complete the Claude 3.5 model family, we’ll be releasing Claude 3.5 Haiku and Claude 3.5 Opus later this year.

In addition to working on our next-generation model family, we are developing new modalities and features to support more use cases for businesses, including integrations with enterprise applications. Our team is also exploring features like Memory, which will enable Claude to remember a user’s preferences and interaction history as specified, making their experience even more personalized and efficient.

We’re constantly working to improve Claude and love hearing from our users. You can submit feedback on Claude 3.5 Sonnet directly in-product to inform our development roadmap and help our teams to improve your experience. As always, we look forward to seeing what you build, create, and discover with Claude.

###
https://time.com/6990076/safe-superintelligence-inc-announced/
Former OpenAI Chief Scientist Announces New Safety-Focused Company
3 MINUTE READ
Ilya Sutskever
Ilya Sutskever speaks at Tel Aviv University in Tel Aviv on June 5, 2023. Jack Guez—AFP via Getty Images
BY HARRY BOOTHJUNE 19, 2024 5:05 PM EDT
Ilya Sutskever, a co-founder and former chief scientist of OpenAI, announced on Wednesday that he’s launching a new venture dubbed Safe Superintelligence Inc. Sutskever said on X that the new lab will focus solely on building a safe “superintelligence”—an industry term for a hypothetical system that’s smarter than humans.

Sutskever is joined at Safe SuperIntelligence Inc. by co-founders Daniel Gross, an investor and engineer who worked on AI at Apple till 2017, and Daniel Levy, another former OpenAI employee. The new American-based firm will have offices in Palo Alto, Calif., and Tel Aviv, according to a description Sutskever shared.

I am starting a new company: https://t.co/BG3K3SI3A1

— Ilya Sutskever (@ilyasut) June 19, 2024
Sutskever was one of OpenAI’s founding members, and was chief scientist during the company’s meteoric rise following the release of ChatGPT. In November, Sutskever took part in the infamous attempt to oust OpenAI CEO Sam Altman, only to later change his mind and support Altman’s return. When Sutskever announced his resignation in May, he said he was “confident that OpenAI will build AGI that is both safe and beneficial” under Altman’s leadership.

Safe Superintelligence Inc. says it will only aim to release one product: the system in its name. This model will insulate the company from commercial pressures, its founders wrote. However, it’s currently unclear who will fund the new venture's development or what exactly its business model will eventually be.

“Our singular focus means no distraction by management overhead or product cycles,” the announcement reads, perhaps subtly taking aim at OpenAI. In May, another senior OpenAI member, Jan Leike, who co-led a safety team with Sutskever, accused the company of prioritizing “shiny products” over safety. Leike’s accusations came around the time that six other safety-conscious employees left the company. Altman and OpenAI’s President, Greg Brockman, responded to Leike’s accusations by acknowledging there was more work to be done, saying “we take our role here very seriously and carefully weigh feedback on our actions.”

Read more: A Timeline of All the Recent Accusations Leveled at OpenAI and Sam Altman

In an interview with Bloomberg, Sutskever elaborated on Safe Superintelligence Inc.’s approach, saying, “By safe, we mean safe like nuclear safety as opposed to safe as in ‘trust and safety’”; one of OpenAI’s core safety principles is to “be a pioneer in trust and safety.”

While many details about the new company remain to be revealed, its founders have one message for those in the industry who are intrigued: They’re hiring.

###
https://huggingface.co/blog/leaderboard-bigcodebench
BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks
Published June 18, 2024
Terry Yue Zhuo's avatar
terryyz
Terry Yue Zhuo
BigCode's avatar
bigcode
Jiawei Liu's avatar
ganler
Jiawei Liu
BigCode's avatar
bigcode
Qian Liu's avatar
SivilTaram
Qian Liu
BigCode's avatar
bigcode
Binyuan Hui's avatar
huybery
Binyuan Hui
BigCode's avatar
bigcode
Niklas Muennighoff's avatar
Muennighoff
Niklas Muennighoff
BigCode's avatar
bigcode
Daniel Fried's avatar
dpfried
Daniel Fried
BigCode's avatar
bigcode
Harm de Vries's avatar
harmdevries
Harm de Vries
BigCode's avatar
bigcode
Leandro von Werra's avatar
lvwerra
Leandro von Werra
BigCode's avatar
bigcode
Clémentine Fourrier's avatar
clefourrier
Clémentine Fourrier
HumanEval is a reference benchmark for evaluating large language models (LLMs) on code generation tasks, as it makes the evaluation of compact function-level code snippets easy. However, there are growing concerns about its effectiveness in evaluating the programming capabilities of LLMs, and the main concern is that tasks in HumanEval are too simple and may not be representative of real-world programming tasks. Compared to the algorithm-oriented tasks in HumanEval, real-world software development often involves diverse libraries and function calls. Furthermore, LLMs' performance on HumanEval is subject to contamination and overfitting issues, making it less reliable for evaluating the generalization of LLMs.
While there have been some efforts to address these issues, they are either domain-specific, deterministic, or agent-centric (sorry DS-1000, ODEX, and SWE-bench 💔). We feel that the community still lacks an easy-to-use benchmark that can broadly evaluate the programming capabilities of LLMs, and that's what we focused on.

We are excited to announce the release of BigCodeBench, which evaluates LLMs on solving practical and challenging programming tasks without contamination. Specifically, BigCodeBench contains 1,140 function-level tasks to challenge LLMs to follow instructions and compose multiple function calls as tools from 139 libraries. To evaluate LLMs rigorously, each programming task encompasses 5.6 test cases with an average branch coverage of 99%.

Ready to dive into BigCodeBench? Let's get started! 🚀

What do the tasks in BigCodeBench look like? 🕵️‍♂️
task
BigCodeBench features complex, user-oriented instructions for each task, including clear functionality descriptions, input/output formats, error handling, and verified interactive examples. We avoid step-by-step task instructions, believing capable LLMs should understand and solve tasks from the user's perspective in an open-ended manner. We verify specific features using test cases.

# We elaborate the above task with some test cases:

# Requirements SetUp
import unittest
from unittest.mock import patch
import http.client
import ssl
import socket

# Start the test
class TestCases(unittest.TestCase):

# Mock the successful connection and assess the response content

def test_response_content(self, mock_conn):
""" Test the content of the response. """
mock_conn.return_value.getresponse.return_value.read.return_value = b'Expected Content'
result = task_func('www.example.com', 443, '/content/path')
self.assertEqual(result, 'Expected Content')

# Mock the failed connection and assess the error handling


def test_ssl_handshake_error_handling(self, mock_conn, mock_socket):
""" Test handling of SSL handshake errors. """
mock_socket.side_effect = ssl.SSLError('SSL handshake failed')
with self.assertRaises(ssl.SSLError):
task_func('badssl.com', 443, '/test/path')

# More test cases...

Tasks in BigCodeBench utilize diverse function calls from popular libraries. We don't restrict the function calls LLMs can use, expecting them to choose appropriate functions and combine them flexibly to solve tasks. Test cases are designed as test harnesses to examine expected program behaviors during runtime.

To assess LLM performance, we use Pass@1 with greedy decoding, measuring the percentage of tasks correctly solved with the first generated code snippet via curated test cases. This approach aligns with benchmarks like HumanEval and MBPP. We address LLMs' tendency to skip long code prompts by adding missing setups (e.g., import statements, global constants) during Pass@1 evaluation, referred to as calibrated Pass@1.

comparison
To better understand implementation complexity and tool-use diversity, we compare the tasks in BigCodeBench with those in representative benchmarks, including APPS, DS-1000, ODEX, APIBench, MBPP, NumpyEval, PandasEval, HumanEval, and TorchDataEval. We find that BigCodeBench requires more complex reasoning and problem-solving skills to implement comprehensive functionalities.

prompt
As shown in the task figure, the main target scenario is code completion (denoted as BigCodeBench-Complete), where LLMs are required to finish the implementation of a function based on detailed instructions in the docstring. However, considering downstream applications such as multi-turn dialogue, users may describe requirements in a more conversational and less verbose manner. This is where instruction-tuned LLMs are beneficial, as they are trained to follow natural-language instructions and generate code snippets accordingly. To test if models can truly understand human intents and translate them into code, we create BigCodeBench-Instruct, a more challenging variant of BigCodeBench designed to evaluate instruction-tuned LLMs.

Where do the tasks come from? 🤔
png
We guarantee the quality of the tasks in BigCodeBench through a systematic "Human-LLM collaboration process." We start with ODEX as the "seed dataset," which contains short but realistic human intents and corresponding Python one-liners from Stack Overflow. We use GPT-4 to expand these one-liners into comprehensive function-level tasks.

Next, 20 human experts—most with over 5 years of Python programming experience—voluntarily guide GPT-4 in an execution-based sandbox. They continually instruct it to refine the synthesized tasks and add test cases. The tasks and test cases are then examined in a local environment, pre-evaluated on other LLMs, and cross-checked by 7 additional human experts to ensure their quality.

To assert overall quality, the authors sample tasks for 11 human experts to solve, achieving an average human performance of 97%.

How well do LLMs perform on BigCodeBench? 📊
We host the BigCodeBench leaderboard on both Hugging Face Space and GitHub Pages. Here, we use the Hugging Face leaderboard as an example.

Loading...

bigcode/bigcodebench-leaderboard
built with Gradio.
Hosted on Hugging Face Space Spaces

Interestingly, we observe that instruction-tuned LLMs like GPT-4 can omit essential import statements in the long prompts of BigCodeBench-Complete, leading to task failures due to missing modules and constants. This behavior, called "model laziness", is discussed in the community.

Compared to human performance, LLMs perform significantly lower on BigCodeBench-Complete and even lower on BigCodeBench-Instruct. The best model (GPT-4o) achieves a calibrated Pass@1 of 61.1% on BigCodeBench-Complete and 51.1% on BigCodeBench-Instruct. Additionally, there is a notable performance gap between closed and open LLMs.

While Pass@1 is a good metric for overall performance, it is not detailed enough to compare models directly. Inspired by Chatbot Arena, we use Elo rating to rank models on BigCodeBench-Complete. This method, originally used in chess, ranks players based on their game performance. We adapt it to programming tasks, treating each task as a game and each model as a player. The Elo rating updates are based on game outcomes and expectations, using task-level calibrated Pass@1 (0% or 100%) and excluding ties. Starting with an initial Elo rating of 1000, we fit it using maximum likelihood estimation and bootstrap with 500 iterations to get final scores. We find that GPT-4o outperforms other models by a large margin, with DeepSeekCoder-V2 in the second tier.

To help the community understand model performance on each task, we track solve rates, measured by calibrated Pass@1. On BigCodeBench-Complete, 149 tasks remain unsolved by all models, while 6 tasks are completely solved. For BigCodeBench-Instruct, 278 tasks remain unsolved and 14 tasks are fully solved by all models. The significant number of unsolved tasks and the small number of fully solved tasks show that BigCodeBench is a challenging benchmark for LLMs.

Great! So, how can I evaluate my model on BigCodeBench? 🛠️
We make BigCodeBench easily accessible to the community by providing a simple and user-friendly evaluation framework, which can be downloaded via PyPI. The prototype of the evaluation framework is based on EvalPlus for the HumanEval+ and MBPP+ benchmarks. However, as our benchmark has tasks with much more diverse library dependencies than EvalPlus, we build less resource-constrained execution environment, and adapt it for unittest in the test harness of BigCodeBench.

To facilitate the evaluation, we provide pre-built Docker images for code generation and code execution. Check out our GitHub repository to find more details on how to use the evaluation framework.

###
https://github.com/hpcaitech/Open-Sora/blob/main/docs/report_03.md
Open-Sora: Democratizing Efficient Video Production for All
We design and implement Open-Sora, an initiative dedicated to efficiently producing high-quality video. We hope to make the model, tools and all details accessible to all. By embracing open-source principles, Open-Sora not only democratizes access to advanced video generation techniques, but also offers a streamlined and user-friendly platform that simplifies the complexities of video generation. With Open-Sora, our goal is to foster innovation, creativity, and inclusivity within the field of content creation.

Open-Sora 1.2 Report
Video compression network
Rectified flow and model adaptation
More data and better multi-stage training
Easy and effective model conditioning
Evaluation
Sequence parallelism
In Open-Sora 1.2 release, we train a 1.1B models on >30M data (80k hours), with training cost 35k H100 GPU hours, supporting 0s16s, 144p to 720p, various aspect ratios video generation. Our configurations is listed below. Following our 1.1 version, Open-Sora 1.2 can also do image-to-video generation and video extension.

image 2s 4s 8s 16s
240p ✅ ✅ ✅ ✅ ✅
360p ✅ ✅ ✅ ✅ ✅
480p ✅ ✅ ✅ ✅ 🆗
720p ✅ ✅ ✅ 🆗 🆗
Here ✅ means that the data is seen during training, and 🆗 means although not trained, the model can inference at that config. Inference for 🆗 requires more than one 80G memory GPU and sequence parallelism.

Besides features introduced in Open-Sora 1.1, Open-Sora 1.2 highlights:

Video compression network
Rectifie-flow training
More data and better multi-stage training
Easy and effective model conditioning
Better evaluation metrics
All implementations (both training and inference) of the above improvements are available in the Open-Sora 1.2 release. The following sections will introduce the details of the improvements. We also refine our codebase and documentation to make it easier to use and develop, and add a LLM to refine input prompts and support more languages.

Video compression network
For Open-Sora 1.0 & 1.1, we used stability-ai's 83M 2D VAE, which compress the video only in the spatial dimension by 8x8 times. To reduce the temporal dimension, we extracted one frame in every three frames. However, this method led to the low fluency of generated video as the generated fps is sacrificed. Thus, in this release, we introduce the video compression network as OpenAI's Sora does. With a 4 times compression in the temporal dimension, we do not need to extract frames and can generate videos with the original fps.

Considering the high computational cost of training a 3D VAE, we hope to re-use the knowledge learnt in the 2D VAE. We notice that after 2D VAE's compression, the features adjacent in the temporal dimension are still highly correlated. Thus, we propose a simple video compression network, which first compress the video in the spatial dimension by 8x8 times, then compress the video in the temporal dimension by 4x times. The network is shown below:

video_compression_network

We initialize the 2D VAE with SDXL's VAE, which is better than our previously used one. For the 3D VAE, we adopt the structure of VAE in Magvit-v2, which contains 300M parameters. Along with 83M 2D VAE, the total parameters of the video compression network is 384M. We train the 3D VAE for 1.2M steps with local batch size 1. The training data is videos from pixels and pixabay, and the training video size is mainly 17 frames, 256x256 resolution. Causal convolutions are used in the 3D VAE to make the image reconstruction more accurate.

Our training involves three stages:

For the first 380k steps, we train on 8 GPUs and freeze the 2D VAE. The training objective includes the reconstruction of the compressed features from 2D VAE (pink one in the figure) and also add a loss to make features from the 3D VAE similar to the features from the 2D VAE (pink one and green one, called identity loss). We find the latter loss can quickly make the whole VAE achieve a good performance for image and much faster to converge in the next stage.
For the next 260k steps, We remove the identity loss and just learn the 3D VAE.
For the last 540k steps , since we find only reconstruction 2D VAE's feature cannot lead to further improvement, we remove the loss and train the whole VAE to reconstruct the original videos. This stage is trained on on 24 GPUs.
For both stage 1 and stage 2 training, we adopt 20% images and 80% videos. Following Magvit-v2, we train video using 17 frames, while zero-padding the first 16 frames for image. However, we find that this setting leads to blurring of videos with length different from 17 frames. Thus, in stage 3, we use a random number within 34 frames for mixed video length training (a.k.a., zero-pad the first 43-n frames if we want to train a n frame video), to make our VAE more robust to different video lengths. Our training and inference code is available in the Open-Sora 1.2 release.

When using the VAE for diffusion model, our stacked VAE requires small memory as the our VAE's input is already compressed. We also split the input videos input several 17 frames clips to make the inference more efficient. The performance of our VAE is on par with another open-sourced 3D VAE in Open-Sora-Plan.

Model SSIM↑ PSNR↑
Open-Sora-Plan 1.1 0.882 29.890
Open-Sora 1.2 0.880 30.590
Rectified flow and model adaptation
Lastest diffusion model like Stable Diffusion 3 adopts the rectified flow instead of DDPM for better performance. Pitiably, SD3's rectified flow training code is not open-sourced. However, Open-Sora 1.2 provides the training code following SD3's paper, including:

Basic rectified flow training (original rectified flow paper)
Logit-norm sampling for training acceleration (SD3 paper Section 3.1, intuitively it is more likely to sample timesteps at middle noise level)
Resolution and video length aware timestep sampling (SD3 paper Section 5.3.2, intuitively it is more likely to sample timesteps with more noise for larger resolution, and we extend it to longer video)
For the resolution-aware timestep sampling, we should use more noise for images with larger resolution. We extend this idea to video generation and use more noise for videos with longer length.

Open-Sora 1.2 starts from the PixArt-Σ 2K checkpoint. Note that this model is trained with DDPM and SDXL VAE, also a much higher resolution. We find finetuning on a small dataset can easily adapt the model for our video generation setting. The adaptation process is as follows, all training is done on 8 GPUs (the adaptation for the diffusion model is quite fast and straightforward):

Multi-resolution image generation ability: we train the model to generate different resolution ranging from 144p to 2K for 20k steps.
QK-norm: we add the QK-norm to the model and train for 18k steps.
Rectified flow: we transform from discrete-time DDPM to continuous-time rectified flow and train for 10k steps.
Rectified flow with logit-norm sampling and resolution-aware timestep sampling: we train for 33k steps.
Smaller AdamW epsilon: following SD3, with QK-norm, we can use a smaller epsilon (1e-15) for AdamW, we train for 8k steps.
New VAE and fps conditioning: we replace the original VAE with ours and add fps conditioning to the timestep conditioning, we train for 25k steps. Note that normalizing each channel is important for rectified flow training.
Temporal attention blocks: we add temporal attention blocks with zero initialized projection layers. We train on images for 3k steps.
Temporal blocks only for video with mask strategy: we train the temporal attention blocks only on videos for 38k steps.
After the above adaptation, we are ready to train the model on videos. The adaptation above maintains the original model's ability to generate high-quality images, and brings multiple benefits for video generation:

With rectified flow, we can accelerate the training and reduce the number of sampling steps for video from 100 to 30, which greatly reduces the waiting time for inference.
With qk-norm, the training is more stablized and an aggressive optimizer can be used.
With new VAE, the temporal dimension is compressed by 4 times, which makes the training more efficient.
With multi-resolution image generation ability, the model can generate videos with different resolutions.
More data and better multi-stage training
Due to a limited computational budget, we carefully arrange the training data from low to high quality and split our training into three stages. Our training involves 12x8 GPUs, and the total training time is about 2 weeks for about 70k steps.

First stage
We first train the model on Webvid-10M datasets (40k hours) for 30k steps (2 epochs). Since the video is all lower than 360p resolution and contains watermark, we train on this dataset first. The training mainly happens on 240p and 360p, with video length 2s~16s. We use the original caption in the dataset for training. The training config locates in stage1.py.

Second stage
Then we train the model on Panda-70M datasets. This dataset is large but the quality varies. We use the official 30M subset which clips are more diverse, and filter out videos with aesthetic score lower than 4.5. This leads to a 20M subset with 41k hours. The captions in the dataset are directly used for our training. The training config locates in stage2.py.

The training mainly happens on 360p and 480p. We train the model for 23k steps, which is 0.5 epoch. The training is not fully done since we hope our new model can meet you earlier.

Third stage
In this stage, we collect ~2M video clips with a total length of 5K hours from all kinds of sources, including:

Free-license videos, sourced from Pexels, Pixabay, Mixkit, etc.
MiraData: a high-quality dataset with long videos, mainly from games and city/scenic exploration.
Vript: a densely annotated dataset.
And some other datasets.
While MiraData and Vript have captions from GPT, we use PLLaVA to caption the rest ones. Compared with LLaVA, which is only capable of single frame/image captioning, PLLaVA is specially designed and trained for video captioning. The accelerated PLLaVA is released in our tools/. In practice, we use the pretrained PLLaVA 13B model and select 4 frames from each video for captioning with a spatial pooling shape of 2*2.

Some statistics of the video data used in this stage are shown below. We present basic statistics of duration and resolution, as well as aesthetic score and optical flow score distribution. We also extract tags for objects and actions from video captions and count their frequencies. stats object_count object_count

We mainly train 720p and 1080p videos in this stage, aiming to extend the model's ability to larger resolutions. We use a mask ratio of 25% during training. The training config locates in stage3.py. We train the model for 15k steps, which is approximately 2 epochs.

Easy and effective model conditioning
For stage 3, we calculate the aesthetic score and motion score for each video clip. However, since the number of video clips is small, we are not willing to filter out clips with low scores, which leads to a smaller dataset. Instead, we append the scores to the captions and use them as conditioning. We find this method can make model aware of the scores and follows the scores to generate videos with better quality.

For example, a video with aesthetic score 5.5, motion score 10, and a detected camera motion pan left, the caption will be:

[Original Caption] aesthetic score: 5.5, motion score: 10, camera motion: pan left.
During inference, we can also use the scores to condition the model. For camera motion, we only label 13k clips with high confidence, and the camera motion detection module is released in our tools.

Evaluation
Previously, we monitor the training process only by human evaluation, as DDPM traning loss is not well correlated with the quality of generated videos. However, for rectified flow, we find the training loss is well correlated with the quality of generated videos as stated in SD3. Thus, we keep track of rectified flow evaluation loss on 100 images and 1k videos.

We sampled 1k videos from pixabay as validation dataset. We calculate the evaluation loss for image and different lengths of videos (2s, 4s, 8s, 16s) for different resolution (144p, 240p, 360p, 480p, 720p). For each setting, we equidistantly sample 10 timesteps. Then all the losses are averaged. We also provide a video showing the sampled videos with a fixed prompt for different steps.

Evaluation Loss Video Evaluation Loss

In addition, we also keep track of VBench scores during training. VBench is an automatic video evaluation benchmark for short video generation. We calcuate the vbench score with 240p 2s videos. The two metrics verify that our model continues to improve during training.

VBench

All the evaluation code is released in eval folder. Check the README for more details.

Model Total Score Quality Score Semantic Score
Open-Sora V1.0 75.91% 78.81% 64.28%
Open-Sora V1.2 79.23% 80.71% 73.30%
Sequence parallelism
We use sequence parallelism to support long-sequence training and inference. Our implementation is based on Ulysses and the workflow is shown below. When sequence parallelism is enabled, we only need to apply the all-to-all communication to the spatial block in STDiT as only spatial computation is dependent on the sequence dimension.

SP

Currently, we have not used sequence parallelism for training as data resolution is small and we plan to do so in the next release. As for inference, we can use sequence parallelism in case your GPU goes out of memory. A simple benchmark shows that sequence parallelism can achieve speedup

###
https://research.character.ai/optimizing-inference/
JUN 20, 2024 4 MIN READ EFFICIENCY
Optimizing AI Inference at Character.AI
Optimizing AI Inference at Character.AI
At Character.AI, we're building toward AGI. In that future state, large language models (LLMs) will enhance daily life, providing business productivity and entertainment and helping people with everything from education to coaching, support, brainstorming, creative writing and more.

To make that a reality globally, it's critical to achieve highly efficient “inference” – the process by which LLMs generate replies. As a full-stack AI company, Character.AI designs its model architecture, inference stack and product from the ground up, enabling unique opportunities to optimize inference to be more efficient, cost-effective and scalable to a rapidly growing, global audience.

Today we serve more than 20,000 inference queries per second. To put this in perspective, this is roughly 20% of the request volume served by Google Search, which processes around 105,000 queries per second according to third party estimates (Statista, 2024).

We can sustainably serve LLMs at this scale because we have developed a number of key innovations across our serving stack. In this blog post, we share some of the techniques and optimizations we have developed over the past two years and recently employed.

Memory-efficient Architecture Design
The key bottleneck of LLM inference throughput is the size of the cache of attention keys and values (KV). It not only determines the maximum batch size that can fit on a GPU, but also dominates the I/O cost on attention layers. We use the following techniques to reduce KV cache size by more than 20X without regressing quality. With these techniques, GPU memory is no longer a bottleneck for serving large batch sizes.

1. Multi-Query Attention. We adopt Multi-Query Attention (Shazeer, 2019) in all attention layers. This reduces KV cache size by 8X compared to the Grouped-Query Attention adopted in most open source models.

2. Hybrid Attention Horizons. We interleave local attention (Beltagy et al., 2020) with global attention layers. Local attention is trained with sliding windows, and reduces the complexity from O(length2) to O(length). We found that reducing attention horizon to 1024 on most attention layers does not have a significant impact on evaluation metrics, including the long context needle-in-haystack benchmark. In our production model, only 1 out of every 6 layers uses global attention.

3. Cross Layer KV-sharing. We tie the KV cache across neighboring attention layers, which further reduces KV cache size by a factor of 2-3x. For global attention layers, we tie the KV cache of multiple global layers across blocks, since the global attention layers dominate the KV cache size under long context use cases. Similar to a recent publication (Brandon et al., 2024), we find that sharing KV across layers does not regress quality.


Figure 1. Left: Standard transformer design where every attention is global attention. Right: The attention design in our production model. Blue boxes indicate global attention, green boxes indicate local attention, and curves indicate KV-sharing. For global attention layers, we share KV across multiple non-adjacent layers. This illustration depicts only a subset of the layers in the full model.
Stateful Caching
One of our key innovations is an efficient system for caching attention KV on host memory between chat turns. On Character.AI, the majority of chats are long dialogues; the average message has a dialogue history of 180 messages. As dialogues grow longer, continuously refilling KV caches on each turn would be prohibitively expensive.

To solve this problem, we developed an inter-turn caching system. For every prefilled prefix and generated message, we cache the KV values on host memory and retrieve them for future queries. Similar to RadixAttention (Zheng et al., 2023), we organize cached KV tensors in a LRU cache with a tree structure. The cached KV values are indexed by a rolling hash of prefix tokens. For each new query, a rolling hash is calculated for each prefix of the context, and the cache is retrieved for the longest match. This allows reusing the cache even for partially matched messages.

At a fleet level, we use sticky sessions to route the queries from the same dialogue to the same server. Since our KV cache size is small, each server can cache thousands of dialogues concurrently. Our system achieves a 95% cache rate, further reducing inference cost.


Figure 2. Blue boxes indicate cached tensors on host memory. Green and yellow boxes indicate KV cache on CUDA memory. When a new query arrives, it retrieves the KV cache for the longest matched prefix. Our rolling hash system allows retrieving cache for partially matched messages.
Quantization for Training and Serving
We use int8 quantization on model weights, activations, and attention KV cache. To support this, we implemented customized int8 kernels for matrix multiplications and attention. Different from commonly adopted "post-training quantization" techniques, we natively train our models in int8 precision, eliminating the risk of training/serving mismatch while also significantly improving training efficiency. Quantized training is a complex topic on its own, and we will address it in future posts.

Building the Future Together
Efficient inference is crucial for scaling AI systems and integrating them seamlessly into our daily lives. Taken together, the innovations discussed above achieve unprecedented efficiency and reduce inference costs to a level that makes it far easier to serve LLMs at scale. We have reduced serving costs by a factor of 33 compared to when we began in late 2022. Today, if we were to serve our traffic using leading commercial APIs, it would cost at least 13.5X more than with our systems.

Yet this is just the beginning. At Character.AI, we're excited to continue building a future where LLMs are driving innovation and enhancing experiences for everyone worldwide. Join us on this exciting journey as we continue to push the limits of what's possible with AI. Together, we are creating a future where efficient and scalable AI systems are at the heart of every interaction.

###
https://n.news.naver.com/article/050/0000076482?cds=news_edit
금융업, AI 자동화로 일자리 뺏길라..."근무일 3.5일 단축 가능성↑"
입력2024.06.20. 오전 9:26 수정2024.06.20. 오전 10:05 기사원문
정유진 기자
정유진 기자
1
1
본문 요약봇
텍스트 음성 변환 서비스 사용하기
글자 크기 변경하기
SNS 보내기
인쇄하기

전체 일자리 중 금융 부문이 인공지능(이하 AI)으로 대체 가능성이 높다는 결과가 나왔다.

19일(현지시간) 블룸버그통신에 따르면 씨티그룹은 AI관련 보고서를 통해 은행 업무의 54%가 자동화되고 12%의 직무에서 AI에 의해 생산성 향상 등 개선 효과가 나타날 수 있을 것으로 분석했다.

보고서는 은행업종에 이어 보험(48%), 에너지(43%), 자본시장(40%), 여행(38%), 소프트웨어·플랫폼(36%), 소매(34%), 커뮤니케이션·미디어(33%), 공공서비스(30%), 자동차(30%) 등 업종 순으로 업무 자동화 정도가 클 것으로 내다봤다.

또 보고서는 실제로 글로벌 주요 은행들이 직원들의 생산성을 높이고 비용 절감에 도움을 될 것으로 보고 지난해부터 서서히 AI를 도입해 각종 실험을 하고 있다고 전했다.

씨티그룹의 경우 개발자들에게 다양한 AI기술을 실험할 수 있는 역량을 갖추도록 했으며, 간단한 질문이나 명령에 따라 문장이나 에세이 등을 생산할 수 있는 생성형 AI를 활용해 수백 쪽에 달하는 규정을 빠르게 검토하고 있다고 소개했다.

JP모건체이스는 “AI 기술과 관련한 인재 영입에 나섰다”며 “이 회사의 제이미 다이먼 최고경영자(CEO)는 이 기술을 활용하면 고용주들이 주당 근무일을 3.5일로 단축할 수 있을 것”이라고 말했다.

씨티그룹 최고기술책임자(CTO) 데이비드 그리피스는 “생성형 AI가 은행 산업을 혁신하고 수익성을 개선할 수 있는 잠재력을 가지고 있다” 며 “씨티에서는 회사와 직원 역량 강화를 위해 안전하고 책임 있는 방식으로 생성형 AI를 구현하는 데 집중하고 있다”고 전했다.