Summary

OpenAI에서는 박사급 AI 연구원의 초봉이 업계 최고 수준인 86만5000달러에 달하는 것으로 나타났습니다. 또한, Apple은 4M(4M: Massively Multimodal Masked Modeling)이라는 새로운 멀티모달 학습 프레임워크를 발표했습니다. DeepSeek는 DeepSeek-Coder-V2라는 새로운 코드 언어 모델을 공개하였으며, Microsoft는 AutoGen Studio라는 멀티 에이전트 워크플로우 구축을 위한 저코드 인터페이스를 소개했습니다. 마지막으로 Google은 비디오 생성 모델과의 연동을 통해 동영상을 위한 오디오 생성 기술을 발표했습니다.

4M: Massively Multimodal Masked Modeling,

다중 모달 마스크 모델링

링크, 2024-06-17,
Apple

  • Apple과 EPFL은 4M이라는 새로운 멀티모달 학습 프레임워크를 발표
  • 4M-7과 4M-21 모델 체크포인트 공개
  • 모델 체크포인트는 RGB, Edge, Geometric, Text, Semantic, Feature map 등의 모달리티 포함
  • Apache 2.0 라이선스로 코드와 가중치 배포
  • 단일 Transformer 인코더-디코더 모델을 사용한 학습
  • 다양한 비전 작업을 수행할 수 있는 다재다능한 모델 구현

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence,

코드 인텔리전스에서 폐쇄형 모델의 장벽을 허물다

링크, 2024-06-17,
DeepSeek

  • DeepSeek-Coder-V2는 GPT-4 Turbo와 유사한 성능을 자랑하는 오픈소스 코드 언어 모델
  • DeepSeek-Coder-V2-Base에서 6조 개의 토큰을 추가로 학습하여 성능 향상
  • 코딩 및 수학적 추론 능력 대폭 강화
  • 지원 프로그래밍 언어를 86개에서 338개로 확장
  • 컨텍스트 길이를 16K에서 128K로 확장
  • 연구 및 상업적 사용을 위한 허가 라이선스 포함

“오픈AI, 박사급 연구원 초봉 11억”…급여 순위 공개,

오픈AI의 박사급 연구원 초봉 11억 공개

링크, 2024-01-03,
로라

  • 오픈AI의 박사급 AI 연구원 초봉이 86만5000달러로 업계 최고 수준
  • 앤트로픽이 85만5000달러로 두 번째로 높은 초봉 제공
  • 인플렉션 AI, 테슬라, 아마존, 구글 브레인 등의 기업도 높은 초봉 제공
  • AI 기술 수요가 공급을 초과하여 초봉이 높아짐
  • 박사 학위 논문 출판 기록이 중요한 평가 요소로 작용

Generating audio for video,

비디오를 위한 오디오 생성

링크, 2024-06-17,
Google Research

  • Google은 비디오 픽셀과 텍스트 프롬프트를 사용하여 풍부한 사운드트랙을 생성하는 V2A 기술 발표
  • V2A는 비디오 생성 모델과 결합하여 영화의 사운드트랙, 현실적인 사운드 효과 또는 대화를 생성 가능
  • 다양한 비디오 자료에 사운드트랙 생성 가능
  • 오디오 출력의 품질을 높이기 위해 추가 정보로 훈련 과정 개선

Introducing AutoGen Studio: A low-code interface for building multi-agent workflows,

멀티 에이전트 워크플로우 구축을 위한 저코드 인터페이스 AutoGen Studio 소개

링크, 2024-06-17,
Microsoft Research

  • AutoGen Studio는 멀티 에이전트 애플리케이션을 구축하기 위한 저코드 인터페이스 제공
  • 사용자는 간단한 그래픽 인터페이스를 통해 에이전트를 구성하고 워크플로우 작성 가능
  • 에이전트 워크플로우를 테스트하고 디버그할 수 있는 기능 제공
  • 워크플로우를 JSON 파일로 내보내어 다른 애플리케이션에서 사용 가능

Pre-translation vs. direct inference in multilingual LLM applications,

다국어 LLM 애플리케이션에서 사전 번역 대 직접 추론

링크, 2024-06-14,
Google Research

  • PaLM2는 다국어 작업에서 사전 번역 없이 직접 추론이 더 나은 성능을 보임
  • 108개 언어 중 94개 언어에서 직접 추론이 사전 번역보다 우수한 결과
  • 다국어 LLM의 효율성과 효과성을 향상시키기 위한 연구 지속

Introducing Gen-3 Alpha,

Gen-3 Alpha 소개

링크, 2024-06-17,
Runway

  • Gen-3 Alpha는 높은 충실도와 일관성을 갖춘 비디오 생성 모델
  • 텍스트에서 비디오, 이미지에서 비디오, 텍스트에서 이미지 도구 제공
  • 사용자 정의 버전 제공, 예술적 및 내러티브 요구사항에 맞춘 모델 생성 가능
  • 새로운 인프라를 통해 대규모 멀티모달 학습 가능
Sources This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

Title,

한글제목

링크, date,
company name

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

Title,

한글제목

링크, date,
company name

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    199
    200
    201
    202
    203
    204
    205
    206
    207
    208
    209
    210
    211
    212
    213
    214
    215
    216
    217
    218
    219
    220
    221
    222
    223
    224
    225
    226
    227
    228
    229
    230
    231
    232
    233
    234
    235
    236
    237
    238
    239
    240
    241
    242
    243
    244
    245
    246
    247
    248
    249
    250
    251
    252
    253
    254
    255
    256
    257
    258
    259
    260
    261
    262
    263
    264
    265
    266
    267
    268
    269
    270
    271
    272
    273
    274
    275
    276
    277
    278
    279
    280
    281
    282
    283
    284
    285
    286
    287
    288
    289
    290
    291
    292
    293
    294
    295
    296
    297
    298
    299
    300
    301
    302
    303
    304
    305
    306
    307
    308
    309
    310
    311
    312
    313
    314
    315
    316
    317
    318
    319
    320
    321
    322
    323
    324
    325
    326
    327
    328
    329
    330
    331
    332
    333
    334
    335
    336
    337
    338
    339
    340
    341
    342
    343
    344
    345
    346
    347
    348
    349
    350
    351
    352
    353
    354
    355
    356
    357
    358
    359
    360
    361
    362
    363
    364
    365
    366
    367
    368
    369
    370
    371
    372
    373
    374
    375
    376
    377
    378
    379
    380
    381
    382
    383
    384
    385
    386
    387
    388
    389
    390
    391
    392
    393
    394
    395
    396
    397
    398
    399
    400
    401
    402
    403
    404
    405
    406
    407
    408
    409
    410
    411
    412
    413
    414
    415
    416
    417
    418
    419
    420
    421
    422
    423
    424
    425
    426
    427
    428
    429
    430
    431
    432
    433
    434
    435
    436
    437
    438
    439
    440
    441
    442
    443
    444
    445
    446
    447
    448
    449
    450
    451
    452
    453
    454
    455
    456
    457
    458
    459
    460
    461
    462
    463
    464
    465
    466
    467
    ###
    https://arxiv.org/abs/2312.06647

    Apple dropped 4M: Massively Multilingual Masked Modeling! 🔥
    Is this what powers the on-device vision-text backbone?
    > A framework for training any-to-any multimodal foundational models. Training/ Finetuning/ Inference.
    > Release 4M-7 and 4M-21 model checkpoints (trained across tens of tasks and modalities).
    > 198M, 705M and 2.8B model checkpoints.
    > Release specialised Text to Image and image super-resolution specialist model checkpoints.
    > Apache 2.0 license for code and weights!
    > A unified transformer encoder-decoder model is trained on a masked modelling objective.
    > Spread across RGB, Edge, Geometric, Text, Semantic, Feature map, and more modalities.
    > Model checkpoints on the Hub 🤗

    Kudos to EPFL and Apple. I especially liked the any-to-any generation bit paired with multimodal chained generation! ⚡

    4M: Massively Multimodal Masked Modeling
    David Mizrahi, Roman Bachmann, Oğuzhan Fatih Kar, Teresa Yeo, Mingfei Gao, Afshin Dehghan, Amir Zamir
    Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision. In this paper, we take a step in this direction and propose a multimodal training scheme called 4M. It consists of training a single unified Transformer encoder-decoder using a masked modeling objective across a wide range of input/output modalities - including text, images, geometric, and semantic modalities, as well as neural network feature maps. 4M achieves scalability by unifying the representation space of all modalities through mapping them into discrete tokens and performing multimodal masked modeling on a small randomized subset of tokens.
    4M leads to models that exhibit several key capabilities: (1) they can perform a diverse set of vision tasks out of the box, (2) they excel when fine-tuned for unseen downstream tasks or new input modalities, and (3) they can function as a generative model that can be conditioned on arbitrary modalities, enabling a wide variety of expressive multimodal editing capabilities with remarkable flexibility.
    Through experimental analyses, we demonstrate the potential of 4M for training versatile and scalable foundation models for vision tasks, setting the stage for further exploration in multimodal learning for vision and other domains.

    ###
    https://github.com/deepseek-ai/DeepSeek-Coder-V2/blob/main/paper.pdf
    DeepSeek-V2
    Homepage Chat Hugging Face
    Discord Wechat Twitter Follow
    Code License Model License
    Model Download | Evaluation Results | API Platform | How to Use | License | Citation

    Paper Link👁️

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
    1. Introduction
    We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from DeepSeek-Coder-V2-Base with 6 trillion tokens sourced from a high-quality and multi-source corpus. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-Coder-V2-Base, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K.

    90.2% on HumanEval and 75.7% on MATH. These are higher numbers than GPT-4-Turbo-0409 according to their technical report.
    More:
    > includes 16B and 236B parameter models
    > further pretrained from DeepSeek-V2 checkpoint
    > uses an additional 6 trillion tokens
    > expands to 338 programming languages
    > context length extended from 16K to 128K
    > permissive license allows for both research and unrestricted commercial use
    Still not quite there for instruction-following capabilities as compared to GPT-4 Turbo but has huge potential to improve.

    ###
    https://www-aitimes-com.cdn.ampproject.org/c/s/www.aitimes.com/news/articleViewAmp.html?idxno=156265
    ”오픈AI, 박사급 연구원 초봉 11억”...급여 순위 공개
    급여 협상 서비스 기업 로라 집계
    2024-01-03 박찬 기자
    박사급 AI 연구원의 초기 보상 제안과 최종 보상 제안 비교(사진=로라)
    박사급 AI 연구원의 초기 보상 제안과 최종 보상 제안 비교(사진=로라)
    오픈AI의 박사급 인공지능(AI) 연구원 초봉이 86만5000달러(약 11억3000만원)로 업계 최고 수준인 것으로 나타났다. 최고급 스타트업과 빅테크의 초봉도 9억~10억원에 달하는 것으로 알려졌다. 그만큼 AI 연구원이 부족하다는 설명이다.

    리드라이트는 2일(현지시간) 급여 협상 서비스 기업인 로라의 집계를 인용, 신규 박사급 AI 연구원을 채용한 600여개 기업 중 오픈AI와 앤트로픽이 각각 86만5000달러와 85만5000달러(약 11억2000만원)로 가장 높은 초봉을 제공했다고 보도했다. 초봉에는 기본급과 보너스, 주식 등이 포함된다.

    이에 따르면 오픈AI와 앤트로픽의 라이벌로 꼽히는 인플렉션 AI가 82만5000달러(약 10억8000만원)로 3위를 차지했다.

    이어 테슬라 78만달러(약 10억2000만원), 아마존 71만9000달러(약 9억4000만원), 구글 브레인 69만5000달러(약 9억1000만원) 등으로 빅테크보다 전문 스타트업의 인재 확보 경쟁이 더 치열한 것으로 나타났다.

    그러나 초기 제안과 최종 제안 사이의 협상폭은 구글 리서치가 평균 77%로 가장 높았으며, 마이크로소프트 리서치, 블룸버그 AI, IBM 리서치, 틱톡 등의 순이었다. 구글 리서치의 한 연구원은 초기 제안으로 21만6000달러(약 2억8000만원)를 받았으나, 협상을 통해 243% 증가한 최종 52만6000달러(약 6억9000만원)의 연봉을 받게 됐다.

    박사급 AI 연구원의 초봉 순위(사진=로라)
    박사급 AI 연구원의 초봉 순위(사진=로라)
    이처럼 박사급 AI 연구원의 연봉 수준이 높은 이유는 AI 기술에 대한 전 세계 수요가 실제 공급보다 훨씬 더 크기 때문이다.

    톨비 서베이의 설문조사에 따르면 2021년에는 컴퓨팅 연구 분야에서 수여된 박사 학위가 1691명에 불과했다. 미국에서만 3만35000명의 컴퓨터 및 정보 연구원이 필요하며 수요는 연간 21% 증가하고 있다. 즉 매년 필요한 연구원보다 일자리가 5000개 이상 많다는 것을 의미한다.

    현재 가장 수요가 높은 분야는 컴퓨터 비전, 로봇공학, 자연어 처리(NLP), 생물학, 신경과학 등에 AI를 적용하는 분야다. '챗GPT'가 도입되면서 대형언어모델(LLM)에 대한 전문성은 최고 인기 기술이 됐다.

    리드라이트는 AI 연구원에게는 검증된 연구 능력이 무엇보다 중요하다고 지적했다. 이를 입증하는 것 중 하나를 논문 출판 기록으로 꼽았다.

    업계 최고 수준의 연구원들은 박사 학위 논문만으로 최대 2000번의 인용과 'H-지수(H-index) 10'을 보유하게 된다고 전했다. H-지수 10은 논문 인용횟수가 10이 넘는 논문이 적어도 10편이 된다는 것을 의미한다.

    이 정도 능력이면 높은 직위와 최고 보상을 요구할 수 있는 최고 연구원급 영향력을 가진다는 설명이다.

    ###
    https://deepmind.google/discover/blog/generating-audio-for-video/
    google research

    RESEARCH

    Generating audio for video
    Published
    17 JUNE 2024
    Authors
    Generative Media team

    Share

    Video-to-audio research uses video pixels and text prompts to generate rich soundtracks

    Video generation models are advancing at an incredible pace, but many current systems can only generate silent output. One of the next major steps toward bringing generated movies to life is creating soundtracks for these silent videos.

    Today, we're sharing progress on our video-to-audio (V2A) technology, which makes synchronized audiovisual generation possible. V2A combines video pixels with natural language text prompts to generate rich soundscapes for the on-screen action.

    Our V2A technology is pairable with video generation models like Veo to create shots with a dramatic score, realistic sound effects or dialogue that matches the characters and tone of a video.

    It can also generate soundtracks for a range of traditional footage, including archival material, silent films and more — opening a wider range of creative opportunities.


    Watch

    00:13
    Prompt for audio: Cinematic, thriller, horror film, music, tension, ambience, footsteps on concrete


    Watch

    00:08
    Prompt for audio: Cute baby dinosaur chirps, jungle ambience, egg cracking


    Watch

    00:09
    Prompt for audio: jellyfish pulsating under water, marine life, ocean


    Watch

    00:09
    Prompt for audio: A drummer on a stage at a concert surrounded by flashing lights and a cheering crowd


    Watch

    00:12
    Prompt for audio: cars skidding, car engine throttling, angelic electronic music


    Watch

    00:08
    Prompt for audio: a slow mellow harmonica plays as the sun goes down on the prairie


    Watch

    00:07
    Prompt for audio: Wolf howling at the moon

    Enhanced creative control
    Importantly, V2A can generate an unlimited number of soundtracks for any video input. Optionally, a ‘positive prompt’ can be defined to guide the generated output toward desired sounds, or a ‘negative prompt’ to guide it away from undesired sounds.

    This flexibility gives users more control over V2A’s audio output, making it possible to rapidly experiment with different audio outputs and choose the best match.


    Watch

    00:08
    Prompt for audio: A spaceship hurtles through the vastness of space, stars streaking past it, high speed, Sci-fi


    Watch

    00:08
    Prompt for audio: Ethereal cello atmosphere


    Watch

    00:08
    Prompt for audio: A spaceship hurtles through the vastness of space, stars streaking past it, high speed, Sci-fi

    How it works
    We experimented with autoregressive and diffusion approaches to discover the most scalable AI architecture, and the diffusion-based approach for audio generation gave the most realistic and compelling results for synchronizing video and audio information.

    Our V2A system starts by encoding video input into a compressed representation. Then, the diffusion model iteratively refines the audio from random noise. This process is guided by the visual input and natural language prompts given to generate synchronized, realistic audio that closely aligns with the prompt. Finally, the audio output is decoded, turned into an audio waveform and combined with the video data.


    Diagram of our V2A system, taking video pixel and audio prompt input to generate an audio waveform synchronized to the underlying video. First, V2A encodes the video and audio prompt input and iteratively runs it through the diffusion model. Then it generates compressed audio, which is decoded into an audio waveform.

    To generate higher quality audio and add the ability to guide the model towards generating specific sounds, we added more information to the training process, including AI-generated annotations with detailed descriptions of sound and transcripts of spoken dialogue.

    By training on video, audio and the additional annotations, our technology learns to associate specific audio events with various visual scenes, while responding to the information provided in the annotations or transcripts.

    Further research underway
    Our research stands out from existing video-to-audio solutions because it can understand raw pixels and adding a text prompt is optional.

    Also, the system doesn't need manual alignment of the generated sound with the video, which involves tediously adjusting different elements of sounds, visuals and timings.


    Watch

    00:09

    Watch

    00:09
    Still, there are a number of other limitations we’re trying to address and further research is underway.

    Since the quality of the audio output is dependent on the quality of the video input, artifacts or distortions in the video, which are outside the model’s training distribution, can lead to a noticeable drop in audio quality.

    We’re also improving lip synchronization for videos that involve speech. V2A attempts to generate speech from the input transcripts and synchronize it with characters' lip movements. But the paired video generation model may not be conditioned on transcripts. This creates a mismatch, often resulting in uncanny lip-syncing, as the video model doesn’t generate mouth movements that match the transcript.


    Watch

    00:09
    Prompt for audio: Music, Transcript: “this turkey looks amazing, I’m so hungry”

    Our commitment to safety and transparency
    We’re committed to developing and deploying AI technologies responsibly. To make sure our V2A technology can have a positive impact on the creative community, we’re gathering diverse perspectives and insights from leading creators and filmmakers, and using this valuable feedback to inform our ongoing research and development.

    We’ve also incorporated our SynthID toolkit into our V2A research to watermark all AI-generated content to help safeguard against the potential for misuse of this technology.

    Before we consider opening access to it to the wider public, our V2A technology will undergo rigorous safety assessments and testing. Initial results are showing this technology will become a promising approach for bringing generated movies to life.

    Note: All examples are generated by our V2A technology, which is paired with Veo, our most capable generative video model.

    ###
    https://www.microsoft.com/en-us/research/blog/introducing-autogen-studio-a-low-code-interface-for-building-multi-agent-workflows/
    Microsoft Research Blog
    Introducing AutoGen Studio: A low-code interface for building multi-agent workflows
    Published June 17, 2024

    By Victor Dibia , Principal Research Software Engineer Gagan Bansal , Senior Researcher Jingya Chen , UX Designer Suff Syed , Principal Design Director Adam Fourney , Principal Researcher Erkang (Eric) Zhu , Senior Researcher Chi Wang , Principal Researcher Saleema Amershi , Senior Principal Research Manager

    Share this page

    Share on Facebook
    Share on Twitter
    Share on LinkedIn
    Share on Reddit
    Subscribe to our RSS feed
    White icons representing (from left to right) agents (multi), workflow, tasks, and coding on a blue to purple to pink gradient background.
    Multi-agent approaches to AI applications, where multiple foundation model-based agents collaborate to solve problems, are emerging as a powerful paradigm for accomplishing increasingly complex tasks. In September 2023, we released AutoGen – a flexible and open-source Python-based framework for defining, configuring, and composing AI agents to drive multi-agent applications. Today, we are introducing AutoGen Studio (version 0.1.0) – a low-code interface for rapidly building, testing, and sharing multi-agent solutions. AutoGen Studio is built on AutoGen and inherits its features and functionalities, while providing a user-friendly and intuitive interface to create and customize agents, with little to no coding required.

    PROJECT
    AutoGen
    During the nine months since it was released, AutoGen(opens in new tab) has been widely adopted by researchers, developers, and enthusiasts who have created a variety of novel and exciting applications(opens in new tab) – from market research to interactive educational tools to data analysis pipelines in the medical domain. With more than 290 community contributors on GitHub and 890,000 downloads of the Python package (as of May 2024), AutoGen continues to be a leading framework for building and researching multi-agent AI applications.

    AutoGen Studio user interface: PDF Book Gen Session
    A screenshot of the AutoGen Studio interface shows results when two agents are used to address the task, “Create a 4-page kids’ .pdf book with details and pictures about weather patterns in Seattle”.
    AutoGen Studio is the next step forward in enabling developers to advance the multi-agent paradigm. We want to make multi-agent solutions responsibly available to diverse audiences – from academic researchers to professional developers across industries – who want to build multi-agent applications to solve real-world problems. Imagine having access to agents that can automate your vacation planning and grocery shopping, manage your personal finances, help you accomplish your learning goals, or perform any other task you care about. How would you build such agents? What capabilities would you give them? How would you make them work together? How would you ensure they are working as intended?

    DOWNLOAD
    AutoGen Studio
    These questions motivated us to build AutoGen Studio. With AutoGen Studio, developers can rapidly build, test, deploy, and share agents and agent-teams (workflows), with the community.

    Note: AutoGen is primarily a developer tool to enable rapid prototyping and research. It is not a production ready tool. Please see the GitHub repository(opens in new tab) and documentation(opens in new tab) for instructions on how to get started.

    What can you do with AutoGen Studio right now?
    We built AutoGen Studio with the following goals in mind:

    Lower the barrier to entry in building multi-agent applications
    Facilitate rapid prototyping and testing of multi-agent solutions
    Cultivate expertise and community by allowing users to share and re-use this technology
    With AutoGen Studio’s early release (v 0.1.0), users can rapidly author agent workflows via a user interface, interactively test and debug agents, reuse artifacts, and deploy workflows.


    The video above shows how users can create skills and models, attach them to agents, create agent workflows, test and deploy them in AutoGen Studio. All in a few clicks.
    Rapidly author agent workflows
    AutoGen Studio provides a “Build” section where users can choose from a library of pre-defined agents and compose them into teams (workflows) that can address tasks in minutes. Furthermore, users can customize agents and agent teams with foundation models, prompts, skills (python functions that accomplish a specific task e.g., fetching the weather from a weather provider), and workflows via a graphical user interface. Workflows may be sequential (where agents act in a predefined sequential order) or autonomous chat (where the order in which agents act may be driven by a large language model, custom logic, all based on the state of the task).

    AutoGen Studio user interface: agent configuration
    In AutoGen Studio, agents can be configured via the user interface. Models and skills can be associated with agents, and agents can be composed into autonomous chat and sequential workflows.
    Debug and test agents
    AutoGen Studio allows developers to immediately test workflows on a variety of tasks and review resulting artifacts (such as images, code, and documents). Developers can also review the “inner monologue” of agent workflows as they address tasks, and view profiling information such as costs associated with the run (such as number of turns and number of tokens), and agent actions (such as whether tools were called and the outcomes of code execution).

    AutoGen Studio user interface: profile sample workflow
    AutoGen Studio user interface: sample workflow
    In AutoGen Studio, users can test workflows, see results, and view visualizations that profile agent actions (such as how often tools were used or code was executed).
    Artifact reuse and deployment
    Users can download the skills, agents, and workflow configurations they create as well as share and reuse these artifacts. AutoGen Studio also offers a seamless process to export workflows and deploy them as application programming interfaces (APIs) that can be consumed in other applications deploying workflows as APIs.

    Specifically, workflows can be exported as JavaScript Object Notation (JSON) files and loaded into any python application, launched as an API endpoint from the command line or wrapped into a Dockerfile that can be deployed on cloud services like Azure Container Apps or Azure Web Apps.

    AutoGen Studio user interface: export workflow
    In AutoGen Studio, users can export agent workflows as a JSON configuration file and then reuse them in any python application, launch it as an API from the command line or deploy on a cloud service like Azure Container Apps and Azure Web Apps.
    MICROSOFT RESEARCH PODCAST

    Microsoft Research Podcast | What's Your Story | Weishung Liu
    What’s Your Story: Weishung Liu
    Principal PM Manager Weishung Liu shares how a career delivering products and customer experiences aligns with her love of people and storytelling and how—despite efforts to defy the expectations that come with growing up in Silicon Valley—she landed in tech.

    Listen now
    Opens in a new tab
    What is the community creating with AutoGen Studio?
    Over the last few months, we have shared an early version of AutoGen Studio, which has been downloaded more than 154,000 times on pypi (January – May 2024). Our observations of early usage patterns (based on feedback from social platforms like GitHub discussions(opens in new tab) , Discord(opens in new tab) and Youtube(opens in new tab) (opens in new tab)) suggest that AutoGen Studio is driving a new group of users who have basic technical capabilities (that is, they can install the tool) and are interested in rapidly testing out ideas but have limited programming skills.

    We have seen these users prototype examples covering tasks like travel planning, pdf brochure generation, market research, structured data extraction, video generation, and visualization generation among others. Importantly, these tasks are accomplished simply by defining agents, giving them access to large language models and skills, adding agents to a workflow, and running tasks with these workflows.


    Users are exploring early use cases such as report/book generation, as seen in the screenshot above. Here, two agents are defined and given access to skills for generating images. The agents are then composed into a workflow where messages and actions are exchanged to solve the task of generating a pdf report.
    Open research questions and next steps
    Orchestrating teams of agents that can explore plans, reflect on actions, and collaborate offers opportunities to build tools that address challenging tasks. We believe that we are just scratching the surface of what may be possible with the multi-agent paradigm, and much is unknown about how best to harness foundation models, let alone foundation model-based agents and multi-agent solutions.

    This leaves open many opportunities for further research.

    For example, the sophisticated interplay between agents in multi-agent paradigms, particularly for increasingly more complex and dynamic domains, highlights many opportunities for multi-agent evaluation and tooling. Open questions include:

    How can we measure the performance, reliability, and reusability of agents across tasks?
    How can we better understand the strengths and limitations of agents?
    How can we explore alternative scenarios and outcomes?
    How can we compare different agent architectures and collaboration protocols?
    These questions require novel methods and metrics that can capture the multi-faceted aspects of multi-agent paradigms and provide actionable insights for developers and users.

    As our understanding of the multi-agent paradigm matures, another opportunity is in distilling design patterns and best practices for building effective agent teams for different types of tasks. For instance:

    What are the optimal number and composition of agents for a given problem?
    What is the best way to distribute responsibilities and coordinate actions among agents?
    What are the trade-offs between centralized and decentralized control, or between homogeneous and heterogeneous agents?
    How can we leverage human oversight and feedback to improve agent reliability and safety?
    These questions require systematic studies and empirical evaluations to discover the key dimensions and principles for designing multi-agent solutions.

    Finally, as agents become more long-lived and ubiquitous in our digital world, an open challenge is in automating and optimizing the agent-creation process itself. For example:

    How can we dynamically spawn agents based on the task requirements and available resources?
    How can we tune agent parameter workflow configurations to achieve the best performance?
    How can we adapt agent teams to changing environments and user preferences?
    Future design improvements
    Naturally, we see AutoGen Studio as a potential vehicle to study many of these research questions – from improvements in the user experience of authoring workflows to a gallery of shareable artifacts to advanced tools for making sense of agent behaviors.

    We are currently working on a new drag-and-drop experience in AutoGen Studio, designed to transform how users’ author multi-agent workflows. Our new visual canvas allows users to easily orchestrate and connect agents, providing an intuitive interface for defining collaboration dynamics.

    AutoGen Studio user interface: visual workflow design
    A new visual canvas interface for AutoGen allows users to easily orchestrate and connect agents, providing an intuitive interface for defining collaboration dynamics. Entities such as skills and models can be associated with agents via drag-and-drop interactions.
    Visual workflow design: The heart of our enhanced user interface is a visual canvas where you can literally see your workflow come to life. Drag and drop different agents onto the canvas to build complex conversation patterns. This graphical approach not only simplifies the initial setup but also makes the process of modifying agents and workflows more intuitive.

    A new visual canvas interface for AutoGen that allows users to both visualize agent interactions as well as update properties of each agent in the same view pane.
    A new visual canvas interface for AutoGen allows users to both visualize agent interactions and update properties of each agent in the same view pane.
    Configurable agents, models, and skills: Customize each agent’s role and skills through simple, direct interactions on the canvas. Whether you’re adding new capabilities or tweaking existing ones, the process is straightforward and user-friendly.

    AutoGen Studio user interface: dynamic prototyping and testing
    The proposed visual canvas interface for AutoGen will explore updated visualization of agent internal monologues for improved debugging.
    Dynamic prototyping and testing: Experimentation is key to perfecting agent workflows. With our new interface, you can prototype various agent configurations and immediately test them in a live environment. This real-time interaction allows you to chat with the workflow, observe all agent messages, and pinpoint areas for improvement on the fly.

    AutoGen Studio community gallery
    The new proposed design explores a gallery of curated workflows and entities (such as skills and agents) that can be reused.
    Finally, we are developing a community gallery within AutoGen Studio where users can share, discover, and learn from one another. This gallery will allow you to publish your workflows, agents, and skills, fostering a collaborative environment where everyone can benefit from shared knowledge and innovations.

    Note on responsible AI: Promoting safe and ethical multi-agent solutions
    AutoGen Studio is designed to provide a low-code environment for rapidly prototyping and testing multi-agent workflows. Our goal is to responsibly advance research and practice in solving problems with multiple agents and to develop tools that contribute to human well-being. Along with AutoGen, AutoGen Studio is committed to implementing features that promote safe and reliable outcomes. For example, AutoGen Studio offers profiling tools to make sense of agent actions and safeguards, such as support for Docker environments for code execution. This feature helps ensure that agents operate within controlled and secure environments, reducing the risk of unintended or harmful actions. For more information on our approach to responsible AI in AutoGen, please refer to transparency FAQS here: https://github.com/microsoft/autogen/blob/main/TRANSPARENCY_FAQS.md(opens in new tab). Finally, AutoGen Studio is not production ready i.e., it does not focus on implementing authentication and other security measures that are required for production ready deployments.

    ###
    https://research.google/blog/pre-translation-vs-direct-inference-in-multilingual-llm-applications/
    google research
    Pre-translation vs. direct inference in multilingual LLM applications
    June 14, 2024

    Roman Goldenberg, Research Scientist, Verily AI, and Natalie Aizenberg, Research Software Engineer, Google Research & Verily AI

    A comprehensive evaluation comparing pre-translation with direct inference of PaLM2 on multilingual tasks, demonstrating its improved performance using direct inference in the source language, compared to pre-translation to English. PaLM2 models do not need pre-translation to excel in multilingual tasks, as demonstrated in a comprehensive evaluation comparing direct inference with pre-translation.

    Large language models (LLMs) are becoming omnipresent tools for solving a wide range of problems. However, their effectiveness in handling diverse languages has been hampered by inherent limitations in training data, which are often skewed towards English. To address this, pre-translation, where inputs are translated to English before feeding them to the LLM, has become a standard practice.

    Previous research has demonstrated the effectiveness of pre-translation for optimal LLM performance for GPT-3/3.5/4, ChatGPT, PaLM and other models. While pre-translation helps address the language bias issue, it introduces complexities and inefficiencies, and it may lead to information loss. With the introduction of new powerful LLMs trained on massive multilingual datasets, it is time to revisit the assumed necessity of pre-translation.

    In our recent work “Breaking the Language Barrier: Can Direct Inference Outperform Pre-Translation in Multilingual LLM Applications?”, to be presented at NAACL’24, we re-evaluate the need for pre-translation using PaLM2, which has been established as highly performant in multilingual tasks. Our findings challenge the pre-translation paradigm established in prior research and highlight the advantages of direct inference in PaLM2. Specifically, we demonstrate that PaLM2-L consistently outperforms pre-translation in 94 out of 108 languages, offering a more efficient and effective application in multilingual settings while unlocking linguistic authenticity and alleviating the limitations of pre-translation.

    Rethinking multilingual LLM evaluation
    Prior research on evaluating the impact of pre-translation mainly focused on discriminative (close-ended) tasks, such as multiple choice question answering (QA), for which the language of the answer is mostly insignificant. For evaluating generative (open-ended) tasks, such as text summarization or attributed QA, the output needs to be in the source language to compare it to the ground truth (GT). This requires adding an extra post-inference translation step. While for source language inference evaluation (a in the figure below), inference is directly compared to GT in the source language, for pre-translation evaluation (b), LLM inference is translated back to source language (c.1).

    BtLB-1-Source
    Comparative evaluation of direct inference vs. pre-translation in source language.

    One of the drawbacks of this evaluation scheme is that comparing model output to GT in different languages using standard lexical metrics, such as ROUGE and F1, is language dependent and introduces inconsistencies. Another problem with this approach is that GT answers in open-ended tasks rely primarily on information present within the provided context. Specifically, in reading comprehension Q&A benchmarks, it is common to have the GT be a substring of the original context. This presents a potential disadvantage for pre-translation, which lacks access to the original context from which the GT was extracted.

    To address both these caveats, we perform a complimentary evaluation in English by translating the GT and direct inference results to English. Here, instead of translating the pre-translated inference back to source language, we translate the direct inference output and GT to English (as illustrated below in panels c.2 and c.3, respectively). Then the evaluation against GT is performed in English.

    BtLB-2-English
    Comparative evaluation of direct inference vs. pre-translation in English.

    In addition, we found that averaging LLM accuracy metrics across languages, as done in the prior approaches, can be misleading, masking crucial details. To gain a more nuanced understanding, we introduced the Language Ratio metric as an alternative aggregation over commonly used lexical metrics. It is defined as the percentage of languages for which direct inference yields better results than pre-translation.

    The Language Ratio can be computed for any accuracy score of choice (such as F1 or Rouge) over a single inference mode (direct and pre-translation) and language. By inspecting the proportion of languages where one method outperforms the other, rather than averaging language bias scores, a fairer overall comparison and more detailed understanding of relative strengths and weaknesses across languages is possible.

    Direct inference takes the lead
    Our analysis encompassed a variety of tasks and languages. We employed six publicly available benchmarks to evaluate PaLM2's performance in both discriminative (XCOPA, XStoryCLoze and BeleBele benchmarks) and generative tasks (XLSum, TyDiQA-GP and XQuAD) across 108 languages. Two variants of PaLM2 were evaluated: PaLM2-S (Small - Bison) and PaLM2-L (Large - Unicorn), while using Google Translation API for pre- and post-translation.

    BtLB-4-Results
    PaLM2-S (left) and PaLM2-L (right) evaluation results, comparing pre-translation (blue) and direct inference (red). Model performance for generative (open-ended) tasks is evaluated both in the source language and in English. Top: Accuracy metrics (accuracy, Rouge-L, F1) measured on various benchmarks. Bottom: Language Ratio metric.

    The results were strikingly different from those reported in prior literature for other models.

    PaLM2-L consistently achieved better performance with direct inference in 94 out of 108 languages evaluated. The advantage was observed for both close- and open-ended tasks, on all benchmarks. The results were consistent across all evaluations — in source language and in English, using standard metrics (Accuracy/F1/Rouge) and the Language Ratio.
    PaLM2-S also favors direct inference in all but the XQuAD benchmark, where the result is less conclusive. Better average F1 score is achieved using direct inference (due to significant improvements in Chinese and Thai), but the Language Ratio is better for pre-translation, which emphasizes the complimentary value of this metric.
    Direct inference yielded superior results even in low-resource languages (LRL). This is particularly significant for fostering communication and information access in under-represented languages.
    Language-focused analysis
    While PaLM2-L clearly performs better using direct inference for the majority of languages, pre-translation shows consistent superiority (across benchmarks) for 7 languages: Bambara, Cusco-Collao Quechua, Lingala, Oromo, Punjabi, Tigrinya, and Tsonga. All 7 are LRL, 4 out of 7 are African, with Lingala, the largest, spoken by over 40 million people. Interestingly, the majority (85%) of LRL benefit from direct inference with PaLM2.

    BtLB-5-Performance
    PaLM2-L average direct inference Lift over pre-translate inference on LRL. The majority of languages (over 85%) benefit from direct inference with PaLM2, with lifts exceeding 5% (dashed line) in 63% of languages.

    The future of multilingual communication
    The comprehensive comparative analysis we performed in this study suggests that the new generation of LLMs, trained on massive multilingual datasets, can better handle information and communication across languages, eliminating the need for pre-translation for certain languages.

    We are committed to ongoing research in this area, focusing on improving LLM performance for all languages and fostering a more inclusive future for multilingual communication.

    ###
    https://runwayml.com/blog/introducing-gen-3-alpha/
    Introducing Gen-3 Alpha
    A new frontier for high-fidelity, controllable video generation.
    Anastasis Germanidis | June 17th, 2024
    Gen-3 Alpha is the first of an upcoming series of models trained by Runway on a new infrastructure built for large-scale multimodal training. It is a major improvement in fidelity, consistency, and motion over Gen-2, and a step towards building General World Models.
    All of the videos on this page were generated with Gen-3 Alpha with no modifications.

    Prompt: Subtle reflections of a woman on the window of a train moving at hyper-speed in a Japanese city.
    Prompt: An astronaut running through an alley in Rio de Janeiro.
    Prompt: FPV flying through a colorful coral lined streets of an underwater suburban neighborhood.
    Prompt: Handheld tracking shot at night, following a dirty blue ballon floating above the ground in abandon old European street.

    Trained jointly on videos and images, Gen-3 Alpha will power Runway's Text to Video, Image to Video and Text to Image tools, existing control modes such as Motion Brush, Advanced Camera Controls, Director Mode as well as upcoming tools for more fine-grained control over structure, style, and motion.

    Gen-3 Alpha will be released with a new set of safeguards, including our new and improved in-house visual moderation system and C2PA provenance standards.
    Prompt: An empty warehouse dynamically transformed by flora that explode from the ground.
    Prompt: Close up shot of a living flame wisp darting through a bustling fantasy market at night.
    Prompt: Handheld tracking shot, following a red ballon floating above the ground in abandon street.
    Prompt: A FPV shot zooming through a tunnel into a vibrant underwater space.
    Prompt: A wide symmetrical shot of a painting in a museum. The camera zooms in close to the painting.
    Prompt: Ultra-fast disorienting hyperlapse racing through a tunnel into a labyrinth of rapidly growing vines.
    Prompt: FPV, internal locomotive cab of a train moving at hyper-speed in an old European city.
    Prompt: Zooming in hyper-fast to a dandelion to reveal macro dream-like abstract world.
    Fine-grained temporal control
    Gen-3 Alpha has been trained with highly descriptive, temporally dense captions, enabling imaginative transitions and precise key-framing of elements in the scene.


    Prompt: An extreme close-up shot of an ant emerging from its nest. The camera pulls back revealing a neighborhood beyond the hill.
    Prompt: A tsunami coming through an alley in Bulgaria, dynamic movement.
    Prompt: A FPV drone shot through a castle on a cliff.
    Prompt: Internal window of a train moving at hyper-speed in an old European city.
    Prompt: Handheld camera moving fast, flashlight light, in a white old wall in a old alley at night a black graffiti that spells ‘Runway’.
    Prompt: Super fast zoom out from the peak of a frozen mountain where a lonely hiker is arriving to the submit.
    Prompt: A first-person POV shot rapidly flies through open doors to reveal a surreal waterfall cascading in the middle of the living room.
    Prompt: A first-person POV shot rapidly flies towards a house's front door at 10x speed.
    Prompt: A pencil drawing an architectural plan.
    Photorealistic Humans
    Gen-3 Alpha excels at generating expressive human characters with a wide range of actions, gestures, and emotions, unlocking new storytelling opportunities.


    Prompt: A cinematic wide portrait of a man with his face lit by the glow of a TV.
    Prompt: A close up portrait of a woman lit by the side, the camera pulls back.
    Prompt: Zoom in shot to the face of a young woman sitting on a bench in the middle of an empty school gym.
    Prompt: A close up of an older man in a warehouse, camera zoom out.
    Prompt: An older man playing piano, lit from the side.
    Prompt: Macro shot to the face freckles of a young woman trying to look for something.
    Prompt: An astronaut walking between stone buildings.
    Prompt: A middle-aged sad bald man becomes happy as a wig of curly hair and sunglasses fall suddenly on his head.
    For artists, by artists
    Training Gen-3 Alpha was a collaborative effort from a cross-disciplinary team of research scientists, engineers, and artists. It was designed to interpret a wide range of styles and cinematic terminology.


    Prompt: View out a window of a giant strange creature walking in rundown city at night, one single street lamp dimly lighting the area.
    Prompt: A man made of rocks walking in the forest, full-body shot.
    Prompt: A slow cinematic push in on an ostrich standing in a 1980s kitchen.
    Prompt: A giant humanoid, made of fluffy blue cotton candy, stomping on the ground, and roaring to the sky, clear blue sky behind them.
    Prompt: Zooming through a dark forest with neon light flora lighting up.
    Prompt: A cyclone of broken glass in an urban alleyway. dynamic movement.
    Prompt: A man standing in front of a burning building giving the 'thumbs up' sign.
    Prompt: Highly detailed close up of a bacteria.
    Prompt: An ultra-wide shot of a giant stone hand reaching out of a pile of rocks at the base of a mountain.
    Prompt: Aerial view shot of a cloaked figure elevating in the sky betweem slyscrapers.
    Prompt: An oil painting of a natural forest environment with colorful maple trees and cinematic parallax animation.
    Prompt: A Japanese animated film of a young woman standing on a ship and looking back at camera.
    Prompt: A close-up shot of a young woman driving a car, looking thoughtful, blurred green forest visible through the rainy car window.
    Prompt: Aerial shot of a drone moving fast in a dense green jungle.
    Prompt: Hyperlapse shot through a corridor with flashing lights. A silver fabric flies through the entire corridor.
    Prompt: Aerial shot of the ocean. a maelstrom forms in the water swirling around until it reveals the fiery depths below.
    Prompt: A push through an ocean research outpost.
    Prompt: A woman singing and standing in a concert stage with a bright light in the background.
    Industry Customization
    As part of the family of Gen-3 models, we have been collaborating and partnering with leading entertainment and media organizations to create custom versions of Gen-3.
    Customization of Gen-3 models allows for more stylistically controlled and consistent characters, and targets specific artistic and narrative requirements, among other features.
    For companies interested in fine-tuning and custom models, reach out to us using the form in the button below: