LG AI Research는 EXAONE 3.0을 공개하며, OpenAI는 Structured Outputs 기능을 도입하였습니다. Meta는 Self-Taught Evaluators 접근법을 소개하며, Hugging Face는 Idefics3-8B를 출시했습니다. Black Forest Labs는 FLUX.1 모델을 발표하였고, BC카드는 K-금융 특화 AI를 무상 공개했습니다.

LG AI Research, EXAONE 3.0 발표

링크, 2024년 8월 7일

  • EXAONE 3.0 7.8B Instruction Tuned 모델 공개
    • 7.8B 파라미터와 8조 개의 토큰 데이터로 훈련된 디코더 전용 트랜스포머 아키텍처 기반
  • 영어와 한국어에서 글로벌 최상위 수준의 성능 달성
    • 영어: 실세계 사용 사례와 벤치마크에서 평균 1위 기록
    • 한국어: 실세계 사용 사례와 일반 성능에서 모두 최상위 결과
  • 경제성 확보: 3년간의 연구개발로 비용 6% 절감
    • EXAONE 1.0 대비 추론 처리 시간 56% 단축, 비용 72% 절감
  • AI 윤리와 투명성 강조
    • Red Teaming 과정을 거쳐 윤리성과 보안 평가 수행
    • 비차별적이고 법적 문제 없는 답변 제공, 개선 필요 영역 투명하게 공개

OpenAI, Structured Outputs 기능 도입

링크, 2024년 8월 6일

  • API에 Structured Outputs 기능 추가
    • 개발자가 제공한 JSON 스키마에 맞게 모델 출력 보장
    • 복잡한 JSON 스키마 따르기에서 100% 신뢰성 달성
  • 새로운 모델 gpt-4o-2024-08-06 출시
    • 복잡한 JSON 스키마 추종에서 기존 모델(gpt-4-0613)보다 높은 점수 기록

OpenAI, 주요 인사 변동

링크, 2024년 8월 6일

  • 공동 창업자 John Schulman, Greg Brockman, Peter Deng 등 주요 인사 이탈
    • John Schulman은 경쟁사 Anthropic으로 이동
    • Greg Brockman은 안식년 계획
    • Peter Deng은 퇴사
  • 올해 초에도 주요 인사 이탈
    • 공동 창업자 Andrej Karpathy, Jan Leike, Ilya Sutskever 퇴사
  • OpenAI의 새로운 음성 기능에 대한 긍정적인 초기 평가

Meta, Self-Taught Evaluators 발표

링크, 2024년 8월 5일

  • 인간의 선호 데이터 없이 모델 평가자를 향상시키는 접근법 소개
    • 대조적 모델 출력을 생성하고 LLM-as-a-Judge를 훈련하여 최종 판단 생성
    • 개선된 예측을 사용하여 반복적으로 훈련 수행
  • Llama3-70B-Instruct 모델 성능 향상
    • RewardBench에서 75.4에서 88.3으로 성능 향상 (다수결로 88.7)
    • GPT-4와 같은 기존 평가자를 능가하는 성능 달성

Hugging Face, Idefics 3-8B 발표

링크, 2024년 8월 4일

  • 텍스트와 이미지를 모두 처리할 수 있는 멀티모달 모델
    • SigLip 비전 백본과 Llama 3.1 8B 텍스트 백본 통합
    • 문서 질문 응답 성능(DocVQA) 87.7, MMStar 55.9 달성
    • 최대 10K 컨텍스트 지원
  • OCR, 문서 이해 및 시각적 추론 능력 향상
  • Apache 2.0 라이선스로 공개
  • Transformers 라이브러리와 통합

Black Forest Labs, FLUX.1 모델 발표

링크, 2024년 8월 1일

  • 텍스트-이미지 생성 모델 FLUX.1 시리즈 발표
    • FLUX.1 [pro], FLUX.1 [dev], FLUX.1 [schnell] 세 가지 변형 제공
    • 각기 다른 해상도와 비율 지원
    • 12B 파라미터 하이브리드 아키텍처 사용
    • Latent adversarial diffusion distillation 기법 적용
  • 시드 펀딩으로 3100만 달러 확보
    • 주요 투자자: Andreessen Horowitz, Brendan Iribe, Michael Ovitz 등
  • 높은 품질의 텍스트-이미지 생성 능력
    • Midjourney v6.0, DALL·E 3 (HD) 등보다 우수한 성능

BC카드, K-금융 특화 AI 무상 공개

링크, 2024년 7월 25일

  • 한국 금융권에 최적화된 거대 언어 모델 (LLM) 공개
    • Llama 3 기반, 200억 개의 파라미터 사용
    • 한국어 학습 능력 및 다양한 금융 지식 정보 탑재
  • 2만여 개의 금융 지식 학습 데이터와 함께 공개
  • 금융 AX 분야 발전에 기여
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
###
https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct
LG AI Research
8/7/24

EXAONE 3.0 : Showcasing Our First Open-Source LLM with Global Top-Level Performance
The three-year journey from version 1.0 to 3.0 of LG AI Research’s EXAONE has not been an easy one, as we have continued our research and released upgraded models and external commercialization results every year. From proving the development potential of AI technology in various industries one by one, we’ve created models that users can better utilize between the two pillars of performance and cost and developed expert-level AI that can be applied to real-world industrial fields.





EXAONE Milestone


Release of the EXAONE 3.0 7.8B Instruction Tuned Language Model

August 2024. Finally, we are excited to announce EXAONE 3.0. Among various EXAONE 3.0 language model lineups, we are releasing the 7.8B Instruction Tuned model as an open source for research. We hope that this model will help AI researchers in Korea and abroad to conduct more meaningful research and help the AI ecosystem move forward.


The 7.8B model released this time is based on the Decoder-only Transformer Architecture in line with recent trends, with 7.8B parameters and 8T training data (tokens). This post will introduce the main features, performance evaluation results, and insights of EXAONE 3.0 7.8B Instruction Tuned language model. For our performance evaluation, we utilized a combination of publicly available datasets and our own benchmark datasets to compare the performance of the 7.8B model with the latest AI models that support English and Korean, which are similar in size to the 7.8B model.


Key Takeaways

■ Achieved global top level in English : Ranked 1st average in real-world use cases and excellent performance in benchmarks

The English performance of the 7.8B model is at the Global Top-level compared to other models. EXAONE is aiming to be a high-level Expert AI that can be utilized in specialized industries. In order for AI models to be utilized in specialized industries and fields of expertise, they must perform well in real-world use cases, i.e., in a complex manner so humans can trust and use them. To evaluate this aspect, the Chatbot Arena method has recently been widely used, which is a method of directly using and evaluating models based on features that humans often use. While this evaluation is time-consuming, an accurate assessment of the real-world utility of the model is an advantage it provides. To confirm the English performance of the 7.8B model, we selected four key benchmarks that are similar to how Chatbot Arena is evaluated and evaluated the model on items with high human utilization. The results showed that EXAONE 7.8B model ranked first in most benchmarks, with the highest average score.

It also demonstrated superior performance on benchmarks. It ranked first in average scores for math and coding, demonstrating superiority over other models. And it also achieved strong performance results in reasoning.





Evaluation Results of Real-world Use Cases (English)



Benchmark – Math, Coding, Reasoning (English)



■ Clearly outstanding Korean language performance : Ranked first in average scores for both real-world use cases and benchmarks

EXAONE 7.8B model is a bilingual model that targets both English and Korean languages. For the Korean performance evaluation, we used two benchmarks to check the performance for real-world use cases, and configured multiple benchmarks to check general performance. As a result, we were able to see top overall results in both real-world use cases and general performance.





Evaluation Results of Real-world Use Cases (Korean)



Benchmark (Korean)



■ Securing economic feasibility : Reduced to 6% of the cost of the initially released model through three years of research and development

In order for AI to be applied to our lives, it is essential to improve performance as well as enhance economic feasibility. Since the release of EXAONE 1.0 in 2021, we have spent the past three years focusing on research and development in AI model compression technologies to achieve cost efficiency. As a result, the 7.8B model released shows a 56% reduction in inference processing time and a 72% reduction in cost compared to EXAONE 2.0. In particular, it is a significant reduction in cost, bringing it down to just 6% of the cost of the initially released EXAONE 1.0.





EXAONE 3.0 Performance Improvement



■ Ethical transparency : In addtion to excellent results, disclosure of areas requiring improvement

LG AI Research always considers AI ethics in the research and development process of AI models. EXAONE 3.0 7.8B Instruction Tuned language model also underwent a Red Teaming process to assess its ethics and security and was evaluated using both internal and external third-party datasets.

While the model released this time is excellent at providing non-sexually discriminatory and legal answers, there are areas that need to be improved. We disclosed the evaluation results as they are because we believe that transparent disclosure of information is a prerequisite for the development of AI ethics. We hope that researchers will conduct more active research on AI ethics based on this disclosure, and LG AI Research will also continue to research AI ethics.





Evaluation Results of Harmlessness (Korean Large Language Model Trustworthiness Benchmark Data)



You can view the detailed information, including the model's performance evaluation results, through the link below and directly download and use the 7.8B model. We hope that the release of this model will contribute to assisting various research and development by AI researchers and enhancing technological competitiveness.

###
https://openai.com/index/introducing-structured-outputs-in-the-api/
OpenAI
August 6, 2024

Introducing Structured Outputs in the API
We are introducing Structured Outputs in the API—model outputs now reliably adhere to developer-supplied JSON Schemas.

Structured Output in the API > Hero Image > Media Item
Last year at DevDay, we introduced JSON mode—a useful building block for developers looking to build reliable applications with our models. While JSON mode improves model reliability for generating valid JSON outputs, it does not guarantee that the model’s response will conform to a particular schema. Today we’re introducing Structured Outputs in the API, a new feature designed to ensure model-generated outputs will exactly match JSON Schemas provided by developers.

Generating structured data from unstructured inputs is one of the core use cases for AI in today’s applications. Developers use the OpenAI API to build powerful assistants that have the ability to fetch data and answer questions via function calling(opens in a new window), extract structured data for data entry, and build multi-step agentic workflows that allow LLMs to take actions. Developers have long been working around the limitations of LLMs in this area via open source tooling, prompting, and retrying requests repeatedly to ensure that model outputs match the formats needed to interoperate with their systems. Structured Outputs solves this problem by constraining OpenAI models to match developer-supplied schemas and by training our models to better understand complicated schemas.

On our evals of complex JSON schema following, our new model gpt-4o-2024-08-06 with Structured Outputs scores a perfect 100%. In comparison, gpt-4-0613 scores less than 40%.

With Structured Outputs, gpt-4o-2024-08-06 achieves 100% reliability in our evals, perfectly matching the output schemas.

###
OpenAI
August 6, 2024


• Cofounder John Schulman is heading to rival Anthropic
• Cofounder Greg Brockman is taking a sabbatical
• Product leader Peter Deng is also departing

This is after other key members left earlier this year:

• Cofounder Andrej Karpathy left in Feb
• Jan Leike, who led OpenAI safety team, left in May
• Chief Scientist and co-founder Ilya Sutskever also left in May

It seems like the company is in free fall as many key employees are leaving the company – some going directly to rivals like Anthropic and Google.

This is happening as Google's Gemini overtook GPT-4o last week. OpenAI is also finding its business model under attack from Meta's open source AI model strategy.

Tough time lie ahead but there may be some light at the end of the tunnel. Early testers of OpenAI's new voice feature are sharing rave reviews, and it may just be the next big thing in AI

###
https://arxiv.org/abs/2408.02666
META
[Submitted on 5 Aug 2024]
Self-Taught Evaluators
Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, Xian Li
Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.
Meta presents Self-Taught Evaluators, an approach to improve model-based evaluators using synthetic training data only.
It first generates contrasting outputs (good and bad model responses) and trains an LLM-as-a-Judge to produce reasoning traces and final judgments.
The self-improvement scheme repeats the training process in an iterative way using its improved predictions.
Keep in mind that this doesn't use any labeled preference data so no human preference judgements are required.
They claim to outperform LLM-judges such as GPT-4 and match top-performing reward models trained on labeled examples.
"Self-Taught Evaluator can improve a strong LLM (Llama3-70BInstruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench."
This is another interesting way to use synthetic data to iteratively improve the evaluation capabilities of the LLM.
This sounds like a good application for small language models but I read in the paper that these were not tried. The seed model needs to have the capability to generate reasonable evaluations (i.e., already instruction-tuned to human preferences).

###
https://huggingface.co/datasets/argilla/magpie-ultra-v0.1
META
8/1/24

The first Synthetic dataset created with Meta Llama 3.1 405B released. 🎏 MagPie-Ultra is the first open dataset using Llama 3.1 405B-Instruct FP8 to generate 50,000 synthetic instruction pairs using the MagPie recipe and Argilla distilabel. It includes challenging instructions for coding math, data analysis, creative writing, advice seeking, or Brainstorming. ⚗️
MagPie datasets are created by prompting LLMs with "empty" prompts that consist only of starting special tokens, allowing the model to auto-regressively generate user queries and corresponding responses, which are then filtered to select high-quality data. 👨‍🎓
Note: The dataset is unfiltered but includes quality & difficulty scores, embeddings, topics, and safety scores from ArmorRM and LlamaGuard. 🛡️


###
https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3
8/4/24
HuggingFaceM4: HuggingFaceM4 is the multimodal team at Hugging Face, working on vision-language models.
Introducing Idefics 3 8B Llama 3, Apache 2.0 licensed VLM with enhanced Document QA capabilities! 🔥
> Vision backbone: SigLip, Text backbone: Llama 3.1 8B
> Text + Image input w/ text output
> 8.5B parameter model
> Supports up to 10K context
> Apache 2.0 licensed
> DocVQA 87.7; MMStar 55.9 (massive increase over Idefics 2)
> Integrated with Transformers
Memory-wise, with 4-bit, you should be able to run it < 5GB VRAM ⚡
Open datasets and open models. Kudos to Hugo Laurençon
& Andi for sprinting and shipping; it's such a brilliant checkpoint!
Transformers version: until the next Transformers pypi release, please install Transformers from source and use this PR to be able to use Idefics3. TODO: change when new version.

Idefics3
Idefics3 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs. It improves upon Idefics1 and Idefics2, significantly enhancing capabilities around OCR, document understanding and visual reasoning.

We release the checkpoints under the Apache 2.0.

Model Summary
Developed by: Hugging Face
Model type: Multi-modal model (image+text)
Language(s) (NLP): en
License: Apache 2.0
Parent Models: google/siglip-so400m-patch14-384 and meta-llama/Meta-Llama-3.1-8B-Instruct
Resources for more information:
Idefics1 paper: OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Idefics2 paper: What matters when building vision-language models?
Idefics3 paper: Coming soon (TODO)
Uses
Idefics3-8B can be used to perform inference on multimodal (image + text) tasks in which the input is composed of a text query along with one (or multiple) image(s). Text and images can be arbitrarily interleaved. That includes image captioning, visual question answering, etc. These model does not support image generation.

The post-training of Idefics3-8B involves only a supervised fine-tuning stage, without RLHF alignment. As a result, the model may produce short answers or require prompt iterations to fully address the user's request. Adding a prefix to the assistant's response, such as "Let's fix this step by step" has been found to effectively influence the generated output.

To fine-tune Idefics3-8B on a specific task, we provide fine-tuning codes for Idefics2 that can be adapted (with almost no changes) to Idefics3:

With the TRL library: Script
With the Hugging Face Trainer: Tutorial notebook

###
https://huggingface.co/black-forest-labs/FLUX.1-schnell
Announcing Black Forest Labs
Aug 1, 2024


by

BlackForestLabs
in News.

Today, we are excited to announce the launch of Black Forest Labs. Deeply rooted in the generative AI research community, our mission is to develop and advance state-of-the-art generative deep learning models for media such as images and videos, and to push the boundaries of creativity, efficiency and diversity. We believe that generative AI will be a fundamental building block of all future technologies. By making our models available to a wide audience, we want to bring its benefits to everyone, educate the public and enhance trust in the safety of these models. We are determined to build the industry standard for generative media. Today, as the first step towards this goal, we release the FLUX.1 suite of models that push the frontiers of text-to-image synthesis.


The Black Forest Team

We are a team of distinguished AI researchers and engineers with an outstanding track record in developing foundational generative AI models in academic, industrial, and open-source environments. Our innovations include creating VQGAN and Latent Diffusion, The Stable Diffusion models for image and video generation (Stable Diffusion XL, Stable Video Diffusion, Rectified Flow Transformers), and Adversarial Diffusion Distillation for ultra-fast, real-time image synthesis.

Our core belief is that widely accessible models not only foster innovation and collaboration within the research community and academia, but also increase transparency, which is essential for trust and broad adoption. Our team strives to develop the highest quality technology and to make it accessible to the broadest audience possible.

Funding

We are excited to announce the successful closing of our Series Seed funding round of $31 million. This round was led by our main investor, Andreessen Horowitz, including notable participation from angel investors Brendan Iribe, Michael Ovitz, Garry Tan, Timo Aila and Vladlen Koltun and other renowned experts in AI research and company building. We have received follow-up investments from General Catalyst and MätchVC to support us on our mission to bring state-of-the-art AI from Europe to everyone around the world.

Furthermore, we are pleased to announce our advisory board, including Michael Ovitz, bringing extensive experience in the content creation industry, and Prof. Matthias Bethge, pioneer of neural style transfer and leading expert in open European AI research.

Flux.1 Model Family


We release the FLUX.1 suite of text-to-image models that define a new state-of-the-art in image detail, prompt adherence, style diversity and scene complexity for text-to-image synthesis.

To strike a balance between accessibility and model capabilities, FLUX.1 comes in three variants: FLUX.1 [pro], FLUX.1 [dev] and FLUX.1 [schnell]:

FLUX.1 [pro]: The best of FLUX.1, offering state-of-the-art performance image generation with top of the line prompt following, visual quality, image detail and output diversity. Sign up for FLUX.1 [pro] access via our API here. FLUX.1 [pro] is also available via Replicate and fal.ai. Moreover we offer dedicated and customized enterprise solutions – reach out via flux@blackforestlabs.ai to get in touch.
FLUX.1 [dev]: FLUX.1 [dev] is an open-weight, guidance-distilled model for non-commercial applications. Directly distilled from FLUX.1 [pro], FLUX.1 [dev] obtains similar quality and prompt adherence capabilities, while being more efficient than a standard model of the same size. FLUX.1 [dev] weights are available on HuggingFace and can be directly tried out on Replicate or Fal.ai. For applications in commercial contexts, get in touch out via flux@blackforestlabs.ai.
FLUX.1 [schnell]: our fastest model is tailored for local development and personal use. FLUX.1 [schnell] is openly available under an Apache2.0 license. Similar, FLUX.1 [dev], weights are available on Hugging Face and inference code can be found on GitHub and in HuggingFace’s Diffusers. Moreover we’re happy to have day-1 integration for ComfyUI.

Transformer-powered Flow Models at Scale

All public FLUX.1 models are based on a hybrid architecture of multimodal and parallel diffusion transformer blocks and scaled to 12B parameters. We improve over previous state-of-the-art diffusion models by building on flow matching, a general and conceptually simple method for training generative models, which includes diffusion as a special case. In addition, we increase model performance and improve hardware efficiency by incorporating rotary positional embeddings and parallel attention layers. We will publish a more detailed tech report in the near future.

A new Benchmark for Image Synthesis

FLUX.1 defines the new state-of-the-art in image synthesis. Our models set new standards in their respective model class. FLUX.1 [pro] and [dev] surpass popular models like Midjourney v6.0, DALL·E 3 (HD) and SD3-Ultra in each of the following aspects: Visual Quality, Prompt Following, Size/Aspect Variability, Typography and Output Diversity. FLUX.1 [schnell] is the most advanced few-step model to date, outperforming not even its in-class competitors but also strong non-distilled models like Midjourney v6.0 and DALL·E 3 (HD) . Our models are specifically finetuned to preserve the entire output diversity from pretraining. Compared to the current state-of-the-art they offer drastically improved possibilities as shown below



All FLUX.1 model variants support a diverse range of aspect ratios and resolutions in 0.1 and 2.0 megapixels, as shown in the following example.


Up Next: SOTA Text-to-Video for All

Today we release the FLUX.1 text-to-image model suite. With their strong creative capabilities, these models serve as a powerful foundation for our upcoming suite of competitive generative text-to-video systems. Our video models will unlock precise creation and editing at high definition and unprecedented speed. We are committed to continue pioneering the future of generative media.

FLUX.1 [schnell] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post.

Key Features
Cutting-edge output quality and competitive prompt following, matching the performance of closed source alternatives.
Trained using latent adversarial diffusion distillation, FLUX.1 [schnell] can generate high-quality images in only 1 to 4 steps.
Released under the apache-2.0 licence, the model can be used for personal, scientific, and commercial purposes.
Usage
We provide a reference implementation of FLUX.1 [schnell], as well as sampling code, in a dedicated github repository. Developers and creatives looking to build on top of FLUX.1 [schnell] are encouraged to use this as a starting point.

###
https://github.com/Alpha-VLLM/Lumina-mGPT?tab=readme-ov-file#local-gradio-demos
Lumina-mGPT
A family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. 👋 join our WeChat
[2024-07-08] 🎉🎉🎉 Lumina-mGPT is released! 🎉🎉🎉
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, Peng Gao
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. Unlike existing autoregressive image generation approaches, Lumina-mGPT employs a pretrained decoder-only transformer as a unified framework for modeling multimodal token sequences. Our key insight is that a simple decoder-only transformer with multimodal Generative PreTraining (mGPT), utilizing the next-token prediction objective on massive interleaved text-image sequences, can learn broad and general multimodal capabilities, thereby illuminating photorealistic text-to-image generation. Building on these pretrained models, we propose Flexible Progressive Supervised Finetuning (FP-SFT) on high-quality image-text pairs to fully unlock their potential for high-aesthetic image synthesis at any resolution while maintaining their general multimodal capabilities. Furthermore, we introduce Ominiponent Supervised Finetuning (Omni-SFT), transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification. The resulting model demonstrates versatile multimodal capabilities, including visual generation tasks like flexible text-to-image generation and controllable generation, visual recognition tasks like segmentation and depth estimation, and vision-language tasks like multiturn visual question answering. Additionally, we analyze the differences and similarities between diffusion-based and autoregressive methods in a direct comparison.

###
https://github.com/THUDM/CogVideo/tree/main?tab=readme-ov-file
CogVideo && CogVideoX
News: 2024/8/6: We have also open-sourced 3D Causal VAE used in CogVideoX-2B, which can reconstruct the video almost losslessly.
CogVideoX-2B is the latest open-source video generation model from ZhiPu AI, renowned for its powerful video creation capabilities. By simply inputting text or images, users can effortlessly generate high-quality video content. CogVideoX-2B is the first in the CogVideoX series, featuring 2 billion parameters and sharing the same lineage as ZhiPu AI's AI video generation product, "Qingying."

CogVideoX-2B integrates several cutting-edge technologies, making it a leader in the video generation field.

3D Variational Autoencoder (3D VAE): Utilizing an innovative three-dimensional convolution approach, the 3D VAE compresses video data across both spatial and temporal dimensions, achieving unprecedented compression rates and superior reconstruction quality. The model architecture includes an encoder, decoder, and a latent space regularizer, ensuring coherent and logical information processing through causal convolution mechanisms.

End-to-End Video Understanding Model: This enhancement improves the model's comprehension of text and adherence to instructions, ensuring the generated videos meet user requirements, even with long and complex prompts.

Expert Transformer Technology: This technology allows for deep parsing of encoded video data, integrating textual inputs to create high-quality, narrative-rich video content.

###
https://huggingface.co/BCCard
BC카드, 국내 최적화 거대언어모델 무상공개…‘금융 GPT’ 제공한다
구현주 기자2024. 7. 25. 10:48
번역 설정글씨크기 조절하기인쇄하기
/BC카드

/BC카드
[마이데일리 = 구현주 기자] 국내 금융에 최적화된 거대언어모델(이하 LLM)이 나왔다. LLM은 대용량 인간 언어를 이해하고 생성하도록 훈련된 AI(인공지능) 모델로 생성형 AI 핵심 기술이다.

25일 BC카드는 국내 금융권에서 처음으로 개발한 K-금융 특화 AI를 무상 공개 한다고 25일 밝혔다.

이번에 개발된 ‘K-금융 특화 AI’는 BC카드 IT기획본부가 KT 기술혁신부문 산하 KT컨설팅그룹 AI 리드와 협업해 지난 6개월간 연구 끝에 국내에 최적화한 LLM이다.

K-금융 특화 AI는 메타(페이스북)의 거대 언어모델(LLama 3)를 기반으로 한국어 학습 능력은 물론 다양한 금융 지식 정보까지 탑재했다.

현재 국내에서 공개된 대부분 LLM은 80억개 수준 파라미터를 갖추고 있지만 ‘K-금융 특화 AI’는 200억개 파라미터를 활용할 수 있다. 파라미터는 생성형 AI가 정보를 학습하고 기억하기 위해 필요한 기본 단위다. 파라미터가 많을수록 축적된 자료를 바탕으로 복잡한 학습을 통해 학습하지 않았던 문제를 해결할 수 있을 뿐만 아니라 정교한 예측과 분석도 가능해진다.

K-금융 특화 AI 정확도는 91%로 범용 AI 대비 높은 정확도를 기록하며 한국 금융에 대한 LLM 지식수준을 한 단계 더 끌어올렸다. 이는 한국은행 등 다양한 국책기관과 금융기관의 검증된 데이터만을 활용했기 때문이다.

BC카드 측은 K-금융 특화 AI 도입을 기점으로 기업 내부 프로세스 개선 및 효율화는 물론 왜곡된 금융 정보로 인한 2차 피해를 예방하는 등 다양한 분야에서 긍정적인 역할을 할 수 있을 것으로 내다봤다.

7월 초 AI 모델 허브 플랫폼 허깅페이스를 통해 K-금융 특화 AI LLM 모델과 2만여개 금융지식 학습 데이터를 무상으로 공개했다. 향후 K-금융 특화 AI 지속적인 고도화 작업을 통해 금융 AX 분야 발전에 이바지함은 물론, BC카드에 카드 운영을 맡기고 있는 금융사를 위한 맞춤형 ‘금융 GPT’ 등을 통해 차별화된 서비스를 지속 제공해 나갈 계획이다.

강대일 BC카드 상무는 “글로벌 AI 시장에서도 경쟁할 수 있는 한국산 금융 지식 모델을 선보일 수 있게 되어 의미가 남다르다”며 “앞으로도 KT AI 기술력을 적극 활용해 국내 여러 산업 분야에서 다양한 시너지를 낼 수 있도록 지속적으로 협업해 나갈 계획”이라고 말했다.