2024년 5월 27일 AI 소식 · TECH BLOG by Dongyoung Kim Ph.D.

Summary

오늘의 소식에서는 GPT-4’o’ 모델의 작동 원리와 유사한 AI를 만드는 방법에 대해 다룹니다. 또한 OpenGPT-4o 모델의 개발 과정과 Falcon 2-11B 모델에 대한 내용을 포함합니다.

Decoding GPT-4’o’: In-Depth Exploration of Its Mechanisms and Creating Similar AI

Decoding GPT-4’o’

날짜: 2024년 5월 21일
작성자: KingNish (Nishith Jain)
내용 요약:
- GPT-4’o’는 여러 모델을 혼합한 혁신적인 AI 모델로, 비디오 채팅, 감정 표현이 가능한 음성 채팅, 텍스트 및 이미지 생성, 문서 및 비디오 QnA, 이미지에서 3D 생성 등의 기능을 하나의 모듈에 통합한 모델입니다.
- SuperChat: 텍스트 생성, 이미지 생성, 이미지 및 문서 분류, 비디오 분류 등을 결합한 모델입니다.
- Voice Chat: 실시간으로 감정을 분석하고 음성으로 응답하는 TTS와 STT를 결합한 모듈입니다.
- Video Chat: 사용자가 대화 시작 시 이미지를 캡처하고 추가 이미지를 생성하여 사용자 질의에 응답하는 제로 샷 이미지 분류를 사용합니다.
- AI 모델 제작 방법:
  - MultiModalification Method: 기능에 따라 2개 이상의 모델을 결합하여 다기능 모델을 생성하는 방법입니다.
  - Duct Tape Method: 추가 훈련 없이 다양한 작업을 수행하기 위해 다양한 모델 또는 API를 사용하는 방법입니다.
- 추천 모델:
  - 텍스트 생성: Llama 3 70B
  - 이미지 생성: Pixart Sigma 또는 RealVisXL
  - 제로 샷 이미지 분류: Sigslip
  - 비디오 분류: Xclip
  - 3D 생성: Instant Mesh

How OpenGPT 4o works

날짜: 2024년 5월 21일
작성자: KingNish (Nishith Jain)
내용 요약:
- OpenGPT 4o는 GPT-4’o’의 오픈 소스 대안으로, 다양한 모델과 API를 결합하여 다기능 모델을 구축했습니다.
- Super Chat Module: 사용자의 입력을 Idefics 2로 처리하여 질문에 응답하고, 이미지 생성 요청 시 Pollination AI를 사용합니다.
- Voice Chat: JARVIS 코드 기반으로 구축된 음성 비서로, STT 모듈을 통해 사용자 질문을 텍스트로 변환하고, Mixtral 8x7B API를 통해 응답을 생성하여 TTS 모듈로 변환합니다.
- Live Chat: uform gen2 dpo 모델을 사용하여 실시간 상호작용을 지원합니다.
- 통합 과정: Gradio를 통해 모든 모듈을 실행하며, GPU 없이도 운영됩니다.

Falcon 2-11B

모델 설명: Falcon2-11B는 11B 파라미터를 가진 인과 디코더 전용 모델로, RefinedWeb과 선별된 말뭉치로 훈련되었습니다.
지원 언어: 영어, 독일어, 스페인어, 프랑스어, 이탈리아어, 네덜란드어, 폴란드어, 포르투갈어, 루마니아어, 체코어 등 11개 언어를 지원합니다.
주요 기능: 텍스트 생성 및 회화에 최적화된 모델입니다.

SimPO: Simple Preference Optimization with a Reference-Free Reward

SimPO

발표일: 2024년 5월 24일
저자: Yu Meng, Mengzhou Xia, Danqi Chen
내용 요약:
- SimPO는 Direct Preference Optimization(DPO) 알고리즘을 단순화한 방법으로, 시퀀스의 평균 로그 확률을 암묵적 보상으로 사용하여 훈련 안정성을 높였습니다.
- Bradley-Terry 목표에 타겟 보상 마진을 도입하여 성능을 향상시켰습니다.
- Llama3-8B-Instruct 모델을 기반으로 한 SimPO는 AlpacaEval 2 및 Arena-Hard 벤치마크에서 뛰어난 성능을 보였습니다.

위 링크를 통해 각 기사에 대한 더 자세한 내용을 확인할 수 있습니다.

Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is: # AI News for (today's date), ## Summary (overall short summary), ## Link1 Title, link, date - detailed summary1, - detailed summary2, - detailed summary..N, ## Link2 Title, link, date - detailed summary1, - detailed summary2, - detailed point..N, etc. The report should be written in Korean and use the 개조식 문체 style. give the very deep details for each link as much as possible.

###
https://huggingface.co/blog/KingNish/decoding-gpt-4o
Decoding GPT-4'o': In-Depth Exploration of Its Mechanisms and Creating Similar AI.
Community Article
Published May 21, 2024
Nishith Jain's avatar
KingNish
Nishith Jain
OpenAI has launched the groundbreaking AI GPT-4'o', a model that is a mixture of many models. In this blog post, we will discuss how GPT-4'o' works and how to create this kind of model.
0. GPT 4'o' Capabilities
Video Chat. (First time introduced feature)
Faster and Human Like Voice Chat. (It even shows emotions and change tones.)
Text Generation, Image Generation, Image QnA, Document QnA, Video QnA ,Sequential Image Generation, Image to 3d and best thing is All these things are Packed in 1 Modal.
Supports 50+ languages.
See Examples in OpenAI Post

1. How GPT 4'o' works.
Firstly GPT 4o working is mainly Divided into 3 parts.

1. SuperChat
As, GPT 4 already achieved Sequential image generation and image QnA. They have to just add doc QnA ,Video QnA and 3d generation. For, tech Giant like OpenAI it is just a piece of cake for them. This can be possible with methods we discuss at end.

2. Voice Chat
OpenAI has integrated TTS (Text-to-Speech) and STT (Speech-to-Text) into a single module, removing the text generation component they previously used. This means that when you speak, the AI analyzes your tone and words to create response in audio in real-time, similar to how streaming is used in text generation. In my opinion, OpenAi made this model comparatively less powerful because it is primarily designed for human interaction, and thus, the AI is trained accordingly.

3. Video Chat
Video chat is not actually a live video interaction. The AI captures an image at the start of the conversation and takes additional images as needed or instructed. It then employs Zero Shot Image Classification to respond to user queries. This module utilizes a more powerful model than voice chat because the AI can address a wider range of requests when it has visual information. For example, it can identify people, places, solve complex mathematical problems, detect coding errors, and much more which means it can do many things as compared to simple voice chat.

Image depicting what people thinks of how OpenGPT-4 works vs Reality.

What you thinkimage/png

How it actually worksimage/png

2. Creating AI Like GPT 4o
We, also make 3 models like OpenAI but before these There are two methods for creating every model. First, it's important to understand them.

1. MultiModalification or Mixture of Modal Method
This method combines 2 or more modals according to their functionality to create a new, powerful, multifunctional model, It aso requires further training.

2. Duct Tape Method
In this method You just need to use different types of Modals or API for doing Different task without ANY TRAINING.

Making of SuperChat Model
MultiModalification or Mixture of Modal Method To create SuperChat model we need to combine Text Generation, Image Generation, Image Classification, Document Classification, Video Classification models. Use the same process used in Idefics 2. A model that combines zero-shot image classification and text generation modal, Idefics 2 can chat with you and answer questions based on images.

Duct Tape Method Method without API - It include One base Modal which PROMPTED to identify which type of task is that and then send users prompt to that specific type of modal then send output to user. Optional: Use text gen modal at end to add some words, to make answer more realistic. Method with API - One base model prompted to use API on specific type of query. This method is utilized by Copilot. For instance, when it's requested to create images, compose songs, conduct web searches, or answer questions from images, it uses an API of that task to accomplish that task.

Recommended models from which you can create SuperChat Modal as powerful as GPT 4o

Base Modal - Llama 3 70B
Image Generation: Pixart Sigma or RealVisXL
Zero Shot Image Classification: Sigslip
Zero Shot Video Classification: Xclip
Sequential Image Gen - Control SDxl
Zero Shot Doc Classification - idf
3d gen - Instant Mesh
Other Models - Animate Diff lightning
Making of VoiceChat Model
MultiModalification or Mixture of Modal Method To develop a human-like speaking AI that also exhibits emotions, high-quality training data is essential. Additionally, an emotion identification model is necessary to recognize users' emotions and Text gen model who understands users emotion.

Duct Tape Method It include One stt Modal to encode users prompt with emotion to text gen modal with emotion encoded in answer and utilizing a TTS such as Parler TTS Expresso can further infuse emotion into the output.

Suggested Models

Speech to Text - Whisper
ChatModal - Llama3 8b
Text to Speech - Parler tts Expresso
Emotion identifier - Speech Emotion Recognition
Making of VideoChat Model
As previously mentioned, it only captures images. Thus, a zero-shot image classification model is necessary, while the rest remains the same as the voice chat model. However, it also requires a highly intelligent model, due to the increased use case with vision.

Suggested Models

ZeroShot Image Classification : Sigslip
Speech to Text - Whisper
ChatModal - Llama3 8b
Text to Speech - Parler tts Expresso
Optional - Speech Emotion Recognition
Alternatively

Image QnA Model - Idefics 2
VoiceChat Model

###
https://huggingface.co/blog/KingNish/opengpt-4o-working
How OpenGPT 4o works

How OpenGPT 4o works
Community Article
Published May 21, 2024
Nishith Jain's avatar
KingNish
Nishith Jain
In the previous blog, we discussed how ChatGPT 4o works. Today, we're going to talk about how I developed OpenGPT 4o, an open-source alternative to GPT 4o.
(Suggestion: Read previous blog post as this blog contains interconnected topics. Link - https://huggingface.co/blog/KingNish/decoding-gpt-4o )

Selecting the Method
There are 2 methods to Creating AI like GPT 4o.

1. MultiModalification or Mixture of Modal Method
This method combines 2 or more modals according to their functionality to create a new, powerful, multifunctional model, It also requires further training.

2. Duct Tape Method
In this method You just need to use different types of Modals or API for doing Different task without ANY TRAINING.

Since I don't have access to a GPU for training models. So, I've choosed the Duct Tape Method.

Next Step is to select the model/API based on their performance, speed and easy implementation.

Models and API used are:
Work	Model/API	Reason
Super Chat Model	Idefics 2	Already made, eliminating the need to build from scratch.
Image Generation Model	Pollination AI (API)	Implementation is fast and straightforward.
Speech to Text	Nemo (API)	Already utilized in another project (JARVIS).
Voice Chat (Base Model)	Mixtral 8x7b (Inference API)	Offers superior speed and power compared to GPT 3.5 Turbo.
Text to Speech	Edge tts (API)	Provides exceptionally fast text-to-speech conversion.
Live Chat (base model)	uform gen2 dpo	Its small size and rapid performance.
As, discussed in Prev Blog ChatGPT working is divide into 3 modules. So, Now discuss each module.

Super Chat Module
Let's Understand working with Visuals:image/png

Explaination: When a user provides input, it is processed by Idefics 2, which interprets user prompts and responds to questions. If a user wishes to generate an image, it creates an image link of Pollination AI. The process for creating this link is explained in detail to AI in its system prompt. Once the link is created, Pollination AI begins generating the image, which becomes visible to the user upon completion.

System Prompt I used
Voice Chat
As, I have already created JARVIS, a voice assistant, so I simply utilize the code from it.

Here is the visuals demonstrating how the voice chat functions.image/png

Explanation: When a user asks the AI a question, it is directed to the STT (Speech to Text) module, which converts it into text and sends it to the Mixtral 8x7B API. This API processes the request and generates a response that is sent to the TTS (Text to Speech) module. This module then converts the response into audio and sends it back to the user.

Live Chat
For real-time interactions, the uform gen2 dpo model powers the live chat feature.

Illustration depicting the working of video chat features.image/pngExplaination: Initially, the user provides input via both webcam and text simultaneously. Then, the AI answers users query from the picture using "UForm Gen2" and the answer is sent back in text format as the output.

The Integration Process
Well, All 3 modules are running through Gradio on ZERO GPU.

Source Code: - https://github.com/KingNishHF/OpenGPT-4o

Conclusion
The creation of OpenGPT 4o using the duct tape method is a prime example of how diverse AI models can be woven together to create a comprehensive and multifaceted tool. It stands as a beacon of possibility in the realm of AI development, showcasing the power of collaboration between different AI technologies.

###
https://huggingface.co/tiiuae/falcon-11B
tiiuae
/
falcon-11B

like
164
Text Generation
Transformers
Safetensors

tiiuae/falcon-refinedweb
English
German
Spanish
French
Italian
Dutch
Polish
Portuguese
Romanian
Czech
falcon
conversational
custom_code
text-generation-inference
5 papers

License:
unknown
Model card
Files and versions
Community
7
🚀 Falcon2-11B
Falcon2-11B is an 11B parameters causal decoder-only model built by TII and trained on over 5,000B tokens of RefinedWeb enhanced with curated corpora. The model is made available under the TII Falcon License 2.0, the permissive Apache 2.0-based software license which includes an acceptable use policy that promotes the responsible use of AI.


###
https://huggingface.co/papers/2405.14734
SimPO: Simple Preference Optimization with a Reference-Free Reward
SimPO: Simple Preference Optimization with a Reference-Free Reward
Published on May 24
Authors:

Yu Meng
,
Mengzhou Xia
,
Danqi Chen
Abstract
Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability. In this work, we propose SimPO, a simpler yet more effective approach. The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient. Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further enhancing the algorithm's performance. We compare SimPO to DPO and its latest variants across various state-of-the-art training setups, including both base and instruction-tuned models like Mistral and Llama3. We evaluated on extensive instruction-following benchmarks, including AlpacaEval 2, MT-Bench, and the recent challenging Arena-Hard benchmark. Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length. Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard. Our top-performing model, built on Llama3-8B-Instruct, achieves a remarkable 44.7 length-controlled win rate on AlpacaEval 2 -- surpassing Claude 3 Opus on the leaderboard, and a 33.8 win rate on Arena-Hard -- making it the strongest 8B open-source model.