• Amazon에서는 새로운 파운데이션 모델 Amazon Nova를 발표하였습니다. 이 모델은 Amazon Bedrock을 통해 독점적으로 제공되며, 텍스트부터 비디오 생성까지 다양한 기능을 지원합니다.
  • Meta AI난양기술대학교 연구팀은 EfficientTAM을 공개하였습니다. 이는 경량의 Vision Transformer를 사용하여 비디오 객체 분할 및 추적의 효율성을 높였습니다.
  • IDEA Research에서는 DINO-X를 발표하였습니다. 이는 개방형 객체 감지와 이해를 위한 세계 최고의 비전 모델로, 다양한 입력 프롬프트와 다중 레벨의 출력 표현을 지원합니다.
  • 베이징교통대학 연구팀은 O1-CODER를 소개하였습니다. 이는 OpenAI의 O1 모델을 복제하여 코딩 작업에 특화된 시스템입니다.
  • Google DeepMind에서는 Generative Omnimatte를 발표하였습니다. 이 기술은 비디오를 레이어로 분해하여 객체와 그에 따른 효과를 분리합니다.
  • Tencent 연구팀은 HiFiVFS를 선보였습니다. 이는 고품질의 비디오 얼굴 교체를 가능하게 하는 프레임워크로, 복잡한 조건에서도 우수한 성능을 보여줍니다.
  • 싱가포르 국립대학교Microsoft 연구팀은 ShowUI를 공개하였습니다. 이는 어떤 UI에서도 동작할 수 있는 작은 종단 간 에이전트입니다.
  • Pydantic 팀은 PydanticAI를 출시하였습니다. 이는 Pydantic을 활용하여 LLM과 함께 사용할 수 있는 에이전트 프레임워크입니다.
  • Quivr에서는 MegaParse를 소개하였습니다. 이는 모든 유형의 문서를 처리할 수 있는 강력한 파서입니다.
  • Google Research에서는 LLM 임베딩을 활용한 회귀 분석에 대한 연구를 발표하였습니다.
  • Adobe 연구팀은 DynaSaur를 발표하였습니다. 이는 사전 정의된 액션을 넘어서는 대형 언어 모델 에이전트입니다.
  • ConsisID는 일관성 있는 아이덴티티 보존 텍스트-비디오 생성 모델을 제안하였습니다.
  • Hugging Face에서는 ShowUI-2B 모델을 공개하였습니다.
  • DataLab은 LLM을 활용한 비즈니스 인텔리전스를 위한 통합 플랫폼을 소개하였습니다.

Amazon, Amazon Nova 파운데이션 모델 발표

링크, 2024년 12월 3일

  • Amazon은 새로운 세대의 최첨단 파운데이션 모델인 Amazon Nova를 발표함.
  • Amazon Nova는 Amazon Bedrock을 통해 독점적으로 제공되며, 다양한 생성 AI 작업을 지원함.
  • Understanding ModelsCreative Content Generation Models의 두 가지 카테고리로 구성됨.
  • Understanding Models는 텍스트, 이미지, 비디오 입력을 받아 텍스트 출력을 생성하며, Micro, Lite, Pro, Premier의 네 가지 모델로 구성됨.
    • Amazon Nova Micro: 텍스트 전용 모델로, 128K 토큰의 컨텍스트 길이를 지원하며, 낮은 지연 시간과 비용으로 텍스트 요약, 번역, 분류 등에 최적화됨.
    • Amazon Nova Lite: 멀티모달 모델로, 최대 300K 토큰의 컨텍스트 길이를 지원하며, 이미지, 비디오, 텍스트 입력을 처리하여 텍스트 출력을 생성함.
    • Amazon Nova Pro: 고성능 멀티모달 모델로, 복잡한 추론 및 에이전트 워크플로우를 지원하며, 300K 토큰의 컨텍스트 길이를 가짐.
    • Amazon Nova Premier: 가장 강력한 멀티모달 모델로, 2025년에 출시 예정임.
  • Creative Content Generation Models는 텍스트 및 이미지 입력을 받아 이미지 또는 비디오 출력을 생성함.
    • Amazon Nova Canvas: 스튜디오 품질의 이미지를 생성하며, 인페인팅, 아웃페인팅, 배경 제거 등의 편집 기능을 제공함.
    • Amazon Nova Reel: 텍스트 프롬프트와 이미지를 통해 전문적인 품질의 비디오를 생성함.
  • 모든 모델은 내장된 안전 제어 및 워터마크 기능을 포함하여 책임 있는 AI 사용을 촉진함.
  • Amazon Bedrock을 통해 사용자 지정 및 미세 조정이 가능하며, Retrieval-Augmented Generation(RAG), 함수 호출, 에이전트 응용 프로그램에 탁월한 성능을 보임.
  • 현재 미국 내 AWS 리전에서 제공되며, 가격은 모델과 사용량에 따라 다름.

Meta AI 및 난양기술대학교, EfficientTAM 발표

링크, 2024년 12월 5일

  • **EfficientTAM(Efficient Track Anything Model)**을 공개함.
  • 경량의 Vision Transformer(ViT)를 사용하여 비디오 객체 분할 및 추적의 효율성을 향상시킴.
  • iPhone 15에서 초당 10프레임 이상의 실시간 성능을 달성함.
  • **SAM 2(Segment Anything Model 2)**와 비교하여 유사한 성능을 유지하면서도 속도와 효율성을 향상시킴.
  • 효율적인 메모리 크로스 어텐션을 도입하여 메모리 계산을 최적화하고, 메모리 공간 임베딩의 강력한 지역성을 활용함.
  • 여러 비디오 세분화 벤치마크(예: Semi-supervised VOS, Promptable Video Segmentation)에서 우수한 성능을 입증함.
  • 다양한 프롬프트(포인트, 박스, 세그먼트 등)를 지원하며, 이미지 세분화에서도 향상된 성능을 보임.
  • 경량의 ViT와 효율적인 메모리 모듈을 결합하여 모델 크기와 지연 시간을 줄임.

IDEA Research, DINO-X 발표

링크, 2024년 11월 25일

  • DINO-X는 개방형 객체 감지 및 이해를 위한 최첨단 비전 모델임.
  • 이전 버전인 Grounding DINO 1.51.6 대비 성능을 크게 향상시킴.
  • COCO, LVIS-minival, LVIS-val 등의 제로샷 객체 감지 벤치마크에서 새로운 최고 성능을 달성함.
  • 특히 LVIS의 희귀 클래스에서 이전 모델 대비 5.8 AP 및 5.0 AP의 성능 향상을 보임.
  • 다양한 입력 프롬프트(텍스트, 비주얼, 사용자 정의)를 지원하며, 바운딩 박스, 세그멘테이션 마스크, 포즈 키포인트, 객체 캡션 등 다중 레벨의 출력 표현을 제공함.
  • 개방형 객체 감지, 프레이즈 그라운딩, 비주얼 카운팅, 포즈 추정, 프롬프트 없는 객체 감지 및 인식 등 다양한 실용적인 작업을 지원함.
  • Grounding DINO와 **SAM(Segment Anything Model)**의 조합인 Grounded SAM과 비교하여 통합된 비전 모델로서 효율성과 성능을 향상시킴.

베이징교통대학, O1-CODER 소개

링크, 2024년 12월 4일

  • O1-CODER는 OpenAI의 O1 모델을 코딩 작업에 특화하여 복제한 시도임.
  • 강화 학습(RL)과 몬테카를로 트리 탐색(MCTS)을 통합하여 모델의 시스템 2 사고 능력을 향상시킴.
  • **테스트 케이스 생성기(TCG)**를 훈련하여 표준화된 코드 테스트를 수행함.
  • 자체 플레이와 강화 학습을 통해 모델이 추론 프로세스를 포함한 코드 데이터를 생성하고, 정책 모델을 반복적으로 최적화함.
  • 모델은 초기에는 의사 코드(pseudo-code)를 생성하고, 이후에 전체 코드를 생성하도록 훈련됨.
  • 실제 응용에서의 기회와 도전을 다루며, 시스템 2 패러다임으로의 전환을 제안함.
  • 모든 소스 코드, 데이터셋 및 모델을 공개할 예정임.

Google DeepMind, Generative Omnimatte 발표

링크, 2024년 12월 3일

  • Generative Omnimatte는 비디오를 레이어로 분해하여 각 객체와 그에 관련된 효과(그림자, 반사 등)를 분리하는 기술임.
  • 기존 방법들은 정적인 배경이나 단일 객체에 제한되었으나, 이 기술은 동적인 배경, 가려짐, 다중 객체의 효과 연관을 처리할 수 있음.
  • 사용자에게 다양한 비디오 편집 기능을 제공하여 객체 제거, 효과 제거 등의 작업을 가능하게 함.
  • 두 단계의 프로세스를 사용함:
    • Stage 1: 객체 및 효과 제거 모델인 Casper를 사용하여 클린 플레이트 배경과 단일 객체 비디오를 생성함.
    • Stage 2: 테스트 시 최적화를 통해 솔로 비디오와 배경 비디오에서 오미매트 레이어를 재구성함.
  • 자기 주의(attention) 메커니즘을 활용하여 효과 연관을 시각화하고, 모델의 내부 이해를 분석함.
  • 사용자 지정 트리마스크(trimask)를 통해 결과를 개선하고, 다중 객체 시나리오에서 더 정확한 결과를 얻을 수 있음.

Tencent, HiFiVFS 공개

링크, 2024년 12월 1일

  • HiFiVFS는 고품질 비디오 얼굴 교체를 위한 프레임워크임.
  • **Stable Video Diffusion(SVD)**를 기반으로 하며, 시간적 안정성을 유지하면서 얼굴 교체를 수행함.
  • **세밀한 속성 학습(FAL)**을 통해 아이덴티티 비민감화와 적대적 학습을 통해 속성을 분리하고 강화함.
  • **상세한 아이덴티티 학습(DIL)**을 통해 더 얼굴 교체에 적합한 ID 특징을 사용하여 아이덴티티 유사성을 향상시킴.
  • 복잡한 조명 조건, 극단적인 포즈, 가려짐, 다른 얼굴 형태 등의 어려운 시나리오에서도 우수한 성능을 보임.
  • 기존 방법들과 비교하여 새로운 최고 성능(SOTA)을 달성하였으며, 추가적인 후처리 없이도 고품질의 얼굴 교체 영상을 생성함.

싱가포르 국립대학교 및 Microsoft, ShowUI 발표

링크, 2024년 12월 5일

  • ShowUI는 어떤 UI에서도 동작할 수 있는 경량의 종단 간 에이전트임.
  • Qwen-2-VL-2B 모델을 기반으로 하여 작은 모델로도 GPT-4V와 같은 더 큰 모델을 능가하는 성능을 보임.
  • UI 스크린샷을 최대한 단순화하여 효율성을 높이고, 패치를 그룹화하여 시각적 복잡성을 줄임.
  • 데스크톱, 안드로이드, 웹 등 다양한 플랫폼에서 작동하며, 추가적인 텍스트 정보 없이도 UI를 탐색하고 작업을 수행함.
  • 다양한 데이터셋(OmniAct, Mind2Web, AMEX 등)을 통합하여 일반적인 UI 에이전트를 훈련함.
  • 에이전트가 자체적으로 도구를 정의하고, 이를 재사용하여 보다 유연한 문제 해결이 가능함.
  • **OOTB(Out-of-the-box)**를 통해 로컬에서 쉽게 실행할 수 있음.

Pydantic, PydanticAI 출시

링크, 2024년 12월 2일

  • PydanticAI는 Pydantic을 활용하여 LLM과 함께 사용할 수 있는 에이전트 프레임워크임.
  • 파이썬으로 제어 흐름과 구성을 수행하며, 별도의 DSL이나 복잡한 추상화가 필요 없음.
  • 타입 안전성을 제공하며, IDE 지원과 정적 타입 검사를 통해 개발 생산성을 향상시킴.
  • OpenAI, Gemini, vLLM, TGI 등 다양한 모델을 지원하며, 모델에 종속되지 않는 구조를 가짐.
  • 구조화된 응답 검증과 스트리밍 응답을 지원하여 안정적인 응용 프로그램 구축이 가능함.
  • 동적 런타임 컨텍스트 및 종속성 주입을 지원하여 테스트와 반복적인 개발을 용이하게 함.
  • Logfire 통합을 통해 LLM 기반 응용 프로그램의 디버깅과 성능 모니터링을 지원함.

Quivr, MegaParse 소개

링크, 2024년 12월 4일

  • MegaParse는 모든 유형의 문서를 처리할 수 있는 강력하고 다재다능한 파서임.
  • 정보 손실 없이 문서를 파싱하는 데 중점을 두며, 텍스트, PDF, 파워포인트, 워드, 엑셀, CSV 등 다양한 파일 형식을 지원함.
  • 빠르고 효율적인 성능을 제공하며, 오픈 소스로 자유롭게 사용할 수 있음.
  • UnstructuredParser, MegaParseVision, LlamaParser 등 다양한 파서 모듈을 제공하여 사용자 필요에 따라 선택 가능함.
  • MegaParse Vision은 멀티모달 모델(예: GPT-4V, Claude 4)을 활용하여 이미지와 PDF의 파싱 성능을 향상시킴.
  • API로도 사용할 수 있으며, 간단한 설치와 사용법을 제공함.
  • 벤치마크 결과에서 기존의 파서보다 높은 유사도 비율을 달성하여 정확성을 입증함.

Google Research, LLM 임베딩을 활용한 회귀 분석 연구 발표

링크, 2024년 12월 2일

  • LLM 임베딩을 회귀 분석의 특징(feature)으로 사용하는 방법을 종합적으로 조사함.
  • LLM 임베딩을 특징으로 사용할 경우, 고차원 회귀 작업에서 전통적인 특징 공학보다 더 나은 성능을 보일 수 있음을 실험적으로 보여줌.
  • LLM 임베딩이 특징 공간에서 리프시츠 연속성(Lipschitz continuity)을 보존한다는 것을 발견함.
  • 모델 크기와 언어 이해 능력이 회귀 성능 향상에 항상 기여하지는 않는다는 것을 관찰함.
  • 다양한 모델 효과(모델 크기, 언어 이해 등)의 기여도를 정량화하여 회귀 분석에서 LLM 임베딩의 활용 가능성을 제시함.

Adobe, DynaSaur 발표

링크, 2024년 12월 2일

  • DynaSaur는 사전 정의된 액션을 넘어서는 대형 언어 모델(LLM) 에이전트임.
  • 기존의 에이전트 시스템은 고정되고 사전 정의된 액션 세트에서 선택하지만, DynaSaur는 코드 생성을 통해 새로운 액션을 동적으로 생성하고 구성할 수 있음.
  • 에이전트가 파이썬 코드를 작성하여 프로그램을 실행하고, 생성된 액션을 축적하여 향후 재사용 가능함.
  • 코드로 행동함으로써 에이전트가 튜링 완전한 문제를 해결할 수 있으며, 일반성과 구성 가능성을 향상시킴.
  • GAIA 벤치마크에서 최고 성능을 달성하였으며, 기존의 정해진 액션 세트 기반의 방법보다 유연하고 강력한 문제 해결 능력을 보임.
  • 코드 에이전트의 성능이 크게 향상되었으며, 특히 예기치 않은 엣지 케이스나 새로운 상황에서 회복 능력이 뛰어남.
  • 오픈 소스로 코드가 공개되어 있으며, 다양한 응용 분야에서 활용 가능함.

ConsisID, 아이덴티티 보존 텍스트-비디오 생성 모델 제안

링크, 2024년 12월 1일

  • ConsisID는 일관성 있는 아이덴티티 보존 텍스트-비디오 생성 모델임.
  • 주파수 분해를 활용하여 얼굴 특징을 저주파(global features)와 고주파(intrinsic features)로 분리함.
  • 글로벌 얼굴 추출기를 통해 참조 이미지와 얼굴 키포인트를 인코딩하여 저주파 정보를 포함한 특징을 추출함.
  • 로컬 얼굴 추출기를 통해 고주파 세부 정보를 캡처하고, 트랜스포머 블록에 주입하여 아이덴티티 보존 능력을 향상시킴.
  • 계층적 학습 전략을 통해 주파수 정보를 활용하여 아이덴티티를 보존함.
  • 튜닝 없이 단일 이미지와 텍스트 프롬프트만으로 고품질의 아이덴티티 일관성을 가진 비디오를 생성함.
  • 다양한 벤치마크에서 우수한 성능을 입증하였으며, Apache 2.0 라이선스로 공개되어 있음.

Hugging Face, Showlab의 ShowUI-2B 모델 공개

링크, 2024년 12월 5일

  • ShowUI-2B는 경량의 비전-언어-액션 모델로, GUI 에이전트임.
  • Qwen-2-VL-2B 모델을 기반으로 하여 다양한 UI에서 동작함.
  • 웹, 데스크톱, 안드로이드 등 다양한 인터페이스에서 작동하며, 추가적인 텍스트 정보 없이도 UI를 탐색하고 작업을 수행함.
  • UI 스크린샷을 최대한 단순화하여 시각적 복잡성을 줄이고, 패치를 그룹화하여 효율성을 높임.
  • 다양한 데이터셋을 통합하여 일반적인 UI 에이전트를 훈련하였으며, GPT-4V와 같은 더 큰 모델을 능가하는 성능을 보임.
  • 에이전트가 자체적으로 도구를 정의하고, 이를 재사용하여 유연한 문제 해결이 가능함.
  • **OOTB(Out-of-the-box)**를 통해 로컬에서 쉽게 실행할 수 있음.

DataLab, LLM 기반 비즈니스 인텔리전스 플랫폼 발표

링크, 2024년 12월 3일

  • DataLab은 LLM 기반 에이전트를 통합한 통합 비즈니스 인텔리전스(BI) 플랫폼임.
  • 자연어 쿼리를 기반으로 자동으로 작업 계획, 추론, 실행을 수행하며, 데이터 분석 업무를 간소화함.
  • 다양한 BI 작업을 단일 환경에서 지원하여 데이터 전문가들이 여러 도구를 전환하지 않고도 작업 가능함.
  • 도메인 지식 통합 모듈을 설계하여 기업별 BI 작업을 지원하고, LLM이 기업 특유의 용어와 데이터에 적응할 수 있게 함.
  • 에이전트 간 통신 메커니즘을 통해 BI 워크플로우에서 정보 공유를 촉진하고, 협업을 강화함.
  • 셀 기반 컨텍스트 관리 전략을 도입하여 노트북 환경에서의 컨텍스트 활용을 최적화하고, 긴 컨텍스트 문제를 해결함.
  • 다양한 BI 작업에서 기존의 방법보다 우수한 성능을 보였으며, 실제 기업 데이터셋에서도 높은 효과와 효율성을 입증함.
  • Tencent의 실제 데이터에서 최대 58.58%의 정확도 향상과 61.65%의 토큰 비용 절감을 달성함.
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
###
https://aws.amazon.com/de/blogs/aws/introducing-amazon-nova-frontier-intelligence-and-industry-leading-price-performance/
Introducing Amazon Nova foundation models: Frontier intelligence and industry leading price performance
by Danilo Poccia on 03 DEC 2024 in Amazon Bedrock, Amazon Machine Learning, Announcements, Artificial Intelligence, AWS re:Invent, Featured, Generative AI, Launch, News Permalink Comments Share
Voiced by Polly
Today, we’re thrilled to announce Amazon Nova, a new generation of state-of-the-art foundation models (FMs) that deliver frontier intelligence and industry leading price performance, available exclusively in Amazon Bedrock.

Unexpected. Amazon is back with Foundation Models. As part of re:Invent, they announced 6 new foundation models, from text-only to text-to-video! 👀 Nova models will be exclusively available through Amazon Bedrock.
TL;DR;
🧠 Micro (text-only), Lite (multimodal), Pro (high-capability), and Premier (coming 2025)
🎨 Canvas (image-generation) and Reel (video-generation)
📊 Context length up to 300K tokens and 200+ languages
🥇 Performance on benchmarks similar to Llama 3
🗺️ Models currently only available in AWS Regions in the US
🔒 Includes watermarking capabilities (no details here)
🔧 Can be fine-tuned inside Amazon Bedrock
💰 Micro: $0.035 / $0.14; Lite: $0.06 / $0.24; Pro: $0.80 / $3.20 per million input/output tokens


You can use Amazon Nova to lower costs and latency for almost any generative AI task. You can build on Amazon Nova to analyze complex documents and videos, understand charts and diagrams, generate engaging video content, and build sophisticated AI agents, from across a range of intelligence classes optimized for enterprise workloads.

Whether you’re developing document processing applications that need to process images and text, creating marketing content at scale, or building AI assistants that can understand and act on visual information, Amazon Nova provides the intelligence and flexibility you need with two categories of models: understanding and creative content generation.

Amazon Nova understanding models accept text, image, or video inputs to generate text output. Amazon creative content generation models accept text and image inputs to generate image or video output.

Understanding models: Text and visual intelligence
The Amazon Nova models include three understanding models (with a fourth one coming soon) designed to meet different needs:

Amazon Nova Micro – A text-only model that delivers the lowest latency responses in the Amazon Nova family of models at a very low cost. With a context length of 128K tokens and optimized for speed and cost, Amazon Nova Micro excels at tasks such as text summarization, translation, content classification, interactive chat and brainstorming, and simple mathematical reasoning and coding. Amazon Nova Micro also supports customization on proprietary data using fine-tuning and model distillation to boost accuracy.

Amazon Nova Lite – A very low-cost multimodal model that is lightning fast for processing image, video, and text inputs to generate text output. Amazon Nova Lite can handle real-time customer interactions, document analysis, and visual question-answering tasks with high accuracy. The model processes inputs up to 300K tokens in length and can analyze multiple images or up to 30 minutes of video in a single request. Amazon Nova Lite also supports text and multimodal fine-tuning and can be optimized to deliver the best quality and costs for your use case with techniques such as model distillation.

Amazon Nova Pro – A highly capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Pro is capable of processing up to 300K input tokens and sets new standards in multimodal intelligence and agentic workflows that require calling APIs and tools to complete complex workflows. It achieves state-of-the-art performance on key benchmarks including visual question answering (TextVQA) and video understanding (VATEX). Amazon Nova Pro demonstrates strong capabilities in processing both visual and textual information and excels at analyzing financial documents. With an input context of 300K tokens, it can process code bases with over fifteen thousand lines of code. Amazon Nova Pro also serves as a teacher model to distill custom variants of Amazon Nova Micro and Lite.

Amazon Nova Premier – Our most capable multimodal model for complex reasoning tasks and for use as the best teacher for distilling custom models. Amazon Nova Premier is still in training. We’re targeting availability in early 2025.

Amazon Nova understanding models excel in Retrieval-Augmented Generation (RAG), function calling, and agentic applications. This is reflected in Amazon Nova model scores in the Comprehensive RAG Benchmark (CRAG) evaluation, Berkeley Function Calling Leaderboard (BFCL), VisualWebBench, and Mind2Web.

What makes Amazon Nova particularly powerful for enterprises is its customization capabilities. Think of it as tailoring a suit: you start with a high-quality foundation and adjust it to fit your exact needs. You can fine-tune the models with text, image, and video to understand your industry’s terminology, align with your brand voice, and optimize for your specific use cases. For instance, a legal firm might customize Amazon Nova to better understand legal terminology and document structures.

You can see the latest benchmark scores for these models on the Amazon Nova product page.

Creative content generation: Bringing concepts to life
The Amazon Nova models also include two creative content generation models:

Amazon Nova Canvas – A state-of-the-art image generation model producing studio-quality images with precise control over style and content, including rich editing features such as inpainting, outpainting, and background removal. Amazon Nova Canvas excels on human evaluations and key benchmarks such as text-to-image faithfulness evaluation with question answering (TIFA) and ImageReward.

Amazon Nova Reel – A state-of-the-art video generation model. Using Amazon Nova Reel, you can produce short videos through text prompts and images, control visual style and pacing, and generate professional-quality video content for marketing, advertising, and entertainment. Amazon Nova Reel outperforms existing models on human evaluations of video quality and video consistency.

All Amazon Nova models include built-in safety controls and creative content generation models include watermarking capabilities to promote responsible AI use.

Let’s see how these models work in practice for a few use cases.

Using Amazon Nova Pro for document analysis
To demonstrate the capabilities of document analysis, I downloaded the Choosing a generative AI service decision guide in PDF format from the AWS documentation.

First, I choose Model access in the Amazon Bedrock console navigation pane and request access to the new Amazon Nova models. Then, I choose Chat/text in the Playground section of the navigation pane and select the Amazon Nova Pro model. In the chat, I upload the decision guide PDF and ask:

Write a summary of this doc in 100 words. Then, build a decision tree.

The output follows my instructions producing a structured decision tree that gives me a glimpse of the document before reading it.

Console screenshot.

Using Amazon Nova Pro for video analysis
To demonstrate video analysis, I prepared a video by joining two short clips (more on this in the next section):

This time, I use the AWS SDK for Python (Boto3) to invoke the Amazon Nova Pro model using the Amazon Bedrock Converse API and analyze the video:

import boto3

AWS_REGION = "us-east-1"
MODEL_ID = "amazon.nova-pro-v1:0"
VIDEO_FILE = "the-sea.mp4"

bedrock_runtime = boto3.client("bedrock-runtime", region_name=AWS_REGION)
with open(VIDEO_FILE, "rb") as f:
video = f.read()

user_message = "Describe this video."

messages = [ { "role": "user", "content": [
{"video": {"format": "mp4", "source": {"bytes": video}}},
{"text": user_message}
] } ]

response = bedrock_runtime.converse(
modelId=MODEL_ID,
messages=messages,
inferenceConfig={"temperature": 0.0}
)

response_text = response["output"]["message"]["content"][0]["text"]
print(response_text)
Python
Amazon Nova Pro can analyze videos that are uploaded with the API (as in the previous code) or that are stored in an Amazon Simple Storage Service (Amazon S3) bucket.

In the script, I ask to describe the video. I run the script from the command line. Here’s the result:

The video begins with a view of a rocky shore on the ocean, and then transitions to a close-up of a large seashell resting on a sandy beach.

I can use a more detailed prompt to extract specific information from the video such as objects or text. Note that Amazon Nova currently does not process audio in a video.

Using Amazon Nova for video creation
Now, let’s create a video using Amazon Nova Reel, starting from a text-only prompt and then providing a reference image.

Because generating a video takes a few minutes, the Amazon Bedrock API introduced three new operations:

StartAsyncInvoke – To start an asynchronous invocation

GetAsyncInvoke – To get the current status of a specific asynchronous invocation

ListAsyncInvokes – To list the status of all asynchronous invocations with optional filters such as status or date

Amazon Nova Reel supports camera control actions such as zooming or moving the camera. This Python script creates a video from this text prompt:

Closeup of a large seashell in the sand. Gentle waves flow all around the shell. Sunset light. Camera zoom in very close.

After the first invocation, the script periodically checks the status until the creation of the video has been completed. I pass a random seed to get a different result each time the code runs.

import random
import time

import boto3

AWS_REGION = "us-east-1"
MODEL_ID = "amazon.nova-reel-v1:0"
SLEEP_TIME = 30
S3_DESTINATION_BUCKET = "<BUCKET>"

video_prompt = "Closeup of a large seashell in the sand. Gentle waves flow all around the shell. Sunset light. Camera zoom in very close."

bedrock_runtime = boto3.client("bedrock-runtime", region_name=AWS_REGION)
model_input = {
"taskType": "TEXT_VIDEO",
"textToVideoParams": {"text": video_prompt},
"videoGenerationConfig": {
"durationSeconds": 6,
"fps": 24,
"dimension": "1280x720",
"seed": random.randint(0, 2147483648)
}
}

invocation = bedrock_runtime.start_async_invoke(
modelId=MODEL_ID,
modelInput=model_input,
outputDataConfig={"s3OutputDataConfig": {"s3Uri": f"s3://{S3_DESTINATION_BUCKET}"}}
)

invocation_arn = invocation["invocationArn"]
s3_prefix = invocation_arn.split('/')[-1]
s3_location = f"s3://{S3_DESTINATION_BUCKET}/{s3_prefix}"
print(f"\nS3 URI: {s3_location}")

while True:
response = bedrock_runtime.get_async_invoke(
invocationArn=invocation_arn
)
status = response["status"]
print(f"Status: {status}")
if status != "InProgress":
break
time.sleep(SLEEP_TIME)

if status == "Completed":
print(f"\nVideo is ready at {s3_location}/output.mp4")
else:
print(f"\nVideo generation status: {status}")
Python
I run the script:

Status: InProgress
. . .
Status: Completed

Video is ready at s3://BUCKET/PREFIX/output.mp4
Bash
After a few minutes, the script completes and prints the output Amazon Simple Storage Service (Amazon S3) location. I download the output video using the AWS Command Line Interface (AWS CLI):

aws s3 cp s3://BUCKET/PREFIX/output.mp4 ./output-from-text.mp4
This is the resulting video. As requested, the camera zooms in on the subject.

Using Amazon Nova Reel with a reference image
To have better control over the creation of the video, I can provide Amazon Nova Reel a reference image such as the following:

A seascape image.

This script uses the reference image and a text prompt with a camera action (drone view flying over a coastal landscape) to create a video:

import base64
import random
import time

import boto3

S3_DESTINATION_BUCKET = "<BUCKET>"
AWS_REGION = "us-east-1"
MODEL_ID = "amazon.nova-reel-v1:0"
SLEEP_TIME = 30
input_image_path = "seascape.png"
video_prompt = "drone view flying over a coastal landscape"

bedrock_runtime = boto3.client("bedrock-runtime", region_name=AWS_REGION)

# Load the input image as a Base64 string.
with open(input_image_path, "rb") as f:
input_image_bytes = f.read()
input_image_base64 = base64.b64encode(input_image_bytes).decode("utf-8")

model_input = {
"taskType": "TEXT_VIDEO",
"textToVideoParams": {
"text": video_prompt,
"images": [{ "format": "png", "source": { "bytes": input_image_base64 } }]
},
"videoGenerationConfig": {
"durationSeconds": 6,
"fps": 24,
"dimension": "1280x720",
"seed": random.randint(0, 2147483648)
}
}

invocation = bedrock_runtime.start_async_invoke(
modelId=MODEL_ID,
modelInput=model_input,
outputDataConfig={"s3OutputDataConfig": {"s3Uri": f"s3://{S3_DESTINATION_BUCKET}"}}
)

invocation_arn = invocation["invocationArn"]
s3_prefix = invocation_arn.split('/')[-1]
s3_location = f"s3://{S3_DESTINATION_BUCKET}/{s3_prefix}"

print(f"\nS3 URI: {s3_location}")

while True:
response = bedrock_runtime.get_async_invoke(
invocationArn=invocation_arn
)
status = response["status"]
print(f"Status: {status}")
if status != "InProgress":
break
time.sleep(SLEEP_TIME)
if status == "Completed":
print(f"\nVideo is ready at {s3_location}/output.mp4")
else:
print(f"\nVideo generation status: {status}")
Python
Again, I download the output using the AWS CLI:

aws s3 cp s3://BUCKET/PREFIX/output.mp4 ./output-from-image.mp4
This is the resulting video. The camera starts from the reference image and moves forward.

Building AI responsibly
Amazon Nova models are built with a focus on customer safety, security, and trust throughout the model development stages, offering you peace of mind as well as an adequate level of control to enable your unique use cases.

We’ve built in comprehensive safety features and content moderation capabilities, giving you the controls you need to use AI responsibly. Every generated image and video include digital watermarking.

The Amazon Nova foundation models are built with protections that match its increased capabilities. Amazon Nova extends our safety measures to combat the spread of misinformation, child sexual abuse material (CSAM), and chemical, biological, radiological, or nuclear (CBRN) risks.

Things to know
Amazon Nova models are available in Amazon Bedrock in the US East (N. Virginia) AWS region. Amazon Nova Micro, Lite, and Pro are also available in the US West (Oregon), and US East (Ohio) regions via cross-Region inference. As usual with Amazon Bedrock, the pricing follows a pay-as-you-go model. For more information, see Amazon Bedrock pricing.

The new generation of Amazon Nova understanding models speaks your language. These models understand and generate content in over 200 languages, with particularly strong capabilities in English, German, Spanish, French, Italian, Japanese, Korean, Arabic, Simplified Chinese, Russian, Hindi, Portuguese, Dutch, Turkish, and Hebrew. This means you can build truly global applications without worrying about language barriers or maintaining separate models for different regions. Amazon Nova models for creative content generation support English prompts.

As you explore Amazon Nova, you’ll discover its ability to handle increasingly complex tasks. You can use these models to process lengthy documents up to 300K tokens, analyze multiple images in a single request, understand up to 30 minutes of video content, and generate images and videos at scale from natural language. This makes these models suitable for a variety of business use cases, from quick customer service interactions to deep analysis of corporate documentation and asset creation for advertising, ecommerce, and social media applications.

Integration with Amazon Bedrock makes deployment and scaling straightforward. You can leverage features like Amazon Bedrock Knowledge Bases to enhance your model with proprietary information, use Amazon Bedrock Agents to automate complex workflows, and implement Amazon Bedrock Guardrails to promote responsible AI use. The platform supports real-time streaming for interactive applications, batch processing for high-volume workloads, and detailed monitoring to help you optimize performance.

Ready to start building with Amazon Nova? Give the new models a try in the Amazon Bedrock console today, visit the Amazon Nova models section of the Amazon Bedrock documentation, and send feedback to AWS re:Post for Amazon Bedrock. You can find deep-dive technical content and discover how our Builder communities are using Amazon Bedrock at community.aws. Let us know what you build with these new models!

###
https://yformer.github.io/efficient-track-anything/
Efficient Track Anything
Yunyang Xiong☨, Chong Zhou, Xiaoyu Xiang, Lemeng Wu, Chenchen Zhu, Zechun Liu, Saksham Suri, Bala Varadarajan, Ramya Akula, Forrest Iandola, Raghuraman Krishnamoorthi☨, Bilge Soran☨, Vikas Chandra☨
12/5/24
▶ Meta AI ▶ Nanyang Technological University
☨Project Lead


Efficient Track Anything Model (EfficientTAM) 📣
🔥 Can run >10 frames per second with reasonable video segmentation performance on iPhone 15. Achieves comparable performance with SAM 2 with improved efficiency (⚡ fast!). Uses a vanilla lightweight ViT image encoder.


Abstract
Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything. Key components of SAM 2 that drive the impressive video object segmentation performance include a large multistage image encoder for frame feature extraction and a memory mechanism that stores memory contexts from past frames to help current frame segmentation. The high computation complexity of multistage image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight track anything models that produce high-quality results with low latency and model size. Our idea is based on revisiting the plain, nonhierarchical Vision Transformer (ViT) as an image encoder for video object segmentation, and introducing an efficient memory module, which reduces the complexity for both frame feature extraction and memory computation for current frame segmentation. We take vanilla lightweight ViTs and efficient memory module to build EfficientTAMs, and train the models on SA-1B and SA-V datasets for video object segmentation and track anything tasks. We evaluate on multiple video segmentation benchmarks including semi-supervised VOS and promptable video segmentation, and find that our proposed EfficientTAM with vanilla ViT perform comparably to SAM 2 model (HieraB+SAM 2) with ~2x speedup on A100 and ~2.4x parameter reduction. On segment anything image tasks, our EfficientTAMs also perform favorably over original SAM with ~20x speedup on A100 and ~20x parameter reduction. On mobile devices such as iPhone 15 Pro Max, our EfficientTAMs can run at ~10 FPS for performing video object segmentation with reasonable quality, highlighting the capability of small models for on-device video object segmentation applications.





Method
Our proposed EfficientTAM takes a vanilla lightweight ViT image encoder for frame feature extraction. An efficient memory cross-attention is proposed to further improve the efficiency of EfficientTAM by leveraging the strong locality of memory spatial embeddings.


The overview of EfficientTAM framework.

Quantative Results

Semi-supervised video object segmentation results across video object segmentation benchmarks.


Speed vs J&F accuracy.

Qualitative Results

Video segmentation and tracking: SAM 2 (top), EfficientTAM (bottom).


Image segmentation with point prompt, box prompt, and segment everything: Input (left), SAM, EfficientSAM, SAM 2, EfficientTAM (right).

###
https://deepdataspace.com/home
DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
2024/11/25
IDEA Research

The World's Top-Performing Vision Model for Open-World Object Detection

The project provides examples for using DINO-X, which are hosted on DeepDataSpace.

IDEA Research

arXiv preprint Homepage

dino_x_intro.mp4
Highlights
Beyond Grounding DINO 1.5, DINO-X has several improvements, taking a step forward towards becoming a more general object-centric vision model. The highlights of the DINO-X are as follows:

✨ The Strongest Open-Set Detection Performance: DINO-X Pro set new SOTA results on zero-shot transfer detection benchmarks: 56.0 AP on COCO, 59.8 AP on LVIS-minival and 52.4 AP on LVIS-val. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, improving the previous SOTA performance by 5.8 box AP and 5.0 box AP. Such a result underscores its significantly enhanced capacity for recognizing long-tailed objects.

🔥 Diverse Input Prompt and Multi-level Output Semantic Representations: DINO-X can accept text prompts, visual prompts, and customized prompts as input, and it outputs representations at various semantic levels, including bounding boxes, segmentation masks, pose keypoints, and object captions, with multiple perception heads.

🍉 Rich and Practical Capabilities: DINO-X can simultaneously support lots of highly practical tasks, including Open-Set Object Detection and Segmentation, Phrase Grounding, Visual-Prompt Counting, Pose Estimation, and Region Captioning. We further develop a universal object prompt to achieve Prompt-Free Anything Detection and Recognition.

Latest News
2024/12/04: Release the DINO-X Open-World Detection and Segmentation feature. Please check here for the API usage.
2024/12/03: Support DINO-X with SAM 2 for Open-World Object Segmentation and Tracking. Please check Grounded SAM 2 for more details.
2024/11/25: Release DINO-X API for Open-World Detection.
2024/11/22: Launch DINO-X project and init documentation.
Contents
Model Framework
Model Performance
Side-by-side Performance Comparison with Previous Best Methods
Zero-Shot Performance on Object Detection Benchmarks
Zero-Shot Performance on Segmentation Benchmarks
API Usage
Installation
Registration
Support Demos
Open-World Detection and Segmentation
Related Works
BibTeX
Model Framework
DINO-X can accept text prompts, visual prompts, and customized prompts as input, and it can generate representations at various semantic levels, including bounding boxes, segmentation masks, pose keypoints, and object captions.


Performance
Side-by-Side Performance Comparison with Previous Best Methods

Zero-Shot Performance on Object Detection Benchmarks
Model COCO
(AP box) LVIS-minival
(AP all) LVIS-minival
(AP rare) LVIS-val
(AP all) LVIS-val
(AP rare)
Other Best
Open-Set Model 53.4
(OmDet-Turbo) 47.6
(T-Rex2 visual) 45.4
(T-Rex2 visual) 45.3
(T-Rex2 visual) 43.8
(T-Rex2 visual)
DetCLIPv3 - 48.8 49.9 41.4 41.4
Grounding DINO 52.5 27.4 18.1 - -
T-Rex2 (text) 52.2 54.9 49.2 45.8 42.7
Grounding DINO 1.5 Pro 54.3 55.7 56.1 47.6 44.6
Grounding DINO 1.6 Pro 55.4 57.7 57.5 51.1 51.5
DINO-X Pro 56.0 59.7 63.3 52.4 56.5
Performance: DINO-X Pro achieves SOTA performance on COCO, LVIS-minival, LVIS-val, zero-shot object detection benchmarks.
Effective Long-tail Object Detection: DINO-X Pro has significantly improved the model's performance on LVIS-rare classes, significantly surpassing the previous SOTA Grounding DINO 1.6 Pro model by 5.8 AP and 5.0 AP, respectively, demonstrating the exceptional capability of DINO-X in long-tailed object detection scenarios.
Zero-Shot Performance on Segmentation Benchmarks
Model COCO
(AP mask) LVIS-minival
(AP mask) LVIS-minival
(AP mask rare) LVIS-val
(AP mask) LVIS-val
(AP mask rare)
Assembled General Perception Model
Grounded SAM (1.5 Pro + Huge) 44.3 47.7 50.2 41.8 46.0
Grounded SAM 2 (1.5 Pro + Large) 44.7 46.2 50.1 40.5 44.6
DINO-X Pro + SAM-Huge 44.2 51.2 52.2 - -
Unified Vision Model
DINO-X Pro (Mask Head) 37.9 43.8 46.7 38.5 44.4
Performance: DINO-X achieves mask AP scores of 37.9, 43.8, and 38.5 on the COCO, LVIS-minival, and LVIS-val zero-shot instance segmentation benchmarks, respectively.Compared to Grounded SAM and Grounded SAM 2, there is still a notable performance gap for DINO-X to catch up. We will further optimize the segmentation performance in the future release.
Effeciency: Unlike Grounded SAM series, DINO-X significantly improves the segmentation efficiency by generating corresponding masks for each region without requiring multiple complex inference steps.
Practical Usage: Users can use the mask function of DINO-X based on their actual needs. If the users require simultaneously object segmentation and tracking, we recommend using the latest Grounded SAM 2 (DINO-X + SAM 2), which we have already implemented in here.
API Usage
Installation
Install the required packages
pip install -r requirements.txt
Note: If you encounter some errors with API, please install the latest version of dds-cloudapi-sdk:

pip install dds-cloudapi-sdk --upgrade
Register on Offical Website to Get API Token
First-Time Application: If you are interested in our project and wish to try our algorithm, you will need to apply for the corresponding API Token through our request API token website for your first attempt.

Request Additional Token Quotas: If you find our project helpful and need more API token quotas, you can request additional tokens by filling out this form. Our team will review your request and allocate more tokens for your use in one or two days. You can also apply for more tokens by sending us an email.

Run local API demos
Open-World Object Detection and Segmentation
Set your API token in demo.py and run local demo
python demo.py
After running the local demo, the annotated image will be saved at: ./outputs/open_world_detection
Demo Image Visualization
Related Work
Grounding DINO: Strong open-set object detection model.
Grounding DINO 1.5: Previous SOTA open-set detection model.
Grounded-Segment-Anything: Open-set detection and segmentation model by combining Grounding DINO with SAM.
T-Rex/T-Rex2: Generic open-set detection model supporting both text and visual prompts.

🦖 Dino-X: Unified Object-Centric LVM 🦖🦖
👉IDEA unveils DINO-X: unified object-centric vision model for Open-World Detection/Segmentation, Phrase Grounding, Visual Counting, Pose Estimation, Prompt-Free Detection/Recognition, Dense Region Caption, & much more. Impressive skills! Demo & API announced💙
𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬:
✅Object-Centric Understanding towards LVM
✅SOTA open-world object detection performance
✅Grounding-100M: 100M+ HQ training samples
✅DINO-X supports textual & visual prompts
✅Universal prompt-free open-world detection
#artificialintelligence #machinelearning #ml #AI #deeplearning #computervision #AIwithPapers #metaverse #LLM

###
https://github.com/PKU-YuanGroup/ConsisID
2024.12.01
Identity-Preserving Text-to-Video Generation by Frequency Decomposition
Published on Nov 26
·
Submitted by
BestWishYsh
on Nov 27
Authors:

Shenghai Yuan
,

Jinfa Huang
,

Xianyi He
,

Yunyuan Ge
,
Yujun Shi
,

Liuhan Chen
,
Jiebo Luo
,
Li Yuan
Abstract
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving DiT-based control scheme. We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video. Inspired by prior findings in frequency analysis of diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features and high-frequency intrinsic features. First, from a low-frequency perspective, we introduce a global facial extractor, which encodes reference images and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into transformer blocks, enhancing the model's ability to preserve fine-grained features. We propose a hierarchical training strategy to leverage frequency information for identity preservation, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our ConsisID generates high-quality, identity-preserving videos, making strides towards more effective IPT2V.

ConsisID can generate high-quality Identity-Preserving videos using an input Image and a text prompt! 🤩 🚀
Apache 2.0 Licensed. ConsisID Gradio app is available on Hugging Face

###
https://arxiv.org/pdf/2411.01747
DynaSaur: Large Language Agents Beyond Predefined Actions
Dang Nguyen, Viet Dac Lai, Seunghyun Yoon, Ryan A. Rossi, Handong Zhao, Ruiyi Zhang, Puneet Mathur, Nedim Lipka, Yu Wang, Trung Bui, Franck Dernoncourt, Tianyi Zhou
12/2/24

Existing LLM agent systems typically select actions from a fixed and predefined set at every step. While this approach is effective in closed, narrowly-scoped environments, we argue that it presents two major challenges when deploying LLM agents in real-world scenarios: (1) selecting from a fixed set of actions significantly restricts the planning and acting capabilities of LLM agents, and (2) this approach requires substantial human effort to enumerate and implement all possible actions, which becomes impractical in complex environments with a vast number of potential actions. In this work, we propose an LLM agent framework that enables the dynamic creation and composition of actions in an online manner. In this framework, the agent interacts with the environment by generating and executing programs written in a general-purpose programming language at each step. Furthermore, generated actions are accumulated over time for future reuse. Our extensive experiments on the GAIA benchmark demonstrate that this framework offers significantly greater flexibility and outperforms previous methods. Notably, it allows an LLM agent to recover in scenarios where no relevant action exists in the predefined set or when existing actions fail due to unforeseen edge cases. At the time of writing, we hold the top position on the GAIA public leaderboard. Our code can be found in \href{this https URL}{this https URL}.

𝗔𝗱𝗼𝗯𝗲'𝘀 𝗰𝗼𝗱𝗲-𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗻𝗴 𝗮𝗴𝗲𝗻𝘁 𝗿𝗲𝗮𝗰𝗵𝗲𝘀 𝘁𝗵𝗲 𝘁𝗼𝗽 𝗼𝗳 𝗚𝗔𝗜𝗔 𝗹𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱 - and their paper cites
💡 Reminder: Broadly defined, an "Agent" is a system where a LLM is augmented with the ability to run multiple steps or call tools. This means when a user asks a question, the LLM's answer can be grounded in a web search, document or code execution instead of having to rely on the LLM's lackuster awareness of the outside world.
In short, Agents are a vehicle in which you put your LLM to allow it access to the outside world!
➡️ The team of researchers at Adobe started from the idea that current agentic systems lack the ability to define their own tools. So they decided to make an agent that writes actions as code, thus allowing it to write python functions that can be re-used later as tools!
Here's what the LLM generations can look like with the proper prompt:
Thought: I need to access the excel file using a different method.
Action:
```py
def access_excel_file(file_path)
... # rest of the code (the agent does writes it, but I don't have room in this post)
return rows

Then your system executes this and appends the observation to the agent’s memory.
Why is this code formulation better than classical tool use formulation as JSON? The paper explains:
“Most existing work uses text or JSON as the representation of actions, which significantly lacks the two criteria mentioned earlier: generality and composability. In contrast, DynaSaur can utilize available actions or create new ones if necessary, using code as a unified representation. In principle, acting with code enables agents to solve any Turing-complete problem.”
The idea of using code is not new: in fact, we do it in transformers.agents (thus the citation that I got). They implementation adds further refinements, like using RAG to retrieve relevant functions before generating an action, which increases performance further.
And they observe that code agents perform much better, reaching the top of GAIA leaderboard! 🥇
Go take a look, it’s really clear and informative!

https://github.com/Francis-Rings/StableAnimator

🛟🛟 StableAnimator: HQ ID-aware Humans 🛟🛟
👉StableAnimator: first e2e ID-preserving diffusion for HQ videos without any post-processing. Input: single image + sequence of poses. Training, pre-processing, eval-dataset & pro-model announced. Code & base-model released💙
𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬:
✅First E2E ID-preserving human animation
✅Global content-aware Face Encoder + ID-Adapter
✅Incorporate face embeddings maintaining fidelity
✅Novel HJB equation-based face optimization
✅Authors: CS-FUDAN, CMIC, #Microsoft, Huya & CMU
#artificialintelligence #machinelearning #ml #AI #deeplearning #computervision #AIwithPapers #metaverse #LLM #tiktok

StableAnimator

StableAnimator: High-Quality Identity-Preserving Human Image Animation
Shuyuan Tu1, Zhen Xing1, Xintong Han3, Zhi-Qi Cheng4, Qi Dai2, Chong Luo2, Zuxuan Wu1
[1Fudan University; 2Microsoft Research Asia; 3Huya Inc; 4Carnegie Mellon University]

Pose-driven Human image animations generated by StableAnimator, showing its power to synthesize high-fidelity and ID-preserving videos. All animations are directly synthesized by StableAnimator without the use of any face-related post-processing tools, such as the face-swapping tool FaceFusion or face restoration models like GFP-GAN and CodeFormer.

Comparison results between StableAnimator and state-of-the-art (SOTA) human image animation models highlight the superior performance of StableAnimator in delivering high-fidelity, identity-preserving human image animation.

Overview
model architecture
The overview of the framework of StableAnimator.

Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses. Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively.

News
[2024-12-4]:🔥 We are thrilled to release an interesting dance demo (🔥🔥APT Dance🔥🔥)! The generated video can be seen on YouTube and Bilibili.
[2024-11-28]:🔥 The data pre-processing codes (human skeleton extraction) are available! Other codes will be released very soon. Stay tuned!
[2024-11-26]:🔥 The project page, code, technical report and a basic model checkpoint are released. Further training codes, data pre-processing codes, the evaluation dataset and StableAnimator-pro will be released very soon. Stay tuned!

https://cxcx1996.github.io/HiFiVFS/
HiFiVFS: High Fidelity Video Face Swapping
Xu Chen*,1, Keke He*,1, Junwei Zhu†,1, Yanhao Ge2, Wei Li2, Chengjie Wang1
1Tencent, 2VIVO
12/1/24
*Indicates Equal Contribution †Corresponding Author
Special makeup
Different face shape
Complicated occlusion
Complex lighting conditions
Extreme poses
Special makeup
Different face shape
Complicated occlusion
Complex lighting conditions
Extreme poses
Special makeup
Different face shape
Complicated occlusion
Abstract
Face swapping aims to generate results that combine the identity from the source with attributes from the target. Existing methods primarily focus on image-based face swapping. When processing videos, each frame is handled independently, making it difficult to ensure temporal stability. From a model perspective, face swapping is gradually shifting from generative adversarial networks (GANs) to diffusion models (DMs), as DMs have been shown to possess stronger generative capabilities. Current diffusion-based approaches often employ inpainting techniques, which struggle to preserve fine-grained attributes like lighting and makeup. To address these challenges, we propose a high fidelity video face swapping (HiFiVFS) framework, which leverages the strong generative capability and temporal prior of Stable Video Diffusion (SVD). We build a fine-grained attribute module to extract identity-disentangled and fine-grained attribute features through identity desensitization and adversarial learning. Additionally, We introduce detailed identity injection to further enhance identity similarity. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) in video face swapping, both qualitatively and quantitatively.

Comparisons in the wild
Method Overview
HiFiVFS
Pipeline of our proposed HiFiVFS, including training and inference phases. HiFiVFS is primarily trained based on the SVD framework, utilizing multi-frame input and a temporal attention to ensure the stability of the generated videos. In the training phase, HiFiVFS introduces fine-grained attribute learning (FAL) and detailed identity learning (DIL). In FAL, attribute disentanglement and enhancement are achieved through identity desensitization and adversarial learning. DIL uses more face swapping suited ID features to further boost identity similarity. In the inference phase, FAL only retains Eatt for attribute extraction, making the testing process more convenient. It is noted that HiFiVFS is trained and tested in the latent space, but for visualization purposes, we illustrate all processes in the original image space.

👺👺 HiFiVFS: Extreme Face Swapping 👺👺
👉#Tencent unveils a novel video face swapping method called HiFiVFS, which can consistently generate HQ face swapping videos even in extremely challenging scenarios (occlusion, makeup, lights, extreme poses, etc.). Impressive results, no code announced😢
𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬:
✅HiFiVFS: high-fidelity video face swapping
✅Temporal stability within the face swapping
✅FAL: Fine-grained Attributes Learning
✅DIL: Detailed Identity Learning preservation
✅It’s the new SOTA, especially with extreme clips
#artificialintelligence #machinelearning #ml #AI #deeplearning #computervision #AIwithPapers #metaverse #LLM

https://huggingface.co/showlab/ShowUI-2B
𝗦𝗵𝗼𝘄𝗨𝗜: 𝗮 𝘀𝗺𝗮𝗹𝗹 𝗲𝗻𝗱-𝘁𝗼-𝗲𝗻𝗱 𝗮𝗴𝗲𝗻𝘁 𝘁𝗵𝗮𝘁 𝗰𝗮𝗻 𝗻𝗮𝘃𝗶𝗴𝗮𝘁𝗲 𝗮𝗻𝘆 𝗨𝗜 𝗮𝗻𝗱 𝗼𝘂𝘁𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝘀 𝗺𝘂𝗰𝗵 𝗯𝗶𝗴𝗴𝗲𝗿 𝘀𝘆𝘀𝘁𝗲𝗺𝘀! 📲
12/5/24

A team from National University of Singapore and Microsoft just released an agent that can act on any UI (Desktop, Android, Web) without needing additional text information. It works extremely well : they applied their method on a tiny Qwen2-VL-2B, and they managed to beat methods that use either much more powerful vision models (like GPT-4V) without using any additional info (e.g. leveraging the DOM of a webpage) like previous methods did ! 👏👏
🧑‍🏫 Reminder: definition of an “Agent”: this is really a bloated word by now. It would be more precise to talk of “agency”, defined as “the ability to for an LLM to execute actions on its environment”. Just parsing an LLM output and using it to determine the workflow of your code could be called an “agent” already.
Anyway, they started from the idea that most existing methods rely heavily on text, which makes them less generalizable, while letting aside rich UI structure that user actually rely on when navigating this interfaces.
⚙️ They put several good ideas to work:
💡 Simplify screenshots to the max:
They prune a lot the heavy visual content of UI screenshots, by removing cloned image patches (like any vast patch of the same color will be reduced to a small patch, while maintaining positional embeddings), then group patches from the same GUI elements together to simplify even further
💡 Build a truly generalist dataset:
To train a general UI agent, you need trajectories from each possible UI, and express them in a common language. Authors merge datasets like OmniAct for Desktop, Mind2Web for websites, AMEX for Android trajectories to create a high-quality and diverse dataset.
➡️ Nice results ensued:
They fine-tune a tiny Qwen-2-VL-2B on their method, and it reaches SOTA on several task (element identification, web navigation), even beating methods that either use additional info from the DOM or use much bigger VLMS like GPT-4v! 🏆
And performance could certainly jump with a slightly bigger vision model. Let’s hope the community builds this soon! 🚀
howUI is a lightweight vision-language-action model for GUI agents.

You can easily run this model on Windows and macOS using OOTB!

model arXiv demo dataset X (formerly Twitter) URL Hits

If you like our project, please give us a star ⭐ for the latest update.

📢 News
[2024.11.16] showlab/ShowUI-2B is available at huggingface.
[2024.11.27] We release the arXiv paper, HF Spaces demo and ShowUI-desktop-8K.
[2024.12.1] We support iterative refinement to improve grounding accuracy. Try it at HF Spaces demo.
[2024.12.5] Major Update: ShowUI is integrated into OOTB for local run!

https://github.com/ADaM-BJTU/O1-CODER
O1-CODER: AN O1 REPLICATION FOR CODING
Yuxiang Zhang, Shangxi Wu, Yuqi Yang, Jiangming Shu, Jinlin Xiao, Chao Kong & Jitao Sang ∗
12/4/24
School of Computer Science and Technology
Beijing Jiaotong University
Beijing, China
{yuxiangzhang, wushangxi, yqyang, jiangmingshu, jinlinx, 23120361,
jtsang}@bjtu.edu.cn
ABSTRACT
The technical report introduces O1-CODER, an attempt to replicate OpenAI’s o1
model with a focus on coding tasks. It integrates reinforcement learning (RL)
and Monte Carlo Tree Search (MCTS) to enhance the model’s System-2 thinking capabilities. The framework includes training a Test Case Generator (TCG)
for standardized code testing, using MCTS to generate code data with reasoning
processes, and iteratively fine-tuning the policy model to initially produce pseudocode, followed by the generation of the full code. The report also addresses the
opportunities and challenges in deploying o1-like models in real-world applications, suggesting transitioning to the System-2 paradigm and highlighting the imperative for environment state updates. Updated model progress and experimental
results will be reported in subsequent versions. All source code, curated datasets,
as well as the derived models will be disclosed at https://github.com/ADaMBJTU/O1-CODER .

Overview
O1-CODER is an attempt to replicate OpenAI’s O1 model, focused on coding tasks. The approach combines Reinforcement Learning (RL) and Monte Carlo Tree Search (MCTS) to enhance the model’s System-2 thinking capabilities, aiming to generate more efficient and logical code.

Method
The core components of O1-CODER are:

Test Case Generator (TCG): Automatically generates standardized test cases to evaluate the correctness of the generated code.
Self-Play and Reinforcement Learning: The model generates reasoning data through self-play, and uses RL and MCTS to iteratively optimize the policy model。 These methods work in an iterative cycle, continuously refining the model to improve systematic reasoning and optimization in coding tasks.

https://gen-omnimatte.github.io/

Generative Omnimatte
Learning to Decompose Video into Layers
Yao-Chih Lee1,2 Erika Lu1 Sarah Rumbley1 Michal Geyer1,3 Jia-Bin Huang2 Tali Dekel1,3 Forrester Cole1
1Google DeepMind 2University of Maryland College Park 3Weizmann Institute of Science

12/3/24

This AI Learned to Turn a Video Into Layers 🖼️
✨ Developed by Google DeepMind, this new method decomposes a video into complete layers, including objects and their associated effects (e.g., shadows, reflections). 🎬
🤔 Previous methods coulnd’t handle:

  • dynamic backgrounds
  • occlusions
  • effect association for multi objects
    Video is liquid 💦

Input video
Omnimatte layers

lego

Our method decomposes a video into a set of RGBA omnimatte layers,
where each layer consists of a fully-visible object and its associated effects like shadows and reflections.
Our omnimattes enable a wide range of video editing for users.
(Scroll to view more videos)

Comparisons on Omnimattes

Input
Omnimatte
Omnimatte3D
OmnimatteRF
Ours
We compare our method with existing omnimatte methods (Omnimatte, Omnimatte3D, OmnimatteRF, and FactorMatte). Existing methods rely on restrictive motion assumptions, such as stationary background, resulting in dynamic background elements becoming entangled with foreground object layers. Omnimatte3D and OmnimatteRF may also produce blurry background layers (e.g., horses) because their 3D-aware background representations are sensitive to camera pose estimation quality. Furthermore, these methods lack a generative and semantic prior for completing occluded pixels and accurately associating effects with their corresponding objects.

Comparisons on Object and Effect Removal

Input & object to remove
ProPainter
Lumiere-Inpainting
ObjectDrop
Ours
We compare our object-effect-removal model, Casper, with existing methods for object removal. Video inpainting models (ProPainter and Lumiere-Inpainting) fail to remove soft shadows and reflections outside the input masks. ObjectDrop is an image-based model, and thus, it processes each video frame independently and inpaints regions without global context and temporal consistency. We use the same ratio of mask dilation for all the methods.

Method
Given an input video and binary object masks, we first apply our object-effect-removal model, Casper, to generate a clean-plate background and a set of single-object (solo) videos applying different trimask conditions. The trimasks specify regions to preserve (white), remove (black), and regions that potentially contain uncertain object effects (gray). In Stage 2, a test-time optimization reconstructs the omnimatte layers Oi from pairs of solo video and background video.

Object and Effect Removal with Trimask Condition
We use different trimask conditions for an input video to obtain a set of single-object (solo) videos and a clean-plate background video (bottom row). Note that we do not cherry pick the random seeds for the Casper model. We use the same random seed (=0) for all different input videos.

Input
Trimask
Output removal

Training data

We collect omnimatte results from existing omnimatte methods (Omnimatte, Omnimatte3D, and OmnimatteRF) to provide examples of cause-and-effect relationships in real videos.

Ablation Study on Training data of Casper

Input
Trimask
Omnimatte-only

  • Tripod
  • Kubric
  • Object-Paste (full)
    We assess the individual contributions of each dataset category to our model’s performance by incrementally adding each category to the training set. While the Omnimatte data provides basic examples of shadows in real-world videos, it primarily features static backgrounds and single objects. The Tripod data provides additional real-world scenarios to handle better water effects, such as reflections and boat wakes. Our Kubric synthetic data strengthens the models’ ability to handle multi-object scenes. Finally, the Object-Paste data reduces undesired background changes and improves inpainting quality.

Ablation Study on Input Condition of Casper

Input
Masked RGB + binary mask
Unmasked RGB + binary mask
Unmasked RGB + Trimask (ours)
Our proposed trimask explicitly defines the regions to be removed or preserved, thereby enabling more accurate handling of multi-object scenarios. In contrast, the model trained on binary masks is susceptible to ambiguity, potentially leading to undesired removal of objects meant to be preserved.

Our Limitations

Input
Trimask
Output removal
The removal model may not always produce the desired outcome, particularly in challenging multi-object cases.

User-specified trimask
We observe some cases where Casper will associate unrelated dynamic background effects with a foreground layer, such as the waves in the below example. To mitigate this, our system allows the user to modify the trimask by specifying a coarse preservation region to preserve the background waves better.

Visualization of Effect Association in the Self-Attention of Video Generator

Input & target object
for visualization metric
Lumiere T2V
output & attention
Lumiere Inpainting
output & attention
Our Casper
output & attention

To investigate the inherent understanding of object-effect associations in the text-to-video (T2V) Lumiere generation model, we analyze its self-attention patterns during the denoising process using SDEdit. We hypothesize that the T2V model possesses an intrinsic understanding of effect associations, allowing us to train an effective object-effect-removal model with a relatively small dataset.

We further compare the attention behaviors of the original T2V model, the Lumiere-Inpainting model, and our Casper model, which is sequentially fine-tuned from the T2V model. To ensure accurate attention measurement, we do not dilate the input mask conditions for both Inpainting and Casper models.

The visualized value of each pixel indicates the strength of association between its query token and the key tokens in the target object mask region. We visualize the first, middle, and final attention blocks of the U-Net architecture at the sampling step t=0.125. For a detailed description of the attention visualization metric, please refer to Section 3.3 of our main paper.

We observe that the T2V model’s object query tokens exhibit a strong focus on the object itself, as its primary task is to generate the object and its effects. This tendency may also be present in the Inpainting model when it attempts to fill the mask region with another object to justify shadows. In contrast, Casper’s object query tokens show less self-attention and more attention to the background region, suggesting a focus on background completion rather than object and effect generation.

In multi-object scenarios (boys-beach, five-beagles), the T2V and Inpainting models may associate different, similar objects with the target object. Our Casper model, however, demonstrates a lower attention response (darker) to similar objects, indicating a stronger ability to isolate individual objects.

We also analyzed the attention patterns of the failure case, five-beagles, where our Casper model did not remove the corresponding shadow completely. We hypothesize that the effect association is already weak in the T2V model, and our Casper model, inheriting knowledge from the pretrained models, struggles to handle such challenging cases.

https://arxiv.org/pdf/2412.02205
[Submitted on 3 Dec 2024 (v1), last revised 4 Dec 2024 (this version, v2)]
DataLab: A Unified Platform for LLM-Powered Business Intelligence
Luoxuan Weng, Yinghao Tang, Yingchaojie Feng, Zhuo Chang, Peng Chen, Ruiqin Chen, Haozhe Feng, Chen Hou, Danqing Huang, Yang Li, Huaming Rao, Haonan Wang, Canshi Wei, Xiaofeng Yang, Yuhui Zhang, Yifeng Zheng, Xiuqi Huang, Minfeng Zhu, Yuxin Ma, Bin Cui, Wei Chen
Business intelligence (BI) transforms large volumes of data within modern organizations into actionable insights for informed decision-making. Recently, large language model (LLM)-based agents have streamlined the BI workflow by automatically performing task planning, reasoning, and actions in executable environments based on natural language (NL) queries. However, existing approaches primarily focus on individual BI tasks such as NL2SQL and NL2VIS. The fragmentation of tasks across different data roles and tools lead to inefficiencies and potential errors due to the iterative and collaborative nature of BI. In this paper, we introduce DataLab, a unified BI platform that integrates a one-stop LLM-based agent framework with an augmented computational notebook interface. DataLab supports a wide range of BI tasks for different data roles by seamlessly combining LLM assistance with user customization within a single environment. To achieve this unification, we design a domain knowledge incorporation module tailored for enterprise-specific BI tasks, an inter-agent communication mechanism to facilitate information sharing across the BI workflow, and a cell-based context management strategy to enhance context utilization efficiency in BI notebooks. Extensive experiments demonstrate that DataLab achieves state-of-the-art performance on various BI tasks across popular research benchmarks. Moreover, DataLab maintains high effectiveness and efficiency on real-world datasets from Tencent, achieving up to a 58.58% increase in accuracy and a 61.65% reduction in token cost on enterprise-specific BI tasks.

https://ai.pydantic.dev/
PydanticAI
Agent Framework / shim to use Pydantic with LLMs
12/2/24

A New Agent Framework from Pydantic looks like a combination of instructor and the OpenAI swarm concept! It’s simple, model-agnostic, type-safe, and without complex abstractions!
TL;DR:
🔧 Uses Python for control flow and composition - no special DSL or patterns
🔍 Type-safe by design with full IDE support and static type-checking
📈 Model-agnostic, supports OpenAI, Gemini, and open models from vLLM or TGI
📚 Structured response validation with Pydantic
🌊 Supports streaming responses with validation
🎮 Dynamic runtime context/dependencies, e.g. customer data
🧪 Dependency injection for testing and iterative development.
🤖 Supports function tools, reflection, and self-correction.

CI Coverage PyPI versions license

When I first found FastAPI, I got it immediately. I was excited to find something so innovative and ergonomic built on Pydantic.

Virtually every Agent Framework and LLM library in Python uses Pydantic, but when we began to use LLMs in Pydantic Logfire, I couldn’t find anything that gave me the same feeling.

PydanticAI is a Python Agent Framework designed to make it less painful to build production grade applications with Generative AI.

Why use PydanticAI
Built by the team behind Pydantic (the validation layer of the OpenAI SDK, the Anthropic SDK, LangChain, LlamaIndex, AutoGPT, Transformers, CrewAI, Instructor and many more)
Model-agnostic — currently OpenAI, Gemini, and Groq are supported, Anthropic is coming soon. And there is a simple interface to implement support for other models.
Type-safe
Control flow and agent composition is done with vanilla Python, allowing you to make use of the same Python development best practices you’d use in any other (non-AI) project
Structured response validation with Pydantic
Streamed responses, including validation of streamed structured responses with Pydantic
Novel, type-safe dependency injection system, useful for testing and eval-driven iterative development
Logfire integration for debugging and monitoring the performance and general behavior of your LLM-powered application

https://github.com/quivrhq/megaparse
12/4/24
MegaParse - Your Parser for every type of documents
Quivr-logo
MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you’re dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.

Key Features 🎯
Versatile Parser: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
No Information Loss: Focus on having no information loss during parsing.
Fast and Efficient: Designed with speed and efficiency at its core.
Wide File Compatibility: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.
Open Source: Freedom is beautiful, and so is MegaParse. Open source and free to use.
Support
Files: ✅ PDF ✅ Powerpoint ✅ Word
Content: ✅ Tables ✅ TOC ✅ Headers ✅ Footers ✅ Images
Example
megaparse.mp4
Installation
pip install megaparse
Usage
Add your OpenAI or Anthropic API key to the .env file

Install poppler on your computer (images and PDFs)

Install tesseract on your computer (images and PDFs)

If you have a mac, you also need to install libmagic brew install libmagic

from megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.parser.unstructured_parser import UnstructuredParser

parser = UnstructuredParser()
megaparse = MegaParse(parser)
response = megaparse.load(“./test.pdf”)
print(response)
megaparse.save(“./test.md”)
Use MegaParse Vision
Change the parser to MegaParseVision
from megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.parser.megaparse_vision import MegaParseVision

model = ChatOpenAI(model=”gpt-4o”, api_key=os.getenv(“OPENAI_API_KEY”)) # type: ignore
parser = MegaParseVision(model=model)
megaparse = MegaParse(parser)
response = megaparse.load(“./test.pdf”)
print(response)
megaparse.save(“./test.md”)
Note: The model supported by MegaParse Vision are the multimodal ones such as claude 3.5, claude 4, gpt-4o and gpt-4.

(Optional) Use LlamaParse for Improved Results
Create an account on Llama Cloud and get your API key.

Change the parser to LlamaParser

from megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.parser.llama_parser import LlamaParser

parser = LlamaParser(api_key = os.getenv(“LLAMA_CLOUD_API_KEY”))
megaparse = MegaParse(parser)
response = megaparse.load(“./test.pdf”)
print(response)
megaparse.save(“./test.md”) #saves the last processed doc in md format
Use as an API
There is a MakeFile for you, simply use : make dev at the root of the project and you are good to go.

See localhost:8000/docs for more info on the different endpoints !

BenchMark
Parser similarity_ratio
megaparse_vision 0.87
unstructured_with_check_table 0.77
unstructured 0.59
llama_parser 0.33
Higher the better

Note: Want to evaluate and compare your Megaparse module with ours ? Please add your config in evaluations/script.py and then run python evaluations/script.py. If it is better, do a PR, I mean, let’s go higher together .

In Construction 🚧
Improve table checker
Create Checkers to add modular postprocessing ⚙️
Add Structured output, let’s get computer talking 🤖
Star History

https://arxiv.org/abs/2411.14708

[Submitted on 22 Nov 2024 (v1), last revised 2 Dec 2024 (this version, v2)]
Understanding LLM Embeddings for Regression
Eric Tang, Bangding Yang, Xingyou Song
Google Research

With the rise of large language models (LLMs) for flexibly processing information as strings, a natural application is regression, specifically by preprocessing string representations into LLM embeddings as downstream features for metric prediction. In this paper, we provide one of the first comprehensive investigations into embedding-based regression and demonstrate that LLM embeddings as features can be better for high-dimensional regression tasks than using traditional feature engineering. This regression performance can be explained in part due to LLM embeddings over numeric data inherently preserving Lipschitz continuity over the feature space. Furthermore, we quantify the contribution of different model effects, most notably model size and language understanding, which we find surprisingly do not always improve regression performance.

1
2
3
4

기술적으로 최대한 자세하게 적어. 14개의 기사가 있고 하나도 빼먹지 말고 적어.

</details>