오늘의 AI 소식에서는 Mistral, OpenAI, NVIDIA, Apple, Microsoft 등 다양한 기업들의 최신 기술 발표가 있었습니다. Mistral은 새로운 비전 언어 모델 Pixtral 12B를 출시하여 멀티모달 AI 경쟁에서 Meta를 앞서갔습니다. OpenAI는 ChatGPT가 창작 과정에서 어떻게 활용되는지를 다룬 사례들을 공유하며, 창작자들이 AI를 어떻게 활용하고 있는지 보여주었습니다. NVIDIA는 최첨단 오디오 생성 모델 BigVGAN v2를 발표하여 오디오 합성 속도와 품질을 크게 향상시켰습니다. Apple은 iPhone 16의 새로운 온디바이스 AI 기능과 그 프롬프트 예시를 공개했습니다. Microsoft는 비정형 데이터를 효율적으로 처리할 수 있는 GraphRAG 라이브러리를 오픈소스로 출시하였고, FishAudio는 다국어 지원 텍스트-음성 모델인 Fish Speech 1.4를 출시했습니다. ICTNLP는 음성과 텍스트를 동시에 처리하는 LLaMA-3.1-8B Omni 모델을 발표했고, Arcee.ai는 Meta LLaMA 3.1 모델을 기반으로 한 SuperNova를 출시했습니다. 또한, AWS는 고성능 AI 워크로드를 위한 EC2 P5e 인스턴스를 출시하였으며, GOT 2.0은 더욱 발전된 OCR(광학 문자 인식) 기능을 선보였습니다. 마지막으로, Upstage는 단일 GPU에서 뛰어난 성능을 발휘하는 Solar Pro Preview 모델을 공개했습니다.

Mistral AI, Pixtral 12B 비전 언어 모델 출시

링크, 9월 12일 2024년

  • Mistral은 Pixtral 12B라는 새로운 비전 언어 모델을 출시
  • 텍스트 백본은 Mistral Nemo 12B, 비전 어댑터는 400M 파라미터 사용
  • GeLU 활성화 함수와 2D RoPE가 비전 인코더에 적용
  • 확장된 어휘 크기: 131,072개
  • 이미지 관련 작업을 위한 새로운 토큰 3종 추가: img, img_break, img_end
  • 이미지 크기: 1024 x 1024 픽셀, 패치 크기: 16 x 16 픽셀
  • 모델 가중치는 bf16 형식으로 제공되며, Hugging Face에서 모델 다운로드 가능

OpenAI, Writing with AI: ChatGPT 활용 사례

링크, 9월 10일 2024년

  • ChatGPT를 사용하여 창작자들이 아이디어를 발전시키는 방법을 소개
  • 작가들이 아이디어 구상, 스토리 컨설팅, 연구 지원, 편집 보조에 ChatGPT를 활용
  • 전문 작가들은 ChatGPT를 스토리 상담자이자 피드백 제공 도구로 사용
  • 창의적인 과정을 돕는 ‘반전 인터뷰’ 기법을 사용해 작가들의 내면 아이디어를 이끌어내도록 지원
  • 비유와 은유, 아이디어 구상에 있어 ChatGPT의 활용이 특히 유용함
  • 일부 작가들은 스토리 피드백을 통해 구체적인 줄거리 문제 해결에 사용

NVIDIA, BigVGAN v2 오디오 합성 모델 발표

링크, 9월 5일 2024년

  • NVIDIA는 BigVGAN v2 모델을 발표하여 오디오 생성 분야에서 성능을 크게 향상
  • 오디오 품질이 이전 모델 대비 크게 향상되었으며, 다양한 오디오 유형에서 높은 평가
  • 최대 3배 빠른 오디오 합성 속도, 최적화된 CUDA 커널을 통해 실시간보다 240배 빠르게 합성 가능
  • 사전 학습된 체크포인트는 다양한 오디오 구성 지원
  • 44kHz 샘플링 레이트를 지원하여 인간 청각 주파수 범위 내 최고 품질의 오디오 제공

Apple, iPhone 16 온디바이스 AI 기능 발표

링크, 9월 9일 2024년

  • Apple은 iPhone 16의 AI 기능을 강조한 발표
  • Apple Intelligence Adapter가 온디바이스에서 동작하며, 우선 알림과 요약 기능을 지원
  • 3B 크기의 온디바이스 모델이 제공되며, 지역화 적응 기능도 탑재
  • 이메일 재작성, 콘텐츠 안전 분류, 우선 알림 등의 기능을 위한 프롬프트가 사용

Microsoft, GraphRAG 오픈소스 출시

링크, 9월 1일 2024년

  • Microsoft는 GraphRAG라는 새로운 오픈소스 라이브러리를 공개
  • GraphRAG는 비정형 텍스트 데이터를 자동으로 처리하여 지식 그래프를 추출
  • LLM(대형 언어 모델)과 결합하여 비정형 데이터를 구조화된 지식으로 변환
  • GraphRAG 솔루션 가속기를 사용하면 Azure 자원을 통한 손쉬운 사용 가능

FishAudio, Fish Speech 1.4 출시

링크, 9월 11일 2024년

  • 700K 시간의 음성 데이터를 학습한 다국어 텍스트-음성 변환 모델 출시
  • 영어, 중국어, 독일어, 일본어, 프랑스어, 스페인어, 한국어, 아랍어 등 8개 언어 지원
  • 실시간 음성 클로닝과 초저지연 성능 제공
  • Hugging Face에서 모델 가중치 다운로드 가능

ICTNLP, LLaMA-3.1-8B Omni 출시

링크, 9월 12일 2024년

  • ICTNLP는 LLaMA-3.1-8B Omni 모델을 발표
  • 이 모델은 음성과 텍스트를 동시에 생성하는 저지연 종단간 음성 모델
  • Whisper Large v3와 HuBERT를 기반으로 하여 250ms 이하의 지연 시간으로 실시간 음성 처리가 가능
  • Hugging Face에서 모델 체크포인트 제공

Arcee.ai, SuperNova LLaMA 3.1 출시

링크, 9월 11일 2024년

  • Arcee.ai는 Meta LLaMA 3.1 405B 모델을 증류한 70B 및 8B 모델 공개
  • SuperNova는 Meta의 LLaMA 3.1 70B 모델보다 뛰어난 성능을 보이는 모델로, API와 Hugging Face에서 사용 가능
  • Spectrum과 Evol-Kit을 활용해 지식 증류와 강화 학습을 결합하여 모델 성능을 향상시킴

AWS, EC2 P5e 인스턴스 출시

링크, 9월 11일 2024년

  • AWS는 고성능 AI 워크로드를 위한 P5e 인스턴스를 출시
  • NVIDIA H200 GPU 기반으로, 메모리 대역폭이 증가해 대규모 AI 모델의 처리 성능이 크게 향상
  • Meta LLaMA 3.1 70B 및 405B 모델의 추론 성능을 크게 향상시켜 최대 40%의 비용 절감 가능
  • GPU 메모리 용량이 크게 증가해, 단일 인스턴스로 대규모 모델을 처리할 수 있어 복잡한 분산 시스템의 필요성 감소

GOT 2.0, 차세대 OCR 기술 발표

링크, 9월 3일 2024년

  • GOT 2.0은 새로운 OCR(광학 문자 인식) 기술로 복잡한 문서 및 수식을 처리하는 모델
  • Vision-Encoder와 Qwen-0.5B Decoder를 사용하여 8k 컨텍스트를 처리할 수 있음
  • 공식, 수학적 표현, 다중 페이지 PDF, 동적 해상도 지원
  • OCR 성능 측면에서 LLaVA-NeXT와 Qwen-VL-Max 모델을 능가

Upstage, Solar Pro Preview: 고성능 LLM 모델 발표

링크, 9월 10일 2024년

  • 단일 GPU에서 동작하는 고성능 LLM 모델인 Solar Pro Preview 발표
  • 22B 파라미터로, LLaMA 3.1 70B 모델과 성능이 비슷하면서도 더 작은 VRAM에서 실행 가능
  • Phi-3-medium 모델을 확장하여 성능을 크게 향상시킨 모델
  • November 2024에 공식 출시 예정이며, 더 넓은 언어 지원과 확장된 기능 제공 예정
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
###
https://huggingface.co/mistral-community/pixtral-12b-240910
Mistral AI, 9/12/24
Mistral released Pixtral 12B Vision Language Model 🔥

Some notes on the release:

1. Text backbone: Mistral Nemo 12B

2. Vision Adapter: 400M

3. Uses GeLU (for vision adapter) & 2D RoPE (for vision encoder)

4. Larger vocabulary - 131,072

5. Three new special tokens - `img`, `img_break`, `img_end`

6. Image size: 1024 x 1024 pixels

7. Patch size: 16 x 16 pixels

8. Tokenizer support in mistral_common

9. Model weights in bf16

10. Haven't seen the inference code yet

11. Weights up on Hugging Face Hub 🤗

GG Mistral for successfully frontrunning Meta w/ Multimodal 🐐


###
https://openai.com/chatgpt/use-cases/writing-with-ai/
OpenAI, 9/10/24
Writing with AI
Five ways professional writers are leveraging ChatGPT

A bright blue sky with white clouds painted in soft, sweeping brushstrokes.
We’re sharing a few examples of how writers are using ChatGPT in their creative process.

Writers are using ChatGPT as a sounding board, story consultant, research assistant, and editor—to talk out ideas, find the right word, clarify their thinking, and get feedback on structure and flow.

Hi! I want to write a short story, but I need a little help getting started. Can you help me come up with an interesting and original premise?

I’d like all the ideas to come from me, but I want your help eliciting them. First, provide me with 10 questions to:

Inspire my creativity and imagination

Prompt me to juxtapose disparate concepts or settings to create novel ideas

Recall meaningful memories, funny stories, conversations, and emotional moments from my own life experiences

Then, ask me each question one at a time. For each response, ask two follow-up questions, one at a time, before moving on to the next question. Let's have a fun and playful conversation to spark my creativity!

Absolutely! Let's get creative! Here are the 10 questions I'll ask to guide your storytelling:

What was the most surprising or unexpected thing that ever happened to you?

What’s a hobby, job, or skill you’ve always wanted to master, but never had the time or opportunity to pursue?

Imagine a place where two completely different worlds collide. What would those worlds look like, and how would they interact?

Have you ever had a conversation that left a lasting impression on you? What was it about?

What’s the funniest or strangest coincidence you’ve ever experienced?

Is there a time in your life when you felt completely out of your element? What was that experience like?

What’s the most peculiar or mysterious place you’ve ever been to?

Think of an object you own that has sentimental value. How could that object be magical in a story?

Have you ever had a vivid dream that stuck with you long after you woke up? What happened in it?

If you could speak to your younger self, what would you say? How might that conversation spark a story?

Ready? Let’s dive in!

What was the most surprising or unexpected thing that ever happened to you?

Continue conversation(opens in a new window)
“The instinct is to say, ‘Oh, this thing just writes for us.’ But I can also prompt it to ask me questions. It can get me thinking by pulling ideas and insights out of me.”
Stew Fortier, writer and founder
Here are five professional writers who use ChatGPT not to write for them, but as a tool to extend their own creativity.

Editorial Feedback
Writing is inherently solitary, and that solitude has a drawback: trapped in your own perspective, it’s hard to be objective about your work. That’s why writers need editors—to provide a fresh view on what works and what needs work.

But editors aren’t always available, and even when they are, they only see finished drafts. That’s where ChatGPT can help, offering instant feedback and guidance at every stage of the writing journey.

Screenwriter David Cornue sees ChatGPT as a 24/7 writers’ room, helping him develop story ideas and solve narrative problems. Here’s how he uses ChatGPT to get feedback on his screenplays and beat sheets.

“I use ChatGPT as a story collaborator and story evaluator—not a story generator.

I tell my fellow writers, if you try to have it write for you, you lose something in yourself. It’s like cheating on a robot’s homework to get an A+. I’d rather I did my own homework.

I never use it to say, “Rewrite this scene for me.” Writers need to ask themselves, “How can you use this in a way that feels good, that makes you feel like it’s still you and still your voice?” As long as you’re the creative engine behind it, this is a great tool to utilize.

I was very clear about how I wanted to use AI. I set boundaries the way one would in a relationship. I said, this is acceptable; this is unacceptable. I put up a creative firewall to say, “Do not write this for me. I’m the writer. You are not the writer.”

I use ChatGPT all day, every day. I think of it like a writers’ room. Right now I’m working on adapting a Stephen King novel. I’ll say, “Here’s my beat sheet. I feel like this is not dramatic enough. Thoughts?” It offers ideas and we have a back-and-forth to talk out the challenges as one would in a writers’ room.

Sometimes I ask it to give me feedback using storytelling principles I’ve developed. Or, I ask it to play the role of a veteran screenwriter. I say, “All right, pretend you’re such-and-such famous writer or famous director. What are your notes?”

I’m always in the driver’s seat. It’s this fantastic sounding board that’s faster, smarter, and better than anything else I could imagine.”

David Cornue

Wordfinding
To write well, you need to find the right words. But there’s a palpable difference between a word that’s almost right—serviceable but vague, directionally correct yet inexact—and one that conveys meaning with thrilling precision: what Flaubert called “le mot juste.” His advice: “Never settle for approximations.”

ChatGPT achieves what a thesaurus never could: it considers the sentence in which a word appears, offering nuanced, pertinent suggestions to help writers articulate ideas with absolute clarity.

Here’s how former New York Times columnist Farhad Manjoo uses ChatGPT for “wordfinding”:

“ChatGPT has become part of my workflow. Before I start writing, I sometimes have a conversation with it about the ideas I’m thinking of. It feels like brainstorming—like thinking an idea through, or thinking out loud.

I’ve also experimented with using it as my first reader. After I write a few paragraphs, I show it to ChatGPT and ask it for general criticism. Sometimes it's pretty good. It will suggest potential counterarguments, or lead me to ideas I hadn’t thought of. It’s not the same as talking to an editor, but it’s still great to have something to bounce my thoughts off.

My top use is what I wrote about in my New York Times article(opens in a new window), which is wordfinding. I'm constantly trying to come up with a clever way of saying something; I guess that's what writing is.

Before ChatGPT, I’d look at dictionaries and thesauruses and go on wild-goose Google searches for the perfect word.

ChatGPT has replaced all of those for me. It’s also really helpful with analogies, metaphors and idioms. Often, I will think of half of an analogy and not be able to finish it in just the right way. Now, if I want some interesting comparison but don't know exactly the words to use, I ask it for help, and after a little back and forth it helps me find the thought I’d been searching for. I've never used any other tool that can help with that.”

Farhad Manjoo

Reverse Interviewing
Most people think of ChatGPT as a question-answering machine: ask a question, get an answer. But writer and founder Stew Fortier finds it most useful when he’s the one giving answers. He calls it “reverse interviewing”—a technique where ChatGPT guides the user through a series of probing questions, drawing out insights and nudging them to articulate their thoughts more clearly.

Here’s how Stew uses ChatGPT as a tool to unlock his own creativity:

“I think of it as a creative collaborator. The question should be, “How can I use AI to be more creative?”

One paradigm I think is underdiscussed is that you can actually prompt these tools to ask you questions, to get you thinking, to prompt you to start writing. The instinct is to say, “Oh, this thing just writes for us.” But it can also ask me questions. It can also get me thinking and shape my ideas—and that’s where great writing really starts.

What if instead of you becoming a prompt engineer, you see what it can prompt out of you? The AI can be a non-judgmental collaborator that helps pull out these great, unique insights from you.

I view it as an always-on editor that can help writers do better work—not replace the work they're doing. They can use it for brainstorming, research, doing the non-writing part of the writing process using the AI.

It’s like if you were in The Onion’s writer’s room. Six of you working together are going to be funnier together than you would be solo. There's a symbiosis that can exist.

When I'm honest with myself, I actually wouldn't ever want to give up the act of writing. Writing is like working out or meditating. It's one of those activities you value intrinsically. Even if the AI could write perfect prose, I’d still want to write because writing itself is a valuable act.

Writing is hard. There are so many things that are legit hard about it that make people give up. You haven't had your coffee yet, you're short on ideas, you can't track down a statistic or fact you need for a story… If AI can help with those things, then you as a writer are more likely to show up and do the work.”

Stew Fortier

Comedy Writing
Good comedy may seem effortless, but writing jokes actually requires research. Humor thrives on context, and as comedy writer Sarah Rose Siskind explains, knock-knock jokes fall flat precisely because they’re decontextualized. Sarah uses ChatGPT to research joke setups, uncovering common themes and exaggerated observations, so she can focus on the fun creative part: writing punchlines.

Here’s how Sarah uses ChatGPT to gather material:

“I use ChatGPT for comedy. It's not going to give me a finished joke, but it's going to start the conversation. I’ll be like, “I need you to embody Moses. And as Moses, hear the complaints of a girl named Megan who’s a present-day social media manager and try to console her, despite your objectively more difficult life.” I can have a conversation with anyone. I feel like I have the person there with me.

I find it useful when I’m writing the setup for a joke. I was writing a parody of Shark Tank and I asked, “What are the kinds of things people say on Shark Tank?”

With a parody, it's not one-to-one accuracy. You're taking things that are different and exaggerating them. So I'm looking for not just what are common things such as, “I’m looking for investment,” but what are uniquely Shark Tank things I can exaggerate? Such as, “Sharks, has this ever happened to you?”

I was writing a roast speech for a guy at a coding conference. I'm not a coder, so I asked ChatGPT to delve into the inside jokes of coding communities. I asked, “What are stereotypes about the 50-year-old programmer on your team?” ChatGPT responded with, among other suggestions: “They will make you listen to endless stories.” That reminded me how older people always talk about meeting celebrities before they were famous. I then prompted ChatGPT for some tech celebrities who might fit the bill and, of its list, Steve Wozniak felt like the funniest pick. In the end the joke was about how senior developers will endlessly regale their younger colleagues about that time they almost met Steve Wozniak at a party in 1998. I chose the direction, but ChatGPT helped me fill in the details.

What’s amazing to me is I do not have writer’s block anymore—like truly. I think writer’s block is the feeling of solipsism and it is the feeling of being totally alone. And I don’t feel alone anymore because of this tool.”

Sarah Rose Siskind

Watch Sarah demonstrate her joke-writing process(opens in a new window)
Research and Worldbuilding
Research is the writer’s gateway to new worlds. Writers are advised to write what they know, but as Tom Wolfe observed, relying solely on firsthand experience limits their creative scope. To authentically portray people and places never seen—real and imagined, past and future—writers must expand what they know through research and reportage.

But research can be agonizingly slow, requiring writers to sift through mountains of sources to uncover a single elusive fact. Novelist and Substack writer Elle Griffin explains how ChatGPT accelerates this process and fuels her imagination, delivering immediate answers to specific questions and inspiring unexpected connections and serendipitous discoveries.

“ChatGPT has completely revolutionized my writing.

By far the most difficult part of writing is researching. Before ChatGPT, I’d use Google to find studies, go down Wikipedia rabbit holes and look up all the footnotes, and read dozens of books from Amazon. I spent way too much time weeding through hundreds of webpages and paying for dozens of books just to read a few chapters from each one.

Now I just ask ChatGPT something like: “What research studies have been conducted about more and less emotive cultures?” and get right to the research I need. I always ask ChatCPT to include links to sources so I can make sure the information is accurate.

I also use ChatGPT to brainstorm my novel. When I wanted to imagine how humans might evolve in one million years, I asked if there’s evidence that humans are still evolving. ChatGPT led me to a study showing that waterfaring people, who hold their breath for long periods while diving, develop enlarged spleens. When researching technological systems, it led me to a study about transmitting data through water. These inspired two chapters of my utopian novel!

None of this querying in any way detracts from my own voice. It just helps me reach my own theses more quickly because I’m led to more relevant source materials. ChatGPT isn’t doing my writing or thinking for me. It’s just giving me better access to information so I can spend more time being creative and less time hunting for information. And none of this detracts from the deep reading and research I love to do. I have now read several incredible books that were very on-target with my interests thanks to ChatGPT finding the works most pertinent to my work.

I can't even begin to explain how much better this is than searching through page after page on Google, weeding through SEO-optimized lists written by bloggers and brands.”

Elle Griffin

###
https://developer.nvidia.com/blog/achieving-state-of-the-art-zero-shot-waveform-audio-generation-across-audio-types/?ncid=so-link-568047&linkId=100000287921594
NVIDIA,

📣 Exciting news for developers in the audio tech space 🌟 BigVGAN v2 is here to revolutionize audio synthesis, delivering unparalleled quality and speed. 🎧✨
https://nvda.ws/47nyYdp

✅ Top-notch Audio Quality
✅ Faster Synthesis
✅ Pre-trained Checkpoints
✅ High Sampling Rate Support
Dive into the future of audio synthesis with BigVGAN v2 and create sounds that are indistinguishable from the real thing 👀🌐💡

Conversational AI

English
Achieving State-of-the-Art Zero-Shot Waveform Audio Generation across Audio Types
Sep 05, 2024
By Sang-gil Lee and Rafael Valle

+6
Like
Discuss (0)

Stunning audio content is an essential component of virtual worlds. Audio generative AI plays a key role in creating this content, and NVIDIA is continuously pushing the limits in this field of research. BigVGAN, developed in collaboration with the NVIDIA Applied Deep Learning Research and NVIDIA NeMo teams, is a generative AI model specialized in audio waveform synthesis that achieves state-of-the-art results. BigVGAN generates waveforms orders of magnitude faster than real time and shows strong robustness with various audio types, including speech, environmental sounds, and music.

This post discusses BigVGAN v2, which delivers significant improvements in speed and quality, empowering a future where generated audio is indiscernible from real audio. BigVGAN v2 highlights include:

State-of-the-art audio quality measured by diverse metrics across many audio types.
Up to 3x faster synthesis speed by leveraging optimized CUDA kernels.
Ready-to-use pretrained checkpoints supporting diverse audio configurations.
Support for a sampling rate of up to 44 kHz, which covers the highest sound frequency humans can hear.
BigVGAN: A universal neural vocoder
BigVGAN is a universal neural vocoder specialized in synthesizing audio waveforms using Mel spectrograms as inputs. Neural vocoders are a cornerstone method in audio generative AI that generate sound waves from compact acoustic features, such as Mel spectrogram. BigVGAN is available as open source through NVIDIA/BigVGAN on GitHub.

BigVGAN is a fully convolutional architecture (Figure 1) with several upsampling blocks using transposed convolution followed by multiple residual dilated convolution layers. It features a novel module, called anti-aliased multiperiodicity composition (AMP), which is specifically designed for generating waveforms.

AMP is specialized in synthesizing high-frequency and periodic sound waves, drawing inspiration from audio signal processing principles. It applies a periodic activation function, called Snake, which provides an inductive bias to the architecture in generating periodic sound waves. It also applies anti-aliasing filters to reduce undesired artifacts in the generated waveforms. To learn more, see BigVGAN: A Universal Neural Vocoder with Large-Scale Training.

The BigVGAN diagram details the flow and interaction between different layers and modules within the BigVGAN framework, highlighting the innovative use of periodic activation functions and filtering methods to improve synthesis fidelity.
Figure 1. BigVGAN architecture and anti-aliased multi-periodicity composition modules (AMP) using periodic activation function (Snake1d) and low-pass filters for universal waveform synthesis
Generating every sound in the world
Waveform audio generation, a crucial component in building virtual worlds, has long been an active research area. Despite its importance, current vocoding methods often produce audio lacking fine details in high-frequency sound waves. BigVGAN v2 effectively addresses this issue, providing high-quality audio with enhanced fine details.

BigVGAN v2 is trained using NVIDIA A100 Tensor Core GPUs and up to more than 100x larger audio data than its predecessor. Aimed at encapsulating every sound in the world, the dataset includes speech in multiple languages, environmental sounds from everyday objects, and diverse instruments. As a result, BigVGAN v2 can generate high-quality sound waves from numerous domains with a single model.

Below, listen to audio comparisons of real recordings and generated samples from BigVGAN and BigVGAN v2 at the 24 kHz sampling rate. BigVGAN v2 generates high-quality sound waves.

Recordings (24 kHz)
BigVGAN
BigVGAN v2
Reaching the highest frequency sound the human ear can detect
Previous waveform synthesizers were limited to sampling rates between 22 kHz and 24 kHz. BigVGAN v2, however, expands this range to a 44 kHz sampling rate, encapsulating the entire human auditory spectrum. This matches the highest frequencies the human ear can detect, which do not exceed a sampling rate of 40 kHz. As a result, BigVGAN v2 can reproduce comprehensive soundscapes, capturing everything from the robust reverberations of drums to the crisp shimmer of crash cymbals in music, for example.

Below, listen to audio comparisons of real recordings and generated samples from two BigVGAN v2 models, one for the 24 kHz sampling rate and another for the 44 kHz sampling rate.

Recordings (44 kHz)
BigVGAN v2 (24 kHz)
BigVGAN v2 (44 kHz)
Faster synthesis with custom CUDA kernels
Compared to its predecessor, BigVGAN v2 also features accelerated synthesis speed by using custom CUDA kernels, with up to 3x faster inference speed than the original BigVGAN. The optimized inference CUDA kernels written for BigVGAN v2 can generate audio waveforms up to 240x faster than real time on a single NVIDIA A100 GPU.

BigVGAN v2 audio quality results
BigVGAN v2 24 kHz shows better audio quality for speech and general audio compared to its open-sourced predecessor—and by a significant margin (Figures 2 and 3).

The quality of generated waveforms metrics obtained for BigVGAN v2 24 kHz and BigVGAN models tested on LibriTTS-dev speech data.
Figure 2. BigVGAN v2 24 kHz versus BigVGAN results on LibriTTS-dev speech data
The quality of generated waveforms metrics obtained for BigVGAN v2 24 kHz and BigVGAN models tested on MUSDB18-HQ music data.
Figure 3. BigVGAN v2 24 kHz versus BigVGAN model results on MUSDB18-HQ music data
In addition, the new BigVGAN v2 44 kHz model shows comparable audio quality to Descript Audio Codec (.dac), an open-source high-quality neural audio codec (Figures 4 and 5).

The quality of generated waveforms metrics obtained for BigVGAN v2 44 kHz and Descript Audio Codec models tested on HiFi-TTS-dev speech data.
Figure 4. BigVGAN v2 44 kHz versus Descript Audio Codec results using HiFi-TTS-dev speech data
The quality of generated waveforms metrics obtained for BigVGAN v2 44 kHz and Descript Audio Codec (DAC) models tested on MUSDB18-HQ music data.
Figure 5. BigVGAN v2 44 kHz versus Descript Audio Codec results using MUSDB18-HQ music data
All results show the quality of generated waveforms with respect to the following metrics:

Perceptual Evaluation of Speech Quality (PESQ)
Virtual Speech Quality Objective Listener (ViSQOL)
Multi-Resolution Short-Time Fourier Transform (M-STFT)
Periodicity Root Mean Square Error (Periodicity)
Voice/Unvoiced F1 Score (V/UV F1)
Conclusion
NVIDIA is committed to delivering the best audio generative AI accessible to all. The release of BigVGAN v2 pushes neural vocoder technology and audio quality to new heights, even reaching the limits of human auditory perception.

BigVGAN v2 sets a new standard in audio synthesis, delivering state-of-the-art quality across all audio types, covering the full range of human hearing. Its synthesis is now up to 3x faster than the original BigVGAN, ensuring efficient processing for diverse audio configurations.

Before diving into BigVGAN v2, we encourage users to review the model card for a seamless experience.

###
https://github.com/Explosion-Scratch/apple-intelligence-prompts/blob/main/PROMPTS.md
Apple, 9/9/24
The new Apple iPhone 16 was announced 1 hour ago with a big focus on-device AI and Apple Intelligence. But did you know the prompts of the Apple Intelligence Adapters for priority notifications or summarization are inspectable? 👀
The available 3B on-device Apple Intelligence Model has special prompts and adapters for features like priority notifications, localization, guardrails, etc., Below are some examples:
1️⃣ Priority Notification: A simple prompt like "{{userContent}} Is this urgent?" helps Apple Intelligence prioritize notifications.
2️⃣ Email Rewriting: “You are an assistant which helps the user respond to their mails..., do not halucinate…{{userContent}}” enables automatic, concise, and factually accurate email replies.
3️⃣ Content Safety Classification: “You are a helpful assistant that classifies input as Safe or Unsafe...{{userContent}}” enables Apple Intelligence to detect unsafe content.
4️⃣ Localization Adaptation: Using a prompt such as "Respond in British English", adjusts communication based on regional preferences.

###
https://github.com/microsoft/graphrag
Microsoft, 9/1/24
Microsoft just open-sourced GraphRAG. It might be the best Python library to extract insights from unstructured text.
It uses LLMs to automate the extraction of knowledge graphs from your datasets and text documents making it much more powerful than vanilla RAG.
GraphRAG
👉 Use the GraphRAG Accelerator solution
👉 Microsoft Research Blog Post
👉 Read the docs
👉 GraphRAG Arxiv

PyPI - Version PyPI - Downloads GitHub Issues GitHub Discussions
Overview
The GraphRAG project is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using the power of LLMs.

To learn more about GraphRAG and how it can be used to enhance your LLM's ability to reason about your private data, please visit the Microsoft Research Blog Post.

Quickstart
To get started with the GraphRAG system we recommend trying the Solution Accelerator package. This provides a user-friendly end-to-end experience with Azure resources.

Repository Guidance
This repository presents a methodology for using knowledge graph memory structures to enhance LLM outputs. Please note that the provided code serves as a demonstration and is not an officially supported Microsoft offering.

⚠️ Warning: GraphRAG indexing can be an expensive operation, please read all of the documentation to understand the process and costs involved, and start small.

Diving Deeper
To learn about our contribution guidelines, see CONTRIBUTING.md
To start developing GraphRAG, see DEVELOPING.md
Join the conversation and provide feedback in the GitHub Discussions tab!
Prompt Tuning
Using GraphRAG with your data out of the box may not yield the best possible results. We strongly recommend to fine-tune your prompts following the Prompt Tuning Guide in our documentation.

###
https://huggingface.co/fishaudio/fish-speech-1.4
9/11/24
🚨 New powerful open Text to Speech model: Fish Speech 1.4 - trained on 700K hours of speech, multilingual (8 languages)🔥
> Instant Voice Cloning
> Ultra low latency
> ~1GB model weights
> Model weights on the Hub 🤗
> Play with the space in the comments 👇
Fish Speech V1.4
Fish Speech V1.4 is a leading text-to-speech (TTS) model trained on 700k hours of audio data in multiple languages.

Supported languages:

English (en) ~300k hours
Chinese (zh) ~300k hours
German (de) ~20k hours
Japanese (ja) ~20k hours
French (fr) ~20k hours
Spanish (es) ~20k hours
Korean (ko) ~20k hours
Arabic (ar) ~20k hours
Please refer to Fish Speech Github for more info.
Demo available at Fish Audio.

###
https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni
9/12/24
End to End Speech models are on fire - LLAMA-OMNI 8B - Apache licensed! 🔥
> Speech Encoder - Whisper Large v3
> LLM backbone - Llama 3.1 8B Instruct
> Speech Decoder - HuBERT (UnitY)
> Simultaneously generate Speech + Text
> Less than 250 ms latency
> Trained in less than 3 days on 4x GPUs
> Used 200K instruct pairs
> Model checkpoints on the Hub 🤗
> Space incoming!
🦙🎧 LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Authors: Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng*

[Paper] [Model] [Code]

LLaMA-Omni is a speech-language model built upon Llama-3.1-8B-Instruct. It supports low-latency and high-quality speech interactions, simultaneously generating both text and speech responses based on speech instructions.



💡 Highlights
💪 Built on Llama-3.1-8B-Instruct, ensuring high-quality responses.

🚀 Low-latency speech interaction with a latency as low as 226ms.

🎧 Simultaneous generation of both text and speech responses.

♻️ Trained in less than 3 days using just 4 GPUs.




###
https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite
Arcee.ai, 9/11/24
First distilled Llama 3.1 released by Arcee.ai! 🦙 SuperNova is a distilled reasoning Llama 3.1 70B & 8B! 👀 Arcee distilled Meta Llama 3.1 405B using offline knowledge distillation and combined it with RLHF and model merging to create new open LLM. SuperNova 70B is available via API, and 8B is available on Hugging Face. 🔥
Post Training pipeline:
1️⃣ Used offline Distillation to distill Llama 3.1 405B into 70B
2️⃣ Fine-tuned a second Llama 3.1 70B on synthetic data with Spectrum and Evol-Kit
3️⃣ Trained a third Llama 3.1 Instruct with DPO
4️⃣ Final model is merged in 2 steps: 2️⃣ + 3️⃣ > 1️⃣
Results:
🥇 Outperforms Meta Llama 3.1 70B instruct across benchmarks
🤗 SuperNova 70B is available via API, and 8B is available on Hugging Face.
🧠 Works with the <thinking> prompt of Reflection
📊 Best open LLM on IFEval, surpassing OpenAI and Anthropic models
🔬 Developed using Spectrum, EvolKit, and Direct Preference Optimization (DPO)


###
https://aws.amazon.com/de/blogs/machine-learning/amazon-ec2-p5e-instances-are-generally-available/
AWS, 9/11/24
H200s are now available on Amazon Web Services (AWS)! AWS launched p5e instances with NVIDIA H200 (141GB) via Amazon EC2 Capacity Blocks. Amazon EC2 Capacity Blocks is an option to reserve GPU capacity for AI workloads on Amazon EC2. P5de with 8x H200 start at $5.408/h/gpu or $43.26/h per instance.
Here is how it works:
1️⃣ Reserve GPU instances for a future start date for a specified duration
2️⃣ Reservations can be made up to 8 weeks in advance
3️⃣ Available in cluster sizes of 1 to 64 instances (up to 512 GPUs)
4️⃣ Reservation durations range from 1-14 days in 1-day increments, up to 28 days total
5️⃣ Instances are colocated in EC2 UltraClusters for high-performance ML workloads

Amazon EC2 P5e instances are generally available
by Avi Kulkarni, Aman Shanbhag, Pavel Belevich, Khaled Rawashdeh, Karthik Venna, Maxime Hugues, and Shruti Koparkar | on 09 SEP 2024 | in Amazon EC2, Announcements, Compute | Permalink | Comments | Share
State-of-the-art generative AI models and high performance computing (HPC) applications are driving the need for unprecedented levels of compute. Customers are pushing the boundaries of these technologies to bring higher fidelity products and experiences to market across industries.

The size of large language models (LLMs), as measured by the number of parameters, has grown exponentially in recent years, reflecting a significant trend in the field of AI. Model sizes have increased from billions of parameters to hundreds of billions of parameters within a span of 5 years. As LLMs have grown larger, their performance on a wide range of natural language processing tasks has also improved significantly, but the increased size of LLMs has led to significant computational and resource challenges. Training and deploying these models requires vast amounts of computing power, memory, and storage.

The size of an LLM has a significant impact on the choice of compute needed for inference. Larger LLMs require more GPU memory to store the model parameters and intermediate computations, as well as greater computational power to perform the matrix multiplications and other operations needed for inference. Large LLMs take longer to perform a single inference pass due to this increased computational complexity. This increased compute requirement can lead to higher inference latency, which is a critical factor for applications that require real-time or near real-time responses.

HPC customers exhibit similar trends. With the fidelity of HPC customer data collection increasing and datasets reaching exabyte scale, customers are looking for ways to enable faster time to solution across increasingly complex applications.

To address customer needs for high performance and scalability in deep learning, generative AI, and HPC workloads, we are happy to announce the general availability of Amazon Elastic Compute Cloud (Amazon EC2) P5e instances, powered by NVIDIA H200 Tensor Core GPUs. AWS is the first leading cloud provider to offer the H200 GPU in production. Additionally, we are announcing that P5en instances, a network optimized variant of P5e instances, are coming soon.

In this post, we discuss the core capabilities of these instances and the use cases they’re well-suited for, and walk you through an example of how to get started with these instances and carry out inference deployment of Meta Llama 3.1 70B and 405B models on them.

EC2 P5e instances overview
P5e instances are powered by NVIDIA H200 GPUs with 1.7 times more GPU memory capacity and 1.5 times faster GPU memory bandwidth as compared to NVIDIA H100 Tensor Core GPUs featured in P5 instances.

P5e instances incorporate 8 NVIDIA H200 GPUs with 1128 GB of high bandwidth GPU memory, 3rd Gen AMD EPYC processors, 2 TiB of system memory, and 30 TB of local NVMe storage. P5e instances also provide 3,200 Gbps of aggregate network bandwidth with support for GPUDirect RDMA, enabling lower latency and efficient scale-out performance by bypassing the CPU for internode communication.

The following table summarizes the details for the instance.

Instance Size vCPUs Instance Memory (TiB) GPU GPU memory Network Bandwidth (Gbps) GPUDirect RDMA GPU Peer to Peer Instance Storage (TB) EBS Bandwidth (Gbps)
p5e.48xlarge 192 2 8 x NVIDIA H200 1128 GB
HBM3e 3200 Gbps EFA Yes 900 GB/s NVSwitch 8 x 3.84 NVMe SSD 80
EC2 P5en instances coming soon
One of the bottlenecks in GPU-accelerated computing may lie in the communication between CPUs and GPUs. The transfer of data between these two components can be time-consuming, especially for large datasets or workloads that require frequent data exchanges. This challenge could impact wide range of GPU-accelerated applications such as deep learning, high-performance computing, and real-time data processing. The need to move data between the CPU and GPU can introduce latency and reduce the overall efficiency. Additionally, network latency can become an issue for ML workloads on distributed systems, because data needs to be transferred between multiple machines.

EC2 P5en instances, coming soon in 2024, can help solve these challenges. P5en instances pair the NVIDIA H200 GPUs with custom 4th Generation Intel Xeon Scalable processors, enabling PCIe Gen 5 between CPU and GPU. These instances will provide up to four times the bandwidth between CPU and GPU and lower network latency, thereby improving workload performance.

P5e use cases
P5e instances are ideal for training, fine-tuning, and running inference for increasingly complex LLMs and multimodal foundation models (FMs) behind the most demanding and compute-intensive generative AI applications, including question answering, code generation, video and image generation, speech recognition, and more.

Customers deploying LLMs for inference can benefit from using P5e instances, which offer several key advantages that make them an excellent choice for these workloads.

Firstly, the higher memory bandwidth of the H200 GPUs in the P5e instances allows the GPU to fetch and process data from memory more quickly. This translates to reduced inference latency, which is critical for real-time applications like conversational AI systems where users expect near-instant responses. The higher memory bandwidth also enables higher throughput, allowing the GPU to process more inferences per second. Customers deploying the 70-billion-parameter Meta Llama 3.1 model on P5e instances can expect up to 1.871 times higher throughput and up to 40%1 lower cost compared to using comparable P5 instances. (1Input Sequence Length 121, Output Sequence Length 5000, batch size 10, vLLM framework)

Secondly, the massive scale of modern LLMs, with hundreds of billions of parameters, requires an immense amount of memory to store the model and intermediate computations during inference. On the standard P5 instances, this would likely necessitate the use of multiple instances to accommodate the memory requirements. However, the P5e instances’ 1.76 times higher GPU memory capacity enables you to scale up by using a single instance to fit the entire model. This avoids the complexity and overhead associated with distributed inference systems, such as data synchronization, communication, and load balancing. Customers deploying the 405-billion-parameter Meta Llama 3.1 model on a single P5e instance can expect up to 1.72 times higher throughput and up to 69%2 lower cost compared to using two P5 instances. (2Input Sequence Length 121, Output Sequence Length 50, batch size 10, vLLM framework)

Finally, the higher GPU memory of the P5e instances also enables the use of larger batch sizes during inference for better GPU utilization, resulting in faster inference times and higher overall throughput. This additional memory can be particularly beneficial for customers with high-volume inference requirements.

When optimizing inference throughput and cost, consider adjusting batch size, input/output sequence length, and quantization level, because these parameters can have a substantial impact. Experiment with different configurations to find the optimal balance between performance and cost for your specific use case.

In summary, the combination of higher memory bandwidth, increased GPU memory capacity, and support for larger batch sizes make the P5e instances an excellent choice for customers deploying LLM inference workloads. These instances can deliver significant performance improvements, cost savings, and operational simplicity compared to alternative options.

P5e instances are also well-suited for memory-intensive HPC applications like simulations, pharmaceutical discovery, seismic analysis, weather forecasting, and financial modeling. Customers using dynamic programming (DP) algorithms for applications like genome sequencing or accelerated data analytics can also see further benefit from P5e through support for the DPX instruction set.

Get started with P5e instances
When launching P5 instances, you can use AWS Deep Learning AMIs (DLAMI) to support P5 instances. DLAMI provides ML practitioners and researchers with the infrastructure and tools to quickly build scalable, secure, distributed ML applications in preconfigured environments. You can run containerized applications on P5 instances with AWS Deep Learning Containers using libraries for Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS).

P5e instances now available
EC2 P5e instances are now available in the US East (Ohio) AWS Region in the p5e.48xlarge sizes through Amazon EC2 Capacity Blocks for ML. For more information, refer to Amazon EC2 P5 Instances.

###
https://github.com/Ucas-HaoranWei/GOT-OCR2.0
9/3/24
OCR-2.0 is coming, and Generative AI and multimodal LLMs will power it! 🔍 GOT (General OCR Theory) is a 580M end-to-end OCR-2.0 model that outperforms all existing methods.
GOT consists of a Vision-Encoder to convert images into transformers images into tokens and a decoder for generating OCR outputs in various formats (e.g., plain text, markdown, Mathpix). GOT is designed to handle complex tasks like sheets, formulas, and geometric shapes
Implementation
1️⃣Vision Encoder Pre-training: The encoder (VitDet) trained using scene text and document OCR data to recognize both slice and whole-page inputs.
2️⃣ Joint-Training: The encoder is connected to the decoder (Qwen-0.5B) and both are trained on more general OCR tasks (e.g., formulas, sheet music, geometry).
3️⃣ Post-Training: Fine-tuned the model with specific tasks, such as fine-grained OCR, multi-page PDFs, and dynamic resolution, using new synthetic datasets.
Insights
🧠 Encoder-Decoder with 80M (VitDet) Encoder and 500M (Qwen2) Decoder with 8k context
🥇 Achieves a 0.035 edit distance and 0.972 BLEU score on plain OCR
📊 Outperforms LLaVA-NeXT and Qwen-VL-Max in document and scene text OCR
🧮 Can extract LaTeX formulas from Arxiv and convert them to Mathpix format
📃 Supports dynamic resolution and multi-page OCR
🖼️ Input resolutions up to 1024x1024

###
https://huggingface.co/upstage/solar-pro-preview-instruct
Upstage, 9/10/24
Solar Pro Preview: The most intelligent LLM on a single GPU
Summary
We introduce Solar Pro Preview, an advanced large language model (LLM) with 22 billion parameters designed to fit into a single GPU. Solar Pro Preview shows superior performance compared to LLMs with less than 30 billion parameters and delivers performance comparable to models over three times its size, such as Llama 3.1 with 70 billion parameters.

Solar Pro Preview is developed using an enhanced version of our previous depth up-scaling method, which scales a Phi-3-medium model with 14 billion parameters to 22 billion parameters, intended to run on a GPU with 80GB of VRAM. Our carefully curated training strategy and dataset have significantly enhanced performance from Phi-3-medium, particularly on the MMLU-Pro and IFEval benchmarks, both respected for evaluating a model’s knowledge and instruction-following abilities.

Solar Pro Preview is a pre-release version of the official Solar Pro, with limitations on language coverage and a maximum context length of 4K. However, we believe Solar Pro Preview not only stands out as a highly efficient and capable model, but has the potential to be further extended to cover more languages and capabilities. The official version of Solar Pro will be released this November 2024 with expanded language support beyond English and longer context windows. To stay informed about the latest updates, please sign up for our mailing list. If you have any feedback or questions about the model, please visit our model discussion board and connect with us directly.