Nvidia는 고성능 소형 언어 모델인 Mistral-NeMo-Minitron 8B를, AI21 Labs는 긴 문맥 처리에 특화된 Jamba 1.5 모델을 선보였습니다. 또한, Jina AI는 장문 임베딩 모델의 새로운 처리 방법을 소개하였으며, Meta는 인간 중심 비전 작업을 위한 새로운 모델 패밀리인 Sapiens를 발표했습니다. LinkedIn과 Neural Magic도 AI 모델의 효율성을 극대화하기 위한 새로운 도구를 공개하였고, KB금융그룹은 AI 기술을 과장하여 홍보하는 ‘AI 워싱’의 문제점을 경고했습니다.

Nvidia, Mistral-NeMo-Minitron 8B: 소형 언어 모델 출시

링크, 2024년 8월 21일

  • Nvidia는 Mistral NeMo 12B 모델을 소형화한 Mistral-NeMo-Minitron 8B를 출시함
  • 이 모델은 높은 정확도를 유지하면서도 낮은 컴퓨팅 비용으로 실행 가능
  • Pruning과 Distillation 기법을 결합하여 모델 크기를 줄이면서도 성능을 최적화
  • 소형 언어 모델 중 최고 성능을 자랑하며, 다양한 AI 작업에서 활용 가능
  • 해당 모델은 Nvidia RTX 워크스테이션에서 실시간으로 구동 가능하며, Hugging Face에서도 다운로드 가능
  • 모델은 9가지 주요 벤치마크에서 우수한 성능을 기록하며, 언어 이해, 상식 추론, 수학적 추론, 요약, 코딩 및 진실한 답변 생성 능력을 입증함
  • Nvidia AI Foundry 플랫폼을 통해 모델을 더욱 소형화하여 스마트폰이나 로봇 등 임베디드 장치에 적합한 버전으로도 활용 가능

AI21 Labs, Jamba 1.5 모델 출시

링크, 2024년 8월 22일

  • AI21 Labs는 새로운 Jamba 1.5 모델 시리즈를 발표함
  • Jamba 1.5 모델은 SSM-Transformer 아키텍처를 사용하여 긴 문맥 처리와 속도, 효율성을 극대화
  • Jamba 1.5 모델은 256K의 긴 문맥 창을 지원하여 대형 문서 요약, 분석, 에이전트 및 RAG(리트리벌 증강 생성) 워크플로우에 적합
  • Jamba 1.5 Mini와 Large 모델은 각각 Arena Hard 벤치마크에서 최고 성능을 기록함
  • 모델은 다중 언어를 지원하며, 구조화된 JSON 출력, 함수 호출, 문서 객체 처리, 인용 생성 기능을 제공
  • AI21 Studio, Google Cloud Vertex AI, Microsoft Azure, Nvidia NIM 등 다양한 클라우드 플랫폼에서 즉시 사용할 수 있으며, Amazon Bedrock, Databricks Marketplace, Snowflake Cortex 등에서도 곧 출시 예정
  • ExpertsInt8라는 새로운 양자화 기술을 통해 MoE(Mixture of Experts) 모델에서 메모리 사용량을 줄이고, 단일 8 GPU 노드에서 모델을 실행 가능

Jina AI, 장문 임베딩 모델을 위한 ‘Late Chunking’ 기술 발표

링크, 2024년 8월 23일

  • Jina AI는 긴 문맥을 다루는 임베딩 모델을 위한 새로운 처리 방법인 ‘Late Chunking’ 기술을 발표함
  • 이 방법은 문서의 긴 문맥을 효과적으로 처리하여, RAG(리트리벌 증강 생성) 시스템에서 보다 나은 검색 성능을 제공
  • ‘Late Chunking’ 기술은 문서를 미리 분할하는 대신, 임베딩 모델의 트랜스포머 레이어를 사용해 문서 전체를 처리한 후, 각 청크에 문맥 정보를 반영한 임베딩을 생성
  • 이 기술은 긴 문서에서 문맥 정보를 더욱 효과적으로 유지하며, BEIR 벤치마크에서 높은 성능을 보임
  • Late Chunking은 긴 문맥을 효과적으로 처리하여, 기존의 청킹 방식보다 검색 정확도를 크게 개선

Meta, 인간 중심 비전 작업을 위한 ‘Sapiens’ 모델 발표

링크, 2024년 8월 23일

  • Meta는 인간 중심 비전 작업을 위한 Sapiens 모델 패밀리를 발표함
  • 이 모델은 2D 포즈 추정, 신체 부위 분할, 깊이 추정, 표면 법선 예측 등 4가지 핵심 작업을 지원
  • 1K 해상도에서의 추론을 기본적으로 지원하며, 개별 작업에 맞게 간단히 미세 조정 가능
  • 3억 개 이상의 인류 이미지 데이터셋을 기반으로 자가 학습을 통해 성능을 크게 향상
  • Sapiens 모델은 다양한 인간 중심 벤치마크에서 기존 최고 성능을 초과 달성
  • Humans-5K, Humans-2K, Hi4D, THuman2 등에서 상대적 RMSE 및 각도 오류를 크게 개선

LinkedIn, Liger-Kernel 출시

링크, 2024년 8월 23일

  • LinkedIn의 LLM 연구팀은 멀티 GPU 파인 튜닝을 위한 새로운 효율적 GPU 커널인 Liger-Kernel을 출시함
  • 이 커널은 멀티 GPU 환경에서 20%의 처리량 증가와 60%의 메모리 감소를 제공
  • Flash Attention, PyTorch FSDP, Microsoft DeepSpeed와 호환 가능하며, 다양한 LLM 모델에 적용 가능
  • Hugging Face 호환 RMSNorm, RoPE, SwiGLU, CrossEntropy, FusedLinearCrossEntropy 등을 지원
  • Triton 커널을 사용하여 정확한 계산을 수행하며, 더 긴 문맥 길이와 더 큰 배치 크기를 처리 가능
  • 이 커널은 단일 라인 코드로 쉽게 적용 가능하며, 오픈 소스로 커뮤니티 주도의 개발을 유도

Neural Magic, LLM Compressor 프레임워크 공개

링크, 2024년 8월 14일

  • Neural Magic은 LLM Compressor라는 오픈 소스 프레임워크를 공개하여, LLM(대형 언어 모델)의 압축과 성능 향상을 지원
  • 이 프레임워크는 다양한 양자화 기술과 희소성 옵션을 제공하여 LLM 모델의 유연성을 극대화
  • Neural Magic 팀은 이 도구를 사용하여 Llama 3.1 405B 모델을 포함한 다양한 모델을 성공적으로 압축
  • Activation Quantization을 통해 INT8 또는 FP8 텐서 코어를 활용하여 최대 3배 더 빠른 서버/처리량 배치를 실현
  • LLM Compressor는 vLLM과 완벽하게 통합되어 바로 사용 가능하며, AI 연구자와 기업들에게 최적화된 모델 생성 및 사용을 지원

KB금융그룹, AI 워싱 경고

링크, 2024년 8월 19일

  • KB금융그룹은 AI 기술을 실제로 사용하지 않으면서 마치 사용하는 것처럼 홍보하는 ‘AI 워싱’의 위험성을 경고함
  • AI 워싱은 소비자 신뢰 저하, 투자 자원의 낭비, 과도한 기대 유발 등의 부작용을 초래할 수 있음
  • 아마존의 무인 매장 ‘아마존고’와 채용 스타트업 ‘준코’ 등 다수의 사례가 AI 워싱으로 지적됨
  • AI 워싱에 대한 규제를 강화하고, 투명한 기술 사용과 정보 제공이 필요하다고 강조
  • 소비자와 투자자들에게는 AI 기술에 대한 비판적 시각과 투명한 설명을 요구할 것을 권장
Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

  • detailed summary1, (개조식 문체 사용)
  • detailed summary2, (개조식 문체 사용)
  • detailed summary N, (개조식 문체 사용)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
###
https://blogs.nvidia.com/blog/mistral-nemo-minitron-8b-small-language-model/
Lightweight Champ: NVIDIA Releases Small Language Model With State-of-the-Art Accuracy
Mistral-NeMo-Minitron 8B is a miniaturized version of the recently released Mistral NeMo 12B model, delivering high accuracy combined with the compute efficiency to run the model across GPU-accelerated data centers, clouds and workstations.
August 21, 2024 by Kari Briski
Share

Developers of generative AI typically face a tradeoff between model size and accuracy. But a new language model released by NVIDIA delivers the best of both, providing state-of-the-art accuracy in a compact form factor.

Mistral-NeMo-Minitron 8B — a miniaturized version of the open Mistral NeMo 12B model released by Mistral AI and NVIDIA last month — is small enough to run on an NVIDIA RTX-powered workstation while still excelling across multiple benchmarks for AI-powered chatbots, virtual assistants, content generators and educational tools. Minitron models are distilled by NVIDIA using NVIDIA NeMo, an end-to-end platform for developing custom generative AI.

“We combined two different AI optimization methods — pruning to shrink Mistral NeMo’s 12 billion parameters into 8 billion, and distillation to improve accuracy,” said Bryan Catanzaro, vice president of applied deep learning research at NVIDIA. “By doing so, Mistral-NeMo-Minitron 8B delivers comparable accuracy to the original model at lower computational cost.”

Unlike their larger counterparts, small language models can run in real time on workstations and laptops. This makes it easier for organizations with limited resources to deploy generative AI capabilities across their infrastructure while optimizing for cost, operational efficiency and energy use. Running language models locally on edge devices also delivers security benefits, since data doesn’t need to be passed to a server from an edge device.

Developers can get started with Mistral-NeMo-Minitron 8B packaged as an NVIDIA NIM microservice with a standard application programming interface (API) — or they can download the model from Hugging Face. A downloadable NVIDIA NIM, which can be deployed on any GPU-accelerated system in minutes, will be available soon.

State-of-the-Art for 8 Billion Parameters
For a model of its size, Mistral-NeMo-Minitron 8B leads on nine popular benchmarks for language models. These benchmarks cover a variety of tasks including language understanding, common sense reasoning, mathematical reasoning, summarization, coding and ability to generate truthful answers.

Packaged as an NVIDIA NIM microservice, the model is optimized for low latency, which means faster responses for users, and high throughput, which corresponds to higher computational efficiency in production.

In some cases, developers may want an even smaller version of the model to run on a smartphone or an embedded device like a robot. To do so, they can download the 8-billion-parameter model and, using NVIDIA AI Foundry, prune and distill it into a smaller, optimized neural network customized for enterprise-specific applications.

The AI Foundry platform and service offers developers a full-stack solution for creating a customized foundation model packaged as a NIM microservice. It includes popular foundation models, the NVIDIA NeMo platform and dedicated capacity on NVIDIA DGX Cloud. Developers using NVIDIA AI Foundry can also access NVIDIA AI Enterprise, a software platform that provides security, stability and support for production deployments.

Since the original Mistral-NeMo-Minitron 8B model starts with a baseline of state-of-the-art accuracy, versions downsized using AI Foundry would still offer users high accuracy with a fraction of the training data and compute infrastructure.

Harnessing the Perks of Pruning and Distillation
To achieve high accuracy with a smaller model, the team used a process that combines pruning and distillation. Pruning downsizes a neural network by removing model weights that contribute the least to accuracy. During distillation, the team retrained this pruned model on a small dataset to significantly boost accuracy, which had decreased through the pruning process.

The end result is a smaller, more efficient model with the predictive accuracy of its larger counterpart.

This technique means that a fraction of the original dataset is required to train each additional model within a family of related models, saving up to 40x the compute cost when pruning and distilling a larger model compared to training a smaller model from scratch.

Read the NVIDIA Technical Blog and a technical report for details.

NVIDIA also announced this week Nemotron-Mini-4B-Instruct, another small language model optimized for low memory usage and faster response times on NVIDIA GeForce RTX AI PCs and laptops. The model is available as an NVIDIA NIM microservice for cloud and on-device deployment and is part of NVIDIA ACE, a suite of digital human technologies that provide speech, intelligence and animation powered by generative AI.

Experience both models as NIM microservices from a browser or an API at ai.nvidia.com.


Nvidia just dropped Mistral NeMo Minitron 8B - Distilled + pruned from 12B, commercially permissive license, and beats the teacher (12B) on multiple benchmarks!
> Achieves similar benchmarks as Mistral NeMo 12B, beats Llama 3.1 8B
> MMLU - L3.1 8B (65), NeMo Minitron 8B (69.5), NeMo 12B (69)
> HumanEval - L3.1 8B (24.75), NeMo Minitron 8B (36.2), NeMo 12B (23.7)
> Trained on 380B tokens
> Iterative pruning and distillation
> Width-only pruning - pruned both embedding + MLP hidden representations
> Pruned MLP intermediate dimension from 14336 to 11520 and hidden size from 5120 to 4096
> Retain the attention head count and number of layers
> Works out of the box with Transformers 🤗
This is a solid base model for further fine-tuning, and task specific use-cases, pretty much the strongest <10B range - which makes it easy to deploy across variety of cheaper GPUs.
8B => bf16 (16 GB), fp8/ 8 bit (8 GB), 4 bit (4GB)
Perfect to deploy on a L4 :D
Kudos to Nvidia for releasing their research and the model weights! 🤗


###
https://huggingface.co/collections/ai21labs/jamba-15-66c44befa474a917fcf55251
𝗔𝗜𝟮𝟭 𝗿𝗲𝗹𝗲𝗮𝘀𝗲𝘀 𝗻𝗲𝘄 𝗝𝗮𝗺𝗯𝗮 𝟭.𝟱 𝗺𝗼𝗱𝗲𝗹𝘀: 𝗡𝗲𝘄 𝘀𝘁𝗮𝗻𝗱𝗮𝗿𝗱 𝗳𝗼𝗿 𝗹𝗼𝗻𝗴-𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝘂𝘀𝗲-𝗰𝗮𝘀𝗲𝘀!🏅
AI21 Labs used a different architecture to beat the status-quo Transformers models: Jamba architecture combines classic Transformers layers with the new Mamba layers, for which the complexity is a linear (instead of quadratic) function of the context length.
What does this imply?
The Jamba 1.5 Open Model Family: The Most Powerful and Efficient Long Context Models
August 22, 2024
Blog posts
The new family of open models from AI21, offering unrivaled speed, efficiency, and quality and the longest context window among open models.

Today, we are debuting the Jamba 1.5 family of open models: Jamba 1.5 Mini and Jamba 1.5 Large. Built on our novel SSM-Transformer architecture, these models demonstrate superior long context handling, speed, and quality—outranking competitors in their size class and marking the first time a non-Transformer model has been successfully scaled to the quality and strength of the market’s leading models.

We are releasing these models under the Jamba Open Model License, upholding our commitment to democratizing access to quality models and opening the door to further experimentation.

Today’s language models are impressive in their capabilities—but too often fail to deliver real value for businesses.

At AI21, we are on a mission to change this by designing AI systems that are purpose-built for the enterprise. These models are built keeping in mind the key measures large businesses care most about when it comes to GenAI implementation: resource efficiency, quality, speed, and ability to actually solve critical tasks.


Long context handling: With a 256K effective context window, the longest in the market, Jamba 1.5 models can improve the quality of key enterprise applications, such as lengthy document summarization and analysis, as well as agentic and RAG workflows
Speed: Up to 2.5X faster on long contexts and fastest across all context lengths in their size class
Quality: Jamba 1.5 Mini is the strongest open model in its size class with a score of 46.1 on the Arena Hard benchmark, surpassing larger models like Mixtral 8x22B and Command-R+. Jamba 1.5 Large, with a score of 65.4, outpaces both Llama 3.1 70B and 405B
Multilingual: In addition to English, the models support Spanish, French, Portuguese, Italian, Dutch, German, Arabic and Hebrew
Developer ready: Jamba natively supports structured JSON output, function calling, digesting document objects, and generating citations
Open for builders: Both models are available for immediate download on Hugging Face (and coming soon to leading frameworks LangChain and LlamaIndex)
Deploy anywhere: In addition to AI21 Studio, the models are available on cloud partners Google Cloud Vertex AI, Microsoft Azure, and NVIDIA NIM and coming soon to Amazon Bedrock, Databricks Marketplace, Snowflake Cortex, Together.AI as well as for private on-prem and VPC deployment
Resource-efficient hybrid architecture
Jamba 1.5 Large and Mini are built on the novel SSM-Transformer Jamba architecture, which weaves together Transformer’s outstanding quality with Mamba’s groundbreaking efficiency.

As a result, the models offer a lower memory footprint than competitors, allowing clients to handle context lengths up to 140K tokens on a single GPU using Jamba 1.5 Mini. The same advantage also makes fine-tuning over long contexts easier and more accessible than with transformer-based models. Thanks to this efficiency-optimized architecture, our models can deliver top quality and speed without skyrocketing costs.

Like all models in its size class, Jamba 1.5 Large can’t be loaded in full (FP32) or half (FP16/BF16) precision on a single node of 8 GPUs. Dissatisfied with currently available quantization techniques, we developed ExpertsInt8, a novel quantization technique tailored for MoE models.

With ExpertsInt8, we only quantize weights that are parts of the MoE (or MLP) layers, which for many MoE models account for over 85% of the model weights. In our implementation, we quantize and save these weights in INT8, an 8-bit precision format, and dequantize them at runtime directly inside the MoE GPU kernel.

This technique offers four advantages: It is fast, with quantization taking up to just a few minutes; it does not rely on calibration, a sometimes unstable process which ordinarily can take hours or days; it can still use BF16 to hold large activations; and, importantly, it allows Jamba 1.5 Large to fit on a single 8 GPU node, while utilizing its full context length of 256K. In our experiments, ExpertsInt8 proved to have the lowest latency of all vLLM quantization techniques for MoE models, without a loss in quality.



Long context that actually delivers
The 256K context window offered by the Jamba 1.5 models is not only the longest amongst open models, but also the only one to back this claim on the RULER benchmark.


Whereas most other models claim a long context window but fail to sustain the same quality of performance at the upper limits of their context window, the Jamba 1.5 family maintains its long context handling throughout the entire span of its 256K context window.

A model that can effectively handle long context is crucial for almost every enterprise scale GenAI application. In addition to thoroughly and precisely summarizing and analyzing lengthy documents, a long context model substantially improves the quality of RAG and agentic workflows—and reduces their cost—by eliminating the need for continuous chunking and repetitive retrievals.

While it’s sometimes claimed that RAG is a substitute for long context, a successful enterprise AI system needs both. In pairing long context and RAG, the long context model improves the quality and cost-efficiency of RAG’s retrieval stage at scale.


Fastest on the market
For the use cases enterprises are interested in, such as customer support agent assistants and chatbots, rapid turnaround is essential. The model needs to be able to keep pace with the scale of operations, even as usage requests and batch sizes increase.

Both Jamba 1.5 models are faster than competitors of a similar size, with up to 2.5X faster inference on long contexts, offering customers major cost, quality, and speed gains under high utilization when deployed in their own environment.


The Jamba 1.5 Mini comparisons were done over 2xA100 80GB GPUs and the Jamba 1.5 Large comparisons were done over 8xA100 80GB GPUs. The test was performed on vLLM, with batch_size=1, output_tokens=512, input_tokens=(context_length-512)
Jamba 1.5 Mini and Jamba 1.5 Large show excellent speed and throughput results in tests run by Artificial Analysis, as can be seen in the chart below, with Jamba 1.5 Mini ranking as the fastest model on 10K contexts.


Output tokens per second on 10K contexts, as independently tested by Artificial Analysis.
Outstanding quality across the board
As measured on the Arena Hard benchmark, Jamba 1.5 Mini emerges as the strongest model in its size class, outshining competitors Claude 3 Haiku, Mixtral 8x22B and Command-R+. Jamba 1.5 Large similarly rises above leading models like Claude 3 Opus, Llama 3.1 70B, and Llama 3.1 405B, offering excellent value per cost for its size class.



Getting started
Build with Jamba 1.5 Mini or Jamba 1.5 Large wherever you like to work. The models are available on the following platforms and cloud partners:


AI21 Studio
Google Cloud Vertex AI
Hugging Face
Microsoft Azure
NVIDIA NIM
And coming soon to Amazon Bedrock, Databricks Marketplace, LangChain, LlamaIndex, Snowflake Cortex, and Together.AI.

For customers who wish to avoid a lengthy experimentation process and keep their data onsite, we offer private deployments and custom models. In this white-glove service, we tailor our models exactly to your needs and use case through continuous pre-training and fine-tuning so you can move more quickly from ideation to production.

###
https://jina.ai/news/late-chunking-in-long-context-embedding-models/
August 23, 2024
Late Chunking in Long-Context Embedding Models
Chunking long documents while preserving contextual information is challenging. We introduce the "Late Chunking" that leverages long-context embedding models to generate contextual chunk embeddings for better retrieval applications.
Diagram illustrating the 'Late Chunking' and 'Long Document Model' processes in machine learning on a black background.
Michael Günther
Han Xiao
Michael Günther, Han Xiao • 8 minutes read
About a year ago, in October 2023, we released the world's first open-source embedding model with an 8K context length, jina-embeddings-v2-base-en. Since then, there has been quite some debate about the usefulness of long-context in embedding models. For many applications, encoding a document thousands of words long into a single embedding representation is not ideal. Many use cases require retrieving smaller portions of the text, and dense vector-based retrieval systems often perform better with smaller text segments, as the semantics are less likely to be "over-compressed" in the embedding vectors.

Retrieval-Augmented Generation (RAG) is one of the most well-known applications that requires splitting documents into smaller text chunks (say within 512 tokens). These chunks are usually stored in a vector database, with vector representations generated by a text embedding model. During runtime, the same embedding model encodes a query into a vector representation, which is then used to identify relevant stored text chunks. These chunks are subsequently passed to a large language model (LLM), which synthesizes a response to the query based on the retrieved texts.

Flowchart detailing a query processing system, starting from "Query" to "Document Chunks" and "Embedding Model," then to "Vec
A typical RAG pipeline of chunking-embedding-retrieving-generating.
In short, embedding smaller chunks seems to be more preferable, partly due to the limited input sizes of downstream LLMs, but also because there’s a concern that important contextual information in a long context may get diluted when compressed into a single vector.

But if the industry only ever needs embedding models with a 512-context length, what’s the point of training models with an 8192-context length at all?

In this article, we revisit this important, albeit uncomfortable, question by exploring the limitations of the naive chunking-embedding pipeline in RAG. We introduce a new approach called "Late Chunking," which leverages the rich contextual information provided by 8192-length embedding models to more effectively embed chunks.

The Lost Context Problem
The simple RAG pipeline of chunking-embedding-retrieving-generating is not without its challenges. Specifically, this process can destroy long-distance contextual dependencies. In other words, when relevant information is spread across multiple chunks, taking text segments out of context can render them ineffective, making this approach particularly problematic.

In the image below, a Wikipedia article is split into chunks of sentences. You can see that phrases like "its" and "the city" reference "Berlin," which is mentioned only in the first sentence. This makes it harder for the embedding model to link these references to the correct entity, thereby producing a lower-quality vector representation.

Comparative panels display Berlin's Wikipedia article and its chunked text to highlight clarity and readability benefits.
This means, if we split a long article into sentence-length chunks, as in the example above, a RAG system might struggle to answer a query like "What is the population of Berlin?" Because the city name and the population never appear together in a single chunk, and without any larger document context, an LLM presented with one of these chunks cannot resolve anaphoric references like "it" or "the city."

There are some heuristics to alleviate this issue, such as resampling with a sliding window, using multiple context window lengths, and performing multi-pass document scans. However, like all heuristics, these approaches are hit-or-miss; they may work in some cases, but there's no theoretical guarantee of their effectiveness.

The Solution: Late Chunking
The naive encoding approach (as seen on the left side of the image below) involves using sentences, paragraphs, or maximum length limits to split the text a priori. Afterward, an embedding model is repetitively applied to these resulting chunks. To generate a single embedding for each chunk, many embedding models use mean pooling on these token-level embeddings to output a single embedding vector.

Flowchart comparing naive and late chunking methods in document processing with labeled steps and embeddings.
An illustration of the naive chunking strategy (left) and the late chunking strategy (right).
In contrast, the "Late Chunking" approach we propose in this article first applies the transformer layer of the embedding model to the entire text or as much of it as possible. This generates a sequence of vector representations for each token that encompasses textual information from the entire text. Subsequently, mean pooling is applied to each chunk of this sequence of token vectors, yielding embeddings for each chunk that consider the entire text's context. Unlike the naive encoding approach, which generates independent and identically distributed (i.i.d.) chunk embeddings, late chunking creates a set of chunk embeddings where each one is "conditioned on" the previous ones, thereby encoding more contextual information for each chunk.

Obviously to effectively apply late chunking, we need long-context embedding models like jina-embeddings-v2-base-en, which support up to 8192 tokens—roughly ten standard pages of text. Text segments of this size are much less likely to have contextual dependencies that require an even longer context to resolve.

It's important to highlight that late chunking still requires boundary cues, but these cues are used only after obtaining the token-level embeddings—hence the term "late" in its naming.

Naive Chunking Late Chunking
The need of boundary cues Yes Yes
The use of boundary cues Directly in preprocessing After getting the token-level embeddings from the transformer layer
The resulting chunk embeddings i.i.d. Conditional
Contextual information of nearby chunks Lost. Some heuristics (like overlap sampling) to alleviate this Well-preserved by long-context embedding models
Implementation and Qualitative Evaluation
Google Colab


The implementation of late chunking can be found in the Google Colab linked above. Here, we utilize our recent feature release in the Tokenizer API, which leverages all possible boundary cues to segment a long document into meaningful chunks. More discussion on the algorithm behind this feature can be found on X.

Tokenizer API
Free API to tokenize text and segment long text into chunks.



When applying late chunking to the Wikipedia example above, you can immediately see an improvement in semantic similarity. For instance, in the case of "the city" and "Berlin" within a Wikipedia article, the vectors representing "the city" now contain information linking it to the previous mention of "Berlin," making it a much better match for queries involving that city name.

Query Chunk Sim. on naive chunking Sim. on late chunking
Berlin Berlin is the capital and largest city of Germany, both by area and by population. 0.849 0.850
Berlin Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. 0.708 0.825
Berlin The city is also one of the states of Germany, and is the third smallest state in the country in terms of area. 0.753 0.850
You can observe this in the numerical results above, which compare the embedding of the term "Berlin" to various sentences from the article about Berlin using cosine similarity. The column "Sim. on IID chunk embeddings" shows the similarity values between the query embedding of "Berlin" and the embeddings using a priori chunking, while "Sim. under contextual chunk embedding" represents the results with late chunking method.

Quantitative Evaluation on BEIR
To verify the effectiveness of late chunking beyond a toy example, we tested it using some of the retrieval benchmarks from BeIR. These retrieval tasks consist of a query set, a corpus of text documents, and a QRels file that stores information about the IDs of documents relevant to each query.

To identify the relevant documents for a query, the documents are chunked, encoded into an embedding index, and the most similar chunks are determined for each query embedding using k-nearest neighbors (kNN). Since each chunk corresponds to a document, the kNN ranking of chunks can be converted into a kNN ranking of documents (retaining only the first occurrence for documents appearing multiple times in the ranking). This resulting ranking is then compared to the ranking provided by the ground-truth QRels file, and retrieval metrics like nDCG@10 are calculated. This procedure is depicted below, and the evaluation script can be found in this repository for reproducibility.

GitHub - jina-ai/late-chunking: Code for explaining and evaluating late chunking (chunked pooling)
Code for explaining and evaluating late chunking (chunked pooling) - jina-ai/late-chunking

GitHub
jina-ai

We ran this evaluation on various BeIR datasets, comparing naive chunking with our late chunking method. For getting the boundary cues, we used a regex that splits the texts into strings of roughly 256 tokens. Both the naive and late chunking evaluation used jina-embeddings-v2-small-en as the embedding model; a smaller version of the v2-base-en model that still supports up to 8192-token length. Results can be found in the table below.

Dataset Avg. Document Length (characters) Naive Chunking (nDCG@10) Late Chunking (nDCG@10) No Chunking (nDCG@10)
SciFact 1498.4 64.20% 66.10% 63.89%
TRECCOVID 1116.7 63.36% 64.70% 65.18%
FiQA2018 767.2 33.25% 33.84% 33.43%
NFCorpus 1589.8 23.46% 29.98% 30.40%
Quora 62.2 87.19% 87.19% 87.19%
In all cases, late chunking improved the scores compared to the naive approach. In some instances, it also outperformed encoding the entire document into a single embedding, while in other datasets, not chunking at all yielded the best results (Of course, no chunking only makes sense if there is no need to rank chunks, which is rare in practice). If we plot the performance gap between the naive approach and late chunking against document length, it becomes evident that the average length of the documents correlates with greater improvements in nDCG scores through late chunking. In other words, the longer the document, the more effective the late chunking strategy becomes.

Line graph showing the decline in relative improvement with increasing document length, from 0 to 1500 characters.
Late chunking's improvement over naive chunking is correlated with the avg. document length.
Conclusion
In this article, we introduced a simple approach called "late chunking" to embed short chunks by leveraging the power of long-context embedding models. We demonstrated how traditional i.i.d. chunk embedding fails to preserve contextual information, leading to suboptimal retrieval; and how late chunking offers a simple yet highly effective solution to maintain and condition contextual information within each chunk. The effectiveness of late chunking becomes increasingly significant on longer documents—a capability made possible only by advanced long-context embedding models like jina-embeddings-v2-base-en. We hope this work not only validates the importance of long-context embedding models but also inspires further research on this topic.


###
https://huggingface.co/OpenMeditron

🐝⛑️It’s here…meet #Meditron3
…and an opportunity to join the upcoming publication as a clinical expert by critically evaluating Meditron 🤖🤺 in the #MOOVE: a Massive Open Online Validation and Evaluation platform)
Meditron is the latest update to our suite of state-of-the-art open Large Medical Language Models trained from #Llama3point1 [8B] and [70B]
---
💯#OPEN: weights released here 🔗
https://lnkd.in/eF8q9A94

🤝#CODESIGNED: from day one with an interdisciplinary team of practicing clinicians from around the world, global humanitarian organizations, AI ethicists, and data scientists
📊#STATEOFTHEART: Out-performing GPT4 on the standard medical benchmarks (*see footnote)
🌍#REPRESENTATIVE: Continually pre-trained on an enormous corpus of expert-curated open-access medical literature and global clinical practice guidelines that is specifically inclusive of low-resource and humanitarian settings


###
https://huggingface.co/papers/2408.12569
Meta presents Sapiens
Foundation for Human Vision Models

We present Sapiens, a family of models for four fundamental human-centric vision tasks - 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability - model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.
Sapiens: Foundation for Human Vision Models
Published on Aug 23
·
Submitted by
akhaliq
on Aug 23
#1 Paper of the day
Authors:
Rawal Khirodkar
,
Timur Bagautdinov
,
Julieta Martinez
,
Su Zhaoen
,
Austin James
,
Peter Selednik
,
Stuart Anderson
,
Shunsuke Saito
Abstract
We present Sapiens, a family of models for four fundamental human-centric vision tasks - 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability - model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.

###
https://github.com/linkedin/Liger-Kernel
Big Update! 20% higher throughput and 60% memory reduction for multi-GPU fine-tuning with Hugging Face Transformers! 🤯 The LLM research team from LinkedIn released new efficient GPU Kernels (Liger Kernels) to speed up and reduce memory when fine-tuning LLMs! 🚀

TL;DR:

🚀 Boost multi-GPU training throughput by 20% and slash memory usage by 60%

💻 Works with Flash Attention, PyTorch FSDP, and Microsoft DeepSpeed

🧠 Supported models include Llama 3, Mistral, Mixtral, Gemma 2

🔧 Adds Hugging Face Compatible RMSNorm, RoPE, SwiGLU, CrossEntropy, and FusedLinearCrossEntropy

🧮 Exact computations with no approximations, longer context lengths, larger batch sizes

⚡️ Train Meta Llama 3 8b ~20% faster with over 40% memory using 4 A100s with FSDP

🔌 Get started with pip install liger-kernel and add 1-line of code

🌟 Open-source and community-driven development

Github: https://lnkd.in/e4XKs4-F
LinkedIn
This link will take you to a page that’s not on LinkedIn
[2024/8/31] CUDA MODE talk, Liger-Kernel: Real-world Triton kernel for LLM Training
[2024/8/23] Official release: check out our X post
Liger (Linkedin GPU Efficient Runtime) Kernel is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%. We have implemented Hugging Face Compatible RMSNorm, RoPE, SwiGLU, CrossEntropy, FusedLinearCrossEntropy, and more to come. The kernel works out of the box with Flash Attention, PyTorch FSDP, and Microsoft DeepSpeed. We welcome contributions from the community to gather the best kernels for LLM training.

Supercharge Your Model with Liger Kernel
Banner

With one line of code, Liger Kernel can increase throughput by more than 20% and reduce memory usage by 60%, thereby enabling longer context lengths, larger batch sizes, and massive vocabularies.

Speed Up Memory Reduction
Speed up Memory
Note:

Benchmark conditions: LLaMA 3-8B, Batch Size = 8, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 8 A100s.
Hugging Face models start to OOM at a 4K context length, whereas Hugging Face + Liger Kernel scales up to 16K.
Examples
Basic
Example Description Lightning Studio
Hugging Face Trainer Train LLaMA 3-8B ~20% faster with over 40% memory reduction on Alpaca dataset using 4 A100s with FSDP TBA
Lightning Trainer Increase 15% throughput and reduce memory usage by 40% with LLaMA3-8B on MMLU dataset using 8 A100s with DeepSpeed ZeRO3 TBA
Advanced
Example Description Lightning Studio
Medusa Multi-head LLM (Retraining Phase) Reduce memory usage by 80% with 5 LM heads and improve throughput by 40% using 8 A100s with FSDP TBA
Key Features
Ease of use: Simply patch your Hugging Face model with one line of code, or compose your own model using our Liger Kernel modules.
Time and memory efficient: In the same spirit as Flash-Attn, but for layers like RMSNorm, RoPE, SwiGLU, and CrossEntropy! Increases multi-GPU training throughput by 20% and reduces memory usage by 60% with kernel fusion, in-place replacement, and chunking techniques.
Exact: Computation is exact—no approximations! Both forward and backward passes are implemented with rigorous unit tests and undergo convergence testing against training runs without Liger Kernel to ensure accuracy.
Lightweight: Liger Kernel has minimal dependencies, requiring only Torch and Triton—no extra libraries needed! Say goodbye to dependency headaches!
Multi-GPU supported: Compatible with multi-GPU setups (PyTorch FSDP, DeepSpeed, DDP, etc.).
Target Audiences
Researchers: Looking to compose models using efficient and reliable kernels for frontier experiments.
ML Practitioners: Focused on maximizing GPU training efficiency with optimal, high-performance kernels.
Curious Novices: Eager to learn how to write reliable Triton kernels to enhance training efficiency.

###
https://neuralmagic.com/blog/llm-compressor-is-here-faster-inference-with-vllm/
📢 Introducing LLM Compressor: SOTA open-sourced framework for compressing LLMs (including Llamas)! 📢
Starting as an internal research project at Neural Magic, I could never imagine that nearly six years later, what I called neuralmagicML would evolve into the incredibly powerful and capable LLM Compressor framework. Thanks to the hard work of our incredible engineering team, this tool can compress LLMs of any size to remarkable levels while recovering full accuracy. (such as our recent Llama 3.1 405B results!)
We are thrilled to announce that we have donated this library and its innovative techniques to the vLLM community. By offering efficient, performant, and accurate solutions for large language models, we aim to empower researchers, hackers, and enterprises.
Why This Matters:
- Cutting-Edge Algorithms: Implement the latest techniques and best practices for top-tier model performance without extensive research.
- Superior Flexibility: Enjoy various compression techniques, quantization schemes, and sparsity options for a solution that fits your use cases.
- Community-Driven: This tool is open-sourced and seamlessly integrated with vLLM and Hugging Face to ensure compatibility and encourage future contributions.
We are excited to witness how the community will leverage this incredible tool! Dive deeper into LLM Compressor by exploring our blog and the repo:

Aug 14, 2024

Announcing LLM Compressor
We are excited to announce LLM Compressor, a unified library for creating compressed models for faster inference with vLLM. Neural Magic's research team has successfully utilized it to create our latest compressed models, including fully quantized and accurate versions of Llama 3.1, and with that, we are excited to open up the toolkit to the community with its first 0.1 release for general usage to compress your models!

LLM Compressor architecture diagram.
In recent months, the high-performance computing team at Neural Magic has brought performant inference for various quantization schemes to vLLM, including custom Marlin kernels for weight-only quantization and custom CUTLASS kernels for INT8 and FP8 activation quantization.

However, before today, creating quantized checkpoints required navigating a fragmented ecosystem of bespoke compression libraries such as AutoGPTQ, AutoAWQ, AutoFP8, etc. We built LLM Compressor from the ground up as a single library for applying the latest compression best practices, including GPTQ, SmoothQuant, SparseGPT, and RTN, with many more actively being added. It works natively with Hugging Face models for seamless ease of use in the open-source ecosystem, and vLLM supports directly loading checkpoints from LLM Compressor for accelerated inference.

Using LLM Compressor, you can create compressed, accurate versions of your models, including:

Activation and weight quantization for up to 3X faster server/throughput deployments. This includes FP8 models using RTN for NVIDIA's Ada Lovelace and Hopper GPUs, and INT8 models using SmoothQuant and GPTQ for Nvidia's Turing and Ampere GPUs.
Weight quantization for up to 4X faster latency with INT4 weight-only models using GPTQ for Nvidia's Ampere GPUs and newer.
Weight pruning for up to 1.5X faster general performance with 2:4, 50% sparse models utilizing SparseGPT for Nvidia's Ampere GPUs and newer.
Enabling Activation Quantization in vLLM
Thanks to LLM Compressor's flexibility, it enables a critical new feature: activation quantization.

The open-source compression ecosystem thus far has focused mainly on weight-only quantization, including AutoGPTQ and AutoAWQ. Weight-only quantization enables smaller models and faster latency, but with 16-bit activations, the compute runs through the same 16-bit tensor cores as the unquantized model. This leads to slower inference for compute-heavy workloads due to the penalty of dequantizing the weights. Activation quantization, where the inputs to each layer are quantized, combined with weight quantization, enables utilization of the faster INT8 or FP8 tensor cores for the matrix multiplies, doubling the performance for compute-bound inference.

Weight-only quantization often fails to deliver speed improvements in production serving deployments. These environments typically result in compute-bound workloads with minimal benefits from weight-only quantization. Activation quantization, however, offers a substantial performance boost in such high-compute scenarios and faster inference at lower queries per second (QPS). The chart below demonstrates a 1.6X speedup at 5 QPS for the INT8 weight and activation quantized model (w8a8) compared to the 16-bit baseline (w16a16), while the 4-bit weight quantized model (w4a16) shows little improvement.

This chart demonstrates a 1.6X speedup at 5 QPS for the INT8 weight and activation quantized model (w8a8) compared to the 16-bit baseline (w16a16), while the 4-bit weight quantized model (w4a16) shows little improvement.
Full replication instructions for the benchmark are available in the appendix.
Activation Quantization Performance in vLLM
Let’s take an example of a Llama 3.1 70B running in vLLM on a 4xA100 GPU setup to see if this analysis holds up!

We will compare the serving latency for three variants for Llama 3.1 70B

Unquantized FP16 (w16a16):
meta-llama/Meta-Llama-3.1-70B-Instruct
Weight and activation quantization to INT8 (w8a8):
neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8
Weight-only quantization to INT4 (w4a16):
neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16
The chart below illustrates the average time to generate each new token (TPOT) across different server loads, measured in queries per second (QPS). Additionally, a deployment constraint of 5 seconds is set for the time to generate the first token (TTFT) to ensure the serving application maintains reasonable initial response times.

This chart illustrates the average time to generate each new token (TPOT) across different server loads, measured in queries per second (QPS). Additionally, a deployment constraint of 5 seconds is set for the time to generate the first token (TTFT) to ensure the serving application maintains reasonable initial response times.
Full replication instructions for the benchmark are available in the appendix.
At low QPS, weight-only quantization offers improved latency relative to an unquantized model. However, as the server load increases and becomes compute-bound, the performance of the weight-only model levels off, matching the unquantized model. In contrast, the activation quantized model performs better under high load, supporting more queries per second before the system becomes overloaded and TTFT exceeds our limits for a responsive application.

For a 70B model on an A100 system, we see that the W8A8 model achieves similar latency performance with just 2 GPUs compared to the unquantized model running with 4, meaning similar latency guarantees with half the resources!

Llama 3.1 70B Time per Output Token coparing w16a16 on 4 GPUs and w8a8 on 2 GPUs.
Full replication instructions for the benchmark are available in the appendix.
Activation Quantization Accuracy
vLLM’s CUTLASS kernels for activation quantization offer flexible support for various schemes, allowing for a high degree of customization, including any combination of:

Per-tensor or per-channel quantization for weights
Per-tensor or per-token quantization for activations
Symmetric or asymmetric quantized activations (for int8).
Side note: We are doing a CUTLASS deep dive during our bi-weekly vLLM office hours on September 5, 2024. Sign up here.


This flexibility in vLLM, combined with LLM Compressor's advanced algorithms such as GPTQ and SmoothQuant, ensures that model accuracy is maintained even after quantization. As we can see from the model card for neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8, we see a negligible drop using static per-channel weight scales and dynamic per token activation scales in comparison to the FP16 baseline on Open LLM:

Evaluation scores for Meta Llama 3.1 70B model after optimizations with the LLM Compressor.
This combination of fine-grained quantization and sophisticated algorithms enables users to achieve faster inference without compromising on the precision and reliability of their models.

Try LLM Compressor
The following snippet is a minimal example of quantizing meta-llama/Meta-Llama-3.1-8B-Instruct with INT8 weights and activations.

Install LLM Compressor via PyPi
LLM Compressor is available for installation via PyPI:

pip install llmcompressor
Apply Quantization with the LLM Compressor
Quantization is applied by selecting an algorithm and calling the oneshot API, which applies the selections in a post-training setting.

In this case, we apply SmoothQuant to make the activations easier to quantize and GPTQ to apply the weight and activation quantization. We apply these algorithms to all linear layers of the network using the built-in open_platypus dataset (note: see the examples for how to use your own calibration set).

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot
# Select quantization algorithm. In this case, we:
# * apply SmoothQuant to make the activations easier to quantize
# * quantize the weights to int8 with GPTQ (static per channel)
# * quantize the activations to int8 (dynamic per token)
recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
GPTQModifier(scheme="W8A8", targets="Linear", ignore=["lm_head"]),
]
# Apply quantization using the built in open_platypus dataset.
oneshot(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
dataset="open_platypus",
recipe=recipe,
output_dir="Meta-Llama-3.1-8B-Instruct-INT8",
max_seq_length=2048,
num_calibration_samples=512,
)
Inference Compressed Models with vLLM
The resulting model is ready to be loaded and run in vLLM out-of-the-box:

from vllm import LLM
model = LLM("./Meta-Llama-3.1-8B-Instruct-INT8")
output = model.generate("My name is")
print("Output:", output[0].outputs[0].text)
# Output: Jakob Schmid. I live in the Republic of South Moluccas
Under the hood, vLLM understands how to load and run the compressed model by looking at the config.yaml next to the weight files. Check out some of our more detailed examples to try out other quantization flows:

FP8 activation quantization with PTQ
INT8 activation quantization with GPTQ and SmoothQuant
INT4 weight-only quantization With GPTQ
LLM Compressor Roadmap
We have a robust roadmap planned to expand support for model compression in LLM Compressor. Our roadmap is prioritized across the following initiatives:

Expand model support: Mixture of Experts (MoE) and vision-language models
Expand algorithm and scheme support: AWQ, additional quantized floating point formats (fp8 and fp4), and KV cache quantization
Support for non-Nvidia hardware: We are actively collaborating with AMD, Google, and Intel teams to support models created by LLM Compressor on non-Nvidia hardware devices.
Tools for creating non-uniform quantization schemes
2:4 sparsity: Sparse foundation models, sparse fine-tuning from sparse foundational models, combining sparsity and quantization
Expand support for training aware methods: Quantization-Aware Training (QAT) and Low-Rank Adaptation (LoRA)
If you have any feature requests, large or small, please comment on our Roadmap Issue in GitHub.

Final Thoughts
At Neural Magic, we believe the Future of AI is Open, and we are on a mission to bring the power of open-source models and vLLM to every enterprise on the planet.

We offer nm-vllm, an enterprise distribution of vLLM, with:

Stable builds with bug fixes and selected model backporting
Enterprise support with SLAs for production deployments of vLLM
Tools and access to our teams for applying model optimizations via LLM Compressor
Pre-optimized model registry
Kubernetes reference architectures for production deployments


###
https://www.kbfg.com/kbresearch/report/reportView.do?reportId=2000501
너도나도 AI? 말로만 AI 외치는 'AI 워싱' 주의보
2024-08-19
KB금융그룹

2024. 8
너도나도 AI? 말로만 AI 외치는 ‘AI 워싱’ 주의보

□ AI 워싱의 주요 특징과 부작용
□ AI 워싱 사례
□ AI 워싱을 피하려면?



1
 AI 워싱(AI Washing)이란 기업이나 제품이 실제로는 인공지능 기술을 사용하지 않으면서 광범위
하게 활용하고 있는 것처럼 홍보하는 행위로 AI 열풍을 이용하는 마케팅 전략의 일환
• 환경친화적이지 않은 기업이 친환경적인 것처럼 홍보하는 그린워싱(Greenwashing) 행태와 유사
• 기업은 AI 기술 사용을 강조함으로써 투자자에게 매력적인 투자 대상으로 인식되어 더 많은 자본을 유
치할 수 있으며, 보다 혁신적이고 기술 선도적인 이미지를 구축하여 비즈니스에서 경쟁사보다 유리한
위치를 점함으로써 기업 가치의 상승 효과를 기대할 수 있음
 [문제점] AI 워싱은 적절한 투자 자원의 배분 저해, 소비자 신뢰 저하, 과도한 기대감 유발 등 부
정적인 결과를 초래할 수 있음
 [사례] AI 워싱의 유형은 다양하나, 자사 제품이나 서비스가 출시되는 과정에서 AI 기여도가 불분
명함에도 광고나 홍보에서 AI 기술 사용을 강조하며 소비자를 기만하는 행위가 대표적
• [아마존고] 무인 매장 아마존고(Amazon Go)의 ‘저스트 워크 아웃(Just Walk Out)’은 고객 퇴점 시 자동으
로 결제가 청구되는 시스템으로 알려졌으나, 사실은 인도 지사 직원들이 수동으로 검토하는 것으로 밝혀짐
• [준코] 채용 스타트업 준코(Joonko)는 AI를 기반으로 기업에 적합한 지원자를 추천한다는 허위 정보
를 유포해 투자 자금을 유치했다는 이유로 미국 증권거래위원회(SEC)와 법무부로부터 기소됨
• [델피아 및 글로벌프레딕션스] 투자자문사 델피아(Delphia)와 글로벌프레딕션스(Global Predictions)는 AI
워싱 행위로 미국 증권거래위원회로부터 각각 22만 5천 달러와 17만 5천 달러의 벌금이 부과됨
 [대응 방안] AI 기술이 지속적으로 발전함에 따라 AI 워싱에 대한 규제와 소비자 보호의 중요성이 더
욱 강조될 것이므로, 기업은 투명한 AI 기술 사용을 통해 소비자와 투자자의 신뢰를 확보함으로써 진
정한 혁신을 도모할 필요
• [규제 강화] 정부는 AI 워싱에 대한 규제를 강화함으로써 기업이 AI 사용에 대한 투명성을 높이고 허위
주장을 남발하지 않도록 유도할 필요
• [투명한 기술 사용 및 정보 제공] 기업은 실제 사용한 AI 기술과 일치하는 투명한 정보를 제공하고, 규
제 기관 등의 감사나 외부 의혹에 대비하여 관련 자료를 기록하고 문서화할 필요
• [소비자 교육] 정부나 기업은 소비자가 AI 기술에 대한 이해도를 높이고 AI와 관련된 과장된 주장에 현
혹되지 않도록 소비자 대상 교육 프로그램을 운영할 필요
• [비판적 태도 갖기] 소비자와 투자자는 주체적으로 정보를 검토하고 검증하는 태도를 가져야 함. 기업
이 AI 기술에 대해 단순히 ‘혁신적’ 또는 ‘지능적’이라는 단어를 남발하는 것이 아닌지 비판적인 시각으
로 바라볼 필요
• [투명한 설명 요구하기] 소비자는 AI 작동 방식에 대한 투명한 설명을 요구하여 기업이 책임감을 갖고
기술의 실제 성능과 한계 등을 포함해 정확한 정보를 제공하도록 유도해야 함
< 요 약 >



2
 AI 워싱의 주요 특징과 부작용
○ AI 워싱(AI Washing)은 기업이나 제품이 실제로는 인공지능 기술을 사용하지 않으면서 광범
위하게 활용하고 있는 것처럼 홍보하는 행위로 AI 열풍을 이용하는 마케팅 전략의 일환
 환경친화적이지 않은 기업이 친환경적인 것처럼 홍보하는 그린워싱(Greenwashing) 행태와 유사
 AI 에 대한 정의가 광범위하고 느슨하게 사용되고 있다는 점도 AI 워싱의 등장 배경 중 하나
【AI 워싱을 하는 이유】
○ 기업은 AI 사용을 강조함으로써 투자자에게 매력적인 투자 대상으로 인식되어 더 많은 자본
을 유치할 수 있으며, 보다 혁신적이고 기술 선도적인 이미지를 구축하여 비즈니스에서 경
쟁사보다 유리한 위치를 점할 수 있음
 [투자 유치] 《포브스(Forbes)》에 따르면 투자 유치 시 AI를 언급한 스타트업이 그렇지 않은 경우보다
적게는 15%, 많게는 50% 더 많은 자금을 확보하면서 AI 역량을 과대 광고하는 기업이 늘고 있음1
- 일부 기업이 자금 조달을 위해 AI 역량을 과장하고, 창업자는 투자 유치 시 AI를 언급
하지 않으면 불리해질 수 있다고 생각하는 경향이 있음2
. 이 때문에 AI 역량을 주장하
는 기업과 실질적인 AI 기업 사이에는 상당한 간극이 존재
- 영국 벤처캐피털 펀드 MMC벤처스(MMC Ventures)의 2019년 조사에 따르면 유럽
AI 스타트업의 40%는 사실상 AI를 전혀 사용하지 않은 것으로 나타남
 [시장 경쟁력 확보] 가전제품 제조업체가 자사 제품에 AI 기술이 적용되었다고 주장하면, 소비자
는 그 제품이 더 스마트하고 혁신적이라 생각하여 적극적인 구매 의사를 나타낼 가능성이 높음
- 기업은 AI 기술을 사용하여 기술적 우월성을 강조하고 시장에서 리더로 자리매김하려
의도하지만, 실제로는 인터넷을 통해 작동하는 것을 AI 시스템을 구축했다고 홍보하거
나 상담 챗봇을 추가하는 정도에 그치는 경우가 많음
【AI 워싱의 문제점】
○ AI 워싱은 흔한 마케팅 방법 중 하나로 보일 수 있지만, 투자 자원의 적절한 배분을 저해하고 소비
자의 과도한 기대감을 유도해 결국 신뢰 저하로 이어지게 하는 등 부정적인 결과를 초래할 수 있음

1 Bernard Marr, Apr. 25, 2024, “Spotting AI Washing: How Companies Overhype Artificial Intelligence”, Forbes
2 영국과 핀란드에 본사를 둔 신기술 투자회사 오픈오션(OpenOcean)에 따르면 2022년 기술 스타트업의 10%만이 투자 유치 시 AI
를 사용한다고 언급했으나 2023년에는 이 비율이 4분의 1 이상으로 증가했고 올해는 3분의 1 이상으로 증가할 것으로 예상



3
○ [적절한 자원 배분 저해] 혁신적인 기술을 보유하고 있다고 포장하는 기업에 소비자와 투자
자의 이목과 투자가 집중될 수 있음
 AI 워싱 기업에 투자 재원이 몰릴 경우 실제 혁신 기술을 보유한 기업이나 프로젝트에는 원활한
자금 공급이 어려울 수 있음
○ [소비자 신뢰 저하] AI 워싱을 마케팅 수단으로 사용하는 기업이 증가하여 대부분의 기업이
AI에 대해 언급한다면 소비자는 AI의 성능에 의문을 갖게 될 것이고, 실제 사용을 통해 부
정적 경험이 누적될 경우 AI 기술 자체에 대한 신뢰도가 저하될 수 있음
 소비자는 과장된 주장을 반복적으로 접하면서 AI 기술을 불신하게 되며, 이는 결과적으로 AI 제품
과 서비스의 구매 기피로 이어져 진정한 AI 기술을 제공하는 기업에 부정적 영향을 미칠 수 있음
○ [기업의 자원 낭비] AI 워싱 기업의 과장된 주장을 의식하여 정상적인 기업마저 현실성이 떨
어지는 목표를 설정하고 추진해 나갈 경우 프로젝트의 실패로 인한 손실뿐 아니라 선택과
집중의 실패로 인해 혁신이 지체되는 등의 부정적 결과를 초래할 수 있음
 기업이 의미 있는 AI 역량을 개발하는 대신 피상적인 개선에 투자를 남발할 경우 기술의 진전을 늦출
수 있음
 진정 가치 있는 AI 솔루션을 모색하는 기업의 의사 결정을 복잡하게 만들어 디지털 전환 노력을
방해하고 혁신을 저해하며 성과를 악화시킬 수 있음
 AI 워싱 사례
○ [아마존고] 무인 매장 아마존고(Amazon Go)의 ‘저스트 워크 아웃(Just Walk Out)’은 고객
퇴점 시 자동으로 결제가 청구되는 시스템으로 알려졌으나, 사실은 인도 지사 직원들이 수
동으로 검토하는 것으로 밝혀짐3
 매장 내 개별 제품에 센서나 칩을 부착하지 않고, 천장에 달린 센서가 실시간으로 고객과 제품을 추적
하는 딥러닝 기술을 적용했다고 밝혔으나, 시스템의 상당 부분이 1천여 명의 인도 지사 직원이 각 매
장 카메라를 통해 수동으로 체크하는 방식으로 운영된다는 사실이 드러나면서 AI 워싱 논란이 제기됨
- 아마존고는 인간이 결제 청구 과정에 관여한다는 사실은 부인하지 않았지만, 직원들이
‘저스트 워크 아웃’ 시스템을 개선하기 위해 AI가 생성한 실제 데이터에 주석을 달 뿐이
며 전체 운영에 관여하는 것은 아니라고 주장4

3 Alex Bitter, Apr. 4, 2024, “Amazon’s just walk out actually uses 1,000 people in India”, Business Insider
4 Emma Roth, Apr. 18, 2024, “Amazon insists Just Walk Out isn’t secretly run by workers watching you shop”, The Verge



4
[그림 1] 아마존고 구조도 [그림 2] 아마존고 외부 전경
자료: 위키피디아 자료: 위키피디아
○ [준코] 채용 스타트업 준코(Joonko)는 AI를 기반으로 기업에 적합한 지원자를 추천한다는
허위 정보를 유포해 투자 자금을 유치했다는 이유로 미국 증권거래위원회(SEC)와 법무부로
부터 기소됨
 증권거래위원회는 지난 6월 AI를 기반으로 지원자를 선발
한다고 속이고 고객 정보와 지원자 수, 기업 수익에 관한
허위 및 오해의 소지가 있는 진술을 하여 최소 2천100만
달러의 투자를 유치한 혐의로 AI 기반 채용 스타트업 준코
의 CEO이자 창업자인 일릿 라즈(llit Raz)를 기소
- 준코는 AI를 사용하여 기업이 다양성, 형평성, 포
용성이 존중되는 인력 조직을 구축할 수 있도록
다양한 지원자를 선발하는 데 도움을 준다고 주장
했으나 조사 결과 거짓으로 밝혀짐
- 라즈는 투자 유치 시 준코가 《포천(Fortune)》 500대 기업을 포함하여 100개 이상의 고
객사를 보유하고 있으며, 100만 달러 이상의 수익을 올렸을 뿐 아니라 10만 명 이상
의 활성 구직자와 협력하고 있다는 등 경영 성과에 대해서도 허위 정보를 제공
- 라즈의 주장에 의심을 품는 투자자에게 허위 정보 유포 사실을 은폐하기 위해 위조된
은행 거래 내역서와 계약서도 제공
○ [델피아 및 글로벌프레딕션스] 미국 증권거래위원회는 AI 사용 범위에 대해 허위 및 오해의 소지가
있는 진술을 한 혐의로 투자자문사 델피아(Delphia)와 글로벌프레딕션스(Global Predictions)를 기
소. 양사는 제기된 혐의를 해결하고 각각 22만 5천 달러와 17만 5천 달러의 벌금을 지불하는
선에서 합의
[그림 3] 준코 홍보 이미지
자료: 준코 링크드인 계정



5
 [델피아] AI 를 활용하여 어떤 기업과 트렌드가 크게
성장할 것인지 예측하여 고객이 선제적으로 투자할
수 있도록 정보를 제공했다고 주장했지만, 증권거래
위원회는 델피아가 실제로는 AI 및 머신러닝 기술을
갖추지 않았다고 판단
- 델피아는 등록 투자자문사가 중요 사실에 대
한 허위 진술이 포함된 광고를 유포하는 것
을 금지한 마케팅 규칙을 위반한 혐의로 추
가 기소됨
 [글로벌프레딕션스] 홈페이지, 소셜미디어, 이메일에
서 자신들이 최초의 AI 금융 자문사로 전문가 수준의 AI 기반 예측(Expert AI-driven Forecasts)
을 제공한다고 주장했지만, 이에 대한 근거를 제시하지 못했고 실제로도 AI 기술을 사용하지 않음
- 홈페이지와 유튜브 채널에 실제 고객 데이터
가 아닌 가상의 성과 데이터를 공개하면서 이
사실을 명시하지 않아 증권거래위원회 규정을
위반. 또한 AI 기반 예측 시스템의 우수성을
홍보했지만 실제로는 AI를 사용하지 않은 것
으로 드러남
- 이러한 규정 위반에 대한 시정 조치로 마케팅
및 교육 자료 검토를 위한 규정 준수 컨설턴
트를 고용하기로 증권거래위원회와 합의
○ [오토메이터스 AI] 오토메이터스 AI(Automators AI)는 AI 기술을 활용해 온라인 스토어의 판매
량을 자동으로 증가시킬 수 있다는 허위 정보를 유포하여 투자를 유치한 혐의로 연방거래위
원회(FTC)에 의해 기소됨5
 아마존이나 월마트와 같은 플랫폼에서 운영되는 온라인 스토어에 입점한 사업자들에게 자신들의
AI 기술을 활용하면 매월 4 천 달러에서 6 천 달러의 순이익을 얻고, 8 개월 후 100% 투자수익률
을 달성할 수 있다고 홍보하여 2 천 200 만 달러의 투자를 유치했지만 허위 정보로 드러남

5 2023년 8월 로만 크레스토(Roman Cresto), 존 크레스토(John Cresto), 앤드루 채프먼(Andrew Chapman)은 오토메이터스
AI를 포함해 여러 회사를 운영하면서 아마존 및 월마트의 이커머스에 AI를 적용하면 불로소득(Passive income)을 얻을 수 있다
고 홍보하며 투자자를 유치했으나 AI 기술이 사용되지 않은 것으로 드러남(Federal Trade Commission, Feb. 27, 2024, “FTC
Action Leads to Ban for Owners of Automators AI E-Commerce Money-Making Scheme”)
[그림 4] 델피아 홍보 이미지
자료: 델피아 홈페이지
[그림 5] 글로벌프레딕션스 홍보 이미지
자료: 글로벌프레딕션스 홈페이지



6
- 그들의 주장과는 달리 실제로는 AI 기술을 활용하지 않았고, 대부분의 고객이 수익은커녕
투자금마저 회수하지 못한 것으로 드러남. 이에 연방법원은 회사 운영을 중단시키고, 관
련자들에게 영구적인 사업 기회 박탈과 2천200만 달러 배상을 명령
○ [오랄비] P&G의 구강 관리 브랜드 오랄비(Oral-B)는 고가의 전동칫솔을 판매하면서 AI가 치아 위
치와 밝기 등을 파악해 이가 잘 닦였는지 확인할 수 있다고 광고. 그러나 《워싱턴포스트》는 “이 칫
솔에 AI 기능이 정확히 어떻게 적용되는지 물었지만 회사는
대답하지 못했다”고 보도6
 오랄비는 해당 칫솔이 AI 기술로 사용자의 칫솔질 습관
을 실시간 분석하여 맞춤형 피드백을 제공한다고 주장했
으나, 소비자와 전문가들은 이러한 기능이 얼마나 효과
적인지에 대해 의문을 제기
- 일부 사용자는 칫솔이 제공하는 피드백이 개인화되었다
고 보기 어렵거나, 예상보다 단순하다며 불만을 표시
○ [코카콜라] 코카콜라는 AI를 사용하여 새로운 음료
를 만들었다고 홍보했지만 AI 워싱임이 드러나 업
계의 비난을 받음
 3000 년대를 상상하며 만들었다고 홍보한 코카
콜라 Y3000 은 AI 와 공동으로 제품을 개발했다
고 밝혔지만, AI 가 개발 과정에서 어떻게 관여했
는지에 대해서는 설명하지 않음
 AI 워싱을 피하려면?
○ AI 기술이 지속적으로 발전함에 따라 AI 워싱에 대한 규제와 소비자 보호의 중요성이 더욱 강조될 것이며,
기업은 투명한 AI 사용을 통해 소비자와 투자자의 신뢰를 확보함으로써 진정한 혁신을 도모할 필요
【정부 및 기업】
○ [규제 강화] 정부는 AI 워싱에 대한 규제를 강화함으로써 기업이 AI 사용에 대한 투명성을 높
이고 허위 주장을 남발하지 않도록 유도할 필요

6 Shira Ovide, Apr. 5, 2024, “This $400 toothbrush is peak AI mania”, The Washington Post
[그림 6] 오랄비 iO시리즈 홍보문구
자료: 오랄비 스마트스토어
[그림 7] 코카콜라 Y3000
자료: 코카콜라 홈페이지



7
 미국 증권거래위원회와 법무부, 연방거래위원회는 AI 워싱 기업을 대상으로 기소하거나 거액의
벌금을 부과하는 등의 조치를 취함으로써 AI 와 관련된 허위 또는 오해의 소지가 있는 주장을 할
경우 단속 대상이 될 수 있다는 경고 메시지를 전달
- 증권거래위원회는 지난해 말 JP모건체이스 등 대형 금융사의 AI 사용 실태에 대한 전수
조사에 착수. 고객 포트폴리오 관리에 활용되는 AI 알고리즘 모델 및 관련 마케팅 서류,
데이터에 대한 제3자 제공 현황, 컴플라이언스 교육 사항 등이 요청 내역에 포함됨7
- 연방거래위원회는 기업이 “AI 제품의 기능을 과장하고 있지 않나요? AI 제품이 비AI 제
품보다 더 나은 것처럼 홍보하고 있나요? 제품이 실제로 AI를 사용하고 있나요?” 등의
질문을 통해 자기 검열을 강화할 것을 권고
 영국 광고표준위원회(Advertising Standards Authority)는 AI 워싱에 대한 규칙과 법률을 제정.
여기에는 AI 와 관련하여 실질적으로 오해의 소지가 있는 진술을 금지하는 내용이 포함됨
○ [투명한 기술 사용 및 정보 제공] 기업은 실제 사용한 AI 기술과 일치하는 투명한 정보를 제공
하고, 규제 기관 등의 감사나 외부 의혹에 대비하여 기록하고 문서화할 필요
 AI 기술을 홍보할 때 허위 주장을 피하고 관련 정보를 투명하게 제공함으로써 소비자와 투자자에
게 신뢰를 얻어야 함
- 추가적으로 기업 내부의 각 부서가 협력하여 AI 기술 사용에 대한 정확한 정보를 공유해
야 하며, AI 기술이 실제로 어떻게 사용되는지에 대한 상세한 기록을 남기고 정기적으로
업데이트를 해야 함
○ [소비자 교육] 정부나 기업은 소비자가 AI 기술에 대한 이해도를 높이고 AI와 관련된 과장된
주장에 현혹되지 않도록 소비자 대상 교육 프로그램을 운영할 필요
 교육 프로그램 등을 제공하여 소비자가 AI 기술의 한계와 실제 활용 가능성에 대해 명확하게 이
해함으로써 과대 광고 및 허위 투자 유치 등에 현혹되지 않는 역량을 함양시키는 것이 중요
【소비자 입장】
○ [비판적 태도 갖기] 소비자와 투자자는 기업의 AI 활용 주장에 대해 비판적 태도를 가져야 함.
사용된 AI 모델알고리즘과 같은 기술적인 부분에 대한 구체적 언급이 있는지 확인하고, 기업
이 관련 데이터와 알고리즘 유형에 대한 투명성을 유지하는지 점검할 필요

7 김현수신아형, 2023.12.12, “美 SEC ‘투자수익 과장하는 AI워싱 위험’… 월街 실태 전수조사”, 《동아일보》



8
 소비자와 투자자가 주체적으로 정보를 검토하고 검증하는 태도를 갖는 것이 중요. 기업이 AI 기술
에 대해 단순히 ‘혁신적’ 또는 ‘지능적’이라는 단어를 남발하는 것이 아닌지 비판적인 시각으로 바
라볼 필요
○ [투명한 설명 요구하기] 소비자는 AI 작동 방식에 대한 투명한 설명을 요구하여 기업이 책임감을
갖고 기술의 실제 성능과 한계 등을 포함해 정확한 정보를 제공하도록 유도해야 함
 AI 시스템의 데이터 출처나 알고리즘 작동 방식, 정확성을 높이기 위해 어떤 조치를 취했는지 등
에 대한 설명을 요구해야 함. 데이터와 알고리즘의 편향이나 AI 환각을 피하기 위한 해결책이 없
는 경우 실제로 AI 를 사용하지 않았을 가능성이 존재
<책임연구원 송원호(wonho.song@kbfg.com) ☎02)2073-5730>