Summary

Meta의 Yann LeCun은 AI 연구실 관리의 중요성에 대해 논하며 특히 AI 프로젝트는 상부 관리의 감시에서 독립적으로 수행되어야만 한다고 강조했습니다. Gartner는 생성 AI가 가장 널리 사용되는 AI 솔루션이라고 발표했습니다. NVIDIA는 LoRA 어댑터를 활용한 효율적인 모델 배포 방법을 설명했고, OpenAI는 새로운 model spec을 공개했습니다. DeepMind는 AlphaFold 3을 발표하며, 생물학적 분자 구조 예측에서의 혁신을 제시했습니다.

연구실 관리의 중요성

링크, 2024년 6월 9일,
Meta

Yann LeCun
VP & Chief AI Scientist at MetaVP & Chief AI Scientist at Meta

연구소의 관리를 위해서는 다음의 것들이 매우 중요합니다.

  1. 뛰어나고 창의적인 인재를 발굴하고, 채용하며, 유지하는 것.
  2. 이들이 최고의 연구를 할 수 있도록 환경, 자원, 자유를 제공하는 것.
  3. 유망한 연구 방향을 찾아내고(주로 연구자들이 제안하는 방향), 그 방향에 자원을 투자하는 것. 과학자들에게 책임을 맡기고 간섭하지 않는 것.
  4. 헛된 주장이나 비현실적인 아이디어를 잘 구별하는 것. 이는 과학자들이 부정직해서가 아니라 종종 자신을 속이기 쉬워서입니다. 자신이 대단한 발명을 했다고 생각하기 쉽습니다. 출판을 장려하고 오픈 소싱을 통해 연구 커뮤니티가 좋은 연구와 그렇지 않은 연구를 구별하도록 하는 방법이 있습니다.
  5. 연구자들이 야심찬 목표를 가진 연구 프로젝트에 참여하도록 동기를 부여하는 것. 단순한 개선 작업은 너무 쉽고 덜 위험할 수 있습니다.
  6. 단기적 성과와 단순한 지표(예: 논문 수)에 지나치게 집중하지 않는 방식으로 연구자들을 평가하는 것. 당신의 판단력을 사용하십시오. 그것이 당신이 높은 보수를 받는 이유입니다.
  7. 프로젝트를 상부 경영진의 감시에서 보호하는 것. 감시하는 냄비는 절대 끓지 않습니다. 계획된 혁신과 6개월 단위의 마일스톤으로는 결코 돌파구를 마련할 수 없습니다.

Gartner 설문 조사: 생성 AI가 가장 널리 사용되는 AI 솔루션

링크, 2024년 5월 7일,
Gartner

  • 설문 조사에 따르면, 29%의 응답자가 생성 AI를 사용 중.
  • 생성 AI는 그래프 기술, 최적화 알고리즘, 규칙 기반 시스템 등을 제치고 가장 많이 사용됨.
  • Microsoft Copilot for 365와 Adobe Firefly와 같은 기존 응용 프로그램에 포함된 생성 AI 활용이 가장 일반적임.
  • AI 도입의 주요 장애물은 AI 프로젝트의 가치 추정 및 입증의 어려움.
  • 성숙한 AI 조직은 AI 운영 모델, AI 엔지니어링, 업스킬링, 신뢰 및 보안 관리에 중점.

LoRA 어댑터를 활용한 효율적 모델 배포

링크, 2024년 6월 7일,
NVIDIA

  • LoRA는 전체 모델을 업데이트하지 않고도 작은 수의 추가 매개변수만 튜닝.
  • 두 가지 LoRA 배포 방법: LoRA 어댑터 병합 및 동적 로드.
  • NIM을 통해 다양한 LoRA 어댑터를 한꺼번에 배치하여 여러 작업을 동시에 처리 가능.
  • NVIDIA NIM은 GPU 메모리와 호스트 메모리에서 어댑터를 동적으로 로드하여 성능 향상.

OpenAI의 새로운 모델 사양 공개

링크, 2024년 5월 8일,
OpenAI

  • 모델 사양(Model Spec)은 모델 행동을 안내하는 고급 지침.
  • 공개 피드백을 통해 모델 사양을 조정.
  • 헌법 AI(Constitutional AI)와 달리 인간 피드백을 활용하여 모델을 강화.
  • 모델 사양은 플랫폼 규칙, 법률 준수, 지적 재산권 존중 등 여섯 가지 행동 원칙 포함.

AlphaFold 3: 모든 생화학을 아우르는 혁신

링크, 2024년 5월 8일,
DeepMind

  • AlphaFold 3는 단백질뿐만 아니라 DNA, RNA, 리간드 등 모든 생물학적 활성 분자의 구조를 예측.
  • 기존 아미노산 구조 지식을 바탕으로 분자의 3D 구조 생성.
  • PoseBusters 데이터베이스에서 77%의 예측 성공률 기록.
  • 단백질-단백질 상호작용 예측에서 77% 성공률 달성.

Hugging Face의 DITTO: 시연 피드백을 통한 모델 정렬

링크, 2024년 6월 3일,
Hugging Face

  • DITTO는 10개 미만의 시연을 통해 LLM 출력을 사용자 행동에 맞추는 방법 제안.
  • 비교 데이터 생성 및 반복 학습을 통해 성능 향상.
  • 소수의 시연으로도 모델을 효과적으로 사용자 정의 가능.

Intel의 Lunar Lake: AI PC를 위한 새로운 코어와 GPU

링크, 2024년 6월 4일,
Intel

  • Lunar Lake는 새로운 코어 IP, GPU, NPU, 메모리 시스템을 갖춘 혁신적인 아키텍처.
  • P-코어와 E-코어의 성능 개선으로 IPC와 단일 스레드 부동 소수점 성능 향상.
  • 새로운 Xe2 Battlemage 아키텍처의 GPU는 50% 더 높은 그래픽 성능 제공.

Microsoft의 Copilot 사용 경험

링크, 2024년 6월 9일,
Microsoft

  • Copilot 도입 첫 해, AI가 업무에 미치는 영향 평가.
  • 초기 도입 부서: 판매, 고객 서비스, 인사.
  • Copilot 사용으로 생산성, 작업 즐거움, 워크라이프 밸런스 개선.

LLM 구축 경험에서 얻은 교훈

링크, 2024년 6월 8일,
Applied LLMs

  • LLM 제품 구축의 전술적, 운영적, 전략적 측면을 다룸.
  • 전술적: 프롬프트 작성, RAG, 흐름 엔지니어링, 평가 및 모니터링.
  • 운영적: 제품 배송의 일상적 문제와 효과적인 팀 구축.
  • 전략적: 장기적 관점과 시스템 중심 접근 방법 강조.

로컬 파일을 위한 생성 검색 엔진 구축

링크, 2024년 6월 8일,
Towards Data Science

  • 로컬 파일과 상호작용하는 오픈 소스 생성 검색 엔진 구현.
  • Qdrant와 Streamlit을 사용하여 Llama 3 모델을 로컬에서 실행.
  • 파일 인덱싱 및 쿼리 응답을 위한 구조와 사용자 인터페이스 설계.
  • 성능과 유연성을 높이기 위해 문서 청크화 및 벡터 유사성 메트릭 사용.

실행 가능한 코드 작업으로 LLM 에이전트 개선

링크, 2024년 2월 2일,
Hugging Face

  • LLM 에이전트의 행동 공간을 통합하기 위해 실행 가능한 Python 코드를 사용.
  • CodeAct를 통해 JSON이나 텍스트 대신 실행 가능한 코드로 작업을 수행.
  • API-Bank와 새로운 벤치마크에서 최대 20% 더 높은 성공률 달성.
  • Llama2와 Mistral에서 파인튜닝된 CodeActAgent를 통해 복잡한 작업 수행 가능.
Sources This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# (today's date in 년 월 일) AI 소식,
## Summary
(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)
## Title,
한글제목 (title 이 영문이라면)
[링크](link), date,
company name
- detailed summary1, (개조식 문체 사용)
- detailed summary2, (개조식 문체 사용)
...
- detailed summary N, (개조식 문체 사용)
## Title,
한글제목 (title 이 영문이라면)
[링크](link), date,
company name
- detailed summary1, (개조식 문체 사용)
- detailed summary2, (개조식 문체 사용)
...
- detailed summary N, (개조식 문체 사용)
...
The report should be written in Korean and use the 개조식 문체 style. give the very deep details for each link as much as possible.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
###
https://www.linkedin.com/posts/yann-lecun_it-is-of-paramount-importance-that-the-management-activity-7205642101001203714-4nug?utm_source=share&utm_medium=member_ios


Yann LeCun
VP & Chief AI Scientist at MetaVP & Chief AI Scientist at Meta
16 hours ago
Follow

It is of paramount importance that the management of a research lab be composed of reputable scientists.

Their main jobs are to:
1. Identify, recruit, and retain brilliant and creative people.
2. Give them the environment, resources, and freedom to do their best work.
3. Identify promising research directions (often coming from the researchers themselves) and invest resources in them. Put the scientists in charge and get out of the way.
4. Be really good at detecting BS, not necessarily because scientists are dishonest, but often because they are self-deluded. It's easy to think you've invented the best thing since sliced bread. Encouraging publications and open sourcing is a way to use the research community to help distinguish good work from not-so-good work.
5. Inspire researchers to work on research projects that have ambitious goals. It's too easy and less risky to work on valuable improvements that are incremental.
6. Evaluate people in ways that don't overly focus on short-term impact and simple metrics (e.g. number of publications). Use your judgment. That's why you get paid the big bucks.
7. Insulate rogue-but-promising projects from the scrutiny of upper management. A watched pot never boils. Planned innovation and 6-months milestones never bring breakthroughs.

You can't do any of this cat herding jobs unless you are an experienced, talented, and reputable scientist with a research record that buys you at least some legitimacy in the eyes of the scientists in your organization.

###
https://www.gartner.com/en/newsroom/press-releases/2024-05-07-gartner-survey-finds-generative-ai-is-now-the-most-frequently-deployed-ai-solution-in-organizations
Gartner Survey Finds Generative AI Is Now the Most Frequently Deployed AI Solution in Organizations
STAMFORD, Conn., May 7, 2024

Estimating and Demonstrating Business Value Is No. 1 AI Adoption Barrier
Generative artificial intelligence (GenAI) is the No. 1 type of AI solution deployed in organizations, according to a new survey by Gartner, Inc.

According to the survey conducted in the fourth quarter of 2023, 29% of the 644 respondents from organizations in the U.S., Germany and the U.K. said that they have deployed and are using GenAI, making GenAI the most frequently deployed AI solution. GenAI was found to be more common than other solutions like graph techniques, optimization algorithms, rule-based systems, natural language processing and other types of machine learning.

The survey also found that utilizing GenAI embedded in existing applications (such as Microsoft’s Copilot for 365 or Adobe Firefly) is the top way to fulfill GenAI use cases, with 34% of respondents saying this is their primary method of using GenAI. This was found to be more common than other options such as customizing GenAI models with prompt engineering (25%), training or fine-tuning bespoke GenAI models (21%), or using standalone GenAI tools, like ChatGPT or Gemini (19%).

“GenAI is acting as a catalyst for the expansion of AI in the enterprise,” said Leinar Ramos, Sr Director Analyst at Gartner. “This creates a window of opportunity for AI leaders, but also a test on whether they will be able to capitalize on this moment and deliver value at scale.”

Demonstrating AI Value Is Top Barrier to Adoption
The primary obstacle to AI adoption, as reported by 49% of survey participants, is the difficulty in estimating and demonstrating the value of AI projects. This issue surpasses other barriers such as talent shortages, technical difficulties, data-related problems, lack of business alignment and trust in AI (see Figure 1).

“Business value continues to be a challenge for organizations when it comes to AI,” said Ramos. “As organizations scale AI, they need to consider the total cost of ownership of their projects, as well as the wide spectrum of benefits beyond productivity improvement.”

Figure 1: Top Barriers to Implement AI Techniques (Sum of Top 3 Ranks)
[Image Alt Text for SEO]
Source: Gartner (May 2024)

"GenAI has increased the degree of AI adoption throughout the business and made topics like AI upskilling and AI governance much more important,” said Ramos. “GenAI is forcing organizations to mature their AI capabilities.”

Learnings from AI-Mature Organizations
“Organizations who are struggling to derive business value from AI can learn from mature AI organizations,” said Ramos. “These are organizations that are applying AI more widely across different business units and processes, deploying many more use cases that stay longer in production.”

The survey found 9% of organizations are currently AI-mature and found that what makes these organizations different is that they focus on four foundational capabilities:

A scalable AI operating model, balancing centralized and distributed capabilities.
A focus on AI engineering, designing a systematic way of building and deploying AI projects into production.
An investment on upskilling and change management across the wider organization.
A focus on trust, risk and security management (TRiSM) capabilities to mitigate the risks that come from AI implementations and drive better business outcomes.
“AI-mature organizations invest in foundational capabilities that will remain relevant regardless of what happens tomorrow in the world of AI, and that allows them to scale their AI deployments efficiently and safely,” said Ramos.

Focusing on these foundational capabilities can help organizations mature and alleviate the current challenge of bringing AI projects to production. The survey found that, on average, only 48% of AI projects make it into production, and it takes 8 months to go from AI prototype to production.

Gartner clients can read more in “Survey Shows How GenAI Puts Organizational AI Maturity to the Test.” Learn more in the complimentary Gartner webinar “What Mature Organizations Do Differently for AI Success.”

Gartner IT Symposium/Xpo
CIOs and IT executives will explore AI adoption and implementation at Gartner IT Symposium/Xpo. Follow news and updates from the conferences on Twitter using #GartnerSYM.

###
https://developer.nvidia.com/blog/seamlessly-deploying-a-swarm-of-lora-adapters-with-nvidia-nim/?ncid=so-link-634884&=&linkId=100000265563449
Technical Blog
NVIDIA

Filter
Subscribe
Generative AI
Seamlessly Deploying a Swarm of LoRA Adapters with NVIDIA NIM
Jun 07, 2024
By Shashank Verma, Neal Vaidya, Vinh Nguyen, Wei Du, Scot Junkin and BoYang Hsueh

+11
Like
Discuss (0)

The latest state-of-the-art foundation large language models (LLMs) have billions of parameters and are pretrained on trillions of tokens of input text. They often achieve striking results on a wide variety of use cases without any need for customization. Despite this, studies have shown that the best accuracy on downstream tasks can be achieved by adapting LLMs with high-quality, domain-specific datasets.

In many cases, smaller customized models can match or even outperform larger generic LLMs while offering significantly lower deployment costs. However, customizing models for specific downstream tasks can bring significant challenges, during both creation and deployment.

Full fine-tuning (that is, updating all parameters of the model) for the largest LLMs can be difficult due to the amount of computational infrastructure required to learn across the whole model. Infrastructure costs are also increased at deployment time, where users are required to either host multiple large models in memory or tolerate increased latency as entire models are swapped in and out. Low-rank adaptation (LoRA) is a technique for mitigating both of these issues.

This post provides a brief overview of LoRA, and explains the two ways to deploy LoRA fine-tuned models. We will also discuss our approach for enabling a heterogeneous LoRA deployment of a swarm of LoRA adapters, enabling mixed-batch inference requests.

Low-rank adaptation
In the past few years, LoRA has emerged as a popular technique that tunes a very small number of additional parameters, as compared to full fine-tuning. These additional parameters, called the LoRA adapter, represent the low-rank decomposition of the changes in the dense layers of the network. LoRA operates on the observation that LLMs are overparameterized, and that newly learned information during fine-tuning has a low “intrinsic rank.” In other words, the effective changes in the model parameters are confined to a lower-dimensional subspace of the entire, very high-dimensional parameter space. With LoRA, it’s possible to reduce the number of trainable parameters by 10,000x.

Figure 1 illustrates the parameters introduced in the form of trainable low-rank matrices A and B. The pretrained weights are frozen while A and B are trained during LoRA customization to represent the newly added information.
Figure 1. Parameters in A and B represent the newly added information. Image credit: LoRA: Low-Rank Adaptation of Large Language Models
Figure 1 depicts the core idea behind LoRA:

The weights of the pretrained model (W) are frozen during customization
Instead of updating W, two smaller trainable matrices A and B are injected, which learn task-specific information. The matrix multiplication B*A forms a matrix with the same dimensions as W, thus it can be added to W (= W + BA).
The ranks of A and B matrices are small values like 8, 16, and so on. Cumulatively, they have far fewer trainable parameters than W, which makes customization computationally and memory efficient. This rank (r) parameter is typically customizable at training time.

There exists a tradeoff between rank size and computational efficiency. A larger rank value enables better expressivity, so the model can capture more patterns relevant to the downstream task. Very high rank values (like 64) approach the capacity of learning information close to full supervised fine-tuning. That is, updating all the parameters in the model. On the downside, larger ranks are also more expensive to train and inference, both in terms of memory and compute requirements. In practice, LoRA fine-tuning with a rank value as small as 8 is already very effective, and is a good starting point for a variety of downstream tasks.

Deploying a LoRA-tuned model
LoRA fine-tunes can be deployed in the following ways.

Option 1: Merging the LoRA adapter
The additional LoRA weights can be merged with the pretrained model to create a purpose-built variant that is structurally equivalent to its predecessor. This avoids incurring any additional inference latency of managing the adapter separately. Merging weights is a simpler approach, but less flexible. The disadvantage of this approach is that the whole model becomes “bespoke” and can only serve one task at a time—that is, the one it is fine-tuned for. This makes it difficult to batch together inputs for different tasks for efficiency in deployment. It is only recommended if you plan to serve a single task per deployment.

Option 2: Dynamically loading the LoRA adapter
LoRA adapters (A and B in Figure 1) are kept separate from the base model (W). At inference, the runtime dynamically loads the adapter weights corresponding to incoming requests to serve it. It enables flexibility in serving and batching inputs from various tasks concurrently to make the best use of the available compute, without having to maintain separate custom models.

Some use cases require several, and even hundreds or thousands of LoRAs over the same base model. For these, ‌dynamic LoRA adapter selection is a better path. Examples include:

Enterprises serving personalized models for their customers, for serving recommendations, or adapting to their specific personas or preferences.
A/B testing to compare between various LoRA fine-tunes of the same use case.
Enterprises serving multiple downstream use cases based on the same base foundation model. For example, IT service teams deploying a multi-LoRA setup for bug summarization, ticket routing and classification, implementing chatbots and knowledge retrieval over specific document corpuses, root cause analysis, and more.
NVIDIA NIM offers optimized inference microservices that support such dynamic loading of LoRA adapters and allow sending mixed-batch requests. The following sections take a deeper look at our approach.

Heterogenous, multiple LoRA deployment with NVIDIA NIM
With NIM, each inference microservice is associated with a single foundation model. This model can have any number of “customizations” in the form of low-rank adapters associated with it.

Adapters, trained using either the NVIDIA NeMo framework or Hugging Face PEFT library are placed into an adapter store and given a unique name.
When making a request to the NIM, clients can specify that they want a particular customization by including the LoRA model name.
When NIM receives a request for some customized model, it will pull the associated adapter from the adapter store into a multi-tier cache. Some adapters are resident in GPU memory and some in host memory, depending on how recently they were used.
During execution, NIM will run specialized GPU kernels that let data flow through both the foundation model and multiple different low-rank adapters simultaneously. This enables it to respond to requests for multiple different custom models at the same time.
This image illustrates an architecture diagram for a mixed batch input neural network model. The key components are: Mixed Batch Input, GPU Memory, Adapter Store, Adapter Cache, and Output Batch.
Figure 2. NVIDIA NIM dynamic LoRA architecture, which enables sending a mixed batch of input over the same foundation model
Handling a mixed batch of requests
The requests in one batch might use different LoRA adapters to support different tasks. Therefore, one traditional General Matrix Multiplication (GEMM) can’t be used to compute all the requests together. Computing them one-by-one sequentially would lead to significant additional overhead. To solve this problem, we used NVIDIA CUTLASS to implement a batched GEMM to fuse batched, heterogeneous request processing into a single kernel. This improves ‌GPU utilization and performance.

Furthermore, we found that the GPU utilization of the batched GEMM is not sufficiently high for the first matrix component of each adapter, because this first matrix has a very large input dimension and small output dimension. Each adapter has two matrix components, A (shaped d-by-r) and B (shaped r-by-d), as seen in Figure 1. Since d is typically much larger than the LoRA rank r, we applied the splitK method to split the GEMM into several tiles on more streaming multiprocessors (SMs), improving the GPU utilization, and use an additional reduction kernel to reduce the partial results after the splitK-batched-GEMM.

Best practices for performance benchmarking
Evaluating the latency and throughput performance of such a multi-LoRA deployment is nontrivial. In this section, we discuss several major considerations generally worth looking at when benchmarking the performance of an LLM LoRA inference framework.

Base model: Both small and large models can be used as base models for LoRA fine-tuning and inference, such as Llama 3 8B and Llama 3 70B. Smaller models excel at many tasks, especially traditional non-generative NLP tasks, such as text classification, while larger models excel at complex reasoning tasks. One of the advantages of LoRA is that even a large 70B model can be tuned on a single NVIDIA DGX H100 or A100 node with FP16, or even a single NVIDIA H100 or NVIDIA A100 GPU with 4-bit quantization.
Adapters: In practice, from the end user’s point of view, it’s desirable to have the flexibility to experiment and select the size that yields the best accuracy. System operators, on the other hand, may want to enforce a certain fixed size uniformly, for uniform LoRAs enable better batching and hence performance. Popular choices for LoRA ranks are 8/16/32/64.
Test parameters: Several other test parameters to be considered for benchmarking include:
Output length control: The ignore_eos parameter tells the inference framework to continue generating text until it reaches the max_token_length limit. This ensures the use case OSL (output sequence length) specification is met. This parameter is increasingly supported by LLM inference frameworks and significantly simplifies benchmarking setup. Notably, with ignore_eos you don’t have to train on “real” tasks for performance profiling purposes.
System load: Concurrency (number of concurrent users) is commonly used to drive load into the system. This should reflect real use cases, while also taking into account the max “batch size” that the system can effectively serve concurrently. For an 8B model on one GPU, consider up to 250 concurrent users for a realistic server load.
Task type: Both generative and non-generative tasks should be considered. These differ in the ISL (input sequence length) and OSL. ISL in the [200, 2000] token range, and OSL in the [1, 2000] token range reflect a wide range of LLM applications from text classification and summary, to translation and code generation.
Tooling: The benchmarking tool should support calling the LoRA models. GenAI-Perf is an LLM benchmarking tool designed with LoRA support. Adapters are called either uniformly at random or in a round-robin fashion, or following a distribution to reflect real usage patterns. For example, 20% of adapters account for 80% of requests.
Metrics: In the LLM domain, the main metrics are latency. TTFT (time to first token), ITL (inter-token latency) and throughput, TPS (total system tokens per second).
Other supplementary metrics include total requests per second and end-to-end request latency.

Compared to serving a base model (or merged LoRA model), the addition of dynamic LoRAs—a single LoRA, multiple LoRAs of the same rank, or multiple LoRAs of different ranks—all induce increasing cost, both in latency and throughput. Ideally, this cost should be reasonable in exchange for the improved accuracy and flexibility that dynamic LoRAs provide.

In the coming weeks and months, we’ll have more to share on the performance characteristics of NIM when serving LoRA.

What’s next
There are exciting new enhancements to LoRA in research that aim to improve the efficiency or accuracy of fine-tuned models. Our future direction includes incorporating these into NIM.

Tied-LoRA
Tied-LoRA is a novel technique from NVIDIA Research that increases the parameter efficiency of LoRA. In LoRA, task-specific low-rank matrices are added that approximate the weight updates for each layer of the LLM. In Tied-LoRA, these low-rank matrices are shared (“tied”) between the various layers, further reducing the number of trainable parameters. Additionally, this technique allows selectively training or freezing of different components of LoRA (low-rank matrices, and scaling vectors) enabling the user to experiment with performance and parameter efficiency trade-offs.

Support for this method with NVIDIA NIM is planned for future releases.

DoRA
DoRA, another technique developed by NVIDIA Research, bridges the performance gap between fully fine-tuned models and LoRA tuning. It achieves this by decomposing pretrained weights into two components: magnitude and direction. For fine-tuning, DoRA specifically uses LoRA for directional updates, thereby minimizing the number of trainable parameters efficiently. This approach enhances the learning capacity and training stability of LoRA without incurring additional inference overhead. DoRA consistently outperforms LoRA in fine-tuning models like LLaMA, LLaVA, and VL-BART across various downstream tasks, including commonsense reasoning, visual instruction tuning, and image and video-text understanding.

Conclusion
NVIDIA NIM enables you to seamlessly deploy and scale multiple LoRA adapters. NIM is generally available now, starting with support for Meta Llama 3 8B and Llama 3 70B, and LoRA adapters in both NVIDIA NeMo and Hugging Face model formats. We’re committed to adding support for additional state-of-the-art community models in future releases.

To get started with multi-LoRA in NIM, check out the Jupyter Notebook tutorial on LoRA tuning a Llama 3 model using NVIDIA NeMo, deploying fine-tuned adapter(s) with NIM, and sending mixed inference requests. For more information about NIM, see the documentation.

###
https://www.deeplearning.ai/the-batch/issue-249/
Published
May 16, 2024
Reading time
14 min read
Share
Dear friends,

In the last couple of days, Google announced a doubling of Gemini Pro 1.5's input context window from 1 million to 2 million tokens, and OpenAI released GPT-4o, which generates tokens 2x faster and 50% cheaper than GPT-4 Turbo and natively accepts and generates multimodal tokens. I view these developments as the latest in an 18-month trend. Given the improvements we've seen, best practices for developers have changed as well.

Since the launch of ChatGPT in November 2022, with key milestones that include the releases of GPT-4, Gemini 1.5 Pro, Claude 3 Opus, and Llama 3-70B, many model providers have improved their capabilities in two important ways: (i) reasoning, which allows LLMs to think through complex concepts and and follow complex instructions; and (ii) longer input context windows.

The reasoning capability of GPT-4 and other advanced models makes them quite good at interpreting complex prompts with detailed instructions. Many people are used to dashing off a quick, 1- to 2-sentence query to an LLM. In contrast, when building applications, I see sophisticated teams frequently writing prompts that might be 1 to 2 pages long (my teams call them “mega-prompts”) that provide complex instructions to specify in detail how we’d like an LLM to perform a task. I still see teams not going far enough in terms of writing detailed instructions. For an example of a moderately lengthy prompt, check out Claude 3’s system prompt. It’s detailed and gives clear guidance on how Claude should behave.


This is a very different style of prompting than we typically use with LLMs’ web user interfaces, where we might dash off a quick query and, if the response is unsatisfactory, clarify what we want through repeated conversational turns with the chatbot.

Further, the increasing length of input context windows has added another technique to the developer’s toolkit. GPT-3 kicked off a lot of research on few-shot in-context learning. For example, if you’re using an LLM for text classification, you might give a handful — say 1 to 5 examples — of text snippets and their class labels, so that it can use those examples to generalize to additional texts. However, with longer input context windows — GPT-4o accepts 128,000 input tokens, Claude 3 Opus 200,000 tokens, and Gemini 1.5 Pro 1 million tokens (2 million just announced in a limited preview) — LLMs aren’t limited to a handful of examples. With many-shot learning, developers can give dozens, even hundreds of examples in the prompt, and this works better than few-shot learning.

When building complex workflows, I see developers getting good results with this process:

Write quick, simple prompts and see how it does.
Based on where the output falls short, flesh out the prompt iteratively. This often leads to a longer, more detailed, prompt, perhaps even a mega-prompt.
If that’s still insufficient, consider few-shot or many-shot learning (if applicable) or, less frequently, fine-tuning.
If that still doesn’t yield the results you need, break down the task into subtasks and apply an agentic workflow.
I hope a process like this will help you build applications more easily. If you’re interested in taking a deeper dive into prompting strategies, I recommend the Medprompt paper, which lays out a complex set of prompting strategies that can lead to very good results.

Keep learning!

Andrew

P.S. Two new short courses:

“Multi AI Agent Systems with crewAI” taught by crewAI Founder and CEO João Moura: Learn to take a complex task and break it into subtasks for a team of specialized agents. You’ll learn how to design agent roles, goals, and tool sets, and decide how the agents collaborate (such as which agents can delegate to other agents). You'll see how a multi-agent system can carry out research, write an article, perform financial analysis, or plan an event. Architecting multi-agent systems requires a new mode of thinking that's more like managing a team than chatting with LLMs. Sign up here!
“Building Multimodal Search and RAG” taught by Weaviate's Sebastian Witalec: In this course, you'll create RAG systems that reason over contextual information across text, images and video. You will learn how to train multimodal embedding models to map similar data to nearby vectors, so as to carry out semantic search across multiple modalities, and learn about visual instruction tuning to add image capabilities to large language models. Sign up here!
News

Why ChatGPT Acts That Way
OpenAI pulled back the curtain on revised rules that will guide its models.

What’s new: OpenAI published its Model Spec, high-level guidelines for use by human labelers to steer model behavior. The company is inviting public comments on the spec until May 22. It has not stated whether or how it will incorporate comments.

How it works: During training, human labelers rate a model’s responses so it can be fine-tuned to conform with human preferences in the process known as reinforcement from human feedback (RLHF). The Model Spec outlines the principles — some new, some previously in use — that will drive those ratings. The principles are arranged hierarchically, and each category will override those below it.

Three top-level objectives describe basic principles for model behavior: (i) “Assist the developer and end user” defines the relationship between humans and the model. (ii) “Benefit humanity” guides the model to consider both benefits and harms that may result from its behavior. (iii) “Reflect well on OpenAI” reinforces the company’s brand identity as well as social norms and laws.
Six rules govern behavior. In order, models are to prioritize platform rules above requests from developers, users, and tools; follow laws; withhold hazardous information; respect intellectual property; protect privacy; and keep their output “safe for work.” (These rules can lead to contradictions. For instance, the model will comply if a user asks ChatGPT to translate a request for drug-related information because the directive to follow requests from users precedes the one to withhold hazardous information.)
What OpenAI calls defaults govern the model’s interaction style. These include “ask clarifying questions when necessary,” “express uncertainty,” “assume an objective point of view,” and “don't try to change anyone's mind.” For example, if a user insists the Earth is flat, the model may respond, “Everyone's entitled to their own beliefs, and I'm not here to persuade you!”
The spec will evolve in response to the AI community’s needs. In the future, developers may be able to customize it. For instance, the company is considering allowing developers to lift prohibitions on “not safe for work” output such as erotica, gore, and some profanity.
Behind the news: OpenAI’s use of the Model Spec and RLHF contrasts with Anthropic’s Constitutional AI. To steer the behavior of Anthropic models, that company’s engineers define a constitution, or list of principles, such as “Please choose the response that is the most helpful, honest, and harmless” and “Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior.” Rather than human feedback, Anthropic relies on AI feedback to interpret behavioral principles and guide reinforcement learning.

Why it matters: AI developers require a degree of confidence that the models they use will behave as they expect and in their users’ best interests. OpenAI’s decision to subject its guidelines to public scrutiny could help to instill such confidence, and its solicitation of public comments might make its models more responsive to social and market forces.

We’re thinking: OpenAI’s openness with respect to its Model Spec is a welcome step toward improving its models’ safety and performance.


AlphaFold 3 Embraces All Biochemistry
The latest update of DeepMind’s AlphaFold model is designed to find the structures of not just proteins but all biologically active molecules as well as interactions between them.

What’s new: Google announced AlphaFold 3, which models the 3D shapes of biomolecules including proteins, DNA, RNA, and ligands (molecules that bind to proteins or DNA, which includes antibodies and many drugs) in any combination. AlphaFold Server provides access for noncommercial uses (with some limitations). Unlike earlier versions, AlphaFold 3 is not open source.

Key insight: Given a sequence of amino acids (the building blocks of proteins), the previous version of AlphaFold drew on an existing knowledge of amino acid structures, computed their locations and angles, and assembled them like Lego blocks. To adapt the system for molecules that aren’t made of amino acids, AlphaFold 3 represents them as collections of individual atoms and uses a generative model to find their positions in space.

How it works: Given a list of molecules, AlphaFold 3 generates their joint 3D structure, revealing how they fit together. Several transformers hone embeddings of proteins and amino acids, while a diffusion model (also a transformer) processes embeddings of atoms. The team trained the system on five datasets including ground truth protein, DNA, and RNA structures interactions in the Protein Data Bank. They also trained it on protein shapes computed by AlphaFold 2; that model’s explicit knowledge of amino acid structures helped overcome AlphaFold 3’s tendency to hallucinate in some instances. Among the key processes:

Given a protein’s amino acid sequence, a molecule’s set of atoms, or any combination thereof, AlphaFold 3 first represents each common amino acid, nucleotide, and individual atom (that isn’t a part of a common amino acid or nucleotide) with a single token.
For each token, the system draws on existing databases to compute a variety of features, which fall into five categories: (i) per-token features like position, (ii) features of proteins in the Protein Data Bank, (iii) features of a given molecule, (iv) features derived from a genetic search (for example, whether two amino acid sequences appear to be related evolutionarily) and (v) features that describe chemical bonds between two tokens.
Given these features, a transformer produces a single embedding that represents all tokens and pairwise embeddings that represent relationships between each pair of tokens. A second transformer refines the pairwise embeddings based on known molecules that share subsequences of amino acids or nucleotides with the input. A third transformer further refines the embeddings.
Given the features, embeddings, and a noisy point cloud of atoms, the diffusion model removes the noise. (That is, it learned to modify the atoms’ positions to match those in their dataset.)
AlphaFold 3 learned to optimize seven additional loss terms, including one that minimized the difference between the predicted and actual length of bonds between molecules and another that minimized the difference between predicted and actual distances between pairs of atoms.
Results: On PoseBusters, a database of protein and protein-molecule shapes, AlphaFold 3 successfully found the shapes of about 77 percent of examples, while AutoDock Vina (a non-learning program that models molecular interactions) achieved about 53 percent. On a Protein Data Bank evaluation set, AlphaFold 3 successfully found about 84 percent of protein shapes, while AlphaFold Multimer 2.3 (an update of AlphaFold 2) found 83 percent. Modeling protein-protein interactions, AlphaFold 3 achieved 77 percent, while AlphaFold Multimer 2.3 achieved 67 percent, according to DockQ (a metric for the quality of such interactions).

Behind the news: The original AlphaFold solved one of the most challenging problems in molecular biology by figuring out how long chains of amino acids would fold, giving scientists clear targets for designing new bioactive molecules. Google spun off Isomorphic Labs to apply AlphaFold 2 to drug discovery. That company will use AlphaFold 3 and control commercial access to it.

Why it matters: AlphaFold 3 is a triumph of machine learning. It extends the utility of the previous version beyond proteins, and it computes with unprecedented accuracy how biological molecules will combine, allowing for a more comprehensive understanding of how drugs interact with the body. Its ability to predict how antibodies will bind to proteins could help stave off future pandemics and other illnesses.

We’re thinking: Although Isomorphic Labs retains control of AlphaFold 3, biologists said the information in the paper is enough for other researchers to develop similar systems. We look forward to open versions!

###
https://huggingface.co/papers/2402.01030
Executable Code Actions Elicit Better LLM Agents
Published on Feb 2
Authors:

Xingyao Wang
,
Yangyi Chen
,
Lifan Yuan
,
Yizhe Zhang
,
Yunzhu Li
,
Hao Peng
,
Heng Ji
Abstract
Large Language Model (LLM) agents, capable of performing a broad range of actions, such as invoking tools and controlling robots, show great potential in tackling real-world challenges. LLM agents are typically prompted to produce actions by generating JSON or text in a pre-defined format, which is usually limited by constrained action space (e.g., the scope of pre-defined tools) and restricted flexibility (e.g., inability to compose multiple tools). This work proposes to use executable Python code to consolidate LLM agents' actions into a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions. Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that CodeAct outperforms widely used alternatives (up to 20% higher success rate). The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language. To this end, we collect an instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn interactions using CodeAct. We show that it can be used with existing data to improve models in agent-oriented tasks without compromising their general capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., model training) using existing libraries and autonomously self-debug.

###
https://huggingface.co/papers/2406.00888
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
Published on Jun 3
·
Featured in Daily Papers on Jun 4
Authors:

Omar Shaikh
,

Michelle Lam
,
Joey Hejna
,
Yijia Shao
,
Michael Bernstein
,
Diyi Yang
Abstract
Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. Steering LLMs away from generic output is possible through supervised finetuning or RLHF, but requires prohibitively large datasets for new ad-hoc tasks. We argue that it is instead possible to align an LLM to a specific setting by leveraging a very small number (<10) of demonstrations as feedback. Our method, Demonstration ITerated Task Optimization (DITTO), directly aligns language model outputs to a user's demonstrated behaviors. Derived using ideas from online imitation learning, DITTO cheaply generates online comparison data by treating users' demonstrations as preferred over output from the LLM and its intermediate checkpoints. We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts. Additionally, we conduct a user study soliciting a range of demonstrations from participants (N=16). Across our benchmarks and user study, we find that win-rates for DITTO outperform few-shot prompting, supervised fine-tuning, and other self-play methods by an average of 19% points. By using demonstrations as feedback directly, DITTO offers a novel method for effective customization of LLMs.

Humans learn faster by being shown rather than told. Well, LLMs also learn faster if you show them! 👀 DITTO from Stanford University proposes that LLMs can be tuned with less than 10 samples! 🤯
Implementation:
1️⃣ Collect a small number (<10) of User/Expert demonstrations (input & output)
2️⃣ Select the SFT Model you want to tune
3️⃣ Generate new negative samples for the demonstrations
4️⃣ Create Pairwise comparison data where (expert > generation)
5️⃣ SFT until defined breakpoint (loss), then apply DPO using the pairwise comparison data
🔄 Repeat 3-5, but in every new iteration, add 20% of “replay” data, with Current Iteration > previous iteration outputs pairs
Insights:
📈 DITTO outperforms few-shot prompting
🔄 Generating 10 negative samples per demonstration improves performance.
📊 DITTO 22.34% relative Improvement
🚀 31.5% performance improvement from the first to the fourth iteration.
🏆 Outperforms SPIN > 10% on using ~10 seed demonstrations
🤗 Built with the @huggingface alignment-handbook

###
https://towardsdatascience.com/how-to-build-a-generative-search-engine-for-your-local-files-using-llama-3-399551786965

How to Build a Generative Search Engine for Your Local Files Using Llama 3
Use Qdrant, NVidia NIM API, or Llama 3 8B locally for your local GenAI assistant
Nikola Milosevic (Data Warrior)
Towards Data Science
Nikola Milosevic (Data Warrior)

·
Follow

Published in
Towards Data Science

·
12 min read
·
2 days ago
303




On the 23rd of May, I received an email from a person at Nvidia inviting me to the Generative AI Agents Developer Contest by NVIDIA and LangChain. My first thought was that it is quite a little time, and given we had a baby recently and my parents were supposed to come, I would not have time to participate. But then second thoughts came, and I decided that I could code something and submit it. I thought about what I could make for a few days, and one idea stuck with me — an Open-Source Generative Search Engine that lets you interact with local files. Microsoft Copilot already provides something like this, but I thought I could make an open-source version, for fun, and share a bit of learnings that I gathered during the quick coding of the system.

System Design
In order to build a local generative search engine or assistant, we would need several components:

An index with the content of the local files, with an information retrieval engine to retrieve the most relevant documents for a given query/question.
A language model to use selected content from local documents and generate a summarized answer
A user interface
How the components interact is presented in a diagram below.


System design and architecture. Qdrant is used for vector store, while Streamlit is for the user interface. Llama 3 is either used via Nvidia NIM API (70B version) or is downloaded via HuggingFace (8B version). Document chunking is done using Langchain. Image by author
First, we need to index our local files into the index that can be queried for the content of the local files. Then, when the user asks a question, we would use the created index, with some of the asymmetric paragraph or document embeddings to retrieve the most relevant documents that may contain the answer. The content of these documents and the question are passed to the deployed large language model, which would use the content of given documents to generate answers. In the instruction prompt, we would ask a large language model to also return references to the used document. Ultimately, everything will be visualized to the user on the user interface.

Now, let’s have a look in more detail at each of the components.

Semantic Index
We are building a semantic index that will provide us with the most relevant documents based on the similarity of the file's content and a given query. To create such an index we will use Qdrant as a vector store. Interestingly, a Qdrant client library does not require a full installation of Qdrant server and can do a similarity of documents that fit in working memory (RAM). Therefore, all we need to do is to pip install Qdrant client.

We can initialize Qdrant in the following way (note that the hf parameter is later defined due to the story flow, but with Qdrant client you already need to define which vectorization method and metric is being used):

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
client = QdrantClient(path="qdrant/")
collection_name = "MyCollection"
if client.collection_exists(collection_name):
client.delete_collection(collection_name)

client.create_collection(collection_name,vectors_config=VectorParams(size=768, distance=Distance.DOT))
qdrant = Qdrant(client, collection_name, hf)
In order to create a vector index, we will have to embed the documents on the hard drive. For embeddings, we will have to select the right embedding method and the right vector comparison metric. Several paragraph, sentence, or word embedding methods can be used, with varied results. The main issue with creating vector search, based on the documents, is the problem of asymmetric search. Asymmetric search problems are common to information retrieval and happen when one has short queries and long documents. Word or sentence embeddings are often fine-tuned to provide similarity scores based on documents of similar size (sentences, or paragraphs). Once that is not the case, the proper information retrieval may fail.

However, we can find an embedding methodology that would work well on asymmetric search problems. For example, models fine-tuned on the MSMARCO dataset usually work well. MSMARCO dataset is based on Bing Search queries and documents and has been released by Microsoft. Therefore, it is ideal for the problem we are dealing with.

For this particular implementation, I have selected an already fine-tuned model, called:

sentence-transformers/msmarco-bert-base-dot-v5
This model is based on BERT and it was fine-tuned using dot product as a similarity metric. We have already initialized qdrant client to use dot product as a similarity metric in line (note this model has dimension of 768):

client.create_collection(collection_name,vectors_config=VectorParams(size=768, distance=Distance.DOT))
We could use other metrics, such as cosine similarity, however, given this model is fine-tuned using dot product, we will get the best performance using this metric. On top of that, thinking geometrically: Cosine similarity focuses solely on the difference in angles, whereas the dot product takes into account both angle and magnitude. By normalizing data to have uniform magnitudes, the two measures become equivalent. In situations where ignoring magnitude is beneficial, cosine similarity is useful. However, the dot product is a more suitable similarity measure if the magnitude is significant.

The code for initializing the MSMarco model is (in case you have available GPU, use it. by all means):

model_name = "sentence-transformers/msmarco-bert-base-dot-v5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}
hf = HuggingFaceEmbeddings(
model_name=model_name,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs
)
The next problem: we need to deal with is that BERT-like models have limited context size, due to the quadratic memory requirements of transformer models. In the case of many BERT-like models, this context size is set to 512 tokens. There are two options: (1) we can base our answer only on the first 512 tokens and ignore the rest of the document, or (2) create an index, where one document will be split into multiple chunks and stored in the index as chunks. In the first case, we would lose a lot of important information, and therefore, we picked the second variant. To chunk documents, we can use a prebuilt chunker from LangChain:

from langchain_text_splitters import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_text(file_content)
metadata = []
for i in range(0,len(texts)):
metadata.append({"path":file})
qdrant.add_texts(texts,metadatas=metadata)
In the provided part of the code, we chunk text into the size of 500 tokens, with a window of 50 overlapping tokens. This way we keep a bit of context on the places where chunks end or begin. In the rest of the code, we create metadata with the document path on the user’s hard disk and add these chunks with metadata to the index.

However, before we add the content of the files to the index, we need to read it. Even before we read files, we need to get all the files we need to index. For the sake of simplicity, in this project, the user can define a folder that he/she would like to index. The indexer retrieves all the files from that folder and its subfolder in a recursive manner and indexes files that are supported (we will look at how to support PDF, Word, PPT, and TXT).

We can retrieve all the files in a given folder and its subfolder in a recursive way:

def get_files(dir):
file_list = []
for f in listdir(dir):
if isfile(join(dir,f)):
file_list.append(join(dir,f))
elif isdir(join(dir,f)):
file_list= file_list + get_files(join(dir,f))
return file_list
Once all the files are retrieved in the list, we can read the content of files containing text. In this tool, for start, we will support MS Word documents (with extension “.docx”), PDF documents, MS PowerPoint presentations (with extension “.pptx”), and plain text files (with extension “.txt”).

In order to read MS Word documents, we can use the docx-python library. The function reading documents into a string variable would look something like this:

import docx
def getTextFromWord(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
A similar thing can be done with MS PowerPoint files. For this, we will need to download and install the pptx-python library and write a function like this:

from pptx import Presentation
def getTextFromPPTX(filename):
prs = Presentation(filename)
fullText = []
for slide in prs.slides:
for shape in slide.shapes:
fullText.append(shape.text)
return '\n'.join(fullText)
Reading text files is pretty simple:

f = open(file,'r')
file_content = f.read()
f.close()
For PDF files we will in this case use the PyPDF2 library:

reader = PyPDF2.PdfReader(file)
for i in range(0,len(reader.pages)):
file_content = file_content + " "+reader.pages[i].extract_text()
Finally, the whole indexing function would look something like this:

file_content = ""
for file in onlyfiles:
file_content = ""
if file.endswith(".pdf"):
print("indexing "+file)
reader = PyPDF2.PdfReader(file)
for i in range(0,len(reader.pages)):
file_content = file_content + " "+reader.pages[i].extract_text()
elif file.endswith(".txt"):
print("indexing " + file)
f = open(file,'r')
file_content = f.read()
f.close()
elif file.endswith(".docx"):
print("indexing " + file)
file_content = getTextFromWord(file)
elif file.endswith(".pptx"):
print("indexing " + file)
file_content = getTextFromPPTX(file)
else:
continue
text_splitter = TokenTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_text(file_content)
metadata = []
for i in range(0,len(texts)):
metadata.append({"path":file})
qdrant.add_texts(texts,metadatas=metadata)
print(onlyfiles)
print("Finished indexing!")
As we stated, we use TokenTextSplitter from LangChain to create chunks of 500 tokens with 50 token overlap. Now, when we have created an index, we can create a web service for querying it and generating answers.

Generative Search API
We will create a web service using FastAPI to host our generative search engine. The API will access the Qdrant client with the indexed data we created in the previous section, perform a search using a vector similarity metric, use the top chunks to generate an answer with the Llama 3 model, and finally provide the answer back to the user.

In order to initialize and import libraries for the generative search component, we can use the following code:

from fastapi import FastAPI
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_qdrant import Qdrant
from qdrant_client import QdrantClient
from pydantic import BaseModel
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import environment_var
import os
from openai import OpenAI

class Item(BaseModel):
query: str
def __init__(self, query: str) -> None:
super().__init__(query=query)
As previously mentioned, we are using FastAPI to create the API interface. We will utilize the qdrant_client library to access the indexed data we created and leverage the langchain_qdrant library for additional support. For embeddings and loading Llama 3 models locally, we will use the PyTorch and Transformers libraries. Additionally, we will make calls to the NVIDIA NIM API using the OpenAI library, with the API keys stored in the environment_var (for both Nvidia and HuggingFace) file we created.

We create class Item, derived from BaseModel in Pydantic to pass as parameters to request functions. It will have one field, called query.

Now, we can start initializing our machine-learning models

model_name = "sentence-transformers/msmarco-bert-base-dot-v5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}
hf = HuggingFaceEmbeddings(
model_name=model_name,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs
)

os.environ["HF_TOKEN"] = environment_var.hf_token
use_nvidia_api = False
use_quantized = True
if environment_var.nvidia_key !="":
client_ai = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key=environment_var.nvidia_key
)
use_nvidia_api = True
elif use_quantized:
model_id = "Kameshr/LLAMA-3-Quantized"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
)
else:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
)
In the first few lines, we load weights for the BERT-based model fine-tuned on MSMARCO data that we have also used to index our documents.

Then, we check whether nvidia_key is provided, and if it is, we use the OpenAI library to call NVIDIA NIM API. When we use NVIDIA NIM API, we can use a big version of the Llama 3 instruct model, with 70B parameters. In case nvidia_key is not provided, we will load Llama 3 locally. However, locally, at least for most consumer electronics, it would not be possible to load the 70B parameters model. Therefore, we will either load the Llama 3 8B parameter model or the Llama 3 8B parameters model that has been additionally quantized. With quantization, we save space and enable model execution on less RAM. For example, Llama 3 8B usually needs about 14GB of GPU RAM, while Llama 3 8B quantized would be able to run on 6GB of GPU RAM. Therefore, we load either a full or quantized model depending on a parameter.

We can now initialize the Qdrant client

client = QdrantClient(path="qdrant/")
collection_name = "MyCollection"
qdrant = Qdrant(client, collection_name, hf)
Also, FastAPI and create a first mock GET function

app = FastAPI()


@app.get("/")
async def root():
return {"message": "Hello World"}
This function would return JSON in format {“message”:”Hello World”}

However, for this API to be functional, we will create two functions, one that performs only semantic search, while the other would perform search and then put the top 10 chunks as a context and generate an answer, referencing documents it used.

@app.post("/search")
def search(Item:Item):
query = Item.query
search_result = qdrant.similarity_search(
query=query, k=10
)
i = 0
list_res = []
for res in search_result:
list_res.append({"id":i,"path":res.metadata.get("path"),"content":res.page_content})
return list_res

@app.post("/ask_localai")
async def ask_localai(Item:Item):
query = Item.query
search_result = qdrant.similarity_search(
query=query, k=10
)
i = 0
list_res = []
context = ""
mappings = {}
i = 0
for res in search_result:
context = context + str(i)+"\n"+res.page_content+"\n\n"
mappings[i] = res.metadata.get("path")
list_res.append({"id":i,"path":res.metadata.get("path"),"content":res.page_content})
i = i +1

rolemsg = {"role": "system",
"content": "Answer user's question using documents given in the context. In the context are documents that should contain an answer. Please always reference document id (in squere brackets, for example [0],[1]) of the document that was used to make a claim. Use as many citations and documents as it is necessary to answer question."}
messages = [
rolemsg,
{"role": "user", "content": "Documents:\n"+context+"\n\nQuestion: "+query},
]
if use_nvidia_api:
completion = client_ai.chat.completions.create(
model="meta/llama3-70b-instruct",
messages=messages,
temperature=0.5,
top_p=1,
max_tokens=1024,
stream=False
)
response = completion.choices[0].message.content
else:
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)


terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
input_ids,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.2,
top_p=0.9,
)
response = tokenizer.decode(outputs[0][input_ids.shape[-1]:])
return {"context":list_res,"answer":response}
Both functions are POST methods, and we use our Item class to pass the query via JSON body. The first method returns the 10 most similar document chunks, with the path, and assigns document ID from 0–9. Therefore, it just performs the plain semantic search using dot product as similarity metric (this was defined during indexing in Qdrant — remember line containing distance=Distance.DOT).

The second function, called ask_localai is slightly more complex. It contains a search mechanism from the first method (therefore it may be easier to go through code there to understand semantic search), but adds a generative part. It creates a prompt for Llama 3, containing instructions in a system prompt message saying:

Answer the user’s question using the documents given in the context. In the context are documents that should contain an answer. Please always reference the document ID (in square brackets, for example [0],[1]) of the document that was used to make a claim. Use as many citations and documents as it is necessary to answer a question.

The user’s message contains a list of documents structured as an ID (0–9) followed by the document chunk on the next line. To maintain the mapping between IDs and document paths, we create a list called list_res, which includes the ID, path, and content. The user prompt ends with the word “Question” followed by the user’s query.

The response contains context and generated answer. However, the answer is again generated by either the Llama 3 70B model (using NVIDIA NIM API), local Llama 3 8B, or local Llama 3 8B quantized depending on the passed parameters.

The API can be started from a separate file containing the following lines of code (given, that our generative component is in a file called api.py, as the first argument in Uvicorn maps to the file name):

import uvicorn


if __name__=="__main__":
uvicorn.run("api:app",host='0.0.0.0', port=8000, reload=False, workers=3)
Simple User Interface
The final component of our local generative search engine is the user interface. We will build a simple user interface using Streamlit, which will include an input bar, a search button, a section for displaying the generated answer, and a list of referenced documents that can be opened or downloaded.

The whole code for the user interface in Streamlit has less than 45 lines of code (44 to be exact):

import re
import streamlit as st
import requests
import json
st.title('_:blue[Local GenAI Search]_ :sunglasses:')
question = st.text_input("Ask a question based on your local files", "")
if st.button("Ask a question"):
st.write("The current question is \"", question+"\"")
url = "http://127.0.0.1:8000/ask_localai"

payload = json.dumps({
"query": question
})
headers = {
'Accept': 'application/json',
'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

answer = json.loads(response.text)["answer"]
rege = re.compile("\[Document\ [0-9]+\]|\[[0-9]+\]")
m = rege.findall(answer)
num = []
for n in m:
num = num + [int(s) for s in re.findall(r'\b\d+\b', n)]


st.markdown(answer)
documents = json.loads(response.text)['context']
show_docs = []
for n in num:
for doc in documents:
if int(doc['id']) == n:
show_docs.append(doc)
a = 1244
for doc in show_docs:
with st.expander(str(doc['id'])+" - "+doc['path']):
st.write(doc['content'])
with open(doc['path'], 'rb') as f:
st.download_button("Downlaod file", f, file_name=doc['path'].split('/')[-1],key=a
)
a = a + 1
It will all end up looking like this:


An example of an answered question in the built user interface. Screenshot by author.
Availability
The entire code for the described project is available on GitHub, at https://github.com/nikolamilosevic86/local-genAI-search. In the past, I have worked on several generative search projects, on which there have also been some publications. You can have a look at https://www.thinkmind.org/library/INTERNET/INTERNET_2024/internet_2024_1_10_48001.html or https://arxiv.org/abs/2402.18589.

Conclusion
This article showed how one can leverage generative AI with semantic search using Qdrant. It is generally a Retrieval-Augmented Generation (RAG) pipeline over local files with instructions to reference claims to the local documents. The whole code is about 300 lines long, and we have even added complexity by giving a choice to the user between 3 different Llama 3 models. For this use case, both 8B and 70B parameter models work quite well.

I wanted to explain the steps I did, in case this can be helpful for someone in the future. However, if you want to use this particular tool, the easiest way to do so is by just getting it from GitHub, it is all open source!
https://github.com/nikolamilosevic86/local-genAI-search

###
https://www.linkedin.com/pulse/intel-unwraps-lunar-lake-ai-pcs-new-cores-gpu-npu-ryan-shrout-jtx4c/?utm_source=share&utm_medium=member_ios&utm_campaign=share_via
Intel Unwraps Lunar Lake for AI PCs: new cores, new GPU, new NPU
Ryan Shrout
Ryan Shrout
Technology and Marketing


June 4, 2024
Intel might be the last of the big four silicon providers to present this week at Computex, but they definitely aren’t going to be the least vocal. Many of the press and analyst corp has been in Taiwan with Intel for the better part of a full week, going through two days of briefings and talks about the new Lunar Lake product architecture and its plans for release. And today during the company’s keynote, they let the details out and began to talk about how it sees Lunar Lake changing the game.

Intel spent multiple days and seemingly 100 different sessions talking to the tech press and media about Lunar Lake, and while I plan to dive into it in more depth in a future story, it’s worth spending a bit of time here to talk about the key points that make Lunar Lake different from Meteor Lake, current shipping Core Ultra processors, and why Intel is confident that they can take on both Qualcomm and AMD in the AI PC segment that has garnered so much attention.

In short, everything changes with Lunar Lake. New core IP, new power delivery, new GPU, new NPU, new memory system; it’s kind of astounding how different this product is from previous ones. The most visible change is the move to an on-package memory system that supports LPDDR5x, four channels, and up to 32GB of total system memory. This on-package design means that Intel can save a tremendous amount of power on the PHY (up to 40% they claim) while also creating a smaller physical footprint.


The processor itself is broken up into two tiles, a compute tile and a platform controller tile. On the compute tile Intel has built a 4+4 design, with four new Lion Cove P-cores and four new Skymont E-cores. The P-cores have a significant number of architectural changes including an 18 execution port design, 8x wider prediction unit, finer clock intervals, and more. Intel claims this results in a 14% improvement in IPC compared to the Redwood Cove core on MTL.



The E-cores got even more attention this time around, with a significant upgrade that includes larger 4MB L2 cache, deeper queuing, all with the goal of providing a broader workload coverage than the previous gen. The result is a 68% improved single-threaded floating point performance vs Crestmont.

These are impressive results if they hold, and it means that Intel thinks it has a breakthrough in power and computing efficiency for x86. Clearly the company is targeting the perception that only an Arm-based design like the Snapdragon X Elite can bring the battery life and low power capabilities to compete with the likes of the Apple M-series of CPUs. We’ll be looking to see if this holds true for video playback, real-world workloads, and other uses cases.
Another reason that Intel has confidence in its power story is an improved scheduling system and new iteration of Thread Director that does more to put and keep threads on the E-cores, and in particular, FEWER E-cores. There is a point to be made here about the dual nature of the E-core and hybrid design that Intel has built; on one-hand you can use the E-cores for more multi-threaded performance in less die area for high performance parts (think higher TDP platforms or desktop systems) OR for power efficiency characteristics like the implementation we are seeing on Lunar Lake. This combined efficiency, in an example Intel highlighted, showed a Teams conferencing workload using 35% less power than in the previous methodologies.


Moving to the new GPU, this is the first instance of the new Xe2 Battlemage architecture, and Intel claims that we will see as much as 50% more graphics performance versus Meteor Lake. It adds some interesting new features that are especially interesting like XMX units, that accelerate AI functions to a significant degree, offering 67 TOPS of performance. There are new vector units, improved ray tracing units, and overall, the expectation is that the GPU on Lunar Lake will be outstanding. There was no information on the power or efficiency here, so I do believe that’s an area we’ll want to look at, but the emphasis from Intel on the GPU is strong this time around.
Other tidbits that Intel discussed include an improved video engine, of which Intel already had the industry leading integration, support a brand-new video codec called VVC, or H.266, that offers up to a 10% bitrate reduction over AV1 at the same image quality. They also integrated solid connectivity improvements with Bluetooth 5.4, Wi-Fi 7, and TBT4, all to make sure Lunar Lake is a complete platform package.

The new NPU, now called NPU 4 as it’s the 4th generation of this technology from Intel, scales from 2 neural engines to 6, increases on-chip bandwidth by 2x, and includes 12 of the SHAVE DSPs that accelerate LLM and transformer operations. The net result is a 48 TOPS integration that is obviously intentional to meet the 40 TOPS requirement of the Microsoft Copilot+ PC program launched in May.

Intel showed the NPU 4 offering up to 2x the performance at ISO power when compared to NPU 3 (back naming the NPU on Meteor Lake) but also up to 4x the peak performance thanks to the increased compute engine, MAC count and also frequency increase and baseline architecture modifications.


This brings the total platform AI capability of Lunar Lake to 120 TOPS. That’s an impressive number combined with potentially impressive power efficiency, though even Intel itself will tell you that a TOPS number is wildly ineffective at communicating real-world AI performance. Software, drivers, optimization layers and ISV / developer relations will end up making the difference between the haves and the have nots in this AI PC race.

Intel hasn’t gotten too specific on the timing of system availability, only stating that it would happen in Q3. In my conversations, Intel is adamant that Q3 will see not just some kind of “shipping” announcement or vague availability of a single SKU in China, but that you would be able to get your hands on designs by the end of September, in plenty of time for the holiday shopping season. And with all the interesting debate around what and when platforms other than the Snapdragon X Elite, will have Copilot+ features will be enabled and running, that availability window will be critically important for Intel to stay relevant and ensure there is not a mind share gap to other silicon platforms.

###
https://www.microsoft.com/en-us/worklab/our-year-with-copilot-what-microsoft-has-learned-about-ai-at-work
Our Year with Copilot: What Microsoft Has Learned About AI at Work
Getting AI right requires intention, experimentation, and some unexpected heroes. Here’s how you can apply insights from our experience to your own organization.

A little while back, Jared Spataro got an email from someone he couldn’t immediately place. It’s an experience common for executives: someone reaches out, and it’s clear you have an existing relationship, but you just can’t recall how you know them. So Spataro, Microsoft Corporate Vice President of AI at Work, instinctively turned to Copilot, prompting the chat interface to search across all his meetings, chats, documents, and emails to find out. “It was the most beautiful response I’ve ever seen,” says Spataro, one of the early architects of Copilot for Microsoft 365. It told him who the man was and how he knew him, when they first met, and what they had talked about.

“That was when I realized, Wow, this is going to change business in a really significant way.”

Spataro has been using Copilot for a year, along with hundreds of thousands of other Microsoft employees and early customers. The company-wide rollout has been marked by creative experimentation, continual learning, and even a little soul searching about the role of AI within an organization. As our own “customer zero,” we had a lot to learn: How quickly would people develop new skills and AI habits? How was it going to change day-to-day work, entire functions, and even entire teams? And how could we quickly scale those lessons across the company?

“It’s been a year of learning, but we have started to discover what Copilot can unlock for individual employees and companies as a whole,” Spataro says. “Most days it can feel like we’re on a rocket ship. More specifically, like we’re riding on the rocket ship as we’re building it.”

As with any rocket launch, this one required multiple test flights. We’ve spent the past year experimenting to see what works and what doesn’t, learning from our experiences, and then sharing what we’ve learned across the company and with our customers. Now, as every leader looks to build the AI-powered organization of the future, we want to share what we’ve learned.

 A colorful illustration of people and elements like tubes and arrows forming a kind of company “machine” that is getting activatedIllustrations by Tomasz Woźniakowski
Share




01

GO FOR THE BIG WINS

(AND THE EASY ONES TOO)


Who should get AI first? We prioritized functions that would drive ROI fastest.

How We Did It

“Every company will have a slightly different approach,” says Nathalie D’Hers, Corporate Vice President of Microsoft Digital, who oversaw the internal rollout to our more than 200,000 employees. “In our case, we zeroed in first on the roles that we knew would gain a lot of benefit.”

It made sense for sales to get first access: After all, they need to know the product inside and out to communicate its value to customers. But beyond that, we found that salespeople are uniquely positioned to benefit from Copilot, whether it’s cutting down on email triage to prioritize leads or gathering relevant info ahead of a client meeting. In early results, our salespeople saved 90 minutes of time per week; 83 percent of them felt they were more productive; and 67 percent said they were able to parlay the time savings into more time with customers.

Next came customer service and support. Nine months ago, they rolled out Copilot to all of their support professionals at once, so they could get the entire organization familiar with the technology fast. They had four objectives: reduce time to expertise for agents, streamline access to knowledge, reduce repetitive administrative tasks (to allow people to focus more on customer support, their key priority), and reduce the high volume of inquiries that come in every day.

It’s been a year of learning, but we have started to discover what Copilot can unlock for individual employees and companies as a whole. Most days, it can feel like we’re on a rocket ship. More specifically, like we’re riding on the rocket ship as we’re building it.
—Jared Spataro, Microsoft Corporate Vice President of AI at Work


The investment has paid off. According to a study last year from our Chief Economist’s office of nearly 10,000 Microsoft support agents, several teams saw, on average, a 12 percent reduction in case handling time and a 10 percent boost in case resolution.

And once HR got access, the department retooled an AI-powered employee resource called Ask HR, which expedited the response time for more complex questions about benefits, payroll, and other HR topics. With HR service advisors using Copilot, employees now get faster and more accurate answers to questions that previously might have taken several days to compile and respond to.

“Our HR service professionals are able to handle employee inquiries more efficiently,” says Kathleen Hogan, Microsoft Executive Vice President and Chief People Officer. “So far we are seeing a 26 percent reduction in initial response time thanks to Copilot.”

From there, we used what we learned from those early adopters to help guide the rollout to the rest of our company.

How You Can Do It Too

Put Copilot where it’s most useful. Whatever department or role you’re targeting, clearly identifying goals before a rollout helps leaders and employees determine from the start what’s working and what’s not. It also helps set appropriate benchmarks for success, whether that’s response times or more effective meetings or other metrics. For guidance, look to our Copilot Scenario Library, which includes suggested use cases and key performance indicators to help orgs determine how Copilot can help.

Go for easy wins too. As you’re going after function-level transformation, use AI to improve simple tasks as well. Gaining confidence and ability early on (for example, asking Copilot to recap a meeting) helps users maintain a healthy growth mindset when they hit the inevitable road bumps.

Give it to entire teams. Rolling out Copilot to entire teams at once—even if they’re small ones—is crucial in promoting peer-to-peer learning: It encourages sharing and learning among the group members, multiplying the impact of the technology. It also allows organizations to see patterns to help identify what’s working (or what’s not).

Make sure to track the impact. To understand how AI is transforming workplace behavior, you’ll need a way to measure its usage. A platform like our Copilot Dashboard can help you plan and measure the impact.

02

FIND YOUR INTERNAL

CHAMPIONS


Their enthusiasm and knack for sharing their AI skills with others will encourage use across the organization.

How We Did It

Many of our employees went through a period of experimentation and playing around with Copilot before they started to drill down on what it could do. That’s where internal champions come in. “They don’t need to be AI experts,” says Callie August, a Copilot champion in the marketing organization. “Just people who are willing to test, learn, and be okay with being wrong.”

Through managers and rollout leaders, we identified people who were most excited to dive into the technology and share what they learned with their peers. We then empowered them to lead internal trainings and create quick demo videos to share their skills. That grassroots approach allows others to see the potential—and inspires them to explore the technology for themselves.

New Words for a New Way of Working

Essential AI terms every leader should know

AI Aptitude
The ability to work alongside AI naturally, including writing great prompts, evaluating creative work, and checking for bias. Take action: Encourage everyone in your organization to always be asking, “How can AI help me?”

Context
The Copilot System
Delegate
Digital Artifact
The 11-by-11-Tipping Point
Internal Champion
Islands of Intelligence
Post-processing
How You Can Do It Too

Employ champions at every level. An early-in-career employee is going to use Copilot in a very different way than someone who’s been managing a team for 20 years. With advocates at all levels of the organization, everyone from individual contributors to the C-suite can see relevant prompts and use cases.

Find the connectors. While technical expertise is great, it’s not a must. Look for people with a natural aptitude for leadership who can take complex information and distill it down in a relatable way. After all, your internal champions will be spending most of their time teaching and interacting with other people, not programming.

Make it official. Once you’ve identified your champions, establish an AI council. As we describe in our adoption playbook, the makeup of that group will be unique to what your company needs, but it should include people from your IT enablement team, your change management team, an executive sponsor, and a representative from risk management. And it should meet regularly to ensure that organizational insights are shared effectively.

Recognize and incentivize. “You have to celebrate people who are adopting AI and showcase their efforts,” says Hossein Nowbar, Chief Legal Officer at Microsoft. “We had early adopters of AI join me onstage during our department-wide summit to talk about how they are leveraging AI and the efficiencies they gained.” This recognition inspires others to join the AI journey.

03

DOUBLE DOWN

ON SKILLING UP


Make employee training a priority from the start; the training will evolve over time as both trainers and learners become more comfortable with Copilot.

How We Did It

We held live one-on-one and group training sessions where people could ask questions and practice prompting in a variety of different work situations. Internal champions created self-guided courses that employees could access on a SharePoint site and answered questions and offered guidance to employees on Viva Engage.

We also offered employees training that accommodated different work schedules and learning preferences: Some people might not have time to join an in-person session, so they can watch videos or snapshot demos. Others may want to join big interactive sessions where they can ask questions of an expert in a live environment. And we created incentives for taking and passing training courses—like digital badges that declare one a “Copilot Champ.”

Our trainings evolved as we learned what worked and what didn’t. “In the beginning, I usually did 30-minute sessions where we’d focus on one app at a time,” August says. “Now we’ll do more comprehensive training where we show one piece of every app.” August eventually took her training sessions public, with a series of short videos explaining everything from how to mitigate writer’s block to what to do if you’re late to a meeting. “I thought about pain points. What are the things I hate to do at work, and are there Copilot prompts that can solve those tasks?”

Like any new routine, building the Copilot habit takes time. Our internal research has found that a time savings of just 11 minutes a day is all it takes for users to see the value from Copilot.


How You Can Do It Too

Don’t reinvent the wheel. Because we created a variety of training materials for our own people, organizations looking to roll out Copilot now have resources available. Look to our adoption playbook and guidance for support on both technical readiness and getting your people prepared.

But also, use what works best for you. Orgs can create interactive libraries of prompts tailored to the work they do, along with recommendations on which app or apps to use, so that everyone can share what works with other teams across the organization.

Remember your managers. “One of our early learnings was that we need to be sure we are engaging with managers as a direct leader of employees,” says Sandeep Bhanot, Microsoft Corporate Vice President of Engineering & Data, who leads the team that supports our commercial sales organization. “We found that unless managers were fully bought in and saw the value of Copilot, they weren’t able to be champions of Copilot for their teams, which is critical to success. This uncovered the need for manager training, too, getting them engaged, skilled, and bought in to the value of Copilot so they could lead by example.”





04

BUILD THE

AI HABIT


In any AI rollout, some people will be eager to adopt the new technology, and others less so. Embrace a growth mindset when it comes to experimenting with AI and then using it regularly.

How We Did It

Throughout our rollout, leaders asked their teams to consider how AI could help them do whatever task they were setting out to do, big or small, before they set out to do it. “When it came to Copilot, we asked ourselves two questions,” D’Hers says. “Number one, how can an AI tool help us be more efficient in this task? And number two, is this something that artificial intelligence can just help us do better?”

Soon enough, users across the organization were developing their own new work habits, based upon early victories and time-saving hacks. After every meeting, they might ask Copilot what their action items are. Or they’ll use Copilot to find material that might live in an email, a chat, or a PowerPoint deck.

Then it clicks: “When people see that this is a way to enhance their work, not a usurping of their work, there’s this spark of realization,” says Chris Fernandez, Microsoft Corporate Vice President of HR Services and Digital Employee Experiences.

Like any new routine, building the Copilot habit takes time. Our internal research has found that a time savings of just 11 minutes a day is all it takes for users to see the value from Copilot. And it takes about a business quarter, or 11 weeks, for most people using Copilot to see improvement in four key areas: productivity, work enjoyment, work-life balance, and the ability to attend fewer meetings.

How You Can Do It Too

Remember that it’s an organizational challenge, not only an IT challenge. “When I talk to customers,” says Colette Stallbaumer, General Manager of Copilot, “one predictor of success is if they have involvement at every level of the organization—from senior leadership to functional leaders to grassroots employee activation.” This approach signifies that a company is thinking of it as a new way of working, and not just a new technology.

Start small. To start building the habit, encourage your teams to find the immediate wins in their workday that deliver from the start. Instead of searching through folders for a deck, for example, encourage your people to use Copilot to locate the file. Executives, meanwhile, can use it to summarize long documents or drawn-out email chains.

Understand that this is new—really new. Unlike other new technology, there’s an emotional component to adopting AI. The shift can be unsettling, so it’s important to help people understand how AI can be valuable—to their time, for instance, or the quality and purpose of their work. Consider the note-taking ability in Microsoft Teams. “Someone might say, ‘But I usually take the notes in meetings!’” says Claire Sisson, Principal Group Product Manager, Microsoft Digital, who helped lead the company-wide rollout. “So we tell them, ‘Instead of taking notes, you can be a full participant in the meeting. Now you can focus your attention on the critical thinking you can bring.’”

Our biggest lesson over the past year? We all have to be thoughtful, iterative, and willing to evolve. And while a project this intricate might seem daunting, it’s so valuable that you can’t afford to put it off. “Leaders who see the opportunity,” Spataro says, “who are able to think creatively about what AI can do to rewire every aspect of the organization, are going to be the ones who gain a competitive edge—and that will set them apart in this next era of work.”

###
https://applied-llms.org/
What We’ve Learned From A Year of Building with LLMs
A practical guide to building successful LLM products, covering the tactical, operational, and strategic.
AUTHORS
Eugene Yan

Bryan Bischof

Charles Frye

Hamel Husain

Jason Liu

Shreya Shankar

PUBLISHED
June 8, 2024

Also published on O’Reilly Media in three parts: Tactical, Operational, Strategic. Also see podcast.

It’s an exciting time to build with large language models (LLMs). Over the past year, LLMs have become “good enough” for real-world applications. And they’re getting better and cheaper every year. Coupled with a parade of demos on social media, there will be an estimated $200B investment in AI by 2025. Furthermore, provider APIs have made LLMs more accessible, allowing everyone, not just ML engineers and scientists, to build intelligence into their products. Nonetheless, while the barrier to entry for building with AI has been lowered, creating products and systems that are effective—beyond a demo—remains deceptively difficult.

We’ve spent the past year building, and have discovered many sharp edges along the way. While we don’t claim to speak for the entire industry, we’d like to share what we’ve learned to help you avoid our mistakes and iterate faster. These are organized into three sections:

Tactical: Some practices for prompting, RAG, flow engineering, evals, and monitoring. Whether you’re a practitioner building with LLMs, or hacking on weekend projects, this section was written for you.
Operational: The organizational, day-to-day concerns of shipping products, and how to build an effective team. For product/technical leaders looking to deploy sustainably and reliably.
Strategic: The long-term, big-picture view, with opinionated takes such as “no GPU before PMF” and “focus on the system not the model”, and how to iterate. Written with founders and executives in mind.
We intend to make this a practical guide to building successful products with LLMs, drawing from our own experiences and pointing to examples from around the industry.