Summary

오늘의 AI 소식에서는 Hugging Face의 새로운 데이터셋 FineWeb과 FineWeb-Edu의 출시, OpenAI의 차세대 AI 모델 개발, 그리고 기업에서 생성형 AI의 ROI 극대화 방법을 다룹니다.

FineWeb 기술 보고서 및 FineWeb Edu 출시

Hugging Face 블로그, 2024년 5월 31일

  • FineWeb: 15조 토큰 규모의 대규모 영어 웹 데이터셋, CommonCrawl에서 파생
  • FineWeb-Edu: 1.3조 및 5.4조 고품질 교육용 데이터셋
  • 교육 콘텐츠 필터링을 위한 텍스트 분류기 사용, Llama-3-70B-Instruct로 품질 평가
  • 독립적인 MinHash 중복 제거 방법 사용
  • FineWeb-Edu는 MMLU, ARC, OpenBookQA에서 다른 데이터셋을 능가
  • ODC-By 1.0 라이센스로 제공, 완전 재현 가능

OpenAI의 새로운 플래그십 AI 모델 훈련 시작

뉴욕 타임즈, 2024년 5월 28일

  • OpenAI, GPT-4 후속 모델 개발 착수
  • 새로운 모델은 ChatGPT를 포함한 여러 AI 제품의 엔진으로 사용 예정
  • 새롭게 구성된 안전 및 보안 위원회가 기술의 위험성 관리 방안 논의
  • Scarlett Johansson의 목소리와 유사한 음성을 사용한 GPT-4o 모델 논란
  • 차세대 모델은 향후 9개월에서 1년 이상 후에 출시 예상

ROI 극대화를 위한 전사적 생성형 AI 구축 모범사례

Gartner 보고서, 2024년 4월

  • 활용 사례 우선순위 설정 프로세스 구축
  • 구축 혹은 구매를 위한 의사 결정 프레임워크 개발
  • 확장성을 위한 시범 운영
  • 유연한 생성형 AI 플랫폼 아키텍처 설계
  • ‘책임감 있는 AI’ 도입
  • 데이터 및 AI 리터러시에 대한 투자 필요

“정규직 40%는 AI 사업 인력”…AI 컴퍼니로 거듭난 SKT

다음 소식, 2024년 5월 9일

  • SK텔레콤, 전체 정규직의 40%가 AI 관련 인력으로 구성
  • 1분기 매출 4조4746억원, 영업이익 4985억원 기록
  • 데이터센터와 클라우드 사업 매출 각각 25.6%, 38.3% 증가
  • AI 서비스 앱 ‘에이닷’ 누적 가입자 수 400만명 달성
  • 글로벌 텔코 AI 얼라이언스와 협력, AI 개인비서 서비스 현지화 계획

이상으로 오늘의 AI 소식를 마칩니다. 더 자세한 내용은 각 링크를 참조하세요.

Sources This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is: # AI News for (today's date), ## Summary (overall short summary), ## Link1 Title, link, date - detailed summary1, - detailed summary2, - detailed summary..N, ## Link2 Title, link, date - detailed summary1, - detailed summary2, - detailed point..N, etc. The report should be written in Korean and use the 개조식 문체 style. give the very deep details for each link as much as possible. make summary with good details, note company name next to date if available.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
###
https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
FineWeb Technical Report and FineWeb Edu released! 🍷 FineWeb is a 15T token open-source English web dataset derived from CommonCrawl! 📚 FineWeb-Edu is a 1.3T & 5.4T high-quality subset. 😍
TL;DR:
🍷 15T tokens in FineWeb outperforming other open datasets
📚 1.3T highest-quality educational dataset FineWeb-Edu
🧠 5.4T high-quality educational tokens in FineWeb-Edu-2
✅ Text Classifier for educational content filtering trained on synthetic data
🤖 Used Llama-3-70B-Instruct for educational quality annotations
🧹 Independent MinHash deduplication per dump
🎓 FineWeb Edu outperforms other datasets on MMLU, ARC, OpenBookQA
🆓 Available under ODC-By 1.0 license
🛠️ Full reproducibility with datatrove and nanotron
FineWeb 15T:
https://lnkd.in/ehEPRCam
Technical Report:
https://lnkd.in/eQNrb58w
FineWeb Edu 5T:
https://lnkd.in/eQtHZ3qA
FineWeb Edu 1.3T:
https://lnkd.in/e22vD8_D

Kudos to the Guilherme Penedo Hynek Kydlíček Anton Lozhkov Colin Raffel Leandro von Werra Thomas Wolf Loubna Ben Allal for their relentless push for open science and transparency! 🤗

FineWeb: decanting the web for the finest text data at scale
AUTHORS
Guilherme Penedo, Hynek Kydlíček, Loubna Ben Allal, Anton Lozhkov, Colin Raffel, Leandro Werra, Thomas Wolf
AFFILIATION
HuggingFace
PUBLISHED
May 31, 2024
The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3
[1]
and Mixtral
[2]
are not publicly available and very little is known about how they were created.

Reading time: 45 min. For the best reading experience, we recommend not using a mobile phone.
Recently, we released 🍷 FineWeb, a new, large-scale (15-trillion tokens, 44TB disk space) dataset for LLM pretraining. FineWeb is derived from 96 CommonCrawl snapshots and produces better-performing LLMs than other open pretraining datasets. To bring more clarity in machine learning and advance the open understanding of how to train good quality large language models, we carefully documented and ablated all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. The present long form report is a deep dive in how to create a large and high-quality web-scale dataset for LLM pretraining. The dataset itself, 🍷 FineWeb, is available here.

We are extremely thankful to the whole distill.pub team (Christopher Olah, Shan Carter, Ludwig Schubert in particular) for creating the template on which we based this blog post. Thanks also for inspiring us with exquisitely crafted articles and blog posts.
In this report we also introduce 📚 FineWeb-Edu, a subset of FineWeb constructed using scalable automated high-quality annotations for educational value, and which outperforms all openly accessible web-datasets on a number of educational benchmarks such as MMLU, ARC, and OpenBookQA. 📚 FineWeb-Edu is available in two sizes/filtering-level: 1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens (all tokens are measured with GPT2 tokenizer
[3]
). You can download it here.

Both datasets are released under the permissive ODC-By 1.0 license

TLDR: This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.

Web data
Finding the raw data
A common question often asked regarding web datasets used to train LLMs is “where do they even get all that data?”. There are generally two options:

you either crawl it yourself, like companies such as OpenAI or Anthropic (among others) do (see here and here)
you use a public repository of crawled webpages, like the one maintained by the non-profit CommonCrawl
To build 🍷 FineWeb, following what has been done in the past by a number of LLM training teams, we used CommonCrawl (CC) as a starting point. The Common Crawl non–profit organization has been crawling the web since 2007 and releases a new crawl containing 200 to 400 TiB of textual content obtained via automatic web crawling usually every 1 or 2 months.

As an example, the latest CC crawl (April 2024) contains 2.7 billion web pages, totaling 386 TiB of uncompressed HTML text content 1 . Ninety-six crawls have been released since 2013 and 3 crawls from 2008 to 2012, which are in a different (older) format. 2

Processing at scale
Given the sheer size of the data involved, one of the main challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate on our processing decisions and easily try out new ideas, while appropriately parallelizing our workloads and providing clear insights into the data.

For this purpose, we developed datatrove
[4]
, an open-source data processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of CPU cores. All the data processing steps involved in the creation of 🍷 FineWeb used this library. You will find the exact scripts we used in the datatrove repository.

What is good data?
This is probably the main question to keep in mind when creating a dataset. In most contexts and, in particular, in the context of large language model pretraining 3 , "high quality" is not a very well defined term
[5]
, and not even a property of documents that can always be clearly perceived through direct human observation alone.
[6]

It is still common to train a model on a given corpus considered "clean" (typically wikipedia 4 ) and use it to check the perplexity on the dataset that we were trying to curate
[7]
. Unfortunately this does not always correlate with improved performance on a set of downstream tasks of interest
[8]
, and as a result another often used approach is to train small models 5 on a representative subset of our dataset and evaluate them on a set of evaluation tasks. Small models are used because training costs and time are a function of model size. In this second approach, it is important to choose a diverse and representative set of dataset-evaluation tasks and try not to overfit to any one individual benchmark as it would risk hurting the generality of the obtained LLM after pretraining.

Yet another way to compare different datasets would be to train a model on each dataset and have humans rate and compare the generations of the models (like on the LMSYS Chatbot Arena)
[9]
. This would arguably provide the most reliable results in terms of representing real model usage, but getting ablation results this way is unfortunately expensive and slow. It also often requires for the models to have undergone an instruction finetuning stage to acquire conversational capabilities, as pretrained models are not directly designed to follow instructions and are thus much more sensitive to prompt details.
[10]

In this work, we went with the approach of training small models and evaluating them on a set of "early-signal" benchmark tasks. We believe this is a reasonable proxy for the quality of the data used to train these models, when keeping in mind the above-mentioned caveat around overfitting on the evaluation benchmarks.

Ablations and evaluation setup
To compare the impact of a given processing step, we trained two models on two versions of the dataset, one version processed with the extra step (the one we wish to evaluate) and another version with this step ablated (cut/removed). Apart from the data, these two models would be otherwise identical: the same number of parameters, architecture hyper-parameters, and trained on an equal number of randomly sampled tokens from each version of the data, for a single epoch — the only difference being thus the training data. We then evaluated each model on the same set of tasks and compared average scores.

Our ablation models were trained using nanotron. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most ablations we trained on ~28B tokens (roughly the Chinchilla
[11]
optimal training size for this model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.

We'll make the configuration to reproduce these ablation models available soon in Nanotron.
We evaluated the models using lighteval. We carefully selected a set of benchmark for ablations by selecting benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few billion" tokens). We generally used the following criteria to select these benchmarks among all the benchmarks available in lighteval:

small variance between runs trained on different samplings of the same dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter's effect
performance increasing monotonically (or close) over a training run: ideally, as the number of seen tokens increases, the performance on a high-signal benchmark should not decrease (which would be indicative of unreliable results at a small scale)
performance above random baseline for this task by at least a few standard deviations: given our small ablation models and trainings we usually don't reach extremely high scores on any benchmark, but we want to make sure that the scores we get are above random noise.
After consideration, we selected the following list of benchmarks:

CommonSense QA
[12]
HellaSwag
[13]
OpenBook QA
[14]
PIQA
[15]
SIQA
[16]
WinoGrande
[17]
ARC
[18]
MMLU
[19]
To ensure our checkpoint evaluation stayed within a limited timeframe, we capped the longer benchmarks at 1000 samples (wall-clock evaluation taking less than 5 min on a single node of 8 GPUs - done in parallel to the training).

You can find the full list of tasks and prompts we used here.
The 🍷 FineWeb recipe
In the next subsections we will explain each of the steps taken to produce the FineWeb dataset.


You can find a fully reproducible datatrove config here.
Starting point: text extraction
CommonCrawl data is available in two main formats: WARC and WET. WARC (Web ARChive format) files contain the raw data from the crawl, including the full page HTML and request metadata. WET (WARC Encapsulated Text) files provide a text only version of those websites.

A large number of datasets take the WET files as their starting point. In our experience the default text extraction used by Common Crawl to create these WET files is suboptimal for the goals of LLM pretraining 6 and there are a variety of open-source libraries that provide better text extraction. We extracted the text content from the WARC files using the trafilatura library
[20]
, which from visual inspection of the results provided good quality extraction when compared to other libraries.

You can find a benchmark comparing several text extraction libraries here.
To validate this decision, we processed the 2019-18 dump directly using the WET files and with text extracted from WARC files using trafilatura 7 . We applied the same processing to each one (our base filtering+minhash, detailed below) and trained two models. While the resulting dataset is about 25% larger for the WET data (around 254 billion tokens), it proves to be of much worse quality than the one that used trafilatura to extract text from WARC files (which is around 200 billion tokens). Visual inspection of some samples confirms that many of these additional tokens on the WET files are unnecessary page boilerplate.

It is important to note, however, that text extraction is one of the most costly steps of our processing, so we believe that using the readily available WET data could be a reasonable trade-off for lower budget teams.

Metric:

Aggregate Score
Rolling window:

0
Base filtering
Filtering is an important part of the curation process. It consists in removing part of the data (be it words, lines, or even full documents) that lowers the performance of the model and is thus deemed to be “lower quality” in our eval-driven process of dataset crafting.

As a basis for our filtering we used part of the setup from RefinedWeb
[21]
. Namely, we:

Applied URL filtering using a blocklist to remove adult content
Applied a fastText language classifier
[22]
[23]
to keep only English text with a score ≥ 0.65
Applied quality and repetition filters from MassiveText
[24]
(using the default thresholds)
After applying this filtering to each of the text extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data 8 .

Deduplicating the data
Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset.

WHY DEDUPLICATE?
The web has many aggregators, mirrors, templated pages or just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages can even be introduced by the crawler itself, when different links point to the same page.

Removing these duplicates (deduplicating) has been correlated with improvements in model performance
[25]
and a reduction in memorization of pretraining data
[26]
, which might allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to increased training efficiency: by removing duplicated content, a model can reach the same performance level with fewer training iterations – or equivalently, for a given number of training tokens, a model will have seen more diverse data.
[27]
[28]

There are different ways to identify and even define duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two documents (or lines, paragraphs, or whatever other granularity level being used) 9 .

OUR DEDUPLICATION PARAMETERS
Following RefinedWeb
[21]
, we decided to apply MinHash, a fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams 10 and compute minhashes using 112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least 75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.

This would mean that for two documents with a similarity (s) of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%, 92% and 98.8% respectively (1-(1-s^8)^{14}). See the plot below for a match probability comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450 buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):

While the high number of hash functions in RefinedWeb allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable trade off.

It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.

MORE DEDUPLICATION IS ALWAYS BETTER, RIGHT?
Initially, we were operating under the assumption that more deduplication is always better, so our first approach was to take the entire dataset (all 90+ dumps) and deduplicate them together as one big dataset using MinHash.

We did this in an iterative manner: starting with the most recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump not only within itself, but removing any document matching any other documents in the previously processed dumps.

For instance, for the second most recent dump (2023-40 at the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the larger the number of dumps it was deduplicated against and the more data we removed from it (indeed, in the oldest dumps, the deduplication step removed more than 90% of the base filtered data).

Deduplicating the dataset in this manner resulted in 4 trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion tokens subset, our ablation models showed next to no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).

Metric:

Aggregate Score
Rolling window:

5
This challenged our assumption that more deduplication would inevitably result in higher benchmark scores, so we decided to take a closer look at one of the oldest dumps, dump 2013-48:

pre deduplication, this dump had ~490 billion tokens
after our iterative MinHash, ~31 billion tokens remained (94% of data had been removed)
As an experiment, we tried training two models on 28 billion tokens sampled from the following data from 2013-48:

the fully deduplicated remaining ~31 billion tokens (originally kept data)
171 billion tokens obtained by individually deduplicating (without considering the other dumps) the ~460 billion tokens that had been removed from this dump in the iterative dedup process (originally removed data) 11
Metric:

Aggregate Score
Rolling window:

0
These results show that, for this older dump taken in isolation, the data that was kept (10% of the original data) was actually worse than the 90% of data we removed 12 . This is also confirmed by visual inspection: originally kept data contains far more ads, lists of keywords and generally badly formatted text than originally removed data.

TAKING A STEP BACK: INDIVIDUAL DUMP DEDUP
We decided to experiment with an alternative approach: we deduplicated each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion tokens of data.

When training on a random sample from this dataset we see that it now matches RefinedWeb’s performance (see curves below):

Metric:

Aggregate Score
Rolling window:

5
We hypothesize that the main improvement gained from deduplication is the removal of very large clusters that are present in every single dump (you will find some examples of these clusters in the RefinedWeb paper, each containing hundreds of thousands of documents) and that further deduplication for clusters with a low number of duplicates (less than ~100 i.e. the number of dumps) actually harms performance: data that does not find a duplicate match in any other dump might actually be worse quality/more out of distribution (as evidenced by the results on the 2013-48 data).

While you might see some performance improvement when deduplicating a few dumps together, at the scale of the entire dataset (all the dumps), the effect from this upsampling of lower quality data side effect seems to be more impactful.

One possibility to consider is that as filtering quality improves, this effect may not be as prevalent, since the filtering might be able to remove some of this lower quality data. We also experimented with applying different, and often “lighter”, deduplication approaches on top of the individually deduplicated dumps. You can read about them further below.

A NOTE ON MEASURING THE EFFECT OF DEDUPLICATION
Given the nature of deduplication, its effect is not always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when deduplicating across all CommonCrawl dumps, as some URLs/pages are recrawled from one dump to the next.

To visualize the effect of scaling the number of training tokens on measuring deduplication impact, we considered the following (very extreme and unrealistic regarding the degree of duplication observed) theoretical scenario:

there are 100 CommonCrawl dumps (roughly accurate)
each dump has been perfectly individually deduplicated (every single document is unique in this dump)
each dump is a perfect copy of each other (maximum possible duplication across dumps, effectively the worst case scenario)
each dump has 200 billion tokens (for a total of 20 trillion, the resulting size of our individual dedup above)
each dump is made up of documents of 1k tokens (200M documents per dump)
We then simulated uniformly sampling documents from this entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image below you can see how often each document would be repeated.

For 1B almost all documents would be unique (#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of documents being repeated twice, and a few even 4-8 times. At the larger scale of 1T (5% of the total dataset), the majority of the documents are repeated up to 8 times, with some being repeated up to 16 times.

We ran our performance evaluations for the deduplicated data at the 350B scale, which would, under this theoretical scenario, be made up of a significant portion of documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with measuring deduplication impact on the training of LLMs, once the biggest duplicate clusters have been removed.

OTHER (FAILED) GLOBAL APPROACHES
To build on top of our newly found method (independently deduplicating each dump). We attempted to improve the performance by further deduplicating the independently minhash deduped 20 trillion tokens of data with alternative global (over all dumps) deduplication methods. We explored the following approaches:

URL deduplication, where we only kept one document per normalized (lowercased) URL (71.5% of tokens removed, 5.6T left) — FineWeb URL dedup
Line deduplication:
remove all but 1 (randomly chosen) occurrence of each duplicated line (77.8% of tokens dropped, 4.4T left) — FineWeb line dedup
same as above, but only removing duplicate lines with at least 10 words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens dropped, 2.9T left) — FineWeb line dedup w/ min words
remove all but 1 occurrence of each span of 3 duplicated lines with each number treated as 0 when finding duplicates, (80.9% of tokens removed, 3.7T left) — FineWeb 3-line dedup
The performance of the models trained on each of these was consistently worse (even if to different degrees) than that of the original independently deduplicated data:

Metric:

Aggregate Score
Rolling window:

5
Additional quality filtering
By this point we had reached the same performance of the previous work we attempted to reproduce and extend: RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset
[29]
, still showed stronger performances on some benchmarks of our evaluation suite.

We therefore set out to find new filtering steps that would, at first, allow us to match the performance of C4 and, at a second stage, surpass it. A natural starting point was to look into the processing of C4 itself.

C4: A DATASET THAT HAS STOOD THE TEST OF TIME
The C4 dataset was first released in 2019. It was obtained from the 2019-18 CommonCrawl dump by removing non english data, applying some heuristic filters on both the line and document level, deduplicating on the line level, and removing documents containing words from a word blocklist.

Despite its age and limited size for current standards (around 175B gpt2 tokens), this dataset is, to this day, a common sub-set of typical LLM training, being used in models such as the relatively recent Llama1
[30]
. This success is due to the strong performance that models trained on this dataset exhibit, excelling in particular on the Hellaswag benchmark
[13]
, one of the benchmarks in our “early signal” group with the highest signal-to-noise ratio. We experimented applying each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump:

Metric:

HellaSwag
Rolling window:

3
applying “All filters” (drop lines not ending on punctuation marks, mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem ipsum” or a curly bracket, {) allows us to match C4’s HellaSwag performance ("All filters" vs "C4" curves, respectively).
The curly bracket filter, and the word lengths filter only give a small boost, removing 2.8% and 4.3% of tokens, respectively
The terminal punctuation filter, by itself, gives the biggest individual boost, but removes around 30% of all tokens (!)
The lorem_ipsum, javascript and policy rules each remove <0.5% of training tokens, so we did not train on them individually
"All filters except the (very destructive) terminal_punct" performs better than terminal_punct by itself, while removing less in total (~7%)
We decided to apply all C4 filters mentioned above except the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in the next section.

A STATISTICAL APPROACH TO DEVELOP HEURISTIC FILTERS
To develop new heuristic filters and select their thresholds we devised a systematic process:

we started by collecting a very large list of high level statistics of our datasets (over fifty different metrics) ranging from common document-level metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (inspired by MassiveText), on both a high quality and a lower quality web dataset;
we selected the metrics for which the Wasserstein distance between the two distributions (of the metric computed on each dataset) was larger;
we inspected the histograms of the two distributions and empirically chose a threshold that would make the lower quality dataset more closely resemble the higher quality one on this metric;
we validated the resulting filter (metric-threshold pair) by using it on a reference dataset and running small ablations.
Due to our (new) assumption that global MinHash greatly upsamples lower quality data in the oldest dumps, we computed metrics on both the independently MinHashed and the (worse quality) global MinHashed versions of the 2013-48 and 2015-22 crawls (two older crawls). We then compared the statistics at a macro level, by looking at the distribution of these metrics for each one.

Perhaps not too surprisingly given our findings for deduplication, we found significant disparities in most of the metrics for the two deduplication methods. For instance, the line-char-duplicates metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup (0.0053 for 2015-22 and 0.0058 for 2013-48), to the global dedup (0.011 for 2015-22 and 0.01 for 2013-48), indicating that the latter had higher inter-document repetition.

Following the process listed above for these datasets yielded seventeen candidate metric-threshold pairs. In the image below, you can see three of these histograms:

Metric:

Lines Ended With Punctuation
As an example, we inspected the histograms of "fraction of lines ending with punctuation" (see the image above) and observed an increased document density of global MinHash at around 0.12. We then filtered with this threshold and found that the removed data had a higher amount of short lists or consisted of only document layout text ("Home", "Sign up", etc).

We then assessed the effectiveness of these seventeen newly created filters, by conducting several of our 28 billion tokens ablation runs on the 2019-18 crawl. Out of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated the most significant improvements on the aggregate score:

Remove documents where the fraction of lines ending with punctuation ≤ 0.12 (10.14% of tokens removed) — vs the 30% from the original C4 terminal punct filter
Remove documents where the fraction of characters in duplicated lines ≥ 0.1 (12.47% of tokens removed) — the original MassiveText threshold for this ratio is ≥ 0.2
Remove documents where the fraction of lines shorter than 30 characters ≥ 0.67 (3.73% of tokens removed)
When applying the three together, ~22% of tokens were removed.
Metric:

Aggregate Score
Rolling window:

3
These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.

The final 🍷 FineWeb dataset
The final 🍷 FineWeb dataset comprises 15T tokens and includes the following previously mentioned steps, in order, each providing a performance boost on our group of benchmark tasks:

base filtering
independent MinHash deduplication per dump
a selection of C4 filters
our custom filters (mentioned in the previous section)
Metric:

Aggregate Score
Rolling window:

5
COMPARISONS WITH OTHER WEB-SCALE DATASETS
We compared 🍷 FineWeb with the following datasets that are usually considered the highest quality openly accessible web-scale datasets (we also indicate for each the approximate number of tokens in the public version of the dataset):

RefinedWeb (500B tokens)
[21]
C4 (172B tokens)
[29]
Dolma v1.6 (3T tokens) (the CommonCrawl part)
[31]
13
The Pile (340B tokens)
[32]
SlimPajama (627B tokens)
[33]
RedPajama2 (20T tokens)
[34]
(deduplicated)
and our new 🍷 FineWeb (15T tokens) (this report)
You will find the 350B-tokens-trained ablation models openly accessible and gathered in this collection. We have uploaded checkpoints at every 1000 training steps. You will also find our full evaluation results here.

Metric:

Aggregate Score
Rolling window:

5
🍷 FineWeb is thus – to the best of our knowledge – the open dataset leading to the current highest model performances while allowing to train on several trillion tokens.

📚 FineWeb-Edu

📚 FineWeb-Edu outperforms 🍷 FineWeb and all other open web datasets on our group of evaluation tasks.
📚 FineWeb-Edu is an additional development of FineWeb that we are excited to introduce in this tech report and openly release. 📚 FineWeb-Edu is based on a new approach that has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3
[1]
and Phi3
[35]
, but its large-scale impact on web data filtering has, in our opinion, thur far not been publicly explored to its full potential.

The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper
[35]
stating:

Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.
Similarly, Llama 3 blog post
[36]
notes:

We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.
However, these classifiers and filtered datasets are not publicly available. To further enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by Llama-3-70B-Instruct to create 📚 FineWeb-Edu.

Annotating for educational quality at scale
We used Llama-3-70B-Instruct to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.

We explored various prompt formats to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.
[37]
worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.

Prompt for LLM annotation
Prompt used for Llama3 annotations of the educational score, also available here.
In terms of open-weight models to use for annotating the data, we experimented with several models including Mixtral-8x7B-Instruct and Mixtral-8x22B-Instruct, Llama-3-70B-Instruct as well as a jury gathering the scores from these three models
[38]
. In our experiments we found that using Llama3 alone gave the most reliable results.

Training a classifier
To scale our annotations to the trillions of tokens in FineWeb, we used the Llama3-70B annotations to train a small classifier. The model we used was a Snowflake-arctic-embed embedding model with a classification head with a single regression output on top of it. We trained this model on the 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.

We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.

The classifier is available at: HuggingFaceFW/fineweb-edu-classifier. The training and inference code is available on GitHub.

Filtering and results
We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that using a threshold of 3 gave the best overall results. Although using a threshold higher than 3 improves performance on knowledge and reasoning intensive benchmarks, it significantly degrades performance on HellaSwag and PIQA. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.

Metric:

MMLU
Note: this ablation was conducted on 8B tokens from the 2024-10 dump for both the FineWeb and FineWeb-Edu subsets, which might not be representative of the entire dataset. The next ablation shows that the findings for threshold 3 hold on a longer run of 350B tokens from all FineWeb dumps, except for HellaSwag, where we noticed a slight performance degradation.

We built 📚 FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.3 trillion educational tokens. To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:

Metric:

MMLU
Here are the key highlights of the ablation results above:

📚 FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.
It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma to match MMLU results.
This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.
Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under HuggingFaceFW/fineweb-edu-score-2.

You can find the two datasets along with the classifier used for the filtering in this collection.

Bonus: CommonCrawl over time
Just like fine wine, not all crawls are created equal.

While ablating filtering steps, we noticed that certain crawls outperformed others by a significant margin. We decided to investigate this phenomenon.

Benchmark performance by crawl
For each crawl, we trained two 1.8B models on 27 billion tokens randomly sampled from that crawl's data (after the base filtering and MinHash deduplication steps), where each run had a different random 27BT sampling of this data. We trained 192 such models, totaling over 60 thousand H100 GPU-hours. We subsequently took the last 3 checkpoints for both runs and plotted the average of these 6 data points per crawl.

The plot below clearly shows that some dumps perform far worse than others. Each year has a different color, and the number of crawls per year also varies.

Metric:

Aggregate Score
We investigated possible causes for this behaviour such as changes in the most common URLs of each dump, as well as potential benchmark contamination, but could not find any conclusive explanation. We leave further investigation for future work.

Synthetic data
We wondered if the strong performance of the last few crawls could be, in part, attributed to the presence of a larger quantity of synthetic data (data generated by LLMs). Such a change would not be surprising due to the recent increase in popularity of LLMs, notably of ChatGPT.

Since, to the best of our knowledge, there is no foolproof method to detect synthetic data, we opted to use a proxy metric: we measured the frequency of the following words in each crawl: "delve", "as a large language model", "it's important to note", "rich tapestry", "intertwined", "certainly!", "dive into", all of which are commonly used by ChatGPT.

It is important to note that not all samples containing one of these phrases were necessarily generated by ChatGPT (and also that many ChatGPT generated samples do not contain any of these phrases), but assuming that the amount of synthetic data were to not change across crawls, one would expect these frequencies to remain approximately constant over time.

The results are shown in the following plot:


While the frequency remained approximately constant until 2023-14 (ChatGPT was released at the end of 2022), we find a steep increase of our proxy metric in recent crawls. While this simple test is not enough to conclude that ChatGPT completions and other synthetic data is improving the quality of the most recent crawl, it at the very least does not seem to drastically harm it.

We expect to continue seeing increasing quantities of synthetic data on new CC crawls. However, while for relatively small trainings this data does not seem to harm performance (and might actually improve it), it is not clear that this holds for much larger trainings.

Conclusion and looking forward
Through our open science efforts we hope to keep shining a light on the black box that is the training of high performance large language models as well as to give every model trainer the ability to create state-of-the-art LLMs. We are excited to continue iterating on FineWeb and to release increasingly better filtered subsets of web data, in a fully open and reproducible manner.

In the short term, we are looking forward to applying the learnings from (English) FineWeb to other languages. While English currently dominates the LLM landscape, we believe that making high quality web data in other languages as accessible as possible would be incredibly impactful.

In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open 🤗.

###
https://www.nytimes.com/2024/05/28/technology/openai-gpt4-new-model.html?smid=nytcore-ios-share&referringSource=articleShare&sgrp=c-cb
OpenAI Says It Has Begun Training a New Flagship A.I. Model
The advanced A.I. system would succeed GPT-4, which powers ChatGPT. The company has also created a new safety committee to address A.I.’s risks.


Listen to this article · 4:48 min Learn more
Share full article

As Sam Altman’s OpenAI trains its new model, its new Safety and Security committee will work to hone policies and processes for safeguarding the technology, the company said.Credit...Jason Redmond/Agence France-Presse — Getty Images
Cade Metz
By Cade Metz
Reporting from San Francisco

May 28, 2024
OpenAI said on Tuesday that it had begun training a new flagship artificial intelligence model that would succeed the GPT-4 technology that drives its popular online chatbot, ChatGPT.

The San Francisco start-up, which is one of the world’s leading A.I. companies, said in a blog post that it expected the new model to bring “the next level of capabilities” as it strove to build “artificial general intelligence,” or A.G.I., a machine that can do anything the human brain can do. The new model would be an engine for A.I. products including chatbots, digital assistants akin to Apple’s Siri, search engines and image generators.

OpenAI also said it was creating a new Safety and Security Committee to explore how it should handle the risks posed by the new model and future technologies.

“While we are proud to build and release models that are industry-leading on both capabilities and safety, we welcome a robust debate at this important moment,” the company said.

OpenAI is aiming to move A.I. technology forward faster than its rivals, while also appeasing critics who say the technology is becoming increasingly dangerous, helping to spread disinformation, replace jobs and even threaten humanity. Experts disagree on when tech companies will reach artificial general intelligence, but companies including OpenAI, Google, Meta and Microsoft have steadily increased the power of A.I. technologies for more than a decade, demonstrating a noticeable leap roughly every two to three years.

OpenAI’s GPT-4, which was released in March 2023, enables chatbots and other software apps to answer questions, write emails, generate term papers and analyze data. An updated version of the technology, which was unveiled this month and is not yet widely available, can also generate images and respond to questions and commands in a highly conversational voice.

Days after OpenAI showed the updated version — called GPT-4o — the actress Scarlett Johansson said it used a voice that sounded “eerily similar to mine.” She said that she had declined efforts by OpenAI’s chief executive, Sam Altman, to license her voice for the product and that she had hired a lawyer and asked OpenAI to stop using the voice. The company said the voice was not Ms. Johansson’s.

Technologies like GPT-4o learn their skills by analyzing vast amounts of digital data, including sounds, photos, videos, Wikipedia articles, books and news articles. The New York Times sued OpenAI and Microsoft in December, claiming copyright infringement of news content related to A.I. systems.

Digital “training” of A.I. models can take months or even years. Once the training is completed, A.I. companies typically spend several more months testing the technology and fine-tuning it for public use.

Editors’ Picks
Is Heat Actually Good for Sore Muscles?
Is This Season of ‘Hacks’ Trolling Jerry Seinfeld?
Bill Walton’s Long, Special Relationship With the Grateful Dead
That could mean that OpenAI’s next model will not arrive for another nine months to a year or more.

As OpenAI trains its new model, its new Safety and Security committee will work to hone policies and processes for safeguarding the technology, the company said. The committee includes Mr. Altman, as well as the OpenAI board members Bret Taylor, Adam D’Angelo and Nicole Seligman. The company said the new policies could be in place in the late summer or fall.

This month, OpenAI said Ilya Sutskever, a co-founder and one of the leaders of its safety efforts, was leaving the company. This caused concern that OpenAI was not grappling enough with the dangers posed by A.I.

Dr. Sutskever had joined three other board members in November to remove Mr. Altman from OpenAI, saying Mr. Altman could no longer be trusted with the company’s plan to create artificial general intelligence for the good of humanity. After a lobbying campaign by Mr. Altman’s allies, he was reinstated five days later and has since reasserted control over the company.

Dr. Sutskever led what OpenAI called its Superalignment team, which explored ways of ensuring that future A.I. models would not do harm. Like others in the field, he had grown increasingly concerned that A.I. posed a threat to humanity.

Jan Leike, who ran the Superalignment team with Dr. Sutskever, resigned from the company this month, leaving the team’s future in doubt.

OpenAI has folded its long-term safety research into its larger efforts to ensure that its technologies are safe. That work will be led by John Schulman, another co-founder, who previously headed the team that created ChatGPT. The new safety committee will oversee Dr. Schulman’s research and provide guidance for how the company will address technological risks.

###
https://cxotoday.com/specials/maximizing-roi-best-practices-for-scaling-generative-ai-across-the-enterprise/
Gartner- ROI 극대화를 위한 전사적 생성형 AI 구축 모범사례
생성형 AI는 다양한 산업 분야에 비즈니스 혁신을 일으킬 수 있는 잠재력을 갖고 있다. 비즈니스 및 기술 리더들은 생성형 AI가 갖고 있는 장점이 잠재적인 위험보다 크다고 확신한다. 그러나 생성형 AI의 모범 사례에 대한 이해도 부족은 기업들이 생성형 AI도입을 가로막는 한 원인이 되고 있다.
가트너는 2025년까지 생성형 AI 프로젝트 중 최소 30%가 데이터 품질 저하, 부적절한 리스크 관리, 비용 증가 등으로 인해 실증단계(POC) 이후 중단될 것으로 예측했다. 최고정보책임자(CIO)가 생성형 AI 확장을 위해 다양한 모범 사례를 참고해야 하는 이유이다.
활용 사례 우선순위 설정을 위한 프로세스 구축
생성형 AI를 구축하기 위한 첫 번째 단계는 AI 구축 목표를 설정하고 달성 가능한 목표에 대한 사전 논의를 진행하는 것이다. 이후에는 생성형 AI 기술로 시범 운영할 수 있는 잠재적 활용 사례를 수집해야 한다. 활용 사례 우선순위 설정은 조직의 필수적인 전략 요소다. 우선순위 설정은 기술의 매력도나 ‘화려한 데모’에 의해서 결정돼서는 안 되고, 조직 가치 제안에 대한 총체적인 평가에 따라 결정돼야 한다. 공급업체가 그들의 역량을 반영해 할인된 실증 과정을 제안하기도 한다.
하지만 핵심은 △실질적인 비즈니스 가치 제공 △실현 가능성이 높은 활용 사례 식별 △규모 확장 시 위험과 비용 증가 회피에 있다. 따라서 우선순위를 정하는 작업에는 기술팀과 더불어 생성형 AI 애플리케이션을 활용할 비즈니스 부서, 보안, 리스크팀까지 모두 참여해야 한다.
구축 혹은 구매를 위한 의사 결정 프레임워크 개발
생성형 AI를 확장하려면 조직 내 잠재적 활용 사례에 대해 구축 혹은 구매 의사 결정을 내릴 수 있는 체계적인 접근 방식이 필요하다. 경쟁 우위를 확보할 수 있고, 프로세스에 필요한 기술과 지식을 갖추고 있다고 판단되면 AI를 구축하는 것이 좋다. CIO는 생성형 AI를 구축할지 구매할지에 대한 결정을 내리기 전에 접근 방식의 모든 장단점을 평가해야 한다.
확장성을 위한 시범운영
기업은 새로운 아이디어를 시범적으로 운영해 조직 내에서 활용 가능한 기술을 체득하고, 실험을 통해 학습해야 한다. 시범적으로 운영할 때 데이터, 개인정보 보호, 보안, 사용성 등을 꼼꼼하게 살펴야 한다. 다음으로 확장, 개선, 중단 등에 대한 결정을 내리기 위해서는 활용 사례를 들여다보고 테스트 전에 반드시 애자일 사고방식(Agile Mindset)을 확보해야 한다.
조직 전반에 안전한 실험을 할 수 있는 샌드박스 환경이 구축돼야 하는 것도 매우 중요하다. 적절한 보안 및 개인정보 보호 조치는 물론, 샌드박스 내에서 실험을 반복을 위한 여러 생성형 AI 모델에 대한 가용성을 갖춰야 한다. 이를 통해 개발자는 특정 활용 사례에 가장 적합한 모델을 유연하게 선택할 수 있다.
유연한 생성형 AI 플랫폼 아키텍처 설계
생성형 AI 환경은 인프라, 모델, AI 엔지니어링 도구, 애플리케이션이라는 네 개의 중요한 레이어로 구성된다. 기업은 자사 플랫폼 아키텍처가 높은 유연성과 확장성을 갖고 있으며 거버넌스가 포함돼 있는지 확인해야 한다. 생성형 AI 모델 환경은 빠르게 변화하고 있으며, 오픈소스 모델과 도메인 모델이 급부상하는 것처럼 현재로서는 상상할 수 없는 방식으로 끊임없이 진화할 것이다. 이 때문에 조직은 추후 모델 교체가 가능하도록 높은 유연성을 가진 아키텍처를 확보해야 한다.
생성형 AI의 최전선에 있는 ‘책임감 있는 AI’
생성형 AI는 기업들에게 큰 기회를 제공한다. 그러나 기회를 제공하는 만큼 위험부담도 높다. ‘책임감 있는 AI’라는 말이 나온 것도 이런 이유 때문이다. ‘책임감 있는 AI’는 AI 도입 시 적절한 비즈니스 및 윤리적 선택을 내리는 데 필요한 모든 측면을 포괄하는 용어다.
이러한 명확한 프레임워크가 없다면 조직은 해당 기술의 이점과 리스크 간 균형을 맞추는 데 어려움을 겪게 된다. 조직은 공정성, 유해성 완화, 윤리, 위험 관리, 개인정보 보호, 지속 가능성, 규정 준수 등 주요 영역에 걸쳐 명확한 원칙과 정책을 수립해 책임감 있는 AI에 대한 비전을 정의하고 공표해야 한다.
데이터 및 AI 리터러시에 대한 투자
전통적인 AI와 달리 생성형 AI는 다수의 직원들이 적극적이고 직접적으로 활용한다. 생성형 AI의 광범위한 배포를 위해서는 관련 활용 사례를 식별하고 해당 AI 애플리케이션을 구현하고 운영할 수 있는 역량이 있어야 한다. 또한 맥락 내에서 AI를 활용할 수 있는 능력인 AI 리터러시에도 중점을 둬야 한다.
기업은 비즈니스 부서를 대상으로 맞춤형 교육을 실시하고, 고위 경영진을 대상으로 데이터 및 AI 리터러시 기술을 교육해야 한다. 또한 신속한 엔지니어링, 모델 검증 및 튜닝, 인프라 관리, 책임감 있는 AI와 같은 분야에서 생성형 AI에 특화된 기술을 갖춘 기술팀의 역량을 강화하는 과정이 반드시 필요하다.

Maximizing ROI: Best Practices for Scaling Generative AI Across the Enterprise CXOtoday News Desk2 months ago By Arun Chandrasekaran Generative artificial intelligence (GenAI) has the potential to revolutionize businesses in various industries. Most business and technology leaders are convinced that the advantages of GenAI outweigh any potential risks. However, lack of understanding about emerging industry best practices is constraining organization wide pilots and scalable production deployments. Through 2025, Gartner predicts that at least 30% of GenAI projects will be abandoned after proof of concept (POC) due to poor data quality, inadequate risk controls, escalating costs or unclear business value. To avoid obstacles to scaling GenAI, chief information officers (CIOs) must embrace the following emerging industry best practices. Establish a Continuous Process to Prioritize Use Cases The initial step in the GenAI journey is to establish the organization’s AI goals and engage in a preliminary discussion about what is achievable. The subsequent step involves gathering potential use cases that can be piloted with GenAI technologies. Prioritizing GenAI use cases is a strategic imperative for organizations. Such prioritization should not be driven solely by the appeal of technology, or the “flashiest demo,” but by a holistic assessment of its value proposition to the organization. While vendors may suggest discounted POCs reflecting their capabilities, the key is to identify use cases that deliver tangible business value and are the most technically feasible and avoid those that could lead to growing risks and costs when scaled in production. The task of prioritizing should be a collective decision, involving not only the technology teams but also the business lines that will utilize the GenAI application as well as security and risk teams. Create a Decision Framework for Build Versus Buy Scaling GenAI requires a systematic approach to build versus buy decisions for the many potential use cases in the organization. Ideally, businesses should consider building an AI product when it can provide a competitive advantage in their industry and when they have the necessary skills and knowledge for the process. In the context of GenAI, use cases where enterprises want to minimize risks for regulatory or brand equity reasons may also warrant a build approach. CIOs must evaluate all pros and cons of the approach before determining their build-versus-buy decisions for GenAI. Pilot Use Cases for Scalability Businesses must run pilots to try new ideas, build muscle memory within the organization on the art of the possible and learn by experimentation. They must ensure that pilots are built with scalability in mind by envisioning future data, privacy, security and usability needs. An agile mindset must be adopted before experimenting and testing the use cases to determine the next step — scale, refine or stop. A sandbox environment must be established to allow for safe experimentation throughout the organization. This should include appropriate security and privacy measures, as well as the availability of multiple GenAI models for experimentation and iteration within the sandbox. This allows developers to have the flexibility to select the most suitable models for each specific use case. Design a Composable Generative AI Platform Architecture The GenAI landscape consists of four critical layers — infrastructure, models, AI engineering tools and applications. Enterprises must ensure that their platform architecture is composable, scalable and embedded with governance upfront. The GenAI model landscape is fast-paced and will constantly evolve, often in ways we cannot envision today (such as the rise of open-source models and domain models). Organizations must ensure there is enough flexibility in their architecture to swap models through composability. Responsible AI Is at the Forefront of All Generative AI Efforts GenAI creates not only new opportunities, but also new risks. Responsible AI is an umbrella term for all the different aspects of making appropriate business and ethical choices when adopting AI. Without a clear responsible AI framework, organizations will struggle to balance the benefits and risks of this technology. Organizations need to define and publicize a vision for responsible AI with clear principles and policies across focus areas like fairness, toxicity mitigation, ethics, risk management, privacy, sustainability and regulatory compliance. Invest in Data and AI Literacy Unlike traditional AI, GenAI is poised for active and direct use by a large segment of employees. This broad deployment requires a strong emphasis on AI literacy: the ability to utilize AI in context with competency to identify relevant use cases, as well as implement and operate corresponding AI applications. Enterprises must create and conduct personalized training programs targeting various business functions and training senior management on the data and AI literacy skills. Upskilling the technology teams with GenAI-specific skills in areas such as prompt engineering, model validation and tuning, infrastructure management and responsible AI is crucial. Additional analysis on GenAI for enterprises will be presented during the Gartner Data & Analytics Summit, taking place April 24-25 in Mumbai, India. (The author is Arun Chandrasekaran, Distinguished VP Analyst at Gartner, and the views expressed in this article are his personal)

Read more at: https://cxotoday.com/specials/maximizing-roi-best-practices-for-scaling-generative-ai-across-the-enterprise/

###
https://v.daum.net/v/20240509060115802
"정규직 40%는 AI 사업 인력"…진짜 AI컴퍼니로 거듭난 SKT
윤정민 기자2024. 5. 9. 06:01
음성으로 듣기번역 설정글씨크기 조절하기인쇄하기
SKT, 이통사 실적에서 보기 드문 AI 인력 현황 공개
AI 매출 성과 가시화…"모든 방안 동원해 AI 투자 재원 확보"
[서울=뉴시스]윤정민 기자 =

[서울=뉴시스] SK텔레콤은 연결 기준 1분기 매출 4조4746억원, 영업이익 4985억원, 당기순이익 3619억원을 기록했다고 8일 밝혔다. 사진은 AI 사업 성과 (사진=SK텔레콤 제공) *재판매 및 DB 금지

[서울=뉴시스] SK텔레콤은 연결 기준 1분기 매출 4조4746억원, 영업이익 4985억원, 당기순이익 3619억원을 기록했다고 8일 밝혔다. 사진은 AI 사업 성과 (사진=SK텔레콤 제공) *재판매 및 DB 금지
"정규직 5286명 중 인공지능(AI) 사업 관련 인력 비중은 40%다."

SK텔레콤은 그동안 이동통신사에 볼 수 없었던 새로운 실적을 공개했다. 바로 AI 인력 수다. 글로벌 AI 컴퍼니로의 도약을 선언한 SK텔레콤이 전통적인 이동통신사의 모습에 벗어나기 위해 내놓은 지표다.

5G(5세대 이동통신) 시장이 성숙기에 접어들었고 알뜰폰 시장도 확대되면서 통신서비스 산업 성장이 둔화세를 보이고 있다. 이에 SK텔레콤은 일찍이 AI와 관련한 비통신 사업에 주력했고 AI 분야 우수 인력도 확보하면서 관련 사업 매출도 성장세를 보이고 있다.

생성형 AI 수요 증가에 데이터센터·클라우드 호황
에이닷 가입자 400만명 달성 등 서비스 성과 가시화
[서울=뉴시스] SK텔레콤이 글로벌 서버 제조 스타트업 기업인 슈퍼마이크로와 글로벌 그래픽처리장치(GPU) 클라우드 회사인 람다와 협력해 인공지능 데이터센터(AI DC) 시장 공략에 나선다고 29일 밝혔다. 28일(현지시간) MWC24 전시장에서 유영상 SK텔레콤 사장(왼쪽에서 10번째)과 센리 첸 슈퍼마이크로 최고성장책임자(CGO, 왼쪽에서 11번째)가 AI데이터센터(AIDC) 분야 협력을 위한 MOU 체결 후 기념 촬영하는 모습 (사진=SK텔레콤 제공) *재판매 및 DB 금지

[서울=뉴시스] SK텔레콤이 글로벌 서버 제조 스타트업 기업인 슈퍼마이크로와 글로벌 그래픽처리장치(GPU) 클라우드 회사인 람다와 협력해 인공지능 데이터센터(AI DC) 시장 공략에 나선다고 29일 밝혔다. 28일(현지시간) MWC24 전시장에서 유영상 SK텔레콤 사장(왼쪽에서 10번째)과 센리 첸 슈퍼마이크로 최고성장책임자(CGO, 왼쪽에서 11번째)가 AI데이터센터(AIDC) 분야 협력을 위한 MOU 체결 후 기념 촬영하는 모습 (사진=SK텔레콤 제공) *재판매 및 DB 금지
[서울=뉴시스]SK텔레콤은 기업 현장에서 실제 진행 중인 연구과제 수행을 통해 인공지능 분야의 미래 인재를 육성하는 'SKT AI 펠로우십' 5기 과정을 성공적으로 마무리했다고 20일 밝혔다. 사진은 SKT AI 펠로우십 5기 학생들이 수료식을 마치고 기념촬영을 하는 모습. (사진=SK텔레콤 제공)

[서울=뉴시스]SK텔레콤은 기업 현장에서 실제 진행 중인 연구과제 수행을 통해 인공지능 분야의 미래 인재를 육성하는 'SKT AI 펠로우십' 5기 과정을 성공적으로 마무리했다고 20일 밝혔다. 사진은 SKT AI 펠로우십 5기 학생들이 수료식을 마치고 기념촬영을 하는 모습. (사진=SK텔레콤 제공)

9일 SK텔레콤에 따르면 지난 1분기 엔터프라이즈 부문 매출액은 4154억원으로 전년 동기 대비 8.7% 늘었다.

SK텔레콤은 AI 인프라 영역인 데이터센터와 클라우드 관련 사업이 엔터프라이즈 매출 성장에 견인했다고 전했다. 데이터센터와 클라우드 매출은 각각 583억원, 350억원으로 전년 대비 25.6%, 38.3% 늘었다.

데이터센터는 데이터 처리 용량을 확보하기 위한 시설로 최근 생성형 AI 수요 증가에 덩달아 고성능 데이터센터 수요도 커지고 있다. SK텔레콤도 지속적인 가동률 증가에 힘입어 매출 성장을 거뒀으며 AI 데이터센터 사업으로 발전시킬 계획이다.

일례로 SK텔레콤은 SK하이닉스, SK브로드밴드, SK엔무브, 사피온 등 그룹사 역량을 결집한 AI 데이터센터 솔루션 패키지를 준비 중이며 미국 서버 제조 기업인 슈퍼마이크로와 그래픽처리장치(GPU) 클라우드 기업 람다 등 글로벌 사업 협력도 추진하고 있다.

또 현재 역량의 2배인 200메가와트(㎿) 이상으로 확장해 국내 1위 사업자를 목표로 수도권에 신규 데이터센터 설립도 추진 중이라고 밝혔다.

클라우드 사업도 AI 수요 증대에 따라 멀티 클라우드 위주로 사업을 확장하겠다며 비용 최적화 기술을 중심으로 본격적인 스케일업에 나서겠다는 입장이다.

하지만 AI 사업 관련해 성과를 낼 것이라고 강조하려면 그만큼의 많은 우수 인력이 필요하다. 이러한 점이 SK텔레콤이 실적에 AI 인력 수를 공개한 이유로 풀이된다. SK텔레콤은 지난달 1일 기준 자사 정규직 임직원 5286명 가운데 AI 사업, 개발 등 관련 업무에 직간접적으로 기여한 인력 비중이 40%(2118명)에 달했고 지난해 1월1일 대비 573명 늘었다고 밝혔다.

[서울=뉴시스] 조성봉 기자 = SK텔레콤이 각각 호주, 싱가포르의 최대 의료기기 유통사인 에이티엑스(ATX)와 스미테크(Smitech)와 파트너십을 맺고, 진단범위도 기존 개에서 고양이로 확대하는 등 국내외에서 반려동물 AI헬스케어 사업의 영역을 넓혀가고 있다고 19일 밝혔다. 사진은 지난 17일 동물병원에서 수의사가 엑스칼리버를 활용해 고양이의 엑스레이 사진을 판독하고 진료하는 모습. (사진=SK텔레콤 제공) 2023.11.19.photo@newsis.com *재판매 및 DB 금지

[서울=뉴시스] 조성봉 기자 = SK텔레콤이 각각 호주, 싱가포르의 최대 의료기기 유통사인 에이티엑스(ATX)와 스미테크(Smitech)와 파트너십을 맺고, 진단범위도 기존 개에서 고양이로 확대하는 등 국내외에서 반려동물 AI헬스케어 사업의 영역을 넓혀가고 있다고 19일 밝혔다. 사진은 지난 17일 동물병원에서 수의사가 엑스칼리버를 활용해 고양이의 엑스레이 사진을 판독하고 진료하는 모습. (사진=SK텔레콤 제공) 2023.11.19.photo@newsis.com *재판매 및 DB 금지

AI 인력 확보 영향인지 SK텔레콤은 AI 서비스를 지속적으로 개선하고 있다. 그 결과 AI 서비스 앱인 '에이닷' 누적 가입자 수는 400만명(지난 3월 말 기준)에 달성했다. 지난해 9월 공식 출시 후 120% 성장한 수치다. 통화녹음·요약, 실시간 통화통역 서비스가 제공된 영향으로 풀이된다.
SK텔레콤은 독일 도이치텔레콤, 아랍에미리트 이앤, 싱가포르 싱텔, 일본 소프트뱅크 등 글로벌 텔코 AI 얼라이언스(GTAA) 창립사들과 협력해 에이닷을 AI 개인비서 서비스(PAA)로써 현지화해 유치할 계획이다.

반려동물 엑스레이 사진을 AI로 분석해 수의사의 질병 진단을 돕는 진단 보조 서비스 '엑스칼리버' 이용 병원 수도 전년 대비 약 5배 증가한 570곳에 달했다. 엑스칼리버는 현재 호주, 싱가포르 등에 진출했으며 연내 미국, 유럽, 동남아 지역에도 상용화를 추진한다.

SK텔레콤은 AI 등 미래 성장 투자 여력을 확보하겠다는 입장이다. 김양섭 SK텔레콤 최고재무책임자(CFO)는 지난 8일 SK텔레콤 1분기 실적 컨퍼런스콜에서 AI 투자 관련한 자본 할당 계획에 대해 "(통상적으로) 연간 대략 1조원 정도의 캐시플로우(현금 흐름)가 남는데 7000억원 이상 현금배당을 꾸준히 하다 보니 투자나 차입금 관리 차원에서 생각하면 운신의 폭이 넓지 않은 것이 사실"이라고 말했다.

하지만 그는 "코스트 콘트롤을 통한 수익성 개선, 자산 유동화, 투자 효율화 등 회사가 생각할 수 있는 모든 방안을 통해서 추가 리소스 창출을 추진해 나갈 계획"이라고 밝혔다.