Spaces:
Sleeping
Sleeping
Commit
·
370837e
1
Parent(s):
2783986
updates
Browse files
web.py
CHANGED
|
@@ -272,9 +272,10 @@ def web_data():
|
|
| 272 |
of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
|
| 273 |
documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
|
| 274 |
"""),
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
|
|
|
| 278 |
),
|
| 279 |
H4('2.1 Word "Javascript"'),
|
| 280 |
P("""
|
|
@@ -284,9 +285,10 @@ def web_data():
|
|
| 284 |
propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
|
| 285 |
The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
|
| 286 |
"""),
|
| 287 |
-
|
| 288 |
-
|
| 289 |
-
|
|
|
|
| 290 |
),
|
| 291 |
H4("2.2 Other Rules from RefinedWeb"),
|
| 292 |
P("""
|
|
@@ -296,9 +298,10 @@ def web_data():
|
|
| 296 |
- The line matches the pattern “r'^\\d+\\s+likes$'”,
|
| 297 |
- The line contains only one word.
|
| 298 |
"""),
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
|
|
|
|
| 302 |
),
|
| 303 |
H4("2.3 Toxic Lines"),
|
| 304 |
P("""
|
|
@@ -308,15 +311,19 @@ def web_data():
|
|
| 308 |
line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
|
| 309 |
the bad words from English but also consider the bad words from other languages.
|
| 310 |
"""),
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
|
| 314 |
),
|
| 315 |
H3("3. Document-Level Filtering"),
|
| 316 |
P("""
|
| 317 |
In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
|
| 318 |
-
Overview of all the quality signals that are used for filtering.
|
| 319 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 320 |
Most of these quality signals were initially introduced by Gopher [2] and subsequently adopted by later
|
| 321 |
studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
|
| 322 |
of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
|
|
|
|
| 272 |
of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
|
| 273 |
documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
|
| 274 |
"""),
|
| 275 |
+
view_data(
|
| 276 |
+
"data/sample_terminal_punc.json",
|
| 277 |
+
0,
|
| 278 |
+
"Sample documents with lines that are removed by the rule of terminal punctuation",
|
| 279 |
),
|
| 280 |
H4('2.1 Word "Javascript"'),
|
| 281 |
P("""
|
|
|
|
| 285 |
propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
|
| 286 |
The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
|
| 287 |
"""),
|
| 288 |
+
view_data(
|
| 289 |
+
"data/sample_java.jsonl",
|
| 290 |
+
0,
|
| 291 |
+
"Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
|
| 292 |
),
|
| 293 |
H4("2.2 Other Rules from RefinedWeb"),
|
| 294 |
P("""
|
|
|
|
| 298 |
- The line matches the pattern “r'^\\d+\\s+likes$'”,
|
| 299 |
- The line contains only one word.
|
| 300 |
"""),
|
| 301 |
+
view_data(
|
| 302 |
+
"data/sample_refinedweb_line.json",
|
| 303 |
+
0,
|
| 304 |
+
"Sample documents with lines that are removed by the RefinedWeb rules",
|
| 305 |
),
|
| 306 |
H4("2.3 Toxic Lines"),
|
| 307 |
P("""
|
|
|
|
| 311 |
line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
|
| 312 |
the bad words from English but also consider the bad words from other languages.
|
| 313 |
"""),
|
| 314 |
+
view_data_static(
|
| 315 |
+
json.load(open("data/toxic_lines.json")),
|
| 316 |
+
"Sample documents with toxic lines",
|
| 317 |
),
|
| 318 |
H3("3. Document-Level Filtering"),
|
| 319 |
P("""
|
| 320 |
In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
|
| 321 |
+
Overview of all the quality signals that are used for filtering."""),
|
| 322 |
+
view_data_static(
|
| 323 |
+
json.load(open("data/all_signals.json")),
|
| 324 |
+
"Overview of all the quality signals that are used for filtering",
|
| 325 |
+
),
|
| 326 |
+
P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
|
| 327 |
Most of these quality signals were initially introduced by Gopher [2] and subsequently adopted by later
|
| 328 |
studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
|
| 329 |
of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
|