ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
What is this?Luca Soldaini has collected documents from Common Crawl that contain mention of a Creative Common license. This was done using a series of regex expressions.

This process comes from false positives, primarily due to the fact that Common Crawl licenses sometimes refer to elements in page (e.g. pictures, part of the document), but not full page.

I've used the prompt shown below to refine this license extraction. Plan is to use GPT-4o or Llama-3-70b to generate silver data, and then train a classifier on top of that to run this filtering at scale.
2
TaskFor the GPT-4o and llama-3-70b sheet, look at the label assigned in Column C and correct it in Column D.

"YES" refers to page being ok for training, "NO" means that the license is not appropiate.
3
Prompt used for labelingGiven the followings HTML snippets enclosed in ```quotes```, respond YES if the Creative Common license mentioned in the snippets refers to to the text content of the web page, otherwise respond NO.

Examples of "NO" include:
- The Creative Common license refers to the images on the page.
- The is another license mentioned on the snippet.
- License is not in an official Creative Common format.
- Mentions that "some", but not "all", of the content is licensed under Creative Common.

Examples of "YES" include:
- Copyright or ©️ is mentioned on the page AND text content is licensed under Creative Common; it is ok if the page is copyrighted if a Creative Common license is also mentioned.
- All "work" or "content" is mentioned as being Creative Common licensed.
- The Creative Commons tag appears on the footer on the page with no extra content.
- The content is in the public domain (i.e., a public domain license in mentioned).

You can use the source URL to help you make an assessment; for example, government and non-profit web pages are more likely to contain creative common licenses. However, DO NOT EXCLUSIVELY rely on the source URL to make a decision.

DO NOT return anything other than "YES" or "NO".

Source URL: {source}

Snippet:
```
{snippet}
```
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100