1 | Collection ID | Description | Collection Start date | Collection End date | Total files (Number of URLs collected) | Total seeds (Preserved sites) | Total volume of WARCS files (TB) | Collection in Production? (URL, Page and Image indexed) |
---|---|---|---|---|---|---|---|---|
2 | AWP1 | 1st complete crawl of the Portuguese web, mainly from the .PT domain, in 2008. | 2008-02-12 | 2008-03-06 | 56,046,288 | 154,787 | 1.60 | TRUE |
3 | AWP2 | 2nd complete crawl of the Portuguese web, mainly from the .PT domain, in 2008. | 2008-03-11 | 2008-05-30 | 48,718,404 | - | 1.60 | TRUE |
4 | AWP3 | 3rd complete crawl of the Portuguese web, mainly from the .PT domain, in 2008. | 2008-10-21 | 2008-12-10 | 51,863,006 | 193,294 | 2.00 | TRUE |
5 | AWP4 | 4th complete crawl of the Portuguese web, mainly from the .PT domain, in 2009. | 2009-05-01 | 2009-05-31 | 68,776,707 | 366,880 | 2.50 | TRUE |
6 | AWP5 | 5th complete crawl of the Portuguese web, mainly from the .PT domain, in 2009. | 2009-10-01 | 2009-10-31 | 119,135,566 | 373,323 | 3.80 | TRUE |
7 | AWP6 | 6th complete crawl of the Portuguese web, mainly from the .PT domain, in 2009. | 2009-12-01 | 2009-12-31 | 118,810,364 | 340,018 | 3.50 | TRUE |
8 | AWP7 | 7th complete crawl of the Portuguese web, mainly from the .PT domain, in 2010. | 2010-05-01 | 2010-05-31 | 87,988,812 | 389,957 | 2.90 | TRUE |
9 | AWP8 | Incremental crawl of the Portuguese web, mainly from the .PT domain, in 2010. The AWP8 crawl is incremental because it was performed using DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/) taking the content of AWP7 as baseline. Thus, the files that remained unchanged from the AWP7 complete crawl were not archived (duplicated) on the AWP8 incremental crawl. | 2010-08-01 | 2010-08-31 | 75,771,317 | 411,562 | 1.90 | TRUE |
10 | AWP9 | Incremental crawl of the Portuguese web, mainly from the .PT domain, in 2011. The AWP9 crawl is incremental because it was performed using DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/) taking the content of AWP7 as baseline. Thus, the files that remained unchanged from the AWP7 complete crawl were not archived (duplicated) on the AWP9 incremental crawl. | 2011-01-20 | 2011-03-22 | 81,114,575 | 473,588 | 2.10 | TRUE |
11 | AWP10 | Incremental crawl of the Portuguese web, mainly from the .PT domain, in 2011. The AWP10 crawl is incremental because it was performed using DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/) taking the content of AWP7 as baseline. Thus, the files that remained unchanged from the AWP7 complete crawl were not archived (duplicated) on the AWP10 incremental crawl. | 2011-05-17 | 2011-06-17 | 76,710,879 | 704,837 | 2.10 | TRUE |
12 | AWP11 | Incremental crawl of the Portuguese web, mainly from the .PT domain, in 2011. The AWP11 crawl is incremental because it was performed using DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/) taking the content of AWP7 as baseline. Thus, the files that remained unchanged from the AWP7 complete crawl were not archived (duplicated) on the AWP11 incremental crawl. | 2011-06-30 | 2011-08-05 | 69,790,126 | 509,280 | 2.30 | TRUE |
13 | AWP12 | Incremental crawl of the Portuguese web, mainly from the .PT domain, from December of 2011 to February of 2012. The AWP12 crawl is incremental because it was performed using DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/) taking the content of AWP7 as baseline. Thus, the files that remained unchanged from the AWP7 complete crawl were not archived (duplicated) on the AWP12 incremental crawl. | 2011-12-30 | 2012-02-28 | 90,122,611 | 328,846 | 2.70 | TRUE |
14 | AWP15 | Complete crawl of the Portuguese web, mainly from the .PT domain, performed between the 5th November of 2013 and the 13rd January of 2014. The AWP15 crawl did NOT use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2013-11-05 | 2014-01-13 | 139,296,363 | 1,088,962 | 6.00 | TRUE |
15 | AWP16 | Incremental crawl of the Portuguese web, mainly from the .PT domain, in 2014. The AWP16 crawl is incremental because it was performed using DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/) taking the content of AWP15 as baseline. Thus, the files that remained unchanged from the AWP15 complete crawl were not archived (duplicated) on the AWP16 incremental crawl. | 2014-09-23 | 2014-11-24 | 203,407,698 | 609,201 | 8.50 | TRUE |
16 | AWP17 | Complete crawl of the Portuguese web performed, mainly from the .PT domain, in 2015. The AWP17 crawl did NOT use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2015-04-10 | 2015-06-09 | 243,803,163 | 818,360 | 9.56 | TRUE |
17 | AWP18 | Incremental crawl of the Portuguese web, mainly from the .PT domain, in 2015. The AWP18 crawl is incremental because it was performed using DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/) taking the content of AWP17 as baseline. Thus, the files that remained unchanged from the AWP17 complete crawl were not archived (duplicated) on the AWP18 incremental crawl. | 2015-05-13 | 2015-11-05 | 214,527,044 | 518,848 | 7.82 | TRUE |
18 | AWP19 | Incremental crawl of the Portuguese web, mainly from the .PT domain, from November of 2015 and May of 2016. The AWP19 crawl is incremental because it was performed using DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/) taking the content of AWP18 as baseline. Thus, the files that remained unchanged from the AWP18 complete crawl were not archived (duplicated) on the AWP19 incremental crawl. | 2015-11-12 | 2016-01-05 | 199,209,953 | 658,777 | 7.10 | TRUE |
19 | AWP20 | Complete crawl of the Portuguese web, mainly from the .PT domain, in 2016. The AWP20 crawl did NOT use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2016-02-05 | 2016-05-03 | 238,822,615 | 686,668 | 12.00 | TRUE |
20 | AWP21 | Incremental crawl of the Portuguese web, mainly from the .PT domain, in 2016. The AWP18 crawl is incremental because it was performed using DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/) taking the content of AWP17 as baseline. Thus, the files that remained unchanged from the AWP17 complete crawl were not archived (duplicated) on the AWP18 incremental crawl. | 2016-05-30 | 2016-08-03 | 193,212,877 | 660,385 | 7.20 | TRUE |
21 | AWP22 | Incremental crawl of the Portuguese web performed, mainly from the .PT domain, from October of 2016 to January 2017. The AWP22 crawl is incremental because it was performed using DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/) taking the content of AWP21 as baseline. Thus, the files that remained unchanged from the AWP21 complete crawl were not archived (duplicated) on the AWP22 incremental crawl. | 2016-10-31 | 2017-01-04 | 162,188,798 | 767,310 | 6.50 | TRUE |
22 | AWP23 | Complete crawl of the Portuguese web, mainly from the .PT domain, in 2017. The AWP23 crawl did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2017-01-01 | 2017-05-07 | 225,221,781 | 925,138 | 13.00 | TRUE |
23 | AWP24 | Complete crawl of the Portuguese web, mainly from the .PT domain, in 2017. The AWP24 crawl did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2017-07-31 | 2017-09-11 | 165,161,477 | 905,973 | 7.80 | TRUE |
24 | AWP25 | Complete crawl of the Portuguese web, mainly from the .PT domain, performed between December of 2017 and January of 2018. The AWP25 crawl did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2017-12-07 | 2018-01-18 | 152,505,397 | 705,296 | 6.70 | TRUE |
25 | AWP26 | Complete crawl of the Portuguese web, mainly from the .PT domain, in 2018. The AWP26 crawl did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2018-04-09 | 2018-07-02 | 233,145,629 | 711,105 | 14.00 | TRUE |
26 | AWP27 | Complete crawl of the Portuguese web, mainly from the .PT domain, in 2018. The AWP27 crawl did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2018-07-14 | 2018-07-28 | 111,848,303 | 640,912 | 13.00 | TRUE |
27 | AWP28 | Complete crawl of the Portuguese web, mainly from the .PT domain, in 2019. The AWP28 crawl did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2018-10-10 | 2018-11-02 | 363,393,207 | 710,924 | 21.00 | TRUE |
28 | AWP29 | Complete crawl of the Portuguese web, mainly from the .PT domain, in 2019. The AWP29 crawl did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2019-04-01 | 2019-04-26 | 386,225,779 | 835,462 | 22.90 | TRUE |
29 | AWP30 | Complete crawl of the Portuguese web, mainly from the .PT domain, in 2019. The AWP30 crawl did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2019-08-29 | 2019-11-03 | 632,266,742 | 1,038,954 | 38.00 | TRUE |
30 | AWP31 | Incremental crawl of the Portuguese web, mainly from the .PT domain, from December of 2019 to January 2020. The AWP31 used DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2019-12-06 | 2020-01-31 | 366,370,692 | 1,587,726 | 24.00 | TRUE |
31 | AWP32 | Complete crawl of the Portuguese web, mainly from the .PT domain, in 2020. The AWP32 crawl did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2020-03-03 | 2020-05-04 | 836,246,709 | 1,409,215 | 29.00 | FALSE |
32 | AWP33 | Complete crawl of the Portuguese web, mainly from the .PT domain, in 2020. The AWP33 crawl did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2020-06-01 | 2020-07-15 | 373,880,292 | 366,939 | 27.00 | FALSE |
33 | AWP34 | Incremental crawl of the Portuguese web, mainly from the .PT domain, in 2020. The AWP31 used DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2020-09-01 | 2020-10-08 | 115,489,181 | 302,504 | 8.40 | FALSE |
34 | AWP35 | Incremental crawl of the Portuguese web, mainly from the .PT domain, performed between December 2020 and January 2021. The AWP35 used DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2020-12-02 | 2021-01-07 | 111,563,667 | 294,671 | 7.70 | FALSE |
35 | AWP36 | Complete crawl of the Portuguese web, mainly from the .PT domain, in 2021. The AWP36 crawl did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2021-03-01 | 2021-04-07 | 634,166,744 | 264,098 | 17.00 | FALSE |
36 | AWP37 | Incremental crawl of the Portuguese web, mainly from the .PT domain, performed between June 2021 and Jully 2021. The AWP37 used DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2021-06-02 | 2021-07-16 | 568,450,007 | 273,468 | 9.80 | FALSE |
37 | AWP38 | Incremental crawl of the Portuguese web, mainly from the .PT domain, performed between October 2021 and November 2021. The AWP37 used DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2021-10-08 | 2021-11-26 | 560,843,698 | 291,518 | 12.00 | FALSE |
38 | AWP39 | Incremental crawl of the Portuguese web, mainly from the .PT domain, performed between January 2022 and February 2022. The AWP39 did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2022-01-19 | 2022-02-23 | 884,737,568 | 727,039 | 27.00 | FALSE |
39 | AWP40 | Complete crawl of the Portuguese web, mainly from the .PT domain, performed between April 2022 and June 2022. The AWP40 did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2022-04-11 | 2022-06-06 | 1,886,902,750 | 926,745 | 45.00 | FALSE |
40 | AWP41 | Incremental crawl of the Portuguese web, mainly from the .PT domain, performed between Jully and August 2022. The AWP41 used DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2022-07-27 | 2022-08-29 | 669,362,162 | 891,452 | 18.00 | FALSE |
41 | AWP42 | Incremental crawl of the Portuguese web, mainly from the .PT domain, performed between October and November 2022. The AWP42 used DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2022-10-07 | 2022-11-18 | 137,515,323 | 889,714 | 7.50 | FALSE |
42 | AWP43 | Complete crawl of the Portuguese web, mainly from the .PT domain, performed between January and 2023. The AWP43 did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2023-01-18 | 2023-02-13 | 528,583,505 | 940,497 | 18.00 | FALSE |
43 | AWP44 | Complete crawl of the Portuguese web, mainly from the .PT domain, performed between January and May 2023. The AWP44 did not use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2023-04-12 | 2023-05-08 | 417,476,124 | 686 889 | 15.00 | FALSE |
44 | AWP45 | Complete crawl of the Portuguese web, mainly from the .PT domain, performed between July and September 2023. The AWP45 did use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2023-07-19 | 2023-09-04 | 233,899,488 | 658,636 | 21.00 | FALSE |
45 | AWP46 | Complete crawl of the Portuguese web, mainly from the .PT domain, performed between October and 2023. The AWP46 did use DeDuplicator (http://landsbokasafn.GitHub.io/DeDuplicator/). | 2023-10-16 | 517,578 | FALSE | |||
46 | BlogsSapo2018 | Special collection of blogs from the Portuguese website blogs.sapo.pt, in 2018. | 2018-08-02 | 2018-08-31 | 2,414,012 | 0.20 | FALSE | |
47 | Roteiro | Collection donated to the Arquivo.pt. Pages collected in 1996, from José Magalhães book "Novo Roteiro Prático da Internet". | 1996-01-01 | 1996-12-31 | 75,174 | 0.00 | TRUE | |
48 | IA | Collection acquired from the Internet Archive. Pages of the Portuguese web collected by the Internet Archive between 1996 and 2007. | 1996-01-01 | 2007-12-31 | 123,889,349 | 2.00 | TRUE | |
49 | BN | Collection donated to the Arquivo.pt. Pages collected by the Biblioteca Nacional de Portugal and the Instituto de Engenharia de Sistemas e Computadores (INESC) as part of the "Recolha" or "collection" project. This partnership collected Web pages, between 2004 and 2005 about the "Legislativas" Portuguese elections "Legislativas", held in February 2005. | 2004-01-01 | 2005-12-31 | 14,373,817 | 0.17 | TRUE | |
50 | Tomba | Collection integrated from Tomba project that includes Web pages collected between 2005 and 2006. The Tomba project was the Portuguese web archive prototype, following the Tumba! project developed by the research group XLDB of the University of Lisbon and supported by FCCN. | 2005-01-01 | 2006-12-31 | 37,000,000 | 1.30 | TRUE | |
51 | Dinis | Collection donated to the Arquivo.pt. Web pages collected between 1997 and 2007, courtesy of Dinis Manuel Alves. | 2000-01-01 | 2007-12-31 | 4,000 | 0.00 | TRUE | |
52 | Weblog | Special collection of blogs from the hosting platform weblog.com.pt before being closed in 2012. | 2012-01-01 | 2012-12-31 | 563,350 | 7,012 | 0.03 | FALSE |
53 | UL | Special collection on the University of Lisbon domain (ul.pt), performed several times by the Arquivo.pt team as tests in the beginning of the service, in 2008. It brings together 6 small collections. | 2008-02-18 | 2008-03-03 | 411,171 | 0.03 | TRUE | |
54 | BlocoEsquerda | Special collection of the first website of the political party Bloco de Esquerda, performed as a test for special collections, in 2012. | 2012-10-01 | 2012-10-31 | 36 | 1 | 0.00 | FALSE |
55 | DEM-IST | Collection donated to the Arquivo.pt with the website content of the Department of Mechanical Engineering of the Instituto Superior Técnico, Lisboa. Files are dated from 1998 to 2006. | 1998-01-01 | 2006-12-31 | 3,536 | 4 | 0.00 | FALSE |
56 | DinisAlves2018 | Collection donated to the Arquivo.pt that have contents about two websites: www.portosdeportugal.pt, website of the Associação de Portos de Portugal, dated from August to December 2012, and portofigueiradafoz.pt, website of the Porto da Figueira da Foz, dated from February to December 2013). This collection is courtesy of Dinis Alves in 2018. | 01/08/2012 | 2013-12-18 | 11,349 | 2 | 0.00 | FALSE |
57 | NON | Collection donated to the Arquivo.pt, first as local files of the NON magazine website, one of the first Portuguese online magazines, then converted into WARC files and integrated into the Arquivo.pt. Courtesy of Rui Bebiano in 2020. Find the recoved websitezonanon.com website at Arquivo.pt. | 01/01/1996 | 2002-12-31 | 8,303 | 2 | 0.00 | FALSE |
58 | FAWP1 | 1st block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from March to July 2010. | 2010-03-23 | 2010-07-06 | 57,352,532 | 332 | 2.00 | TRUE |
59 | FAWP2 | 2nd block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from July to September 2010. With Deduplicator from the second day. | 2010-07-07 | 2010-09-21 | 33,957,637 | 359 | 0.80 | TRUE |
60 | FAWP3 | 3rd block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from September to December 2010. With Deduplicator. | 2010-09-22 | 2010-12-31 | 45,623,908 | 360 | 0.87 | TRUE |
61 | FAWP4 | 4th block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from January to March 2011. With Deduplicator. | 2011-01-01 | 2011-03-31 | 42,094,295 | 360 | 1.01 | TRUE |
62 | FAWP5 | 5th block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from April to June 2011. With Deduplicator. | 2011-04-01 | 2011-06-30 | 41,941,367 | 360 | 1.30 | TRUE |
63 | FAWP6 | 6th block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from July to September 2011. With Deduplicator. | 2011-07-01 | 2011-09-30 | 42,436,564 | 360 | 1.80 | TRUE |
64 | FAWP7 | 7th block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from September to December 2011. With Deduplicator. | 2011-10-01 | 2011-12-31 | 43,833,826 | 1.90 | TRUE | |
65 | FAWP8 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from January to March 2012. With Deduplicator. | 2012-01-01 | 2012-03-31 | 45,522,178 | 2.10 | TRUE | |
66 | FAWP10 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from July to September 2012. With Deduplicator. | 2012-07-01 | 2012-09-30 | 25,254,390 | 0.90 | TRUE | |
67 | FAWP11 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from October to December 2012. With Deduplicator. | 2012-10-01 | 2012-12-31 | 8,560,866 | 0.30 | TRUE | |
68 | FAWP12 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from January to September 2013. With Deduplicator. | 2013-01-01 | 2013-06-30 | 10,423,663 | 0.48 | TRUE | |
69 | FAWP14 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from July to September 2013. With Deduplicator. | 2013-07-01 | 2013-09-30 | 27,686,676 | 1.20 | TRUE | |
70 | FAWP15 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from October to December 2013. With Deduplicator. | 2013-10-01 | 2013-12-31 | 16,461,666 | 0.77 | TRUE | |
71 | FAWP17 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from April to June 2014. With Deduplicator. | 2014-04-01 | 2014-06-30 | 18,800,556 | 1.00 | TRUE | |
72 | FAWP18 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from July to September 2014. With Deduplicator. | 2014-07-01 | 2014-09-30 | 29,436,673 | 1.60 | TRUE | |
73 | FAWP19 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from October to December 2014. With Deduplicator. | 2014-10-01 | 2014-12-31 | 39,843,502 | 2.00 | TRUE | |
74 | FAWP20 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from January to March 2015. With Deduplicator. | 2015-01-01 | 2015-03-31 | 38,936,485 | 324 | 1.80 | TRUE |
75 | FAWP21 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from April to June 2015. With Deduplicator. | 2015-04-01 | 2015-06-30 | 38,636,837 | 327 | 1.80 | TRUE |
76 | FAWP22 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from July to September 2015. With Deduplicator. | 2015-07-01 | 2015-09-30 | 44,702,505 | 252 | 2.10 | TRUE |
77 | FAWP23 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from October to December 2015. With Deduplicator. | 2015-10-01 | 2015-12-31 | 57,405,014 | 257 | 3.70 | TRUE |
78 | FAWP24 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from January to March 2016. With Deduplicator. | 2016-01-01 | 2016-03-31 | 60,725,384 | 291 | 4.50 | TRUE |
79 | FAWP25 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from April to June 2016. With Deduplicator. | 2016-04-01 | 2016-06-30 | 63,894,659 | 296 | 4.90 | TRUE |
80 | FAWP26 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from July to September 2016. With Deduplicator. | 2016-07-01 | 2016-09-30 | 63,780,872 | 299 | 6.10 | TRUE |
81 | FAWP27 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from October to December 2016. With Deduplicator. | 2016-10-01 | 2016-12-31 | 64,083,906 | 308 | 6.90 | TRUE |
82 | FAWP28 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from January to March 2017. With Deduplicator. | 2017-01-01 | 2017-03-31 | 62,797,293 | 323 | 7.60 | TRUE |
83 | FAWP29 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from April to June 2017. With Deduplicator. | 2017-04-01 | 2017-06-30 | 73,100,203 | 320 | 8.80 | TRUE |
84 | FAWP30 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from July to September 2017. With Deduplicator. | 2017-07-01 | 2017-09-30 | 75,259,797 | 326 | 9.40 | TRUE |
85 | FAWP31 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from October to December 2017. With Deduplicator. | 2017-10-01 | 2017-12-31 | 73,242,650 | 326 | 8.70 | TRUE |
86 | FAWP32 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from January to March 2018. With Deduplicator. | 2018-01-01 | 2018-03-31 | 80,766,956 | 325 | 9.20 | TRUE |
87 | FAWP33 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from April to June 2018. With Deduplicator. | 2018-04-01 | 2018-06-30 | 125,860,345 | 325 | 11.00 | TRUE |
88 | FAWP34 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from July to September 2018. With Deduplicator. | 2018-07-01 | 2018-09-30 | 126,921,267 | 325 | 13.00 | TRUE |
89 | FAWP35 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from October to December 2018. With Deduplicator. | 2018-10-01 | 2018-12-31 | 124,775,974 | 361 | 14.00 | TRUE |
90 | FAWP36 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from January to March 2019. With Deduplicator. (3 Heritrix ARCs) | 2019-01-01 | 2019-03-31 | 137,178,831 | 361 | 14.00 | TRUE |
91 | FAWP37 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from April to June 2019. With Deduplicator. (3 Heritrix ARCs) | 2019-04-01 | 2019-06-30 | 132,608,317 | 361 | 14.00 | TRUE |
92 | FAWP38 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from July to September 2019. With Deduplicator. (3 Heritrix ARCs) | 2019-07-01 | 2019-09-30 | 140,120,566 | 361 | 15.00 | TRUE |
93 | FAWP39 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from October to December 2019. With Deduplicator. (3 Heritrix ARCs & WARCs) | 2019-10-01 | 2019-12-31 | 154,710,796 | 361 | 19.00 | TRUE |
94 | FAWP40 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from January to March 2020. With Deduplicator. (3 Heritrix WARC) | 2020-01-01 | 2020-03-31 | 142,757,027 | 361 | 21.00 | FALSE |
95 | FAWP41 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from April to June 2020. With Deduplicator. (3 Heritrix WARC) | 2020-04-01 | 2020-06-30 | 178,792,290 | 256 | 21.00 | FALSE |
96 | FAWP42 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from July to September 2020. With Deduplicator. (3 Heritrix WARC) | 2020-07-01 | 2020-09-30 | 119,671,522 | 256 | 17.00 | FALSE |
97 | FAWP43 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from October to December 2020. Possible Without Deduplicator. (3 Heritrix WARC) | 2020-10-01 | 2020-12-31 | 282,254,451 | 156 | 32.00 | FALSE |
98 | FAWP44 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from January to March 2021. With Deduplicator. (3 Heritrix WARC) | 2021-01-01 | 2021-03-31 | 85,092,238 | 154 | 15.00 | FALSE |
99 | FAWP45 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies,. With Deduplicator, from April to June 2021. (3 Heritrix WARC) | 2021-04-01 | 2021-06-30 | 81,529,382 | 154 | 13.00 | FALSE |
100 | FAWP46 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies,. With Deduplicator, from July to September 2021. (3 Heritrix WARC) | 2021-07-01 | 2021-09-30 | 85,239,435 | 140 | 12.00 | FALSE |
101 | FAWP47 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies,. With Deduplicator, from October to December 2021. (3 Heritrix WARC) | 2021-10-01 | 2021-12-31 | 111,910,859 | 140 | 19.00 | FALSE |
102 | FAWP48 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from January to March 2022. With Deduplicator. (3 Heritrix WARC) | 2022-01-01 | 2022-03-31 | 115,394,656 | 140 | 18.00 | FALSE |
103 | FAWP49 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from April to June 2022. With Deduplicator. (3 Heritrix WARC) | 2022-04-01 | 2022-06-30 | 134,942,148 | 140 | 21.00 | FALSE |
104 | FAWP50 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from July to September 2022. Possible Without Deduplicator. (3 Heritrix WARC) | 2022-07-01 | 2022-09-30 | 158,087,167 | 140 | 21.00 | FALSE |
105 | FAWP51 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from October to December 2022. With Deduplicator. (3 Heritrix WARC) | 2022-10-01 | 2022-12-31 | 152,594,620 | 140 | 24.00 | FALSE |
106 | FAWP52 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from January to March 2023. With Deduplicator. (3 Heritrix WARC) | 2023-01-01 | 2023-03-31 | 130,387,084 | 140 | 25.00 | FALSE |
107 | FAWP53 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from April to June 2023. With Deduplicator. (3 Heritrix WARC) | 2023-04-01 | 2023-06-30 | 124,928,811 | 140 | 23.00 | FALSE |
108 | FAWP54 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from July to September 2023. With Deduplicator. (3 Heritrix WARC) | 2023-07-01 | 2023-09-30 | 128,772,574 | 124 | 23.00 | FALSE |
109 | FAWP55 | Block of frequent collections of Portuguese web, mainly websites of news media, websites with frequent renewal of contents, government and public bodies, from October to December 2023. With Deduplicator. (3 Heritrix WARC) | 2023-10-01 | 124 | FALSE | |||
110 | EAWP1 | Special collection about content related to web preservation. This collection contains web pages crawled between May 2011 and February 2012. | 2011-05-20 | 2012-02-05 | 3,087,727 | 0.05 | TRUE | |
111 | EAWP2 | Special collection of Portuguese Governmental Websites after the Portuguese Legislative Election, June 5, 2011. For example: https://arquivo.pt/wayback/20110621184154/http://www.governo.gov.pt/pt/GC17/Pages/Inicio.aspx | 2011-06-22 | 2011-06-30 | 727,975 | 0.06 | TRUE | |
112 | EAWP5 | Special collection of the prof2000.pt website, before the removal of the publisher port. Prof2000 was a program of remote training for school teachers in Portugal. The crawl was performed in October 2014. Available at https://arquivo.pt/wayback/20141023142253/http://prof2000.pt/ | 2014-10-01 | 2014-10-31 | 109,274 | 0.00 | TRUE | |
113 | EAWP6 | Special collection of the .EU domain, first crawl, performed by the Arquivo.pt between November and December 2014, to preserve content related to the European Community (EU) and its official bodies. This collection is the first attempt to crawl and preserve web sites hosted under the .EU domain within the scope of RESAW activities. RESAW is a European network that aims to create a Research Infrastructure for the Study of Archived Web Materials (resaw.eu). Find logs, reports, analyses and presentations at "A first attempt to archive the .EU domain". Search this collection through the prototype Research .EU. See also EAWP8 and EAWP15. | 2014-11-21 | 2014-12-16 | 129,793,987 | 5.80 | TRUE | |
114 | EAWP7 | Special collection about the Portuguese elections "Legislativas 2015" held on the 4th of October. Crawls were performed before and after the elections, between September and December 2015. The community contributed and suggested contents to be recorded: Web pages from the running political parties, news in the media about the elections, blogs, opinion articles, and satirical political Web pages. A list with 127 suggestions is available for download at Dados.gov. (Multiple Collection) Collections related to elections: EAWP7, EAWP9, EAWP16 EAWP17, EAWP23, EAWP26, EAWP37, EAWP39, EAWP40. | 2015-09-22 | 2015-12-08 | 2,802,407 | 124 | 0.27 | TRUE |
115 | EAWP8 | Special collection, second crawl of the .EU domain performed by the Arquivo.pt in January 2016, in order to preserve content related to the European Community (EU) and its official bodies. Continues the EAWP6 collection. Find logs, reports, analyses and presentations at "A first attempt to archive the .EU domain". Search this collection through the prototype Research .EU. See also EAWP6 and EAWP15. | 2016-01-07 | 2016-01-26 | 61,863,684 | 138,256 | 3.10 | TRUE |
116 | EAWP9 | Special collection about the Portuguese elections "Presidenciais" of January 24, 2016. The crawl was performed before and after the election in January 2016.See an example of a page into this collection. A list of 285 websites and pages is available for download. (Multiple Collection). Collections related to elections: EAWP7, EAWP9, EAWP16 EAWP17, EAWP23, EAWP26, EAWP37, EAWP39, EAWP40. | 2016-01-21 | 2016-01-28 | 551,672 | 4,461 | 0.07 | TRUE |
117 | EAWP10 | Special collection of European Research Projects Websites of the Seventh Framework Programme ( FP7) in order to preserve contents related to European scientific projects. Arquivo.pt team searched for the missing project URLs through an automatic procedure. Find more about the methodology at "Arquivo.pt preserved websites about Research & Development projects funded by the EU", the paper "Preserving Websites Of Research & Development Projects" and technical documentation on GitHub. A list of FP7 projects and URLs is available for download (xlsx). | 2016-03-14 | 2016-03-23 | 10,440,947 | 20,429 | 1.40 | TRUE |
118 | EAWP10-2 | Special collection of European Research Projects Websites of the Seventh Framework Programme (FP7). Seeds were curated manually to create a test collection. An experimental crawl was performed in April 2016. To know more, follow the same references as at the EAW10 collection. | 2016-04-21 | 2016-04-28 | 1,048,448 | 1,370 | 0.13 | TRUE |
119 | EAWP11 | Special collection, first crawl of European Research Projects Websites of the Fourth, Fifth and Sixth Framework Programme (FP4, FP5, FP6). Seeds were identified by automatic process. The crawl was performed between May and June 2016. Find more about the methodology at "Arquivo.pt preserved websites about Research & Development projects funded by the EU", the paper "Preserving Websites Of Research & Development Projects" and technical documentation on GitHub. A list of projects and URLs from Cordis, EU, available for download (xls format): FP4, FP5 and, FP6. | 2016-05-13 | 2016-06-01 | 39,911,966 | 31,099 | 5.20 | TRUE |
120 | EAWP11-2 | Special collection, second crawl of European Research Projects Websites of the Fourth, Fifth and Sixth Framework Programme (FP4, FP5, FP6). Seeds were identified and curated manually. The crawl was performed in July 2016. To know more, follow the same references as at the EAW11 collection. | 2016-07-14 | 2016-07-15 | 762,772 | 1,095 | 0.24 | TRUE |
121 | EAWP12 | Special collection of Portuguese Research Projects funded by Fundação para a Ciência e a Tecnologia, I.P. (FCT). Seeds were identified through an automatic process. The crawl was performed in November 2016. Find a list of URLs and technical information on GitHub | 2016-11-18 | 2016-11-21 | 600,721 | 7,956 | 0.07 | TRUE |
122 | EAWP13 | Special collection of international contents, within the scope of the International Internet Preservation Consortium Content Development Working Group (IIPC-CDG) that aimed to promote collaborative collections. The collection EAWP13 includes the following topics: European Refugee Crisis, International Cooperation Organizations, 2016 Summer Olympics, World War I Commemoration and National Olympic and Paralympic Committees. The Arquivo.pt performed the crawl in November 2016. Find the list of seed URLs at the Portuguese open data portal Dados.gov. | 2016-11-18 | 2016-11-21 | 2,186,565 | 5,683 | 0.27 | TRUE |
123 | EAWP14 | Integration of the AWPJornais collection, a set of pages from Portuguese newspapers in the early 2000's. Pages are dated from 2000 and 2003. | 2000-12-14 | 2003-12-30 | 1,362,084 | - | 0.09 | TRUE |
124 | EAWP15 | Special collection, third crawl of the .EU domain performed by the Arquivo.pt in July 2017, in order to preserve content related to the European Community (EU) and its official bodies. Find logs, reports, analyses and presentations at "A first attempt to archive the .EU domain". Search this collection through the prototype Research .EU. See also EAWP6 and EAWP8. | 2017-06-02 | 2017-07-10 | 105,823,552 | 466,828 | 11.00 | TRUE |
125 | EAWP16 | Special collection, first crawl of the Portuguese local/municipal elections, "Eleições Autárquicas", held on October 1, 2017. The crawl was performed before the election, the 27th and 28th September. Find the list of URLs at the Portuguese open data portal Dados Gov. Collections related to elections: EAWP7, EAWP9, EAWP16 EAWP17, EAWP23, EAWP26, EAWP37, EAWP39, EAWP40. | 2017-09-27 | 2017-09-28 | 1,182,825 | 14,264 | 0.13 | TRUE |
126 | EAWP17 | Special collection, second crawl of the Portuguese local/municipal elections, "Eleições Autárquicas" held in October 1, 2017. The crawl was performed after the election, the 10th and 11th October. Find the list of URLs at the Portuguese open data portal Dados Gov. Collections related to elections: EAWP7, EAWP9, EAWP16 EAWP17, EAWP23, EAWP26, EAWP37, EAWP39, EAWP40. | 2017-10-10 | 2017-10-11 | 1,083,062 | 14,213 | 0.23 | TRUE |
127 | EAWP18 | Special collection the website dgartes.gov.pt of the Direção Geral das Artes website, the governmental body in Portugal for the arts and shows. The high quality crawl was performed between September and October 2017. See this website at Arquivo.pt. | 2017-09-27 | 2017-10-05 | 186,002 | 41 | 0.01 | TRUE |
128 | EAWP19 | Special collection donated, named "Israblog", containing Israeli blogs. Collected by Anat Ben-David between May 2018 and January 2019, integrated into Arquivo.pt in 2020. Custom search page: arquivo.pt/israblog/ | 2018-05-07 | 2019-01-14 | 24,520,849 | 110,229 | 0.55 | TRUE |
129 | EAWP20 | Special collection of Web pages performed manually using the Webrecorder in order to test on demand recording and high-quality collection of specific websites or pages. WARCs files obtain from webrecorder.io tools were integrated as a simple way to patch incomplete pages, for example, Portuguese newspapers websites that had not CSS. This collection includes records from Webrecorder in 2018. | 2018-01-01 | 2018-12-31 | 105,290 | 0.01 | FALSE | |
130 | EAWP21 | World War I Commemoration. Special collection within the scope of the International Internet Preservation Consortium Content Development Working-Group (IIPC-CDG) that aims to promote collaborative collections. The Arquivo.pt performed the crawl in November 2016. Find the list of seed URLs at the Portuguese open data portal Dados.gov. | 2019-03-14 | 2019-03-15 | 1,112,356 | 408 | 0.06 | TRUE |
131 | EAWP22 | Special collection of Web pages performed manually using the Webrecorder in order to get a high quality collection of specific websites or pages. This collection includes files recorded with Webrecorder tools in 2019. | 2019-04-02 | 2019-12-31 | FALSE | |||
132 | EAWP23 | European Elections 2019. Special collection about the European Parliament Election, held on May 26, 2019. This collection is multilingual, over 23 EU official languages. Seeds were obtained through an automatic process and validated humanly. The EAWP23 collection is the result of 6 crawls, made before and after the election, between May and July 2019. Find lists of URLs at the Portuguese open data portal Dados.gov and information about the methodology, logs, presentations at Arquivo.pt. Custom search page: arquivo.pt/ee2019/. Collections related to elections: EAWP7, EAWP9, EAWP16 EAWP17, EAWP23, EAWP26, EAWP37, EAWP39, EAWP40. | 2019-05-22 | 2019-07-17 | 99,179,348 | 12,147 | 4.8 | TRUE |
133 | EAWP24 | World News collection, 2019. Special collection that contains news around the world between the 23rd and the 30th of July 2019. This collection is within the scope of the Internet Preservation Consortium Content Development Working-Group (IIPC-CDG) and aims to promote collaborative collections. Find the list of seed URLs at the Portuguese open data portal Dados.gov. | 2019-07-23 | 2019-07-30 | 5,291,153 | 1,167 | 0.42 | TRUE |
134 | EAWP25 | Diário da República Eletrónico collection, 2019. Special collection of the dre.pt Website, the official channel of legislative publication of the Portuguese Government. Special and customized crawl in collaboration with Imprensa Nacional-Casa da Moeda. Crawls were performed between May and August 2019. | 2019-05-17 | 2020-08-05 | 24,446,517 | 5,134,629 | 1.5 | TRUE |
135 | EAWP26 | Portuguese Legislative Elections collection, 2019. Special collection of the Portuguese Legislative Elections, held on October 6, 2019. Seeds were obtained through automatic process, before and after the elections, and through manual suggestion from people and organizations from the European Union (EU). The EAWP26 collection is the result of multiple crawls. (Multiple Collection). Find lists of URLs used to perform the colection. Collections related to elections: EAWP7, EAWP9, EAWP16 EAWP17, EAWP23, EAWP26, EAWP37, EAWP39, EAWP40. | 2019-09-30 | 2019-10-31 | 2,793,855 | 3,300 | 0.54 | TRUE |
136 | EAWP27 | Gaza War 2014 collection. Special collection of a set of seeds about the Israel/Gaza conflict in 2014. Seeds list created with the contributions of Anat Ben-David. The crawl was performed by Arquivo.pt in November 2019. | 2019-11-04 | 2019-11-14 | 22,857,543 | 124,355 | 1.5 | TRUE |
137 | EAWP28 | RCAAP - Portuguese science citations collection, 2019. Special collection of the rcaap.pt portal. Repositório Científico de Acesso Aberto de Portugal (RCAAP) is the portal that collects, aggregates and indexes open access scientific contents from Portuguese institutional repositories. Seeds were obtained from the existing citations in scientific articles. | 2019-11-14 | 2020-02-13 | 95,790,146 | 1,078,257 | 7 | TRUE |
138 | EAWP30 | Ciencia Vitae - Portuguese researchers' collection, 2020. Special collection of the cienciavitae.pt platform. Ciencia Vitae is a platform for Portuguese research CVs. URLs presented in scientific CVs were extracted by automatic process and collected. This was a collaboration between Arquivo.pt and PT-CRIS. The crawl was performed between February and March 2020. | 2020-02-03 | 2020-03-11 | 131,941,061 | 184,924 | 16 | FALSE |
139 | EAWP31 | FCT - Portuguese R&D units, 2020 - first collection. Special collection of scientific research results reported in the fct.pt website. Fundação para a Ciência e a Tecnologia (FCT) is the institution that supports scientific research in Portugal and funds more than 300 R&D units. Seeds were obtained from the list of funded units in 2019, publicly available, and are mostly its websites. Crawls were performed between January and July 2020. Find more information on the Arquivo.pt website. See also EAWP34, EAWP38. | 2020-01-23 | 2020-07-05 | 38,077,294 | 382 | 1.5 | FALSE |
140 | EAWP33 | Novel Coronavirus (Covid-19) outbreak collection. Special collection about the pandemic Covid-19, through collaborative, manual and automatic identification of seeds. This collection contains: 1) seeds from the collaborative list of IIPC CDG (International Internet Preservation Consortium - Content Development Group); 2) seeds obtained automatically through the Bing API; 3) seeds manually selected about the pandemic in Portugal, e.g., videos from Youtube and governmental bodies. Different tools were used to collect the content (Brozzler, Heritrix and Webrecorder). The thematic scope of the collection includes contents in many languages, countries and perspectives. Crawls were performed between March and September 2020. Find URLs lists and details about the methodology. | 2020-03-23 | 2020-09-04 | 29,267,832 | 17,131 | 4.5 | FALSE |
141 | EAWP34 | Portuguese research projects funded by FCT, 2020 - second collection. Special collection of scientific research results reported in the fct.pt website. Fundação para a Ciência e a Tecnologia (FCT) is the institution that supports scientific research in Portugal and funds more than 300 R&D units. Seeds were obtained by automatic process from the intermediate and final reports of funded projects. Crawls were performed between July and September 2020. Find more information on the Arquivo.pt website. See also EAWP31, EAWP38. | 2020-07-14 | 2020-09-14 | 9,023,347 | 246 | 1.6 | FALSE |
142 | EAWP35 | H2020 projects collection. Horizon 2020 is the EU Research and Innovation programme between 2014 and 2020. Seeds were obtained by automatic process using the Bing API. The collection is the result of different crawls with Heritrix, supplemented with Brozzler. The crawls were performed between December 2020 and March 2021 and between August 2021 and October 2021. Find lists of URLs projects available at the Portuguese open data portal Dados.gov and more information at Arquivo.pt. | 2020-12-29 | 2021-10-06 | 197,560,040 | 118,326 | 17 | FALSE |
143 | EAWP36 | Portuguese Foreign Affairs Websites collection. Special collection of the portaldiplomatico.mne.gov.pt website and related subsites of the Ministério dos Negócios Estrangeiros - Portuguese Foreign Affairs. The high-quality collection was done as part a collaboration with this governamental body. The crawl was performed in January 2021. See the portaldiplomatico.mne.gov.pt on Arquivo.pt. | 2021-01-11 | 2021-01-18 | 67,359,912 | 301 | 6.1 | FALSE |
144 | EAWP37 | Presidential Elections 2021 collection. Special collection about the Portuguese Presidential Elections, held on January 24, 2021. Seeds were obtained by automatic process using Bing API, followed by human curation. The collection ran before and after the election, using different tools (Heritrix, Brozzler and also Webrecord). Social pages of the candidates were recorded, namely Facebook and Twitter even if the complete replay won't be possible for now. Crawls were performed between January and February 2021. Find the list of URLs and at the Portuguese open data portal Dados.gov. Collections related to elections: EAWP7, EAWP9, EAWP16 EAWP17, EAWP23, EAWP26, EAWP37, EAWP39, EAWP40. | 2021-01-18 | 2021-02-11 | 2,839,323 | 18,956 | 0.626 | FALSE |
145 | EAWP38 | Portuguese research projects funded by FCT, 2021. The Fundação para a Ciência e a Tecnologia (FCT, fct.pt ) is the Portuguese body for scientific research in Portugal. Seeds were obtained by automatic process from the intermediate and final reports of funded projects and crawled. The crawl was performed in February 2021. Find more information at Arquivo.pt website. See also EAWP31, EAWP34. | 2021-02-12 | 2021-02-23 | 11,720,374 | 821 | 1.6 | FALSE |
146 | EAWP39 | Local Elections 2021 collection. Special collection about the Portuguese Local Elections, held on September 26, 2021. Seeds were obtained by automatic process using Bing API, followed by human curation. The collection ran before and after the election, using different tools (Heritrix, Brozzler). Social pages of the candidates were recorded, namely Facebook and Twitter even if the complete replay won't be possible for now. Crawls were performed between August and October 2021. Find the list of URLs available at the Portuguese open data portal Dados.gov. Collections related to elections: EAWP7, EAWP9, EAWP16 EAWP17, EAWP23, EAWP26, EAWP37, EAWP39, EAWP40. | 2021-07-23 | 2021-10-07 | 31,266,653 | 118,440 | 2.7 | FALSE |
147 | EAWP40 | Parliamentary Elections 2022 collection. Special collection about the Portuguese Parliamentary Elections, held on January 25, 2022. Seeds were obtained by automatic process using Bing API, followed by human curation. The collection ran before and after the election, using different tools (Heritrix, Brozzler and Webrecord). Social pages of the candidates were recorded, namely Facebook and Twitter even if the complete replay wont be possible for now. Crawls were performed between January and February 2022. Find the list of URLs at hte Portuguese open data portal Dados.gov. See collections related to elections: EAWP7, EAWP9, EAWP16 EAWP17, EAWP23, EAWP26, EAWP37, EAWP39, EAWP40. | 2022-01-25 | 2022-02-10 | 1,207,073 | 3,052 | 0.249 | FALSE |
148 | EAWP41 | Collection of websites and social media related to cryptocurrencies (from coingecko.com and coinmarketcap.com and opensea.io). 5 Crawls were made between Jan 2022 and November 2022. "Life of cryptocurrencies in a year". Find related data set available at the Portuguese open data portal Dados.gov and more information on Arquivo.pt website. | 2022-01-22 | 2022-11-28 | 122,491,428 | 105,253 | 12 | FALSE |
149 | EAWP42 | Collection of external links from Wikipedia using the Wikimedia dumps. Find technical information about this collection on GitHub. | 2023-02-15 | 2023-02-27 | 12,454,652 | 10,137,225 | 0.856 | FALSE |
150 | EAWP43 | RCAAP - Portuguese science citations collection, 2023. Special collection of the rcaap.pt portal. Repositório Científico de Acesso Aberto de Portugal (RCAAP) is the portal that collects, aggregates and indexes open access scientific contents from Portuguese institutional repositories. Seeds were obtained from the existing citations in scientific articles. | 2023-03-15 | 2023-03-31 | 99,100,177 | 1,904,691 | 9.8 | FALSE |
151 | EAWP44 | Ciencia Vitae - Portuguese researchers' collection, 2023. Special collection of the cienciavitae.pt platform. Ciencia Vitae is a platform for Portuguese research CVs. URLs presented in scientific CVs were extracted by automatic process and collected. This was a collaboration between Arquivo.pt and PT-CRIS. The crawl was performed between August and 2023. | 2023-08-31 | 1,294,776 | FALSE | |||
152 | RAQ2017 | High Quality Collection or RAQ (stands for Recolha de Alta Qualidade) - first block. Collection performed with Brozzler, a tool that uses a browser, in 2017. Find in this collection: 1) Websites later integrated in the Memorial service (arquivo.pt/memorial), for example, UMIC (umic.pt), Mundo na Escola program (mundonaescola.pt); 2 ) Hight Quality Collections (RAQ), for example Unplace project (unplace.org), PAN political party (pan.com.pt), Marlisco initiative (marlisco.eu), Green Peace Spain (es.greenpeace.org/es), FCT (fct.pt); 3) Parcial contents about the Local Portuguese Elections 2017 "Autárquicas" collected with Brozzler. | 2017-03-01 | 2017-12-28 | 3,277,067 | 0.01 | FALSE | |
153 | RAQ2018 | High Quality Collection or Recolha de Alta Qualidade (RAQ) - second block. Collection performed with Brozzler, in 2018. Find in this collections: Zappiens multimedia portal (zappiens.fccn.pt), Visibilidade.net website (visibilidade.net), TSF radio station (tsf.pt), Sem Planos blog (semplanos.com), mercadodolivro (mercadodolivro.pt), jornadas2018 (jornadas.fccn.pt), Aventuras de Jeremias (jeremias.com.pt), IPV66 Task Force (ipv6-tf.com.pt), ICOCL meeting (icolc.fccn.pt), Fundação Mário Soares website (fmsoares.pt), EEA Grants (eeagrants.cig.gov.pt), Consultorio de Ciência e Tecnologia (consultorioct.mct.pt), Encontro Ciência 2008 (cla.fccn.pt), Algarve Digital (cidadesdigitais.pt), Casa Comum (casacomum.pt), B-on (b-on.pt). Recordings were done using Brozzler tool. | 2018-01-01 | 2018-12-31 | 2,366,581 | 0.10 | FALSE | |
154 | RAQ2019 | High Quality Collection or Recolha de Alta Qualidade (RAQ) - third block. Collection performed with Brozzler, in 2019. Includes Degois.pt. Found in this collection: Galeria Zé dos Bois (zedosbois.org), Robotica (robotica.pt), Os bichos (osbichos.pt), Mundo na Escola portal (mundonaescola.pt), Minema project(minema.di.fc.ul.pt), Marlisco.eu (marlisco.eu), Jornadas FCCN 2019 (jornadas.fccn.pt), Personal page (jlmiran (w2.estgp.pt/docentes/jlmiran), Internete Segura (internetsegura.pt), Twitter page of the Portuguese Government (twitter.com/govpt), Forcomp website of Evora University (forcomp.uevora.pt), Fraternidade Nun'Álvares (fna-escuteiros.org), Infrestruturas digitais de investigação (e-infras.pt), Degóis platform reserchers CVs (degois.pt), Old websites owned by CEGER governmental network for IT management (ceger.gov.pt) (like natolisboa2010.gov.pt, dislikebullyinghomofobico.gov.pt, natomedicalconference2009.gov.pt, pepal.gov.pt, knetworks.gov.pt, disaster-recovery.gov.pt, redesdoconhecimento.gov.pt, diadomar.mdn.gov.pt, missaovenezuela.gov.pt, religare.gov.pt). Know more about the collaboration with CEGER https://www.youtube.com/watch?v=E8WuwF4OJnc . | 2019-01-01 | 2019-12-31 | 7,810,127 | 0.51 | TRUE | |
155 | RAQ2020 | High Quality Collection or Recolha de Alta Qualidade (RAQ) - fourth block. Collection performed with Brozzler and Browsertrix. Found in this collection: Exposição FCSH Tempos de doença tempos de cura (fcsh.unl.pt/faculdade/bibliotecas/tempos-de-doenca-tempos-de-cura/); FCT website (fct.pt); Senior3045 program (senior3045.ipportalegre.pt); Teatro Nacional D. Maria II website recorded locally by Rita Carpinha using Webrecorder.io tools and integrated at Arquivo.pt (tndm.pt, 2019110); Revisionista (revisionista.pt), contents collected within the revisionista project and donated to the Arquivo.pt by André Mourão; Radiopax radio station (radiopax.com); Portugal Government portal (portugal.gov.pt); Old websites owned by FCT (ciencia2012.fct.pt, newsletter.fct.pt, esoiday.fct.pt, spaceforum.fct.pt, ticsociedade.pt, encontros.act.fct.pt, gow.fct.pt, eskills.fct.pt, curadoriadigital.fct.pt, arquivosuniversitarios.fct.pt, arquivoscientificos.fct.pt, arquivosap.fct.pt); Museu do Benfica (museubenfica.slbenfica.pt); Portuguese Ministry of Foreign Affairs website (mne.gov.pt); Jornadas FCCN 2020 (jornadas.fccn.pt); Instituto Politécnico de Portalegre summer school (sc2013.ipportalegre.pt, sc2014.ipportalegre.pt, sc2012-byday.ipportalegre.pt, sc2013-byday.ipportalegre.pt, sc2014-byday.ipportalegre.pt); Pages from Facebook to test the recording; ESPAP Public Administration portal websites (gerap.gov.pt, inst-informatica.pt, www.inst-informatica.pt); and some sites on the initiative of the digital curator. | 2020-01-01 | 2020-12-31 | 2,864,891 | 1.3 | FALSE | |
156 | RAQ2021 | High Quality Collection or Recolha de Alta Qualidade (RAQ) - fifth block. Collection performed with Brozzler, Browsertrix and ArchiveWeb.page. World Health Organization website (who.int); Videos about Covid-19 to test the recording; Velo.city Lisbon Conference website (velo-city2021.com); TVI24 website (tvi24.iol.pt); Pages about tourism in Portugal in collaboration with MUVITUR (more information at https://sobre.arquivo.pt/en/virtual-museum-of-tourism-muvitur-creates-a-collection-of-preserved-websites/); Timor Lorosae (2020.tlsa.pt); Copy of contents from Sines, Portugal, sent by the Arquivo Municipal de Sines to be integrated at Arquivo.pt; Contents from researchers through PTCRIS portal (more information at https://sobre.arquivo.pt/en/arquivo-pt-preserves-websites-of-national-scientific-projects/); Art and galleries websites in collaboration with Art Library of Gulbenkian Foundation (more information at https://sobre.arquivo.pt/pt/memoria-de-festivais-e-eventos-de-arte-para-sempre/); ROSSIO first website (rossio.fcsh.unl.pt); O Corvo newspaper website (ocorvo.pt); Biblioteca do Instituto Politécnico de Leiria (ipleiria.pt/sdoc/); Research center IFILNOVA (ifilnova.pt);Research center IELT (ielt.fcsh.unl.pt); Arquitect João Rocha personal website (joaoalvarorocha.pt); NAU website for MOOCs (nau.edu.pt); Govtech competition promoted by AMA (govtech.gov.pt); Old websites owned by from Faculdade de Ciências da Universidade de Lisboa (FCUL); Old websites owned by CEGER (oe2020.gov.pt, oe2021.gov.pt, oe2022.gov.pt, pfn.gov.pt, portugaldigital.gov.pt, prestarcontas.gov.pt, covid19estamoson.gov.pt); Old websites owned by AMA (tenhoumacrianca.pt, boaspraticasautarquicas.gov.pt); Câmara Municipal de Almada website (cm-almada.pt); Minima Linea (minimalinea.pt); FCCN website (fccn.pt); BAD librarians association website (bad.pt); Pages from Afganistan (more information at https://sobre.arquivo.pt/en/afghanistan-websites-and-the-fall-of-the-regime-in-august-2021/); 500 years Megallan circumnavegation (magalhaes500.pt); MCTES 25th anniversary (see a related result at https://arquivo.pt/25anosmctes/). | 2021-01-01 | 2021-12-31 | 20,648,567 | 7.5 | FALSE | |
157 | RAQ2022 | High Quality Collection or Recolha de Alta Qualidade (RAQ) - fifth block. Collection performed with Brozzler, Browsertrix and ArchiveWeb.page. Find in this collection: Copy of contents from Sines, Portugal, sent by the Arquivo Municipal de Sines to be integrated at Arquivo.pt; Tecnico E-Escola website (e-escola.tecnico.ulisboa.pt); Brazilian Presidential Election 2022 collection; Portuguese political parties oficial websites; FCT (fct.pt, jornadas.fccn.pt); Ukranian War impact in Ucranians and Russians living in Portugal; Hemeroteca digital website (hemerotecadigital.cm-lisboa.pt); Afghan websites through the IIPC collection seeds (https://dados.gov.pt/pt/datasets/r/10b8b0b3-c932-44c8-a56d-e31536ac942b); Infraestruturas Digitais de Informação (e-infras.pt); Pages about editors and book sellers in Portugal; Crypto currencies websites (more information at https://sobre.arquivo.pt/en/open-dataset-about-cryptocurrency/); Banco de Portugal (bportugal.pt); Agência Portuguesa do Ambiente website (www.apambiente.pt); IndieLisboa Festival website (see a related result at https://arquivo.pt/indielisboa/); Old websites owned by CEGER (oe2020.gov.pt, oe2021.gov.pt, oe2022.gov.pt, pfn.gov.pt, portugaldigital.gov.pt, prestarcontas.gov.pt, covid19estamoson.gov.pt). | 2022-01-01 | 2022-12-31 | 21,109,185 | 6.1 | FALSE | |
158 | RAQ2023 | High Quality Collection or Recolha de Alta Qualidade (RAQ) - fifth block. Collection performed with Brozzler, a web based crawler in 2023. | 2023-01-01 | FALSE | ||||
159 | PATCHING2019 | First collection from pyWb Patching. Collection of lacking information inside web pages in order to improve the replay in old web pages, for example, recovering CSS files or images. Contents were collected the live Web or from the Internet Archive using the software pyWB (Webrecorder.net). The patching was performed between September and December 2019. | 2019-09-19 | 2019-12-31 | 2,545,401 | 0.04 | FALSE | |
160 | PATCHING2020 | Second collection from Pywb Patching. Collection of lacking information inside web pages in order to improve the replay in old web pages, for example, recovering CSS files or images. Contents were collected from the live Web or from the Internet Archive using the software Pywb (Webrecorder.net). This collection includes patch crawls performed between January 2019 and January 2021. This collection includes the auto-patching of Publico.pt websites. https://sobre.arquivo.pt/wp-content/uploads/dominios-do-jornal-PUBLICO-no-Arquivopt-1996-2019.pdf | 2019-01-01 | 2021-01-21 | 12,066,082 | 0.44 | FALSE | |
161 | PATCHING2021 | Third collection from Pywb Patching. Collection of lacking information inside web pages in order to improve the replay in old web pages, for example, recovering CSS files or images. Contents were collected the live Web or from the Internet Archive using the software Pywb (Webrecorder.net). This collection includes patch crawls and is a working progress started in January 2021. | 2021-01-21 | 2021-12-31 | 106,543 | 0.01 | FALSE | |
162 | PATCHING2022 | Fourth collection from Pywb Patching. Collection of lacking information inside web pages in order to improve the replay in old web pages, as example, recovering CSS files or images. Contents were collected the live Web or from the Internet Archive using the software pyWB (Webrecorder.net). This collection includes patch crawls and is a working progress started in January 2022. | 2022-01-01 | 2022-12-31 | FALSE | |||
163 | PATCHING2023 | Fifth collection from Pywb Patching. Collection of lacking information inside web pages in order to improve the replay in old web pages, for example, recovering CSS files or images. Contents were collected the live Web or from the Internet Archive using the software pyWB (Webrecorder.net). This collection includes patch crawls and is a working progress started in January 2023. | 2023-01-01 | FALSE | ||||
164 | SAWP1 | First collection from the Pywb SavePageNow, which is an experimental service (May 2021) that allows anyone to record a web page to be integrated in Arquivo.pt. The Save Page Now (SAWP) service uses the software pyWB (Webrecorder.net). This collection includes patch crawls and is a working progress started in January 2021. | 2021-04-01 | 2023-01-11 | FALSE | |||
165 | SAWP2 | Second collection from the Pywb SavePageNow service that allows anyone to record a web page to be integrated in Arquivo.pt. The Save Page Now (SAWP) service uses the software pyWB (Webrecorder.net). This collection includes patch crawls and is a working progress started in January 2023. | 2023-01-11 | FALSE | ||||
166 | Curadoria | Curadoria 2020 collection. Collection of webpages using the Webrecorder, performed manually by the web curator or sent to the web curator by the community after the training on web recording. Resulting WARC files were integrated in the Arquivo.pt and are first records are dated 2020. | 2020-06-15 | 260,019 | 0.38 | FALSE | ||
167 | InternetMemory | Internet Memory collection. Donated by Julien Masanès, founder and head of the Internet Memory Foundation after its end in 2018. Two hard-disks were recovered and the resulting data was integrated in the Arquivo.pt. Contents in this collection between 2004 and 2010 are part of the first european web archive, the European Archive Foundation. | 143,212,251 | 6.30 | TRUE | |||
168 | CEGER | CEGER Memorial, donated by CEGER in 2019. This collection includes the following websites: diadomar.mdn.gov.pt, missaovenezuela.gov.pt, religare.gov.pt. Three inactive websites originally as HTML, CSS, Scripts and images, were converted into WARC files and integrated in the Arquivo.pt Memorial. Centro de Gestão da Rede Informática do Governo (CEGER) is the Portuguese government computer network management center. Files timestamp are between November 2017 and April 2019. | 2017-11-17 | 2019-04-12 | 184 | 3 | 0.00 | FALSE |
169 | Geocities | Geocities collection. Integration of the Geocities dump done by the Archive Team in 2009. The information was originally in Web files (HTML, CSS, Script and images) and was converted into WARC files by the Arquivo.pt. The integration was finished in 2021. Custom Search page: arquivo.pt/searchGeocities | 35,616,800 | 0.38 | TRUE | |||
170 | FortuneCities | FortuneCities collection. Integration of the FortuneCity dump done by the Archive Team in 2012. The integration was finished in 2021. | FALSE | |||||
171 | Revisionista | News websites collection crawled and donated by Revisionista.pt. | 2020-03-13 | 2022-05-27 | 26,957,627 | 61 | 2.10 | FALSE |
172 | MAWP1 | 1st block of Monthly Collections. Includes a selection of websites to be collected monthly, given its relevance, e.g., for Portuguese local communities, or according to their publishing of new contents. Crawls were performed between May and September 2020. | 2020-05-21 | 2020-09-30 | 41,595,384 | 87 | 18.00 | FALSE |
173 | MAWP2 | 2nd block of Monthly Collections. Includes a selection of websites to be collected monthly, given its relevance, e.g., for Portuguese local communities, or according to volume of new contents. Crawls were performed between October and December 2020. | 2020-10-01 | 2020-12-31 | 20,399,639 | 88 | 9.60 | FALSE |
174 | MAWP3 | 3rd block of Monthly Collections. Includes a selection of websites to be collected monthly, given its relevance, e.g., for Portuguese local communities, or according to their publishing of new contents. Crawls were performed between January and March 2021. | 2021-01-01 | 2021-03-31 | 30,035,262 | 88 | 9.1 | FALSE |
175 | MAWP4 | 4th Block of Monthly Collections. Includes a selection of websites to be collected monthly, given its relevance, e.g., for Portuguese local communities, or according to their publishing of new contents. Crawls were performed between April and June 2021. | 2021-04-01 | 2021-06-30 | 18,336,531 | 88 | 3.5 | FALSE |
176 | MAWP5 | 5th Block of Monthly Collections. Includes a selection of websites to be collected monthly, given its relevance, e.g., for Portuguese local communities, or according to their publishing of new contents. Crawls were performed between July and September 2021. | 2021-07-01 | 2021-09-30 | 18,947,796 | 88 | 7.4 | FALSE |
177 | MAWP6 | 6th Block of Monthly Collections. Includes a selection of websites to be collected monthly, given its relevance, e.g., for Portuguese local communities, or according to their publishing of new contents. Crawls were performed between October and December 2021. | 2021-10-01 | 2021-12-31 | 24,066,714 | 88 | 7.9 | FALSE |
178 | MAWP7 | 7th Block of Monthly Collections. Includes a selection of websites to be collected monthly, given its relevance, e.g., for Portuguese local communities, or according to their publishing of new contents. Crawls were performed between January and March 2022. | 2022-01-01 | 2022-03-31 | 27,220,701 | 88 | 6.3 | FALSE |
179 | MAWP8 | 8th Block of Monthly Collections. Includes a selection of websites to be collected monthly, given its relevance, e.g., for Portuguese local communities, or according to their publishing of new contents. Crawls were performed between April and June 2022. | 2022-04-01 | 2022-06-30 | 34,712,830 | 88 | 8 | FALSE |
180 | MAWP9 | 9th Block of Monthly Collections. Includes a selection of websites to be collected monthly, given its relevance, e.g., for Portuguese local communities, or according to their publishing of new contents. Crawls were performed between July and September 2022. | 2022-07-01 | 2022-09-30 | 30,912,031 | 88 | 8.3 | FALSE |
181 | MAWP10 | 10th Block of Monthly Collections. Includes a selection of websites to be collected monthly, given its relevance, e.g., for Portuguese local communities, or according to their publishing of new contents. Crawls were performed between October and December 2022. | 2022-10-01 | 2022-12-31 | 37,896,236 | 88 | 4.8 | FALSE |
182 | MAWP11 | 11th Block of Monthly Collections. Includes a selection of websites to be collected monthly, given its relevance, e.g., for Portuguese local communities, or according their publishing of new contents. Crawls were performed between January and March 2023. | 2023-01-01 | 2023-03-31 | 37,357,999 | 88 | 4 | FALSE |
183 | MAWP12 | 12th Block of Monthly Collections. Includes a selection of websites to be collected monthly, given its relevance, e.g., for Portuguese local communities, or according their publishing of new contents. Crawls were performed between April and June 2023. | 2023-04-01 | 2023-06-30 | 48,103,804 | 88 | 3.3 | FALSE |
184 | MAWP13 | 13th Block of Monthly Collections. Includes a selection of websites to be collected monthly, given its relevance, e.g., for Portuguese local communities, or according their publishing of new contents. Crawls were performed between July and September 2023. | 2023-07-01 | 2023-09-30 | 29,203,723 | 75 | 2.7 | FALSE |
185 | MAWP14 | 14th Block of Monthly Collections. Includes a selection of websites to be collected monthly, given its relevance, e.g., for Portuguese local communities, or according their publishing of new contents. Crawls were performed between October and December 2023. | 2023-10-01 | 75 | FALSE |
1 | Collection ID prefix | Description |
---|---|---|
2 | AWP | Stands for "Archive Web Portuguese". Every three months a list of all the domains registered under the .PT domain is collected, using a conventional crawler due to the high volume of data involved. Those content blocks or "collections" are named using the AWP acronym plus a sequencial number. In April 2023 we reached AWP44. |
3 | MAWP | Stands for "Monthly - Archive Web Portuguese". Monthly crawl means that it takes place over the course of a month and runs for approximately 22 days. A list of selected websites are collect using a combination of browser-based crawlers. Mostly are newspaper websites and other media and government websites. The browser-base crawling allows a better recording and collection of media contents and improve the quality when replaying web contents. On the other hand, it consumes more resources. Thus the periodicity of these MAWP collections, has depended on the resources available. |
4 | FAWP | Stands for "Frequent - Archive Web Portuguese". Daily crawls means that it recorded same webistes every day. Mostly are newspapers websites, other media and government websites. These are typically websites that publish new content every day. |
5 | EAWP | Special crawls ("Especial"): selected pages about a given theme or event crawled with varied frequency. |
6 | SAWP | Save Page Now crawls: pages web-archived by users in high-quality using the arquivo.pt/savepagenow service. |
7 | RAQ | High-quality crawls: a list of selected websites that were carefully archived and curated with the highest-quality possible using the best combination of technologies available. |
8 | PATCHING | Web-archived resources crawled by CompletePage service after user interaction. |
9 | Varied prefixes (Roteiro, Dinis, Weblog, etc.) | Donated collections |