1 | European Data Science Academy Dataset Register | |
---|---|---|
2 | ||
3 | This database contains metadata relating to the datasets used and generated within the EDSA project. The database acts as a catalog providing background information on each dataset, their method of collection and, where published, how to access them | |
4 | ||
5 | The EDSA data management plan can be found at: | |
6 | http://edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2015-P-D55-FINAL.pdf | |
7 | ||
8 | Table of Contents | |
9 | Introduction | Metadata |
10 | V1.2 - DMP M12 | The dataset register |
11 | DMP Field Description | A list of the fields used to construct the register |
12 | DMP Code Lists | Codes and abbreviations used in specific fields |
13 | ||
14 | ||
15 | Licence | |
16 | ||
17 | ||
18 | Certificate | |
19 | This work has an Open Data Certificate | |
20 | ||
21 | Contact | |
22 | If you have any questions, suggestions or comments, please don't hesitate to email us at: training@theodi.org | |
23 | ||
24 | Published | |
25 | 26/05/2016 | |
26 | ||
27 | Last updated | |
28 | 07/07/2016 | |
29 | ||
30 | Please cite as | |
31 | The European Data Science Academy Register (2016) The Open Data Institute | |
32 | ||
33 | To open in Google sheets: | |
34 |
1 | Data set reference and name | Data set description | Standards | Data sharing | Archiving and preservation | Data Ethics | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | Relevant to DMP "Has the dataset been used to infer results of research?" | Work Package | Organisation | Dataset Title | Dataset Identifier | Status - January 2018 (ongoing, in progress, due date) | New entry to data management plan since M18? Yes/No | Generated or collected | Origin | Scale | Who is this useful for? | Similar existing dataset and possibility for integration? Value of this new dataset? | What standards and methodologies will be utilised for data collection and management? | Outline the metadata, documentation or other supporting material that should accompany the data for it to be interpreted correctly | Status and location of metadata, documentation or other supporting material | Licensing, data protection, ownership and copyright | Can the data be published under an open licence? | If the data cannot be published openly, why? | How will the data be shared? (including access procedures, dissemination, software/tools needed for enabling reuse | Which repository will be used for the data? Why this respository? | Is it ready to be published? | Current location of dataset | Dataset Link | Licence | How long should the data be preserved? How will it exceed the length of the project if necessary? | Approx end volume | Who is responsible in your organisation for the data managament and curation? | Quality assurance and back up procedures? | Associated costs and how these will be covered - do you need to purchase storage? How much time will it take for a person to manage the data - how will this be covered? | Are there any ethical or legal issues that can have an impact on sharing this data? Y/N | What are the ethical or legal issues, if any, that can occur from sharing this data? | Is informed consent for data sharing and long term preservation included in questionnaires dealing with personal data? Y/N/NA |
3 | Yes | WP1 | ODI | Corpora of crawled web-based adverts from LinkedIn | WebSiteHarvest | Finished | No | Collected | 46 terms, 31 languages, 47 countries, 1 harvest per day, 2162 data points per day | Internal demand analysis only. | Many datasets are collected in this area, however due to the specific nature of this study, collection of new data is required and integration with existing datasets not viable. The value of this dataset comes from the provision of an up-to-date snapshot of current data science skills needs across the EU. | All data collected is translated into CSV format. | Data will be not available for reuse or accessible by anyone outside of the project. The data collected will be used for internal analysis to inform the creation of curriculum. | Metadata is not publically available | The terms of the LinkedIn user agreement now forbid harvesting and collection of data without express permission. When the data was collected, this was not the case. https://www.linkedin.com/legal/user-agreement?trk=hb_ft_userag | No | The terms of the LinkedIn user agreement forbid harvesting and collection of data without express permission. "Use manual or automated software, devices, scripts robots, other means or processes to access, “scrape,” “crawl” or “spider” the Services or any related data or information;" https://www.linkedin.com/legal/user-agreement?trk=hb_ft_userag | Data will be not shared or available for reuse | Using Github so that the data stays close to its usage and can be used quickly and easily. | N/A | N/A | N/A | N/A | Until the end of the project | <1Gb | ODI lead data management and curation, other WP1 partners will contribute | Backed up to an internal ODI repository | Approximately 1 day person effort per month | Y | Internal demand analysis only | NA | |
4 | Yes | WP1 | ODI | Aggregated statistics of European skill demand based on web-based job adverts | WebSiteStatistics | Finished | No | Collected | Adzuna API, Trovit | Varied | Populating the dashboard, internal demand analysis and to inform curriculum development. | Many datasets are collected in this area, however due to the specific nature of this study, collection of new data is required and integration with existing datasets not viable. The value of this dataset comes from the provision of an up-to-date snapshot of current data science skills needs across the EU. | All data collected is translated into CSV format. | The Adzuna data is accessible via the Adzuna API. The Trovit data will be not available for reuse or accessible by anyone outside of the project. | Metadata is not publically available | The data will be available for use via the EDSA dashboard However it will not be available to download as this contravenes Trovit’s terms and conditions. | No | Trovit’s terms of use prohibit the use of their data. The research exception allows us to use the data but not to make it available in raw format for others to consume for commercial purposes. | Via the EDSA dashboard | In an internal JSI repository | N/A | N/A | N/A | N/A | Until the end of the project | <1Gb | ODI lead data management and curation, other WP1 partners will contribute | Backed up in an internal JSI repository | Approximately 1 day person effort per month | Y | Data can only be used for research purposes | NA |
5 | Yes | WP1 | ODI | Individual results from demand analysis | IndividualResponses | Ongoing | No | Generated | Interviews and survey | 584 surveys, 108 interviews at present. Online survey still open. | Internal demand analysis. | A number of surveys exist in this domain but their data is not available to this project. This data will enable EDSA to build up a country by country view of current capacity and requirements for data science skills. | Data collection methods outlined in D1.4. Translated into CSV format. | Data will be not shared or available for reuse. The data collected will be used for internal analysis to inform the creation of curriculum. Anonymised data will be publicly available. | Metadata is not publically available | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | Data protection of personal data | Data will be not shared or available for reuse | Internal ODI repository | N/A | N/A | N/A | N/A | Until the end of the project | <100Mb | ODI lead data management and curation, other WP1 partners will contribute | Backed up to an internal ODI respository | Approximately 1 day effort per month | Y | Contains personal data | N |
6 | Yes | WP1 | ODI | Summary data from surveys and interviews | DemandAnalysisSummary | Finished | No | Generated | Interviews and survey | 585 surveys, 108 interviews. | External analysis of respondents who took the surveys and interviews. | None | Data collection methods outlined in D1.4. Translated into CSV format. | A README.md file is available detailing the data structure and basic usage. | https://theodi.github.io/edsa-demand-analysis-summary-data/ | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | Yes | N/A | Data will be available to access from the EDSA website and the ODIs Github repository. | Github/ EDSA website | Yes | https://theodi.github.io/edsa-demand-analysis-summary-data/ | https://theodi.github.io/edsa-demand-analysis-summary-data/ | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | As long as Github exists as a minimum. Beyond that a value judgement would have to be made. | <100Mb | ODI lead data management and curation, other WP1 partners will contribute | Stored in external repositories - EDSA website and Github | Github free and public | No | None | N/A |
7 | Yes | WP1 | ODI | De-identified survey responses from demand analysis | DeidentifiedResponses | Ongoing | No | Generated | Survey | 496 survey results | External analysis of results and trends by anyone who wishes to gather survey data in the area of data science | There are a number of other surveys that have been aggregated that we can compare our result too and use these results if necessary. This dataset has the same eventual value to others in the area | Data collection methods outlined in D1.4. Translated into CSV format. | A README.md file is available detailing the data structure and basic usage. | http://davetaz.github.io/quantitative-data-from-edsa-demand-analysis-/ | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | Yes | N/A | Data will be available to view on the EDSA dashboard and accessible for free in the EDSA dashboard Github repository. | Github/ EDSA Dashboard on website | Yes | Github/ EDSA Dashboard on website | http://davetaz.github.io/quantitative-data-from-edsa-demand-analysis-/ | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | As long as Github exists as a minimum. Beyond that a value judgement would have to be made. | <100Mb | ODI lead data management and curation, other WP1 partners will contribute | Stored in external repositories - EDSA website and Github | Github free and public | No | None | N/A |
8 | Yes | WP1 | ODI | Recordings and transcriptions of interviews | InterviewTranscipts | Finished | No | Generated | Interviews | 108 transcripts, 108 recordings | Internal demand analysis only | No similar datasets exist that are usable for this project. The interviews provide insights and data points for use in the demand analysis. | Qualitative research methodology for collection outlined in D1.4 | Data will be not available for reuse or accessible by anyone outside of the project. The data collected will be used for internal analysis to inform the creation of curriculum. | N/A | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | Data protection of personal data | Data will be not shared or available for reuse. The data collected will be used for internal review to inform the creation of curriculum and will only be available publically as anonymous data | Internal ODI repository | N/A | N/A | N/A | N/A | Until the end of the project | < 3GB | ODI lead data management and curation, other WP1 partners will contribute | Backed up to an internal ODI respository | As part of the subcontracting costs of WP1 | Y | Raw data including personal data | No |
9 | No | WP1 | ODI | ideXlab search platform results | ExpertIdentification | Ongoing | No | Collected | Research publications | Final scale not yet known as collection is ongoing | Internal demand analysis and to inform curriculum development. Provides insights into offer side of skills analysis. | Not in this area. This dataset will provide validation of the demand analysis and form the basis for further insights. | The ideXlab search engine will use the sampling approach outlined in D1.2. for data collection. CSV data will be created | Data will be not available for reuse or accessible by anyone outside of the project. The data collected will be used for internal analysis to inform the creation of curriculum. | Accompanying document to explain data structure. This will not be made open. | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | Data protection of personal data | The data will not be shared due to restrictions on the use of personal data. | ideXlab search platform | N/A | N/A | N/A | N/A | Until the end of the project | Est. 1000 returns | ideXlab lead data management and curation, other WP1 partners will contribute | Backed up to an internal ideXlab respository | Approx 2 person days per month. No other external costs | Y | Raw data including personal data | No |
10 | Yes | WP1 | JSI | trainings repository | TrainingRepository | Ongoing | No | Both | Trainings APIs, training websites, Wikifer processing | >50.000 offline and online trainings (within different fields) | EDSA dashboard | This is an extensive cross-lingual dataset, that covers a wide spectrum of trainings. Each training is processed and semantically annotated. | JSON | Data is available within project, statistics is available on request. | For Wikifier annotations - Wikifier.org | Raw data will be owned by the project and unlicensed. | N/A | N/A | Data is shared via EDSA dashboard. | QMiner (JSI platform) | N/A | JSI server | N/A | N/A | At least until the end of the project | N/A | JSI participates in EDSA dashboard data collection and development | Backed up regularly | N/A | No | None | N/A |
11 | Yes | WP1 | JSI | Job repository | JobRepository | Ongoing | No | Both | Job APIs, Job websites, Wikifier processing | > 3.000.000 jobs | EDSA dashboard | This is an extensive cross-lingual dataset, that covers a wide spectrum of European countries. Each job announcement is processed and semantically annotated. | JSON | Data is available within project, statistics is available on request. | For annotation with skills - SARO ontology, for Wikifier annotations - Wikifier.org, for geolocations - Geonames | Raw data will be owned by the project and unlicensed. | No | N/A | Data is shared via EDSA dashboard. | QMiner (JSI platform) | N/A | JSI server | N/A | N/A | At least until the end of the project | N/A | JSI participates in EDSA dashboard data collection and development | Backed up regularly | N/A | N/A | N/A | N/A |
12 | Yes | WP2 | ODI | Related course data regarding similar modules and training offerings across the EU | DataScienceCourses | Finished | No | Collected | Course websites | 600 KB | Internal use for development of curricula and learning materials. External use for identfying useful courses | None. The data will provide a useful resource for those wishing to understand what courses are available. | Systematic search and review of available data science courses. The search terms were Data Science, Big Data, Data Analytics, Business Analytics, Machine Learning, Distributed Computing, Advanced Computing Data Science Stream, Data Analytics stream. | Metadata has been published alongside the data | https://theodi.github.io/data-science-courses-in-europe-2016/ | The data is licensed under a Creative Commons CC-BY 4.0 licence | Yes | N/A | GitHub/EDSA website | Github, EDSA website | Yes | https://theodi.github.io/data-science-courses-in-europe-2016/ | https://theodi.github.io/data-science-courses-in-europe-2016/ | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | Until the end of the project | < 1GB | ODI lead data management and curation | Backed up to an internal ODI repository | 0.5 days per month | N | None | N/A |
13 | No | WP2 | Persontyle | Datasets for course examples and exercises | Using namespace notation to specify R packages: sml::poly4, sml::poly4b, sml::kmeans, sml::seeds, car::Duncan, car::Davis, datasets::car, datasets::HairEyeColor, datasets::Airquality, datasets::swiss, bestGLM::zprostate, MASS::menarche | Finished | No | Both | Various - many from third party R packages students download from CRAN. Some in an author developed package. | 12 small datasets. <1MB | Students in the "Essentials of Data Analytics and Machine Learning" course. | Third party R packages students download from CRAN. Some in an author developed package hosted on CRAN | None | The datasets will be used within learning activities offered as part of the "Essentials of Data Analytics and Machine Learning" course. They are stored in the sml R package. | Package documentation (except, currently, for those in the sml package) | GNU GPL V3, http://www.gnu.org/licenses/gpl-3.0.en.html | Yes | N/A | Via R packages, searchable online. | CRAN | Yes | CRAN, except for sml package which is currently available on the EDSA portal and will move to CRAN when finished. | https://vincentarelbundock.github.io/Rdatasets/datasets.html | GNU GPL V3, http://www.gnu.org/licenses/gpl-3.0.en.html | As long as the owners do not remove them. If the datasets are no longer accessible, other similar datasets will be used in the module. | < 1MB | Persontyle lead data management and curation, third parties for collected data | Relying on CRAN | None | No | None | N/A |
14 | No | WP2 | TU/e | Event log from a municipality process | Finished | No | Collected | Dutch municipality | 200 KB | Users interested in real life event logs. | We have a large collection of real life event logs at http://data.3tu.nl/repository/collection:event_logs_real | Management throuh 3TU data center | Includes number of traces, events, attributes, timespan, etc. | Non-commercial licence | No | The data is shared and publically available for non-commercial reuse. Its non-commercial licence means it cannot be published openly. | Yes | unknown | The data is stored in long-term storage so will be available for years to come | 200 KB | 3TU | Reliant on third party. If the dataset becomes unavailable we will use a similar one in the online module. | none | No | None | No | ||||||
15 | No | WP3 | JSI | Repository statistics on downloads and views of educational resources | RepositoryStatistics | Available, regularly updated | No | Collected | videolectures.net | views and comments for each videolecture | internal analysis, curriculum development, external demand analysis | None. Provides evidence of resource usage and basis for improving curriculum, content and course structure. | CSV is used for Videolectures API | Videolectures REST api documentation. An MD Readme file is available for download | https://github.com/innanoval/edsa-videolectures-statistics-dataset-1/tree/gh-pages/data | The data is licensed under a Creative Commons CC-BY 4.0 licence | Yes | N/A | Available to see at videolectures website; described as part of WP3 deliverables | videolectures repository. Proximity to data source. | Yes | JSI server | https://github.com/innanoval/edsa-videolectures-statistics-dataset-1/tree/gh-pages/data | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | the data will be available after the project ends as part of the project's learning materials | < 1GB | JSI lead data management and curation. OU contribute | videolectures - relying on internal quality assurance & back up procedures | Approximately 1 day per month during the project’s lifetime | No | None | N/A |
16 | No | WP3 | OU | Learning Analytics data generated from the EDSA Online Courses portal | EDSAOnlineCoursesLA | Ongoing. Generation of data started on 01/02/2016, together with the launch of the first EDSA courses | No | Generated | http://courses.edsa-project.eu | Not yet known | Course producers can get an understanding of how their courses are being used. Learners can monitor their learning progress. | Not many Learning Analytics datasets are publicly available. The OU has recently published a similar dataset: https://analyse.kmi.open.ac.uk/open_dataset | The xAPI specification is used for expressing the data; the open source Learning Locker software is used for storing and visualising the data. | Introduction to the xAPI (or Tin Can API): https://tincanapi.com/overview/. Introduction to Learning Locker: https://learninglocker.net | https://tincanapi.com/overview/ https://learninglocker.net https://alexmikro.github.io/learning-analytics-dataset-from-the-edsa-online-courses-portal/ | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | Yes | N/A | Via the EDSA website / Github | We have setup a dedicated EDSA Learning Locker. This was chosen for the reasons outlined in https://learninglocker.net/benefits/ | Yes | EDSA Learning Locker | https://alexmikro.github.io/learning-analytics-dataset-from-the-edsa-online-courses-portal/ | CC-BY 4.0 | Until the end of project | Not yet known | OU lead data management and curation. | Relying on the backup procedures of the OU, as the dataset is hosted on an OU server. | Server storage has already been purchased. Effort for analysing the data has been allocated in Task 3.4. | No, this dataset is anonymised. | None | N/A |
17 | No | WP3 | JSI | Internal logs of elearning systems | InternalLogs | Available, regularly updated | No | Collected | videolectures.net | for videolectures: 20.000 videos, 17.431 lectures, 12.998 authors, 952 events, 579 categories | internal demand analysis | None. Provides evidence of resource usage and basis for improving curriculum, content and course structure. | JSON is used for Videolectures API | Videolectures REST api documentation | N/A | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | Privacy. Data requires anonymisation and/or aggregation, and at the moment the use case for anonymised data is not clear. | Available to see at videolectures website; described as part of WP3 deliverables | videolectures repository. Proximity to data source. | N/A | JSI server | N/A | N/A | at least until the end of project | N/A | JSI lead data management and curation. OU contribute | videolectures - relying on internal quality assurance & back up procedures | N/A | Yes | Privacy | N/A |
18 | No | WP3 | JSI | Statistics of course registration, participation and completion | StatisticsForCourses | Available, regularly updated | No | Collected | videolectures.net | for videolectures - available per videolecture, per viewer | internal demand analysis | None. Provides basis for improving curriculum, content and course structure. | JSON is used for Videolectures API | Videolectures REST api documentation | N/A | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | Privacy. Data that does not contain privacy issues might be publishable | Available to see at videolectures website; described as part of WP3 deliverables | videolectures repository. Proximity to data source. | N/A | JSI server | N/A | N/A | at least until the end of project | < 1GB | JSI lead data management and curation. OU contribute | videolectures - relying on internal quality assurance & back up procedures | N/A | No | None | N/A |
19 | No | WP3 | JSI | Aggregated statistics of engagement with the developed courses and educational resources | AggregatedStatistics | Available, regularly updated | No | Generated | videolectures.net | for videolectures - available per videolecture, per viewer | internal demand analysis | None. Provides evidence of adoption and basis for improving curriculum, content and course structure. | JSON is used for Videolectures API | Videolectures REST api documentation | N/A | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | Privacy. Data that does not contain privacy issues might be publishable | Available to see at videolectures website; described as part of WP3 deliverables | videolectures repository. Proximity to data source. | N/A | JSI server | N/A | N/A | at least until the end of project | < 1GB | JSI lead data management and curation. OU contribute | videolectures - relying on internal quality assurance & back up procedures | N/A | No | None | N/A |
20 | No | WP3 | TU/e | Recorded behavior of students following the first session of the process mining MOOC | CourseraMOOCprocmin001 | Ongoing | No | collected | coursera.org | several large tables | learning analytics within EDSA | every Coursera course has this data recorded | Data collection is managed by Coursera | There is no external link to the metadata | N/A | Raw data is owned by TU/e and cannot be shared due to Coursera restrictions of use. | No | Restrictions of use from the data provider | N/A | The data is collected by and stored on a Coursera repository. | No | Coursera | N/A | N/A | N/A | around 1 GB | Joos Buijs, Tu/e lead data management and curation | relying on coursera | N/A | Yes | Student identifiable data | No |
21 | No | WP4 | SOTON | Web server logs and Google analytics of project website access | WebsiteAnalytics | Ongoing | No | Collected | http://edsa-project.eu | 1 website | Internal analysis for dissemination and community analysis. Secondary use for implicit demand analysis. | None. Provides evidence of engagement and basis for UX improvement. | Quantitative recording of website traffic via Google Analytics dashboard, analysed using a variety of analytic tools. | Sessions, Page views, Demographics, User Flow, Bounce rate, | There is no external link to the metadata | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | User privacy. The data can be aggregated and published under an open license | Analysed data will be made available throughout deliverable reports in WP4. | Internal institutional Soton/OU repositories | N/A | https://analytics.google.com/ | N/A | N/A | at least until the end of project | < 1GB | OU lead data management and curation. Soton contribute | Backed up remotely | Free storage. 0.5 day per month | Yes | Google analytics and web log data contain information which could be used to identify individuals | N/A |
22 | No | WP4 | SOTON | Generated social media engagement data | SocialMediaEngagements | Ongoing | No | Collected | 1 Twitter Account | Internal analysis for community strength and project dissemination. | None that relate to EDSA. Provides evidence for engagement with project, effectiveness of dissemeniation activities. Provides basis for understanding what content users find most engaging. | Regular access of data from analytics.twitter.com | Tweets, Impressions, Profile Visits, Followers, Mentions | https://analytics.twitter.com/user/edsa_project/home | Data will be licensed in compliance with each social network's terms and conditions | No | Data sharing needs to comply with individual site licenses. However the majority of social networks do not permitted collection, harvesting and republication of data | Dashboard on EDSA website. Deliverable reports in WP4. | Internal institutional Soton repositories | Data not accessible directly without tools. Required Twitter harvester. | Until the end of the project | < 1GB | Soton lead data management and curation. | Backed up remotely | Free storage. 1 day per month | Yes | Subject to license conditions from the social network | N/A | ||||
23 | No | WP5 | ideXlab | List of project exploitation results – collaborations, institutional and geographical beneficiaries, | ProjectExploitation | Ongoing | No | Generated | Project partners | Variable | Internal analysis for results to be exploited and targets | None. Provides data on dissemination activity, network and results. | Report detailing results from interviews and exploitation activities | N/A | N/A | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | Confidentiality, data obsolescence | Deliverable reports in WP5. | Google docs shared document | No | Data not accessible directly. | N/A | N/A | Until the end of the project | 200 KB | ideXlab lead data management curation | Backed up remotely | Free storage. 1 day per month | No | None | NA |
24 | No | WP5 | ODI | The EDSA Register | EDSARegister | Ongoing | No | Generated | Project partners | <500KB | Anyone interested in understanding the datasets used within the EDSA project. Internal management tool. | None | Project partners update every three months until the end of the project. ODI responsible for conversion to CSV and publication as open data. | A README.md file is available detailing the data structure and basic usage. | https://theodi.github.io/european-data-science-academy-register/ | This dataset is published on Github, under a CC-BY licence. | Yes | N/A | Via Github and via the EDSA website (http://edsa-project.eu/resources/datasets/) | Google docs shared document and Github | Yes | https://theodi.github.io/european-data-science-academy-register/ | https://theodi.github.io/european-data-science-academy-register/ | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | As long as Github exists as a minimum. Beyond that a value judgement would have to be made. | < 500MB | ODI lead data management and curation, other WP1 partners will contribute | Stored in external repositories - EDSA website and Github | Stored in external repositories - EDSA website and Github; approximately 2 days per month effort for maintenance. | Yes | Has links to de-annoynamised data | No |
25 | No | WP3 | TU/e | FutureLearn course run data 'Introduction to process mining with ProM' | FLMOOC-procmin1 | Ongoing | Yes | Collected | https://www.futurelearn.com/admin/courses/process-mining/1/ | <10Mb | Learning analytics within EDSA | One of several datasets of FutureLearn course behavior data, interesting for comparison between sessions and runs. | Data collection is managed by FutureLearn | https://partners.futurelearn.com/data/datasets/ | N/A | Raw data is owned by TU/e and cannot be shared outside the EDSA project due to FutureLearn restrictions of use. | No | User privacy. The data can be aggregated and published under an open license | Analysed data will be made available throughout deliverable reports in WP3. | Local storage at TU/e, to store privacy sensitive data. | N/A | FutureLearn | N/A | N/A | N/A | <10MB | Joos Buijs, Tu/e lead data management and curation | Relying on FutureLearn | N/A | Yes | Student identifiable data | No |
26 | No | WP2 | ODI | Monthly Rainfall (mm) Totals for Selected Stations in Tanzania, 2014 | Tanzania_Rainfall | Ongoing | Yes | Collected | http://training.theodi.org/InPractice/inpractice1/course/en/exercises/Tanzania_Rainfall.pdf | <66KB | Anyone interested in understanding the exercises within the finding stories curriculum | None. | None | The datasets will be used within learning activities offered as part of the "Finding stories in Data" course. | Modules 4 - Gathering Data | This dataset is published on Github, under a CC-BY licence. | Yes | N/A | Via Github and via the EDSA website (http://courses.edsa-project.eu/course/view.php?id=52) | Github, EDSA website | N/A | EDSA Website | http://training.theodi.org/InPractice/inpractice1/course/en/exercises/Tanzania_Rainfall.pdf | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | N/A | <66KB | David Tarrant, ODI | Github | N/A | No | None | N/A |
27 | No | WP2 | ODI | BBC RSS Feed | BBCnews | Ongoing | Yes | Collected | http://feeds.bbci.co.uk/news/rss.xml | 1 Twitter account | Anyone interested in understanding the exercises within the finding stories curriculum | None | None | The datasets will be used within learning activities offered as part of the "Finding stories in Data" course. | Modules 4 - Gathering Data | This data is publicly available from an external source | Yes | N/A | Via Github and via the EDSA website (http://courses.edsa-project.eu/course/view.php?id=52) | N/A | N/A | EDSA Website | http://feeds.bbci.co.uk/news/rss.xml | N/A | N/A | Unknown | David Tarrant, ODI | N/A | N/A | No | None | N/A |
28 | No | WP2 | ODI | Health Facility list ratings Tanzania | healthfacilitiy | Ongoing | Yes | Collected | https://drive.google.com/file/d/0B1VBoooQ3X5jeEQycHo4OG4tclE/view | <22KB | Anyone interested in understanding the exercises within the finding stories curriculum | None | None | The datasets will be used within learning activities offered as part of the "Finding stories in Data" course. | Modules 4 - Gathering Data | This dataset is published on Github, under a CC-BY licence. | Yes | N/A | Via Github and via the EDSA website (http://courses.edsa-project.eu/course/view.php?id=52) | Github, EDSA website | N/A | EDSA Website | https://drive.google.com/file/d/0B1VBoooQ3X5jeEQycHo4OG4tclE/view | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | N/A | <22KB | David Tarrant, ODI | Github | N/A | No | None | N/A |
29 | No | WP2 | ODI | Louisiana Secretary of State Officials | Dataset1 | Ongoing | Yes | Collected | http://www.sos.la.gov/tabid/136/Default | >2.5MB | Anyone interested in understanding the exercises within the finding stories curriculum | None | None | The datasets will be used within learning activities offered as part of the "Finding stories in Data" course. | Module 6 - Cleaning Data | This dataset is published on Github, under a CC-BY licence. | Yes | N/A | Via Github and via the EDSA website (http://courses.edsa-project.eu/course/view.php?id=52) | Github, EDSA website | N/A | EDSA Website | http://training.theodi.org/resources/dataset1.xls | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | N/A | >2.5MB | David Tarrant, ODI | Github | N/A | No | None | N/A |
30 | No | WP2 | ODI | US Projects Dataset | Dataset2 | Ongoing | Yes | Collected | http://www.itdashboard.gov/data_feeds. | <2MB | Anyone interested in understanding the exercises within the finding stories curriculum | None. | None | The datasets will be used within learning activities offered as part of the "Finding stories in Data" course. | Module 6 - Cleaning Data | This dataset is published on Github, under a CC-BY licence. | Yes | N/A | Via Github and via the EDSA website (http://courses.edsa-project.eu/course/view.php?id=52) | Github, EDSA website | N/A | EDSA Website | http://training.theodi.org/resources/dataset2.csv | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | N/A | <2MB | David Tarrant, ODI | Github | N/A | No | None | N/A |
31 | No | WP2 | ODI | UK GP Earnings | Dataset3 | Ongoing | Yes | Collected | http://data.gov.uk/dataset/gp-earnings-and-expenses-2009-10 | <3KB | Anyone interested in understanding the exercises within the finding stories curriculum | None. | None | The datasets will be used within learning activities offered as part of the "Finding stories in Data" course. | Module 6 - Cleaning Data | This dataset is published on Github, under a CC-BY licence. | Yes | N/A | Via Github and via the EDSA website (http://courses.edsa-project.eu/course/view.php?id=52) | Github, EDSA website | N/A | EDSA Website | http://training.theodi.org/resources/dataset3.csv | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | N/A | <3KB | David Tarrant, ODI | Github | N/A | No | None | N/A |
1 | Data set reference and name | Data set description | Standards | Data sharing | Archiving and preservation | |||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | Work Package | Organisation | Dataset Title | Dataset Identifier | Status - July 2016 (ongoing, in progress, due date) | New entry to data management plan since M6? Yes/No | Generated or collected | Origin | Scale | Who is this useful for? | Similar existing dataset and possibility for integration? Value of this new dataset? | What standards and methodologies will be utilised for data collection and management? | Outline the metadata, documentation or other supporting material that should accompany the data for it to be interpreted correctly | Status and location of metadata, documentation or other supporting material | Licensing, data protection, ownership and copyright | Can the data be published under an open licence? | If the data cannot be published openly, why? | How will the data be shared? (including access procedures, dissemination, software/tools needed for enabling reuse | Which repository will be used for the data? Why this respository? | Is it ready to be published? | Current location of dataset | Dataset Link | Licence | How long should the data be preserved? How will it exceed the length of the project if necessary? | Approx end volume | Who is responsible in your organisation for the data managament and curation? | Quality assurance and back up procedures? | Associated costs and how these will be covered - do you need to purchase storage? How much time will it take for a person to manage the data - how will this be covered? |
3 | WP1 | ODI | Corpora of crawled web-based adverts from LinkedIn | WebSiteHarvest | Finished | No | Collected | 46 terms, 31 languages, 47 countries, 1 harvest per day, 2162 data points per day | Internal demand analysis only. | Many datasets are collected in this area, however due to the specific nature of this study, collection of new data is required and integration with existing datasets not viable. The value of this dataset comes from the provision of an up-to-date snapshot of current data science skills needs across the EU. | All data collected is translated into CSV format. | Data will be not available for reuse or accessible by anyone outside of the project. The data collected will be used for internal analysis to inform the creation of curriculum. | Metadata is not publically available | The terms of the LinkedIn user agreement now forbid harvesting and collection of data without express permission. When the data was collected, this was not the case. https://www.linkedin.com/legal/user-agreement?trk=hb_ft_userag | No | The terms of the LinkedIn user agreement forbid harvesting and collection of data without express permission. "Use manual or automated software, devices, scripts robots, other means or processes to access, “scrape,” “crawl” or “spider” the Services or any related data or information;" https://www.linkedin.com/legal/user-agreement?trk=hb_ft_userag | Data will be not shared or available for reuse | Using Github so that the data stays close to its usage and can be used quickly and easily. | N/A | N/A | N/A | N/A | Until the end of the project | <1Gb | ODI lead data management and curation, other WP1 partners will contribute | Backed up to an internal ODI repository | Approximately 1 day person effort per month | |
4 | WP1 | ODI | Aggregated statistics of European skill demand based on web-based job adverts | WebSiteStatistics | Finished | No | Collected | Adzuna API, Trovit | Varied | Populating the dashboard, internal demand analysis and to inform curriculum development. | Many datasets are collected in this area, however due to the specific nature of this study, collection of new data is required and integration with existing datasets not viable. The value of this dataset comes from the provision of an up-to-date snapshot of current data science skills needs across the EU. | All data collected is translated into CSV format. | The Adzuna data is accessible via the Adzuna API. The Trovit data will be not available for reuse or accessible by anyone outside of the project. | Metadata is not publically available | The data will be available for use via the EDSA dashboard However it will not be available to download as this contravenes Trovit’s terms and conditions. | No | Trovit’s terms of use prohibit the use of their data. The research exception allows us to use the data but not to make it available in raw format for others to consume for commercial purposes. | Via the EDSA dashboard | In an internal JSI repository | N/A | N/A | N/A | N/A | Until the end of the project | <1Gb | ODI lead data management and curation, other WP1 partners will contribute | Backed up in an internal JSI repository | Approximately 1 day person effort per month |
5 | WP1 | ODI | Individual results from demand analysis | IndividualResponses | Ongoing | No | Generated | Interviews and survey | 584 surveys, 108 interviews at present. Online survey still open. | Internal demand analysis. | A number of surveys exist in this domain but their data is not available to this project. This data will enable EDSA to build up a country by country view of current capacity and requirements for data science skills. | Data collection methods outlined in D1.4. Translated into CSV format. | Data will be not shared or available for reuse. The data collected will be used for internal analysis to inform the creation of curriculum. Anonymised data will be publicly available. | Metadata is not publically available | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | Data protection of personal data | Data will be not shared or available for reuse | Internal ODI repository | N/A | N/A | N/A | N/A | Until the end of the project | <100Mb | ODI lead data management and curation, other WP1 partners will contribute | Backed up to an internal ODI respository | Approximately 1 day effort per month |
6 | WP1 | ODI | Summary data from surveys and interviews | DemandAnalysisSummary | Finished | Yes | Generated | Interviews and survey | 585 surveys, 108 interviews. | External analysis of respondents who took the surveys and interviews. | None | Data collection methods outlined in D1.4. Translated into CSV format. | A README.md file is available detailing the data structure and basic usage. | https://theodi.github.io/edsa-demand-analysis-summary-data/ | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | Yes | N/A | Data will be available to access from the EDSA website and the ODIs Github repository. | Github/ EDSA website | Yes | https://theodi.github.io/edsa-demand-analysis-summary-data/ | https://theodi.github.io/edsa-demand-analysis-summary-data/ | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | As long as Github exists as a minimum. Beyond that a value judgement would have to be made. | <100Mb | ODI lead data management and curation, other WP1 partners will contribute | Stored in external repositories - EDSA website and Github | Github free and public |
7 | WP1 | ODI | De-identified survey responses from demand analysis | DeidentifiedResponses | Ongoing | No | Generated | Survey | 496 survey results | External analysis of results and trends by anyone who wishes to gather survey data in the area of data science | There are a number of other surveys that have been aggregated that we can compare our result too and use these results if necessary. This dataset has the same eventual value to others in the area | Data collection methods outlined in D1.4. Translated into CSV format. | A README.md file is available detailing the data structure and basic usage. | http://davetaz.github.io/quantitative-data-from-edsa-demand-analysis-/ | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | Yes | N/A | Data will be available to view on the EDSA dashboard and accessible for free in the EDSA dashboard Github repository. | Github/ EDSA Dashboard on website | Yes | Github/ EDSA Dashboard on website | http://davetaz.github.io/quantitative-data-from-edsa-demand-analysis-/ | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | As long as Github exists as a minimum. Beyond that a value judgement would have to be made. | <100Mb | ODI lead data management and curation, other WP1 partners will contribute | Stored in external repositories - EDSA website and Github | Github free and public |
8 | WP1 | ODI | Recordings and transcriptions of interviews | InterviewTranscipts | Finished | No | Generated | Interviews | 108 transcripts, 108 recordings | Internal demand analysis only | No similar datasets exist that are usable for this project. The interviews provide insights and data points for use in the demand analysis. | Qualitative research methodology for collection outlined in D1.4 | Data will be not available for reuse or accessible by anyone outside of the project. The data collected will be used for internal analysis to inform the creation of curriculum. | N/A | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | Data protection of personal data | Data will be not shared or available for reuse. The data collected will be used for internal review to inform the creation of curriculum and will only be available publically as anonymous data | Internal ODI repository | N/A | N/A | N/A | N/A | Until the end of the project | < 3GB | ODI lead data management and curation, other WP1 partners will contribute | Backed up to an internal ODI respository | As part of the subcontracting costs of WP1 |
9 | WP1 | ODI | ideXlab search platform results | ExpertIdentification | Ongoing | No | Collected | Research publications | Final scale not yet known as collection is ongoing | Internal demand analysis and to inform curriculum development. Provides insights into offer side of skills analysis. | Not in this area. This dataset will provide validation of the demand analysis and form the basis for further insights. | The ideXlab search engine will use the sampling approach outlined in D1.2. for data collection. CSV data will be created | Data will be not available for reuse or accessible by anyone outside of the project. The data collected will be used for internal analysis to inform the creation of curriculum. | Accompanying document to explain data structure. This will not be made open. | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | Data protection of personal data | The data will not be shared due to restrictions on the use of personal data. | ideXlab search platform | N/A | N/A | N/A | N/A | Until the end of the project | Est. 1000 returns | ideXlab lead data management and curation, other WP1 partners will contribute | Backed up to an internal ideXlab respository | Approx 2 person days per month. No other external costs |
10 | WP2 | ODI | Related course data regarding similar modules and training offerings across the EU | DataScienceCourses | Finished | No | Collected | Course websites | 600 KB | Internal use for development of curricula and learning materials. External use for identfying useful courses | None. The data will provide a useful resource for those wishing to understand what courses are available. | Systematic search and review of available data science courses. The search terms were Data Science, Big Data, Data Analytics, Business Analytics, Machine Learning, Distributed Computing, Advanced Computing Data Science Stream, Data Analytics stream. | Metadata has been published alongside the data | https://theodi.github.io/data-science-courses-in-europe-2016/ | The data is licensed under a Creative Commons CC-BY 4.0 licence | Yes | N/A | GitHub/EDSA website | Github, EDSA website | Yes | https://theodi.github.io/data-science-courses-in-europe-2016/ | https://theodi.github.io/data-science-courses-in-europe-2016/ | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | Until the end of the project | < 1GB | ODI lead data management and curation | Backed up to an internal ODI repository | 0.5 days per month |
11 | WP2 | Persontyle | Datasets for course examples and exercises | Using namespace notation to specify R packages: sml::poly4, sml::poly4b, sml::kmeans, sml::seeds, car::Duncan, car::Davis, datasets::car, datasets::HairEyeColor, datasets::Airquality, datasets::swiss, bestGLM::zprostate, MASS::menarche | Finished | Yes | Both | Various - many from third party R packages students download from CRAN. Some in an author developed package. | 12 small datasets. <1MB | Students in the "Essentials of Data Analytics and Machine Learning" course. | Third party R packages students download from CRAN. Some in an author developed package hosted on CRAN | None | The datasets will be used within learning activities offered as part of the "Essentials of Data Analytics and Machine Learning" course. They are stored in the sml R package. | Package documentation (except, currently, for those in the sml package) | GNU GPL V3, http://www.gnu.org/licenses/gpl-3.0.en.html | Yes | N/A | Via R packages, searchable online. | CRAN | Yes | CRAN, except for sml package which is currently available on the EDSA portal and will move to CRAN when finished. | https://vincentarelbundock.github.io/Rdatasets/datasets.html | GNU GPL V3, http://www.gnu.org/licenses/gpl-3.0.en.html | As long as the owners do not remove them. If the datasets are no longer accessible, other similar datasets will be used in the module. | < 1MB | Persontyle lead data management and curation, third parties for collected data | Relying on CRAN | None |
12 | WP2 | TU/e | Event log from a municipality process | Ongoing | Yes | Collected | Dutch municipality | 200 KB | Users interested in real life event logs. | We have a large collection of real life event logs at http://data.3tu.nl/repository/collection:event_logs_real | Management throuh 3TU data center | Includes number of traces, events, attributes, timespan, etc. | Non-commercial licence | No | The data is shared and publically available for non-commercial reuse. Its non-commercial licence means it cannot be published openly. | Yes | unknown | " As long as the owners do not remove them. If the datasets are no longer accessible, other similar datasets will be used in the module. " | 200 KB | 3TU | Reliant on third party. If the dataset becomes unavailable we will use a similar one in the online module. | none | ||||||
13 | WP3 | JSI | Repository statistics on downloads and views of educational resources | RepositoryStatistics | Available, regularly updated | No | Collected | videolectures.net | views and comments for each videolecture | internal analysis, curriculum development, external demand analysis | None. Provides evidence of resource usage and basis for improving curriculum, content and course structure. | CSV is used for Videolectures API | Videolectures REST api documentation. An MD Readme file is available for download | https://github.com/innanoval/edsa-videolectures-statistics-dataset-1/tree/gh-pages/data | The data is licensed under a Creative Commons CC-BY 4.0 licence | Yes | N/A | Available to see at videolectures website; described as part of WP3 deliverables | videolectures repository. Proximity to data source. | Yes | JSI server | https://github.com/innanoval/edsa-videolectures-statistics-dataset-1/tree/gh-pages/data | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | the data will be available after the project ends as part of the project's learning materials | < 1GB | JSI lead data management and curation. OU contribute | videolectures - relying on internal quality assurance & back up procedures | Approximately 1 day per month during the project’s lifetime |
14 | WP3 | OU | Learning Analytics data generated from the EDSA Online Courses portal | EDSAOnlineCoursesLA | Ongoing. Generation of data started on 01/02/2016, together with the launch of the first EDSA self-study courses | No | Generated | http://courses.edsa-project.eu | Not yet known | Course producers can get an understanding of how their courses are being used. Learners can monitor their learning progress. | Not many Learning Analytics datasets are publicly available. The OU has recently published a similar dataset: https://analyse.kmi.open.ac.uk/open_dataset | The xAPI specification is used for expressing the data; the open source Learning Locker software is used for storing and visualising the data. | Introduction to the xAPI (or Tin Can API): https://tincanapi.com/overview/. Introduction to Learning Locker: https://learninglocker.net | https://tincanapi.com/overview/ https://learninglocker.net https://alexmikro.github.io/learning-analytics-dataset-from-the-edsa-online-courses-portal/ | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | Yes | N/A | Via the EDSA website / Github | We have setup a dedicated EDSA Learning Locker. This was chosen for the reasons outlined in https://learninglocker.net/benefits/ | Yes | EDSA Learning Locker | https://alexmikro.github.io/learning-analytics-dataset-from-the-edsa-online-courses-portal/ | CC-BY | At least until the end of project | Not yet known | OU lead data management and curation. | Relying on the backup procedures of the OU, as the dataset is hosted on an OU server. | Server storage has already been purchased. Effort for analysing the data has been allocated in Task 3.4. |
15 | WP3 | JSI | Internal logs of elearning systems | InternalLogs | Available, regularly updated | No | Collected | videolectures.net | for videolectures: 20.000 videos, 17.431 lectures, 12.998 authors, 952 events, 579 categories | internal demand analysis | None. Provides evidence of resource usage and basis for improving curriculum, content and course structure. | JSON is used for Videolectures API | Videolectures REST api documentation | N/A | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | Privacy. Data requires anonymisation and/or aggregation, and at the moment the use case for anonymised data is not clear. | Available to see at videolectures website; described as part of WP3 deliverables | videolectures repository. Proximity to data source. | N/A | JSI server | N/A | N/A | at least until the end of project | N/A | JSI lead data management and curation. OU contribute | videolectures - relying on internal quality assurance & back up procedures | N/A |
16 | WP3 | JSI | Statistics of course registration, participation and completion | StatisticsForCourses | Available, regularly updated | No | Collected | videolectures.net | for videolectures - available per videolecture, per viewer | internal demand analysis | None. Provides basis for improving curriculum, content and course structure. | JSON is used for Videolectures API | Videolectures REST api documentation | N/A | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | Privacy. Data that does not contain privacy issues might be publishable | Available to see at videolectures website; described as part of WP3 deliverables | videolectures repository. Proximity to data source. | N/A | JSI server | N/A | N/A | at least until the end of project | < 1GB | JSI lead data management and curation. OU contribute | videolectures - relying on internal quality assurance & back up procedures | N/A |
17 | WP3 | JSI | Aggregated statistics of engagement with the developed courses and educational resources | AggregatedStatistics | Available, regularly updated | Np | Generated | videolectures.net | for videolectures - available per videolecture, per viewer | internal demand analysis | None. Provides evidence of adoption and basis for improving curriculum, content and course structure. | JSON is used for Videolectures API | Videolectures REST api documentation | N/A | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | Privacy. Data that does not contain privacy issues might be publishable | Available to see at videolectures website; described as part of WP3 deliverables | videolectures repository. Proximity to data source. | N/A | JSI server | N/A | N/A | at least until the end of project | < 1GB | JSI lead data management and curation. OU contribute | videolectures - relying on internal quality assurance & back up procedures | N/A |
18 | WP3 | TU/e | Recorded behavior of students following the first session of the process mining MOOC | CourseraMOOCprocmin001 | Ongoing | Yes | collected | coursera.org | several large tables | learning analytics within EDSA | every Coursera course has this data recorded | Data collection is managed by Coursera | There is no external link to the metadata | N/A | Raw data is owned by TU/e and cannot be shared due to Coursera restrictions of use. | No | Restrictions of use from the data provider | N/A | The data is collected by and stored on a Coursera repository. | No | Coursera | N/A | N/A | N/A | around 1 GB | Joos Buijs, Tu/e lead data management and curation | relying on coursera | N/A |
19 | WP4 | SOTON | Web server logs and Google analytics of project website access | WebsiteAnalytics | Ongoing | No | Collected | http://edsa-project.eu | 1 website | Internal analysis for dissemination and community analysis. Secondary use for implicit demand analysis. | None. Provides evidence of engagement and basis for UX improvement. | Quantitative recording of website traffic via Google Analytics dashboard, analysed using a variety of analytic tools. | Sessions, Page views, Demographics, User Flow, Bounce rate, | There is no external link to the metadata | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | User privacy. The data can be aggregated and published under an open license | Analysed data will be made available throughout deliverable reports in WP4. | Internal institutional Soton/OU repositories | N/A | https://analytics.google.com/ | N/A | N/A | at least until the end of project | < 1GB | OU lead data management and curation. Soton contribute | Backed up remotely | Free storage. 0.5 day per month |
20 | WP4 | SOTON | Generated social media engagement data | SocialMediaEngagements | Ongoing | No | Collected | 1 Twitter Account | Internal analysis for community strength and project dissemination. | None that relate to EDSA. Provides evidence for engagement with project, effectiveness of dissemeniation activities. Provides basis for understanding what content users find most engaging. | Regular access of data from analytics.twitter.com | Tweets, Impressions, Profile Visits, Followers, Mentions | https://analytics.twitter.com/user/edsa_project/home | Data will be licensed in compliance with each social network's terms and conditions | No | Data sharing needs to comply with individual site licenses. However the majority of social networks do not permitted collection, harvesting and republication of data | Dashboard on EDSA website. Deliverable reports in WP4. | Internal institutional Soton repositories | Data not accessible directly without tools. Required Twitter harvester. | Until the end of the project | < 1GB | Soton lead data management and curation. | Backed up remotely | Free storage. 1 day per month | ||||
21 | WP5 | ideXlab | List of project exploitation results – collaborations, institutional and geographical beneficiaries, | ProjectExploitation | Ongoing | No | Generated | Project partners | Variable | Internal analysis for results to be exploited and targets | None. Provides data on dissemination activity, network and results. | Report detailing results from interviews and exploitation activities | N/A | N/A | Raw data will be owned by the project and unlicensed. It will not be available for reuse. | No | Confidentiality | Deliverable reports in WP5. | Google docs shared document | N/A | N/A | N/A | Until the end of the project | < 500MB | ideXlab lead data management curation | Backed up remotely | Free storage. 1 day per month | |
22 | WP5 | ODI | The EDSA Register | EDSARegister | Ongoing | Yes | Generated | Project partners | <500KB | Anyone interested in understanding the datasets used within the EDSA project. Internal management tool. | None | Project partners update every three months until the end of the project. ODI responsible for conversion to CSV and publication as open data. | A README.md file is available detailing the data structure and basic usage. | https://theodi.github.io/european-data-science-academy-register/ | This dataset is published on Github, under a CC-BY licence. | Yes | N/A | Via Github and via the EDSA website (http://edsa-project.eu/resources/datasets/) | Google docs shared document and Github | Yes | https://theodi.github.io/european-data-science-academy-register/ | https://theodi.github.io/european-data-science-academy-register/ | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | As long as Github exists as a minimum. Beyond that a value judgement would have to be made. | < 500MB | ODI lead data management and curation, other WP1 partners will contribute | Stored in external repositories - EDSA website and Github | Stored in external repositories - EDSA website and Github; approximately 2 days per month effort for maintenance. |
1 | Category | Field | Description |
---|---|---|---|
2 | Dataset Reference and Name | Work Package | Work package in which the dataset was used or produced |
3 | Dataset Reference and Name | Organisation | Organisation responsible for collecting or using the data |
4 | Dataset Reference and Name | Dataset Title | Title of the dataset |
5 | Dataset Reference and Name | Dataset Identifier | Unique identifier for the dataset |
6 | Dataset Reference and Name | Status - January 2017 (ongoing, in progress, due date) | |
7 | Dataset Reference and Name | New entry to data management plan? Yes/No | |
8 | Data set description | Generated or collected | Whether the dataset was Collected (e.g. by harvesting) or Generated (e.g. by analysing another dataset) |
9 | Data set description | Origin (if collected) | Name of service or other data sources |
10 | Data set description | Scale | Description of size and content of dataset |
11 | Data set description | Who is this useful for? | |
12 | Data set description | Similar existing dataset and possibility for integration? Value of this new dataset? | |
13 | Standards | What standards and methodologies will be utilised for data collection and management? | |
14 | Standards | Outline the metadata, documentation or other supporting material that should accompany the data for it to be interpreted correctly | |
15 | Standards | Status and location of metadata, documentation or other supporting material | |
16 | Data sharing | Licensing, data protection, ownership and copyright | Notes on the copyright, rights and ownership of the dataset |
17 | Data sharing | Can the data be published under an open licence? | Indicates whether the data can be openly published (Yes, No, Maybe) |
18 | Data sharing | Reasons why the data cannot be shared | Description of why a dataset cannot or might not be suitable for openly published |
19 | Data sharing | How will the data be shared? (including access procedures, dissemination, software/tools needed for enabling reuse | Notes on how the data might be shared with others |
20 | Data sharing | Which repository will be used for the data? Why this respository? | The platform or repository used to host the data, and why it was selected |
21 | Data sharing | Is it ready to be published? | Indicates whether the data is ready for publication (Yes, No, N/A if not suitable for publishing) |
22 | Data sharing | Current location of dataset | The current location of the data, e.g. the name of a platform or service |
23 | Data sharing | Dataset Link | This should be a link to the actual, downloadable dataset for any openly published data. N/A if data cannot be published |
24 | Data sharing | Licence | The licence applies to any openly published data. N/A if data cannot be openly published |
25 | Archiving and preservation | How long should the data be preserved? How will it exceed the length of the project if necessary? | |
26 | Archiving and preservation | Approx end volume | |
27 | Archiving and preservation | Who is responsible in your organisation for the data managament and curation? | |
28 | Archiving and preservation | Quality assurance and back up procedures? | |
29 | Archiving and preservation | Associated costs and how these will be covered - do you need to purchase storage? How much time will it take for a person to manage the data - how will this be covered? | |
30 | Data Ethics | Are there any ethical or legal issues that can have an impact on sharing this data? Y/N | |
31 | Data Ethics | What are the ethical or legal issues that can occur from sharing this data? |
1 | Field | Code | Description |
---|---|---|---|
2 | Work Package | WP1 | Work package 1 – Demand analysis and advisory board |
3 | Work Package | WP2 | Work package 2 – Curricula and course development |
4 | Work Package | WP3 | Work package 3 – Training delivery and learning analytics feedback |
5 | Work Package | WP4 | Work package 4 – Dissemination and community building |
6 | Work Package | WP5 | Work package 5 – Exploitation |
7 | Organisation | ODI | Open Data Institute |
8 | Organisation | OU | Open University |
9 | Organisation | JSI | Jožef Stefan Institute |
10 | Organisation | SOTON | University of Southampton |
11 | Organisation | ideXlab | ideXlab |
12 | Organisation | TU/e | Eindhoven University of Technology |
1 | Data set reference and name | Data set description | Standards | Data sharing | Archiving and preservation | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | Timeline - ongoing, in progress, due date | Identifier for the data set produced | Generated or collected | Origin (if collected) | Scale | Who is this useful for? | Similar existing dataset and possibility for integration? Value of this new dataset? | What standards and methodologies will be utilised for data collection and management? | Outline the metadata, documentation or other supporting material that should accompany the data for it to be interpreted correctly | Licensing, data protection, ownership and copyright | How will the data be shared? (including access procedures, dissemination, software/tools needed for enabling reuse | Repository for the data. Why this respository? | If the data cannot be shared - why? | How long should the data be preserved? How will it exceed the length of the project if necessary? | Approx end volume | Who is responsible in your organisation for the data managament and curation? | Quality assurance and back up procedures? | Associated costs and how these will be covered - do you need to purchase storage? How much time will it take for a person to manage the data - how will this be covered? | |||
3 | WP1 | ODI | Corpora of crawled web-based adverts from LinkedIn | Ongoing | WebSiteHarvest | Collected | 46 terms 31 languages 47 countries 1 harvest per day 2162 data points per day | Internal demand analysis. External research into job and skill demand | Many datasets are collected in this area, however due to the specific nature of the question, collection a fresh is just as easy and reliable. Wider value is only relevant if you are also doing a data science study with the same methodology. | All data collected is translated into CSV format. | A README.md file is available detailing the data structure and basic usage. | CC-BY 4.0 | Data will be visiable via the dashboard and available openly and for free in the dashboard Github repository. | Using Github so that the data stays close to its usage and can be used quickly and easily. | N/A | As long as Github exists as a minimum. Beyond that a value judgement would have to be made. | <1Gb | Everyone | No-one we store everything with external providers, e.g. Github | Github free and public | |
4 | Aggregated statistics of European skill demand based on web-based job adverts | Ongoing | WebSiteStatistics | Generated | TBC - end of June | Internal demand analysis. External research into job and skill demand | as above | as above | as above | CC-BY 4.0 | Data will be made available to view online on the EDSA dashboard, Where possible open data will be provided so that users are able to access the raw data themselves and look at the findings of the project. | Github/ EDSA Dashboard on website | N/A | N/A | <1Gb | Everyone | No-one we store everything with external providers, e.g. Github | Github free and public | |||
5 | Individual results from online survey | Ongoing | OnlineResponses | Collected | Individuals | End of M18 | Internal demand analysis. External research into job and skill demand | N/A | JSON | as above | Not available | Not shared due to data protection | Internal ODI repository | Data protection of personal data | Until the end of the project | <100Mb | Project team | Project team | As part of the subcontracting costs of WP1 | ||
6 | Aggregated results from online survey | Ongoing | OnlineResponsesStatistics | Generated | Individuals | End of M18 | External analysis of results and trends by anyone who wishes to gather survey data in the area of data science | There are a number of other surveys that have been aggregated that we can compare our result too and use these results if necessary. This dataset has the same eventual value to others in the area | CSV/JSON | as above | CC-BY 4.0 | Publically and freely available via the projects GitHub repository | Github/ EDSA Dashboard on website | N/A | As long as Github exists as a minimum. Beyond that a value judgement would have to be made. | <100Mb | Everyone | No-one we store everything with external providers, e.g. Github | Github free and public | ||
7 | Recordings/transcriptions of interviews | Ongoing | N/A | Collected | Interviews | Approx 30 | Internal analysis only | N/A | Qualitative research methodology for collection outlined in D1.1 | N/A | Particpants are asked if they are happy for their interview to be recorded for us within the bounds of the project only. All data will result in anonymised, aggregated data only and will not be able to be used to identify individuals. | Raw data will not publically be available. The data collected will be used for internal review to inform the creation of curriculum and will only be available publically as anonymous aggragated data | Internal ODI repository | Privacy | Until the end of the project | Approx 30 recordings, transcriptions and summaries | ODI | Backed up to an internal ODI respoisitory | As part of the subcontracting costs of WP1 | ||
8 | Aggregated statistics of ideXlab search platform results | Sample end of June, ongoing after M6 | expertIdentification | Collected | Publications | TBC - end of June after implementation of application | Internal analysis, and curriculum development, external users of the demand analysis dashboard | Not in this area | Sampling approach outlined in D1.2. CSV data | N/A | Not available | Aggregated data will be available via the EDSA dashboard | ideXlab search platform | Privacy | Until the end of the project | Est. 1000 returns | ideXlab | Backed up to an internal ideXlab respository | Approx 2 person days per month. No other external costs | ||
9 | WP2 | OU | Linked open data sources, such as DBLP and GeoNames | Ongoing | N/A | Collected | DBLP, GeoNames, etc | Ongoing | Users of the project's curricula and learning materials (learners, educators, trainers, etc) | N/A | Systematic search and review of available datasets | The datasets will be used within learning activities offered as part of the project's learning materials | Creative Commons licenses | Will be made available via the interactive elements of the project's learning materials | DBLP, GeoNames, etc | N/A | The data will be available after the project ends as part of the project's learning materials | < 1GB | WP2 partners | Relying on the procedures of the external datasets providers. | Approx 2 person days per month for collecting the data |
10 | Publically available governmental, financial, network and environmental datasets for each course. | Ongoing | N/A | Collected | Data.gov.uk, www.data.gov, etc | Ongoing | Users of the project's curricula and learning materials (learners, educators, trainers, etc) | N/A | Systematic search and review of available datasets | The datasets will be used within learning activities offered as part of the project's learning materials | Creative Commons licenses | Will be made available via the interactive elements of the project's learning materials | Data.gov.uk, www.data.gov, etc | N/A | The data will be available after the project ends as part of the project's learning materials | < 1GB | WP2 partners | Relying on the procedures of the external datasets providers. | Approx 2 person days per month for collecting the data | ||
11 | Related course data regarding similar modules and training offerings across the EU | Ongoing | DataScienceCourses | Collected | Course websites | TBC | Internal use for development of curricula and learning materials | N/A | Systematic search and review of available courses | Links to other coursers etc. | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | EDSA Website | Internal Soton repository, EDSA dashboard | N/A | Until the end of the project | < 1GB | Soton | Backed up remotely, hosted on Google Docs | 0.5 days per month | ||
12 | Subsets derived from the existing data repositories produced for exercises in the learning resources | Ongoing | EDSAExercisesDatasets | Generated | N/A | Ongoing | Users of the project's curricula and learning materials (learners, educators, trainers, etc) | N/A | Building on top of existing datasets for the purposes of the EDSA exercises | The datasets will be used within learning activities offered as part of the project's learning materials | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | Will be made available via the interactive elements of the project's learning materials | N/A | N/A | The data will be available after the project ends as part of the project's learning materials | < 500MB | WP2 partners | Backed up remotely, hosted on Google Docs | Approx 2 person days per month for generating the data | ||
13 | WP3 | JSI | Repository statistics on downloads and views of educational resources | for videolectures - available through api | Statistics | Collected | videolectures | views and comments for each videolecture | internnal analysis, curriculum development, external demand analysis | N/A | JSON is used for Videolectures API; answers to some general, legal and technical questions are provided by this link: http://videolectures.net/faq/ | Videolectures REST api documentation | for videolectures: answers to some general, legal and technical questions are provided by this link: http://videolectures.net/faq/ | this statistics is possible to see at videolectures website; described as part of WP3 deliverables | videolectures repository | N/A | the data will be available after the project ends as part of the project's learning materials | < 1GB | JSI(videolectures)/OU | videolectures - relying on internal quality assurance & back up procedures | N/A |
14 | Internal logs of elearning systems | for videolectures - available through api | InternalLogs | Collected | videolectures | for videolectures: 20.000 videos, 17.431 lectures, 12.998 authors, 952 events, 579 categories | internnal analysis, demand analysis | N/A | JSON is used for Videolectures API | Videolectures REST api documentation | N/A | described as part of WP3 deliverables | videolectures repository | Privacy | at least until the end of project | N/A | JSI(videolectures)/OU | videolectures - relying on internal quality assurance & back up procedures | N/A | ||
15 | Statistics of course registration, participation and completion | for videolectures - ongoing | Statistics | Collected | videolectures | ongoing | internnal analysis, demand analysis | N/A | JSON is used for Videolectures API | Videolectures REST api documentation | N/A | described as part of WP3 deliverables | videolectures repository | N/A | at least until the end of project | < 1GB | JSI(videolectures)/OU | videolectures - relying on internal quality assurance & back up procedures | N/A | ||
16 | Aggregated statistics of engagement with the developed courses and educational resources | for videolectures - ongoing | AggregatedStatistics | Generated | videolectures | ongoing | internnal analysis, demand analysis | N/A | JSON is used for Videolectures API | Videolectures REST api documentation | N/A | described as part of WP3 deliverables | videolectures repository | N/A | at least until the end of project | < 1GB | JSI(videolectures)/OU | videolectures - relying on internal quality assurance & back up procedures | N/A | ||
17 | WP4 | SOTON | Web server logs and Google analytics of project website access | Ongoing | WebsiteAnalytics | Collected | http://edsa-project.eu | 1 website | Internal analysis for dissemination and community analysis. Secondary use for implicit demand analysis. | N/A | Quantitative recording of website traffic | Description of metric terms | Raw data will be owned by the project and unlicensed. | Analysed data will be made available throughout deliverable reports in WP4. | Internal Soton/OU repositories | N/A | < 1GB | OU / Soton | Backed up remotely | Free storage. 0.5 day per month | |
18 | Generated social media engagement data | Ongoing | SocialMediaEngagements | Collected | Twitter, LinkedIn, | 1 Twitter Account; Up to 30 LinkedIn Community Groups | Internal analysis for community strength and project dissemination. | N/A | Regular access of data from analytics.twitter.com | Descriptions of data attributes | Licensed in compliance with each social network's terms. | Dashboard on EDSA website. Deliverable reports in WP4. | Internal Soton repositories | Until the end of the project | < 1GB | Soton | Backed up remotely | Free storage. 1 day per month | |||
19 | Aggregated statistics of networking and engagement data | Ongoing | EngagementReports | Generated | Variable | Internal analysis for dissemination and community building. | N/A | Quantiative analysis of engagement data | Who attended each event, what type of presentation or activity was taken, where was the event. | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | Deliverable reports in WP4. | EDSA Dashboard | Until the end of the project | < 500MB | Soton | Backed up remotely | Free storage. 2 days per month | ||||
20 | Learning materials access data | Ongoing | LearningMaterialAccess | Collected | Various: MOOCs (Futurelearn, Coursera), project website, iBook Store | Variable | Internal analysis for dissemination and engagement with learning materials | Repository statistics on downloads and views of educational resources, and Statistics of course registration, participation and completion from WP3 | Quantiative recording of web server logs and page views | Description of terms | Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ | Deliverable reports in WP4. Dashboard on EDSA website. | EDSA Dashboard | Until the end of the project | < 1GB | OU / Soton | Backed up remotely | 1 day per month | |||
21 | WP5 | ideXlab | List of project exploitation results – collaborations, institutional and geographical beneficiaries, | Ongoing | ProjectExploitation | Generated | Project partners | Variable | Internal analysis for results to be exploited and targets | N/A | Results from interviews and exploitation activities | N/A | Not available | Deliverable reports in WP5. | Google docs shared document | Confidentiality | Until the end of the project | < 500MB | ideXlab | Backed up remotely | Free storage. 1 day per month |