DLF Metadata Assessment Working Group - Tools & Tools Documentation supplement

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q
1	Reviewer sign up	Lit Review ID	Assessment grouping	Tool Name	Designed for:	Type	URL	Abstract	Other	Tool Creator/Maintainer	Source code / download URL	Documentation URL	GUI	CLI	Free?	OSS or proprietary	Written in...

2	Anna Neatrour	MA-002	Interface (GUI) tools designed for assessing metadata	DPLA Aggregation tools	assessing metadata	tools package	https://github.com/ncdhc/dpla-aggregation-tools	This set of tools provides a way of visually browsing metadata from OAI-PMH feeds, with the option to check for values in required fields. Data is displayed in grids, allowing a user to more effectively assess an entire set/collection. Can be particularly useful for people who would like to assess the metadata available over OAI-PMH but who are not comfortable reviewing XML. While the tools are set to review simple dublin core and a set of required fields that applies to NCDHC, this can be modified by changing the code to review a qualified dublin core OAI-PMH feed, and the setting for required fields can also be adjusted. At the University of Utah, we are using these tools (modified by the Mountain West Digital Library) to assess mappings and required field values for legacy collections.		North Carolina Digital Heritage Center	https://github.com/ncdhc/dpla-aggregation-tools	https://github.com/ncdhc/dpla-aggregation-tools/wiki	y		y	OSS
3	Rachel T.	MA-005	Command-line scripts designed for assessing metadata	Metadata Breakers	assessing metadata	stand-alone script	https://github.com/vphill/metadata_breakers	This Python script allows you to parse digital library metadata exposed in an OAI-PMH repository. The data comes in as OAI, and Metadata Breakers provides flexible options for outputting the data in a format that other tools can easily use it. More detailed explanation and examples of how the tool could be used are found in a 2013 Code4Lib article http://journal.code4lib.org/articles/7818		Mark Phillips					y	OSS	Python
4		MA-006	Command-line scripts designed for assessing metadata	Completeness rating in Europeana	assessing metadata	tool	https://docs.google.com/document/d/1Henbc0lQ3gerNoWUd5DcPnNq4YxOxDW5SQ7g4f26Py0/edit#heading=h.l2fg46yn5tej	This Java program assigns point-based values to “score” individual metadata records for completeness and assumed “information value” [attractiveness?] to humans. The score is used to increase the visibility of best record in the Europeana portal by boosting their ranking. Logic for points awarded to a record is laid out in supporting documentation. Note from Borys Omelayenko (in-line comment via github): “It gives rank from 0 to 10 for a record, that consists of two parts: up to 5 points for tags with values (potentially) coming from controlled vocabularies, and up to 5 points for free-text fields.”	Notes from Borys Omelayenko (in-line comment via github): “It gives rank from 0 to 10 for a record, that consists of two parts: up to 5 points for tags with values (potentially) coming from controlled vocabularies, and up to 5 points for free-text fields.” (line 27-52)	Hugo Manguinhas (editor)	https://github.com/europeana/uim-europeana/blob/master/workflow_plugins/europeana-uim-plugin-enrichment/src/main/java/eu/europeana/uim/enrichment/utils/RecordCompletenessRanking.java						Java
5	Andrea L.	MA-017	Other	Google Analytics	business intelligence	tool	https://analytics.google.com/	Offers event tracking to discover which links are clicked and files downloaded. Track how the search feature is used. APIs, such as the Google Tag Manager, can extract data into various formats. Customize dimensions and metrics to collect data on specific metadata fields. Can be used with R to extract data.		Google				related question:
6		MA-010	Command-line resources (languages, libraries, and interfaces) that can be used for assessing metadata	D3	data visualization	programming language or library	https://d3js.org/	D3 is a JavaScript library for visualizing data with HTML, SVG, and CSS.		Mike Bostock https://github.com/mbostock	https://github.com/d3/d3	https://github.com/d3/d3/wiki					JavaScript
7		MA-011	Command-line resources (languages, libraries, and interfaces) that can be used for assessing metadata	Plot.ly	data visualization	programming language or library	https://plot.ly/	An online analytics and data visualization tool		Plotly		https://github.com/d3/d3/wiki		n	y/n (free for Python, R, and Matlab)	both (OSS for Python, R, and Matlab)
8	Kathryn G	MA-001	Interface (GUI) tools designed for assessing metadata	OpenRefine	efficiency and assessment across large datasets	tool	http://openrefine.org/	OpenRefine is a free, open source data normalization and reconciliation tool that runs locally in a web browser. Can work with large sets of data, but does best processing <100k lines at a time. Users can utilize faceted search and browsing to identify similar data, or rely on the built-in, super-charged algorithms that suggest ‘clusters’ of data that OpenRefine thinks can be normalized to a single value (including suggesting the ‘best’ value based on relevancies defined in the algorithm). Very useful for assessing and migrating legacy metadata from different systems, and plays well with lots of standard data storage formats (CSV and other *-delimited files, RDF, XML, JSON, etc). Advanced users can explore OpenRefine as a tool for linking existing data to external sources (eg FreeBase) or normalizing data using programming languages for complex queries. Relatively short learning curve for ‘basic’ level of usage - common actions have built in buttons, pretty intuitive navigation and design, and import/export is very simple. openrefine.org provides easy-to-understand video tutorials, in addition to text-based documentation	Documentation Notes: openrefine.org provides easy-to-understand video tutorials, in addition to text-based documentation. Also have a documentation wiki here: https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users Source link: https://github.com/OpenRefine/OpenRefine	community-maintained (http://openrefine.org/community.html)	https://github.com/OpenRefine/OpenRefine	https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users	y	n	y	OSS	Java
9	Kathryn G	MA-004	Interface (GUI) tools designed for assessing metadata	LODrefine	efficiency and assessment across large datasets	tool	https://github.com/sparkica/LODRefine	While still operational, this tool longer supported/maintained. This is a LOD-enabled version of OpenRefine. It builds off OpenRefine version 2.5 with integrated extensions that make transition from tabular data to Linked Data more streamlined. It was built by a post-doc student at the now-defunt DERI institue, and makes heavy use of the DERI RDF extension. Last updated in 2013. Documentation URL not active.	Information on installing LODRefine or OpenRefine with the DERI extension: https://github.com/LODLAM/LODLAMTO16/blob/master/OpenRefine_Tutorial/Installation/README.md	Mateja Verlic (@sparkica)	https://github.com/sparkica/LODRefine	[url inactive] http://code.zemanta.com/sparkica	y	n	y	OSS	Java
10	Sara R.	MA-012	Command-line resources (languages, libraries, and interfaces) that can be used for assessing metadata	Anaconda distribution of Python	efficiency and assessment across large datasets	programming language or library	https://www.python.org/	Python is a widely-used programming language. The Anaconda distribution of Python comes bundled with packages useful for metadata assessment, including data analysis and visualization libraries (e.g., scikit-learn, pandas, NumPy, SciPy, NLTK, matplotlib), as well as the Jupyter (IPython) notebook interactive computational environment.		Continuum Analytics	https://www.continuum.io/downloads	https://docs.continuum.io/anaconda/index	n	y	y	OSS	Python
11	Sara R.	MA-013	Command-line resources (languages, libraries, and interfaces) that can be used for assessing metadata	Python pandas	efficiency and assessment across large datasets	programming language or library	http://pandas.pydata.org/	Python library for analyzing data. It is available as a standalone download or as part of the “Anaconda distribution of Python” (see above.)		Wes McKinney (creator); Python for Data community (maintainer)	http://pandas.pydata.org/pandas-docs/stable/	http://pandas.pydata.org/pandas-docs/stable/	n	n	y	OSS	Python
12	Laura A. (?)	MA-014	Infrastructure tools that make data processing more efficient and allow for assessment across large datasets	Apache Spark	efficiency and assessment across large datasets	computing framework	http://spark.apache.org/	A fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It's an open source cluster computing framework.		Apache Software Foundation
13		MA-015	Infrastructure tools that make data processing more efficient and allow for assessment across large datasets	Hadoop	efficiency and assessment across large datasets	computing framework	https://hadoop.apache.org/	Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. (summary from www.sas.com/en_us/insights/big-data/hadoop.html)		Apache Software Foundation
14	Conal T	MA-021	Interface (GUI) tools designed for assessing metadata	Gadget	efficiency and assessment across large datasets	tool	https://github.com/Conal-Tuohy/SIMILE-Gadget/wiki	Gadget is an XML inspector designed to create useful summaries of large XML datasets. It generates sparklines, and displays frequencies of values clustered by XPaths.		Created by Stefano Mazzocchi for MIT SIMILE project	https://github.com/Conal-Tuohy/SIMILE-Gadget	https://github.com/Conal-Tuohy/SIMILE-Gadget/wiki	y (website)	y	y	OSS	Java
15		MA-009	Command-line resources (languages, libraries, and interfaces) that can be used for assessing metadata	R Studio	integrated development environment (IDE)	programming language or library	https://www.rstudio.com/	Free and open-source integrated development environment (IDE) for R, a programming language for statistical computing and graphics				https://support.rstudio.com/hc/en-us	y		y	OSS
16		MA-016		eCommons Metadata	sharing and testing	dataset	https://github.com/cmh2166/eCommonsMetadata	Review of the eCommons DSpace metadata as of Wednesday, February 3rd, 2016.		Christina Harlow
17	Rachel T.	MA-018	Dataset	Digital Public Library of America: Bulk Metadata Download Feb 2015	sharing and testing	dataset	http://digital.library.unt.edu/ark:/67531/metadc502991/	Dataset containing metadata (~8 million records) contributed to the Digital Public Library of America (DPLA) and normalized into their internal format. This provides and easy, ready-to-download example of DPLA data for testing and experiementation. The full DPLA dataset can also be accessed directly from DPLA.		Mark Phillips
18	Laura A.	MA-019	Dataset	UNT Libraries Metadata Edit Dataset	sharing and testing	dataset	http://digital.library.unt.edu/ark:/67531/metadc304852/	This dataset contains data samples from metadata records (1,193,814 samples per file) extracted from the UNT Libraries' Digital Collections. It contains one sample per metadata record version in the system with aggregate counts of fields and also hash values of an element as well. Data was collected in March 2014 with dates from May 19, 2004 to February 4, 2014.
19	Kathryn G	MA-020	Dataset	Internet Archive Dataset Collection	sharing and testing	dataset	https://archive.org/details/datasets	The Dataset Collection is an aggregation resource for large data archives from both organizations/sites and individuals. It is accessible via API (vs. bulk download).		Internet Archive for collection. Data sets are provided by individuals or organizations.	each data set has a unique URL	each data set has a unique URL	y (website)		y		N/A
20	Rachel T.	MA-003	Interface (GUI) tools designed for statistical computing and which could be used for assessing metadata	SPSS	statistical computing	tool	http://www-01.ibm.com/software/analytics/spss/	Statistical analysis tool widely used in the social sciences, commercially available from IBM. Useful for identifying meaningful relationships between variables.		IBM			y	n	n	proprietary	Java
21	Rachel T.	MA-007	Other	Tableau	statistical computing and graphics	tool	http://www.tableau.com/	Tableau is a popular commercial tool for data analysis and visualization, designed to be usable for people without programming skills and often used in business settings for market analytics. It can intake multiple types of data (spreadsheets, databases, and web data) and allow users to design a dashboard of visualizations from that data.		Tableau Software			y	no, but scripts can be added	Tableau Public Free is free, but projects must be saved to public.tableau.com	proprietary
22	Andrea L.	MA-008	Command-line resources (languages, libraries, and interfaces) that can be used for assessing metadata	R	statistical computing and graphics	programming language or library	https://www.r-project.org/	R is a (no cost) statistical computing software "environment" that can be used for data analysis and displaying it graphcially. Runs on UNIX, FreeBSD, Linux, Windows, and MacOS.		The R Foundation							R
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100