Complexity and uncertainty in DH projects

PROVIDEDH.eu @PROVIDEDH

A co-design approach around data visualization

Eveline Wandl-Vogt, Enric Senabre, Roberto Theron

DH2019. Utrecht

Front page: Partners and funders

Welcome

PROVIDEDH.eu @PROVIDEDH

Goal
introduce methods
further our research appraoch
jointly grow and learn from each other

Core Topics:

Open Innovation

Participatory Knowledge Creation
Visualization and Uncertainty

PROgressive VIsual DEcision-Making in Digital Humanities

PROVIDEDH.eu @PROVIDEDH

The PROVIDEDH project aims to give Digital Humanities scholars a space to explore and assess the completeness and evolution of digital research objects, the degree of uncertainty that the models applied to the data incorporate, and to share their perspectives and insights with the project’s broad range of stakeholders.

PROVIDEDH is an interdisciplinary research project. The experience gained in other scientific areas in which the intervention of computing has been much deeper and constant will be analysed and adapted to the case of humanities. Specially, regarding infrastructures, frameworks, models and tools that can be standardized for the different disciplines in the humanities.

exploration space

PROVIDEDH.eu @PROVIDEDH #explorations4u

-

is a physical as well as virtual space
aiming to foster innovation
and support digital transformation
against the background of humanities.

It is a best practice example

for Open Innovation
of the Austrian government.

exploration space

PROVIDEDH.eu @PROVIDEDH #explorations4u

-

exploration space

PROVIDEDH.eu @PROVIDEDH #explorations4u

-

exploration space

PROVIDEDH.eu @PROVIDEDH #explorations4u

-

exploration space

PROVIDEDH.eu @PROVIDEDH #explorations4u

-

PROVIDEDH.eu @PROVIDEDH

Documentation

Introductions: roles and personal skills in research

PROVIDEDH.eu @PROVIDEDH

Survey results: data visualization & co-design

PROVIDEDH.eu @PROVIDEDH

To what extent are you familiar with data visualization in research?

To what extent are you familiar with co-design methods?

Survey results: open innovation

PROVIDEDH.eu @PROVIDEDH

“sharing research and findings to promote a culture of building on each other's work, rather than being secretive for fear that someone else will ‘use’ or claim credit for your work”

working and creating and developing in public, open spaces

making knowledge and knowledge design processes open to others to view and sometimes participate in”

“interaction between, e.g., science and a greater public to resolve research problems”

“platforms, data sets, research methods, and outputs that are open and transparent to all at a low cost”

“you share information about what you are doing in research and what you are discovering”

To what extent are you familiar with open innovation?

Survey results: complexity & uncertainty

PROVIDEDH.eu @PROVIDEDH

To what extent do you face situations of complexity and/or uncertainty in DH

in relation to the following:

Survey results: complexity & uncertainty

PROVIDEDH.eu @PROVIDEDH

“ there is uncertainty about how to interpret the historical records of traditional languages: how things were pronounced, the distribution of the languages; even the names of the languages”

any time I try to fit early modern information into spreadsheet cells!

I am working in a hermeneutical perspective [...] since it is an emergent field, I often do not have the latest development at hand and have to shift my research data and questions accordingly”

“Crowd-sourced, digitised data derived from handwritten WWI military diaries hold many possible sources of uncertainty and ambiguity: missing diary pages, illegible parts, typos, reliability of soldiers or crowd-workers/transcribers, ambiguous entity names (places or persons)”

“the collection data I work with has uncertain dates, provenance, creator and subject information

“I am working with a difficult historical subject which requires me to incorporate community voices while at the same time taking into consideration the historical context and landscapes of the area”

“I often reach a plateau where my dataset needs a stronger management and/or analysis tool than I know how to use or maintain”

geocoding samples/collections when the original location is lost or is not derived from an exact location

“when something does not fit neatly into a single discipline

“adopting metadata standards can be a field of great uncertainty”

“with historical data, there is always a missing data point”

“trying to figure out what would be a valid methodology for computationally analyzing translated texts

PART 1

data visualization

PROVIDEDH.eu @PROVIDEDH

Front Page: Funders only

PANTONE GREEN: HEX #00B3B0 RGB (0, 171, 132)

PANTONE GREY: HEX 86888A RGB (134, 136, 138)

The font used in the logo is Booster Next Medium and the sub heading is Din regular.

VISUALIZATION FOR THE DIGITAL HUMANITIES

PROVIDEDH.eu @PROVIDEDH

Vis contribution types (DH2015-2018)

  • Visualization in DH Conf. reached its peak in 2015.
  • However, since then, this interest seems to be decaying.
  • Why? Let’s look back.

PROVIDEDH.eu @PROVIDEDH

Compare to technology hype (this already happened with the DH themselves).

Is visualization for the DH presenting the same symptoms?

Are we in a “Trough of Disillusionment” regarding visualization for the humanities?

PROVIDEDH.eu @PROVIDEDH

Visualization techniques in DH2014

[1] Verbert, K. (June-July, Sydney, Australia). On the Use of Visualization for the Digital Humanities. Digital Humanities 2015. Retrieved from http://dh2015.org/

  • Graph visualizations (especially force-directed layouts) stood out from the rest.
  • How were these visualizations built?
    • Mainly with generic tools (Gephi, Voyant Tools).
    • Or reusing standard code examples
  • But, how were they
    • designed?
    • or evaluated?

What is visualization used for in the DH?

The point here is to state that graph visualizations often are overused.

We are not saying this is necessarily bad, but sometimes is not enough.

This might be misleading some scholars into thinking that data visualization is useless?

PROVIDEDH.eu @PROVIDEDH

What are DH Vis papers about?

Keywords from all contributions in DH2015-2018 containing “visualization” in their keywords, title or abstract (229 papers).

PROVIDEDH.eu @PROVIDEDH

Complexity

Left:Correlation graph from all contributions in DH2015-2018 containing “visualization” in their keywords, title or abstract (229 papers).

Right:Radial showing keywords hierarchy: networks

Keywords pairs having a correlation factor > 0.56 are linked.

Too much diversity for the same visualizations (graph, map) to work out in all contexts?

What does this imply? What’s going on here?

PROVIDEDH.eu

Visualization papers’ authors
in DH2013-2016

[2] Jänicke, S. (2016). Valuable Research for Visualization and Digital Humanities: A Balancing Act. Proc. 1st Workshop on Visualization for the Digital Humanities (VIS4DH). Presented at the 1st Workshop on Visualization for the Digital Humanities.

What is visualization used for in the DH?

PROVIDEDH.eu

Challenges in close and distant reading (2015)[3]

  • Novel techniques for close reading.
  • Geospatial uncertainty.
  • Temporal uncertainty.
  • Reconstructing workflows with visualization.
  • Usability studies.

[3] Jänicke, S., Franzini, G., Cheema, M. F., & Scheuermann, G. (2015). On Close and Distant Reading in Digital Humanities: A Survey and Future Challenges. In R. Borgo, F. Ganovelli, & I. Viola (Eds.), Eurographics Conference on Visualization (EuroVis) - STARs. https://doi.org/10.2312/eurovisstar.20151113

There is a current important effort from the vis community into enhancing visualization for the humanities.

PROVIDEDH.eu @PROVIDEDH

The “slope of enlightenment” of DH Vis

  • Design visualizations and interfaces that are able to capture the nuances of the humanities research workflow.
    • Less tool-oriented.
    • More task/need-centric.
  • Tailored software design methodologies?
    • User-centered and participatory design processes?
    • Involvement of larger and more diverse communities?
    • Agile?

Some humanities scholars wonder if the Digital Humanities aren’t too much tool-oriented.

Ok, we are making tools for the humanities, but are these tools doing what they’re supposed to do?

How do we know? HCI to the rescue.

PROVIDEDH.eu @PROVIDEDH

The “slope of enlightenment” of DH Vis

  • Better validation processes for tools and results.
    • How do we know DH Vis tools are doing what they are supposed to do?
      • Only users know!
      • We need to involve users in the design task.
  • Better categorization, assessment and communication of uncertainty.
    • Models
    • Data formats (e.g., TEI)
    • Visual channels (e.g., colors, shapes)
  • Expose algorithms and other computational methods to the final user?
    • Open the black box and allow finer control.
    • Allow users to know what is going on behind the scenes.

PROVIDEDH.eu @PROVIDEDH

More visualization tools (for dealing with uncertainty)

PROVIDEDH.eu @PROVIDEDH

Comparing / discussing with shared indicators

Usability

Interoperability

Innovation

...

PART 2

Open Innovation

PROVIDEDH.eu @PROVIDEDH

Open Innovation

PROVIDEDH.eu @PROVIDEDH #explorations4u

-

… is a distributed process …
… based on purposively managed knowledge flows
across organizational boundaries…
using pecuniary and non-pecuniary mechanisms
[bogers and chesbrough 2014]

Open Innovation versus Open Science and Citizen Science

PROVIDEDH.eu @PROVIDEDH

    • Open Science (OS) is
      an “umbrella term encompassing a multitude of assumptions about the future of knowledge creation and dissemination” (
      Fecher et al), especially regarding technological infrastructure, accessibility of knowledge creation, access to knowledge, measurement of impact and collaborative research.
    • Citizen Science (CS) refers to
      the “inclusion of members of the public in some aspect of scientific research” (
      Eitzel).
      ECSA puts forward 10
      principles of what constitutes good Citizen Science.

PROVIDEDH.eu @PROVIDEDH

Collaborative Research Practices

Source: Senabre Hidalgo (2017)

PROVIDEDH.eu @PROVIDEDH

Collaborative Research Practices

Source: Senabre Hidalgo (2017)

Open Innovation

Open Science, Citizen Science and Open Innovation

PROVIDEDH.eu @PROVIDEDH

Towards a strategically openness

  • Inclusiveness
    • Higher degree of novelty
    • Better probblem solving competences
    • Fostering / fasten up knowledge creation process
    • Intensifying knowledge exchange porcess
  • Access
    • Better visibility of the actors
    • Higher processefficiancy
    • Better knowledge transfer
  • Transparency
    • Higher level of trust
    • Quality ensurance

wirkungspotentiale strategischer offenheit c.f. blümel, fecher, leimüller 2018

area of work:

Embedding unusal suspects into
research and innovation procjectss

How to activate user innovation activity?

PROVIDEDH.eu @PROVIDEDH

    • Lead User Method
      activate the best qualified user
    • User Crowdsourcing
      self selection
    • User Innovation Communities
      develop/connect to communities
    • Toolkits

.

How to access widely distributed knowledge for innovation?

PROVIDEDH.eu @PROVIDEDH

PROVIDEDH.eu @PROVIDEDH

Survey: overview on relevant criteria for DH Viz

Which of the following criteria and features do you consider are most important

for digital tools when approaching DH-related data?

PART 3

Uncertainty & complexity in DH

PROVIDEDH.eu

PROVIDEDH.eu @PROVIDEDH

Matrix of knowledge uncertainty (based on Johari window)

PROVIDEDH.eu @PROVIDEDH

PROVIDEDH.eu @PROVIDEDH

PROVIDEDH.eu @PROVIDEDH

PROVIDEDH.eu @PROVIDEDH

PART 4

Visualization of uncertainty

PROVIDEDH.eu

PROVIDEDH.eu @PROVIDEDH

Annotating and visualizing uncertainty
in TEI documents

  • The necessity for a visual annotator.
    • Information available in TEI files
    • Common visual display of XML based
      document editors
    • Goals
  • How documents are displayed
    • Annotation and uncertainty encoding
    • Additional information
    • Global status and evolution

What is visualization used for in the DH?

The point here is to state that graph visualizations often are overused.

We are not saying this is necessarily bad, but sometimes is not enough.

This might be misleading some scholars into thinking that data visualization is useless?

PROVIDEDH.eu @PROVIDEDH

The need for a visual annotator

Our goal is to allow collaboration between researchers in an uncertainty-aware context

<creation><date when="1662_12_19"/>
<placeName><country>Ireland</country></placeName></creation>
<!-- Points to keywords in keywords.xml file -->
<textClass><keywords scheme="1641"><list type="deposition_type"><xi:include href="keywords.xml" xpointer="nod51"/>

The Examinations followinge weare taken by Dortor Henry Jones Roger, the Eighteenth day of January 1641 at the Board by the hands of Dr Henry Jones Lord Bishopp of Meath and Henry Brereton Clerk the 19th day of December.

TEI file

Visual annotator

What is visualization used for in the DH?

The point here is to state that graph visualizations often are overused.

We are not saying this is necessarily bad, but sometimes is not enough.

This might be misleading some scholars into thinking that data visualization is useless?

PROVIDEDH.eu @PROVIDEDH

How documents are displayed

TEI is displayed in a transparent manner to the user, with a minimal interface.

What is visualization used for in the DH?

The point here is to state that graph visualizations often are overused.

We are not saying this is necessarily bad, but sometimes is not enough.

This might be misleading some scholars into thinking that data visualization is useless?

PROVIDEDH.eu @PROVIDEDH

How documents are displayed

The underlying markup is represented using different color scales.

What is visualization used for in the DH?

The point here is to state that graph visualizations often are overused.

We are not saying this is necessarily bad, but sometimes is not enough.

This might be misleading some scholars into thinking that data visualization is useless?

PROVIDEDH.eu @PROVIDEDH

How documents are displayed

Certainty is encoded as color saturation -> “Darker" colors depict higher degrees of uncertainty.

What is visualization used for in the DH?

The point here is to state that graph visualizations often are overused.

We are not saying this is necessarily bad, but sometimes is not enough.

This might be misleading some scholars into thinking that data visualization is useless?

PROVIDEDH.eu @PROVIDEDH

How documents are displayed

Own annotations (at full line height) are easily distinguishable from other authors’ annotations (half-line height).

What is visualization used for in the DH?

The point here is to state that graph visualizations often are overused.

We are not saying this is necessarily bad, but sometimes is not enough.

This might be misleading some scholars into thinking that data visualization is useless?

PROVIDEDH.eu @PROVIDEDH

How documents are displayed

Keeping track of the provenance, amount and types of uncertainty-related annotations found in a dataset is key in a collaborative research environment.

What is visualization used for in the DH?

The point here is to state that graph visualizations often are overused.

We are not saying this is necessarily bad, but sometimes is not enough.

This might be misleading some scholars into thinking that data visualization is useless?

PROVIDEDH.eu @PROVIDEDH

How documents are displayed

The tracking of provenance, amount and types of uncertainty-related annotations should be a key design goal of corpus annotation tools.

What is visualization used for in the DH?

The point here is to state that graph visualizations often are overused.

We are not saying this is necessarily bad, but sometimes is not enough.

This might be misleading some scholars into thinking that data visualization is useless?

Example annotator

PROVIDEDH.eu

PROVIDEDH.eu @PROVIDEDH

Sources of uncertainty in DH:

An example on name normalization.

PROVIDEDH.eu @PROVIDEDH

1641 Depositions: A multivariate dataset

  • 7011 depositions in TEI format
  • TEI Header (metadata)
    • Creation date and place, keywords, nature, list of participants, responsible people...
  • TEI Body
    • Actual digital transcription of the original texts.
    • Original and normalized digital transcriptions exist.

User Stories

PROVIDEDH.eu

PROVIDEDH.eu @PROVIDEDH

1641 Depositions: A multivariate dataset

  • 7011 depositions in TEI format
  • TEI Header (metadata)
    • Creation date and place, keywords, nature, list of participants, responsible people...
  • TEI Body
    • Actual digital transcription of the original texts.
    • Original and normalized digital transcriptions exist.

PROVIDEDH.eu @PROVIDEDH

1641 Depositions: A multivariate dataset

  • 6013 TEI files define a list of participants in their headers.
    • Depositions that did not contain a list of participants were discarded for the experiment.
  • A case of uncertainty (gaps)
    • The data we visualize is the “useful” portion of the original dataset, according to the problem at hand (name matching).

PROVIDEDH.eu @PROVIDEDH

1641 Depositions: Names

  • Derived variable: number of participants
  • 998 (out of 7013, 14.2%) depositions do not declare a list of participants in their header.
  • There is one big outlier with 488 participants!
  • 35364 distinct names detected in 6013 files.

mean 9.151671
std 14.485049
min 1.000000
25% 3.000000
50% 5.000000
75% 10.000000
max 488.000000

PROVIDEDH.eu @PROVIDEDH

1641 Depositions: Time

  • Derived variable: number of participants
  • 998 (out of 7013, 14.2%) depositions do not declare a list of participants in their header.
  • There is one big outlier with 488 participants!
  • 35364 distinct names detected in 6013 files.

PROVIDEDH.eu @PROVIDEDH

1641 Depositions: Space

  • The spatial resolution of the 1641 depositions dataset is the town/village (populated place).
  • Thus, data can be aggregated into counties and displayed on, for example, choropleth maps.
  • Among all counties, county Cork is the one that holds the largest amount of depositions in the dataset (1053, 17.5%)
  • A non-negligible amount (334, 5.5%) of the records do not say anything about the deponent’s place of residence.

PROVIDEDH.eu @PROVIDEDH

Name matching (I)

  • Orthographic variations of a name may refer to the same person.
  • The aim of this task is to gather enough evidence to conclude that variations of a name found in different depositions refer to the same person.
  • An example:
    • “Anselmus Adams” and “Ancelmus Adams”
    • The graphing of the two ego networks reveals that have other actors (4) in common.

PROVIDEDH.eu @PROVIDEDH

Other ego networks.

PROVIDEDH.eu @PROVIDEDH

Name matching (II)

  • Nodes with similar variations (edit distance above a certain threshold in this case ratio of 0.80) are merged.

  • This reveals more connections between “Anselmus” and “Ancelmus” and bolds the structural similarity between the two depositions.

PROVIDEDH.eu @PROVIDEDH

Name matching (III)

  • The researcher can explore the text from the two original depositions and compare the context in which the two names, “Anselmus” and “Ancelmus” were used. By reading the two contexts it seems to be clear that “Anselmus” and “Ancelmus” were the same person, whose name was transcribed in two different ways. These nodes can now be merged with a degree of confidence as perceived by the user.

PROVIDEDH.eu @PROVIDEDH

Name matching (II)

O’{something}

O’neill

O’mulhollan

O’kelly

O’haggan

Mc{something}

john ffreman and john freeman

Unknown surname

Aleatory / Statistical noise

Forename/Surname matches

Orthographic variations

PROVIDEDH.eu

PROVIDEDH.eu

Name matching (III)

  • The nodes of “Anselmus” and “Ancelmus” can be merged with a degree of confidence based on different criteria such as:
    • The user’s perception
    • The normalized distance ratio (0.8)

  • The interface fixates the user’s mental state on the data at the time of interpreting results.

    • The matching may be annotated manually or automatically in the corresponding TEI files.

Name matching (IV): Tuning the uncertainty

  • The nodes of “Anselmus” and “Ancelmus” can be merged with a degree of confidence based on different criteria such as:
    • The user’s perception
    • The normalized distance ratio (0.8)

  • The matching may be annotated manually or automatically in the corresponding TEI files.

PROVIDEDH.eu

Demo

PROVIDEDH.eu

apr 52 to june 52

PROVIDEDH.eu @PROVIDEDH

Personas: focus on potential users and their needs

PROVIDEDH.eu @PROVIDEDH

User stories: concrete scenarios of use

WHAT IF AS A...

WHAT IF AS A

DH STUDENT

WHAT IF AS A DIGITAL ARTIST

...I COULD...

I COULD SHARE LEVELS OF UNCERTAINTY IN HISTORY DATA

I COULD DISCOVER NEW OPPORTUNITIES IN DH DATA

...WITH...

WITH OTHER PEOPLE THAT COULD CHECK AN IMPROVE IT

WITH AN APP MAPPING AREAS OF RESEARCH UNCERTAINTY

...IN ORDER TO...

IN ORDER TO GET HELP FROM A WIDER ONLINE COMMUNITY

IN ORDER TO

INSPIRE MY WORK AND EXPAND IT

PROgressive VIsual DEcision-Making in Digital Humanities

PROVIDEDH.eu @PROVIDEDH

The project can be broken down into the following research objectives:

  • To understand all the sources of uncertainty that can affect the DH practice.
  • To develop a set of metrics that convey the degree of uncertainty that research objects, data sets, and collections introduce as well as the different computational models applied to them.
  • To propose a framework that makes use of the uncertainty metrics, so any given representation of the data can be assessed according to its degree of uncertainty.
  • To propose a Progressive Visual Analytics solution that ensures that users can follow the behaviour and degree of uncertainty of the underlying dataset as it evolves, i.e., able to trace changes in data and its inherent uncertainty as well as in the way it is perceived.
  • To develop a web-based multimodal collaborative platform for the progressive visual analysis of different DH collections, both for scholars and citizen humanists.
  • To trigger the formation of a “community of practice” that humanists can build on to reinforce each other’s efforts to achieve metrics that are practical and of high quality.

Get in touch with us

PROVIDEDH.eu @PROVIDEDH

Eveline
eveline.wandl-vogt@oeaw.ac.at | @caissarl

Enric
enric.senabre@oeaw.ac.at | @esenabre

Roberto
theron@usal.es | @robertotheron

PROVIDEDH Partners

PROVIDEDH Funders

The PROVIDEDH project is a three-year project funded within the CHIST-ERA call 2016 for the topic “Visual Analytics for Decision Making under Uncertainty - VADMU.”

PROVIDEDH.eu @PROVIDEDH

s

Copy of VisUSAL-DH2019 - Workshop presentation_final - Google Slides