1 of 18

Measurement, Testing, & Scoring

Conformance - An Introduction

1

2 of 18

Goal

Goal: To review measurement, testing and scoring in FPWD along with the issues raised on the forward public working draft

Context: We will use alt text as the example to discuss the challenges today, but the broader conversation is about measuring

2

3 of 18

Terminology

  • Measuring
    • Used to determine success or failure at the most granular level
    • How do we measure success?
  • Testing
    • Methodology used to measure success or failure
  • Scoring
    • The process of taking the results of measuring (aka measurements) and aggregating them into an score
  • Conformance
    • How well a resulting score matches the set of requirements set out in the conformance document
  • View
    • All content visually or programmatically available without a substantive change.
    • Views vary based on the technology being tested. While these guidelines provide guidance on scoping a view, the tester will determine what constitutes a view, and describe it. Views will often vary by technology. Views typically include state permutations that are based on that view such as dialogs and alerts, but some states may deserve to be treated as separate views.

3

4 of 18

Relevant Requirements

  • Multiple ways to measure
    • All WCAG 3.0 guidance has tests or procedures so that the results can be verified. In addition to the current true/false success criteria, other ways of measuring (for example, rubrics, sliding scale, task-completion, user research with people with disabilities, and more) can be used where appropriate so that more needs of people with disabilities can be included.
  • Technology Neutral
    • Guidance should be expressed in generic terms so that they may apply to more than one platform or technology. The intent of technology-neutral wording is to provide the opportunity to apply the core guidelines to current and emerging technology, even if specific technical advice doesn't yet exist.
  • Motivation
    • The Guidelines motivate organizations to go beyond minimal accessibility requirements by providing a scoring system that rewards organizations which demonstrate a greater effort to improve accessibility.
  • Regulatory Environment
    • The Guidelines provide broad support, including
    • Structure, methodology, and content that facilitates adoption into law, regulation, or policy, and
    • clear intent and transparency as to purpose and goals, to assist when there are questions or controversy.
  • Requirements Document

4

5 of 18

Atomic Tests vs. Holistic Tests (FPWD)

Atomic Tests

Atomic tests evaluate content, often at an object level, for accessibility. Atomic tests include the existing tests that support A, AA, and AAA success criteria in WCAG 2.X. They also include tests that may require additional context or expertise beyond tests that fit within the WCAG 2.X structure. In WCAG 3.0, atomic tests are used to test both processes and views. Test results are then aggregated across the selected views. Critical errors within selected processes are also totaled. Successful results of the atomic tests are used to reach a Bronze rating.

Atomic tests may be automated or manual. Automated evaluation can be completed without human assistance. These tests allow for a larger scope to be tested but automated evaluation alone cannot determine accessibility. Over time, the number of accessibility tests that can be automated is increasing, but manual testing is still required to evaluate most methods at this time.

Holistic Tests

Holistic tests include assistive technology testing, user-centered design methods, and both user and expert usability testing. Holistic testing applies to the entire declared scope and often uses the declared processes to guide the tests selected.

5

6 of 18

Atomic, Automated Tests for Functional Images (FPWD)

Procedure for HTML

  • Run an automated test that displays the text alternative (or accessible name) for images.
  • Check that functional, informative, and images of text have alternative text that serves an equivalent purpose of the image
    • Functional images describe the function
    • Informative images describe the image
    • Images of text repeat the text or the equivalent purpose of the text.
  • Check that decorative images are appropriately coded (see “Decorative Images” method) so they are hidden to assistive technology.

Expected Results: Check #2 and #3 are true.

6

Procedure for Technology Agnostic

  • Examine each image in the content.
  • Check that each image that conveys meaning has its text alternative.
  • If the image contains text that is not purely decorative, the text alternative contains the same text.
  • If it is within a link together with text, check that it is implemented to be ignored by assistive technology or the text alternative describes the image and supplements the link text.
  • If it is a button, check that the text alternative indicates the button's function.

Expected Results: Checks #2 and #3, or #2 and #4, or #2 and #5 are true.

Unit Tested: All Images

Measurement: Percentage (# passed/total # of img elements for all images)

  • This test is measured by the number of img elements in the HTML document or the number of images in non-HTML content being tested.
  • The percentage test result is the number passed divided by the total number of img elements or images.

7 of 18

Atomic, Manual Tests for Functional Images (FPWD)

Procedure for HTML

  • Examine each img element in the content
  • Check that each img element which conveys meaning contains an alt attribute.
  • If the image contains text that is not purely decorative, the value of the alt attribute is the same as the text.
  • If it is within a link together with text, check that an img element within an a element has either a null alt attribute value or a value that supplements the link text and describes the image
  • If it is a button, check that the text alternative indicates the button's function.

Expected Results: Checks #1 and #2 are true (see notes)

7

Procedure for Technology Agnostic

  • Examine each functional image in the content.
  • Check that each functional image that conveys meaning has its text alternative.
  • If the image contains text that is not purely decorative, the text alternative contains the same text.
  • If it is within a link together with text, check that it is implemented to be ignored by assistive technology or the text alternative describes the image and supplements the link text.
  • If it is a button, check that the text alternative indicates the button's function..

Expected Results: Checks #2 and #3, or #2 and #4, or #2 and #5 are true.

Unit Tested:This test is measured by the number of the following elements only for “Functional Images” in the HTML document

  • img elements
  • input elements with type=”image”

Measurement: Percentage (# passed/total # of img elements for “Functional Images”)

8 of 18

Scoring "Text alternative available" (FPWD)

8

Rating

Criteria

Rating 0

Less than 60% of all images have appropriate text alternatives OR there is a critical error in the process

Rating 1

60% - 69% of all images have appropriate text alternatives AND no critical errors in the process

Rating 2

70%-79% of all images have appropriate text alternatives AND no critical errors in the process

Rating 3

80%-94% of all images have appropriate text alternatives AND no critical errors in the process

Rating 4

95% to 100% of all images have appropriate text alternatives AND no critical errors in the process

9 of 18

Issues: Concern about Difficulty of Counting

  • 275 - “A standard measure is needed to better define vendor claims in VPATS/ACRs that a test condition is "partially compliant."...However, as a certified trusted tester in the DHS Accessibility Test Process for Web, using the measure of "a percentage of overall instances" to obtain this data will be problematic. It would be burdensome to expect a manual test process to "count" all the instances where a test condition applies. Even those using automated testing would not be able to fully determine the extent of applicability, since only a small percentage of testing can reliably be performed using automated tools.”
  • 278 - Rating sites for percentage of overall conformance by success criterion/guideline has the potential to create an added burden for manual testers, requiring a tally of conforming content in addition to identification of non-conforming content. If manually tested content is rated more qualitatively (rather than based on a overall percentage of conformance) then you might still encounter challenges comparing the qualitative rating to the quantitative ratings in order to define overall impact on users
  • 498 - Concern expressed about the fact that new testing rules and tools will need to be created to support WCAG 3. “Depending on the number of outcomes that require manual testing, the ability to count instances (manually) of applicable content, and appropriately test all of the outcomes could cost significant time.

9

10 of 18

Issues: Types of Measures

  • 280 - “I also understand that the atomic tests do not involve testing with AT. I feel that this is dangerous since most automated testing tools do not test functionality but rather WCAG2.x issues...Functional testing with AT is essential for establishing web accessibility. Therefore, as an evaluator, I believe that testing core tasks with the users' preferred AT must be part of the Bronze conformance level.
  • 486 - “if one wants to create a document that would be used in the legal and regulatory arena, then one has to stick with objective measures.”
  • 505 - “As the standard develops, we strongly encourage WCAG 3.0 to, where possible, provide expectations for outcomes without mandating how the implementer of the standard gets to that goal

10

11 of 18

Issues: Need for Clarity

  • 435 - 6.3 This approach, which allows the tester some flexibility in assigning scores, has the advantage of simplicity and allowing a tester to take the context into account beyond the simple percentages. How can unambiguity be assured?
  • 436 - This could work as long as the method is unambiguous, and every tester comes to the same conclusion. Also there should be a drive for organizations to make their websites as accessible as possible, by giving them some kind of platform. And there should be a simple way built into the report for people (with disabilities) for who the website is not accessible to address the problem so that the owner can rectify this situation or offer an alternative. This gives the owner an incentive to comply as fully as possible since it reduces the costs for a helpdesk of chatbox.

11

12 of 18

Overall Complexity/Need for Simplicity

  • 417 - Although the intent of the proposed scoring and rating system is to provide a more flexible conformance approach, it will make testing significantly more complex. Having five possible rating "bands" (a scale of 0 to 4) for each outcome will make all testing, and in particular manual testing, extremely complex. In addition, having different scoring methods (e.g., pass/fail, rating scales) and different percentages-as well as adjectival ratings-for the various rating scale "bands" for different criteria adds yet another layer of complexity. We recommend that the proposed scoring and rating system be simplified to reduce the additional testing burden. For example, the scoring system should be more objective and data-driven, and avoid adjectival ratings which potentially introduce ambiguity and whose meaning may change when translated into other languages.
  • 444 - “Outcome ratings vs. critical errors vs. conformance levels is confusing. Adding to the complexity is the objective pass/fail outcome vs adjectival ratings, making it even harder to follow. Understanding and meeting the outcomes is hard enough for people to learn that we don’t want to see there be a complex scoring added on top.”
  • 463 - “While the example guidelines included a lot of detail, the information regarding scoring is throughout the document leading to a lot of confusion on how it would really work. Having an alternative to pass/fail is a requirement when it comes to applying WCAG to complex enterprise level products, but it needs to be done in a way that everyone can understand. Additionally, the methods used for scoring are complex since each outcome will have its own scoring scale. The impact to testing groups for the amount of data that must be recorded, and tracking which scale is to be used on each outcome is going to lead to extensive time being spent on just tracking statistics of testing. We recommend a simplified scoring methodology be created to reduce this overhead.”
  • 457 - “There should be one model for conformance, based on rating and scoped by documenting the sampling, processes, and any other testing performed with an easy way to calculate the conformance level. If it is too complicated, it will be difficult for policy makers to understand what conformance levels can be reasonably expected for technologies to attain that can be included in regulations. More clarity in the specific proposals is needed to give a more detailed response to the question.”

12

13 of 18

Suggestions for Rating Scales

  • 277 - Suggestion to use common bug rating practices for accessibility:
    • Severity 1 (aka Critical or Urgent) - the bug/issue renders the site/system/application inoperable, unusable, or untestable
    • Severity 2 (aka Major or High) - the issue/bug prevents the ability to perform some core functionality, but there could be some workaround
    • Severity 3 (aka Moderate or Medium) - the issue/bug affect some minor functionality, but there is no workaround
    • Severity 4 (aka Minor or Low) - the issue/bug affects a minor functionality, and there is a workaround
    • [Bonus Severity 5 (aka Cosmetic)] - the issue/bug does not affect functionality, but diminishes user experience in some way (e.g., misspellings, bad page spacing, poorly structured page content, etc.)
  • 459 - “The general concept of defining thresholds for adjectival ratings and scores is good conceptually, but several questions remain: [discussion of thresholds which we will discuss in future meetings] It would help evaluate and understand the scoring scheme if there was a sample scoring sheet provided. It is very hard to determine the difficulty of conformance scoring across different functional groupings and in aggregate. Understanding whether the model is workable without the ability to view it in this context is a substantial challenge.”

13

14 of 18

Joint Silver/ACT Work to Date

  • Q2 and Q3: Revise the Methods template to:
  • Q4: Revise the Outcomes template to:
    • Reduce ambiguity / improve reliability
    • Design a new template of Outcome
    • Include more specific definitions of terms to explain the outcome
    • Early draft of a new outcome for Programmatic Language (formerly Language of page and parts)

14

15 of 18

Example - Hamburger Menu

Quantitative Measure: Does it have alt text?

  • Yes/No
  • Percentage

Qualitative Measure: Equivalent purpose of the image?

  • “Hamburger menu”
  • Decorative
  • “Settings menu”
  • “Three lines”
  • 123.png

15

16 of 18

Example - Dogs

Quantitative: Does it have alt text?

  • Yes/No
  • Percentage

Qualitative: Equivalent purpose of the image?

  • “dog and dog and dog and dog”
  • “4 dogs”
  • “Four dogs on leashes sitting on a sidewalk.”

16

17 of 18

Example: Aggregate

  • Alt text
    • Select menu - No alt text
    • Dog photos - Adequate alt text
    • Dog bone - alt=””
  • Quantitative Scoring: Percentage
  • 75%
  • Qualitative Testing: Equivalent purpose of the image?
  • Step-by-step guide to passing or failing an image
  • Qualitative Scoring Possibilities
  • Pass/Fail
  • Rating scale (FPWD)
  • Points (Alternate Proposal)
  • Considerations
  • As objective as possible
  • Simplicity of measures
  • Consistency across guidelines
  • Interrater reliability
  • Scoring possibilities

17

18 of 18

Next Steps: Proposals for measuring/testing “A text alternative that serves the equivalent purpose.”

  • October 30th: Submit proposals on ways to test qualitative content to public-silver-editors@w3.org
    • Take into account Github Issues
    • Keep the proposal scoped only to measuring
    • Optionally give examples of how the measurements could possibly be scored
  • After we discuss proposals, we will try them with other guidance like headings and COGA examples

18