1 of 32

Tool Search:

"Past, Present, and Future"

Ahmed Awan, Tyler Collins, Michelle Savage

Johns Hopkins University

September 15, 2022

2 of 32

Flaws With the Old Tool Search

  • Poor ordering of search Results

  • The tool you searched for was often randomly placed inside of the list of returned tools

  • Didn’t allow for some more advanced filtering and searching options

3 of 32

Tool Search GCC

Cameron Hyde

Tyler Collins

Michelle Savage

Ahmed Awan

4 of 32

The Backend Search: What is Whoosh?

Whoosh is a python library of classes and functions for indexing text and then searching the index.

Essentially, Woosh allows us to develop a custom search engine for searching the tools loaded on a Galaxy instance.

5 of 32

Previous Search Index Schema

The first step in building the backend search is defining an index schema of all of the fields that should be searchable such as title or content.

6 of 32

Populating the Search Index

We then populate our index by looping through all of the available tools on a galaxy instance.

7 of 32

How Woosh searches the index

  • Parse the Query and create a Query Object.
  • Compare the Query Object against the created Document Index.
  • Score the Results with some algorithm.
  • Return a Result Object ordered by Score.

8 of 32

Parsing the Query

Galaxy was using a custom n-gram implementation to help parse the query after it was converted to lowercase.

Galaxy was also configured to default to an ‘or’ search instead of an ‘and’ search.

Galaxy also added a wild card search to the query.

9 of 32

Old Search Implementation

Galaxy would then sum the score of all the fields. The field scores were calculated by the scoring algorithm BM25F * by some boosting value that was defined to mark the importance of each field.

10 of 32

The BM25 Algorithm

=

  • qi is the ith query term
  • The IDF component of our formula measures how often a term occurs in all of the documents and “penalizes” terms that are common.
  • fieldLen/avgFieldLen can be thought of as as how long a document is relative to the average document length in the index.
  • If b is bigger, the effects of the length of the document compared to the average length are more amplified.
  • f(qi,D) boils down to the more times the query term(s) occur a document, the higher its score will be.
  • K1 limits how much a single query term can affect the score of a given document.

11 of 32

What we did to Improve the Search

12 of 32

Changed Scoring Algorithm

  • The BM25F algorithm was a poor choice for many of the short length fields in our index, such as name or id.

  • As a result the scoring method was modified for those fields to switch to a simple frequency scoring algorithm instead.

  • The BM25F algorithm is still being used in help text and had its parameters slightly modified to better fit galaxy use cases.

13 of 32

Fixed Boosting

  • The old boosting method had minimal effect and didn’t succeed in differentiating field importance.
  • The implemented method just modified the b term.
  • Instead a new multiplicative boosting method was created that actually weights the fields differently.

  • If b is bigger, the effects of the length of the document compared to the average length are more amplified.

14 of 32

Implemented Improved N-gram Search

  • The previous custom N-gram search in Galaxy was relying on the use of wild-cards and wasn’t a true n-gram search.
  • We went and implemented a woosh version of the Ngram Tokenizer and implemented that instead to parse the name field in the index.
  • We also made it configurable with a min and max n-gram size for any admins to easily be able to adjust.

15 of 32

Implemented Woosh Analyzers

  • An analyzer is a function or callable class that takes a unicode string and returns a generator of tokens.
  • This lets us modify and transform the query string for each individual field that we define in our schema.
  • For example it can allow us to apply a stemming filter to remove suffixes from words in a search query to help better find matches.

16 of 32

Fixed id Search

Additionally we fixed the ID field to actually be searchable now.

17 of 32

Changed how Config Params are Accessed

We also added all of the boost parameters and ngram sizes and other changes as customizable options to the config_schema.yml.

18 of 32

Results After Changes

19 of 32

Why refactor (or clean) code?

Why (any of below)

  • readability (1+ devs)
  • persistent bugs (6+ months)
  • code alignment

credit: https://www.flickr.com/photos/mercer52/16141913875

20 of 32

API-to-Front End

21 of 32

Code Conventions

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

principal developer at Xerox PARC, Digital, Silicon Graphics, and Netscape (Karlton.org)

22 of 32

Code Conventions for Clarity*

Nice to Haves:

  • Consistent variable structure: (camelCase OR under_score, not both)
  • Object-oriented variable naming: ("data" = "tools", "results", "tags")
  • Static object naming ("inputField" = "query", "date", "category")
  • Consistent naming and ordering within methods (f[x,y] = p[x,y], z[x,y])
  • Full words, no abbreviations (non-native language devs)
  • Long variables/methods/functions (characters or lines) take priority over brevity** (large development teams)

____

* Unless a developer is working entirely on their own, the code must be easily understood for other developers to understand, maintain, and extend.

** sometimes there is a naming limitation on the coding language/tech so brevity in naming must be adhered to

23 of 32

Code Conventions for Clarity*

Before

After

24 of 32

SOLID coding principles

Key Takeaways (from code file top-to-bottom):

  • Single-responsibility principle
    • Concise code (methods/functions no more than 7 lines)
    • Libraries are fine for limited use
  • Open–closed principle
    • Wrapped methods in Classes for easier sharing between variables within code object
  • Liskov substitution principle
    • Seamless connectivity (ToolboxWorkflow.vue)
  • Interface segregation principle
    • Constants held in database/API or listed at top of files
  • Dependency inversion principle
    • Naming variables using objects, nounces

https://en.wikipedia.org/wiki/SOLID

25 of 32

Front End Test Coverage *

* Integration coverage extremely helpful for more complex feature too.

Unit (or Integration)

Filename

Existing or New?

Unit

ToolSection.test.js

Existing

Unit

ToolSearch.test.js

New

26 of 32

Advanced Tool Search

  • In 22.05, Advanced search was added to the history panel.
  • Following the same pattern, the need arose for an advanced tool search.

27 of 32

Advanced Tool Search

Implemented and merged in dev, as a dropdown, advanced menu similar to the history:

28 of 32

Advanced Tool Search

29 of 32

  • Filter not applied to the tool panel itself.
  • Results show up in the center panel in a rich formatted b-table instead.
  • Table shows tool name, short description, panel section, workflow compatibility and tool target (local/not local).

30 of 32

Advanced Tool Search

Clicking on a tool name may show tool help text/information.

31 of 32

Advanced Tool Search

  • All tools retrieved from api/tools.
  • Does not provide enough variables to query.
  • Could be handy to have tool input/output data type as a filter in the future.

32 of 32

Special Thanks

  • Cameron Hyde
  • Dannon Baker
  • Björn Grüning
  • Aysam Guerler
  • everyone else who added input!