FLUID-6378: Add ability to search infusion docs using phrases.

Metadata

Source
FLUID-6378
Type
Task
Priority
Major
Status
Closed
Resolution
Fixed
Assignee
Tony Atkins [RtF]
Reporter
Tony Atkins [RtF]
Created
2019-04-30T07:13:27.131-0400
Updated
2024-07-22T10:35:19.047-0400
Versions
N/A
Fixed Versions
N/A
Component
  1. Tech. Documentation

Description

FLUID-5722 added basic searching of the documentation. @@Antranig Basman and I have discussed improvements to this, namely:

  1. Support for quoted phrases or quoted words as exact matches.
  2. Support for plus and minuses.

As Fuse.js does not support phrase matching, but does support "whole query" matching, one approach would be to create two Fuse instances, one "exact", one "fuzzy". We would then remove quoted phrases from the overall query string, and search for those individually using the "exact" search. The non-phrase data remaining would be run through the "fuzzy" search. The results would be interpolated such that:

  1. The "relevance" score for each "hit" (page and section) would be the sum of its relevance in the individual searches, such that a hit that matches multiple phrases would score higher than a hit that matches only one phrase.
  2. Any phrase with a plus would be used to filter "fuzzy" results to ensure that only pages that contain the exact phrase are allowed.
  3. Any phrase with a minus would be used to exclude matches from "fuzzy" results (and other phrase matching.

This approach to phrase matching would multiply the search time, i.e. you'd need 50-75 ms extra per phrase, so we might need to add a loading animation depending.

Attachments

Comments

  • Antranig Basman commented 2019-06-30T18:03:49.769-0400

    If Fuse does any kind of fuzzy matching, it should at least be on the basis of some kind of recognisable stemming. Right now, a search for "reference" also hits "preference" which is nonsensical. Unfortunately each time nerds reinvent the wheel they throw out the last 30 years of its turns.

  • Tony Atkins [RtF] commented 2019-07-01T04:29:14.816-0400

    Looking at their demo as docs, it seems like the "tokenize" option might address this. We'd want to test both for performance and relevance to confirm, but with luck this part might simply be a change in configuration options.

  • Tony Atkins [RtF] commented 2019-07-01T04:53:31.661-0400

    It seems like we really need the tokenise option, further testing today highlights the problems with the threshold, distance, and location approach. Take for example something as simple as searching for promise rejection material.

    I've also noticed that the problem is a bit more severe than simply not supporting plus signs as quotes. For example:

    So it seems like plus signs, minus signs, and quotes at best result in confusing search hits, and are often harmful. I am thinking through the query string parsing strategy and other issues at the moment and may spike this improvement between other things.

  • Tony Atkins [RtF] commented 2019-07-01T05:36:22.937-0400

    There is no stemming support at all, it seems like our options within Fuse.js are:

    1. current behaviour: return words matching each word as a substring, i.e. spaces and other word boundaries are not meaningful. Order by threshold, location, and distance.
    2. tokenize: Only return matches containing one or more of the search terms, but only matching "whole words".
    3. tokenize and matchAllTokens: Only return matches containing all of the whole words (in any order) with matches in the correct order ranked higher.

    First, let's assume that:

    1. All words and phrases with plusses are "must have"
    2. All words and phrases with minuses are "must not have"
    3. All words and phrases with neither are "may have" (used to boost weighting).

    We could achieve something more reasonable than we have now with multiple passes, as follows:

    1. Each "must have" phrase is used in a "tokenize and match all tokens" pass, filtered to exact matches.
    2. Each "must not have" phrase is used in a "tokenize and match all tokens" pass, filtered to exact matches.
    3. All "must have" individual words are used in a "tokenize and match all tokens" pass.
    4. All "must not have" individuals words are used in a "tokenize and match all tokens" pass.
    5. Each "may have" phrase is used in a "tokenize and match all tokens" pass, filtered to exact matches.
    6. All "may have" individual words are used in a "tokenize" pass.
    7. "may have" and "must have" results are combined into a single set of results in which "must haves" that also match "may haves" are upweighted.
    8. the combined results are updated to filter out all "must not have" results.

    My assumption in the above design is that only including "may not have" words and phrases is not supported. Filtering, upweighting, etc. are all handled using the section identifier, i.e. a page may contain excluded material in one or more sections but still have search hits from other sections returned.

  • Tony Atkins [RtF] commented 2019-07-08T10:48:23.070-0400

    In later work with fuse.js, I was having difficulty performing a search in which one or more search terms are required in a given search. This is a deal breaker for any meaningful improvement, and is a reason to convert to using lunr.js as Antranig and I have been discussing. In addition, although its indexing time is lower, fuse.js requires so much time per search that we are better off moving to lunr.js purely for performance reasons, as advanced searches might take four or five search passes and end up taking multiple seconds with fuse.js.

    I completed the search string parsing that is required for either approach earlier today and will move onto the fuse.js refactor shortly. With fuse.js I think it's possible to even achieve phrase matching as a post-search filtering step without bloating the load time too much.

  • Tony Atkins [RtF] commented 2019-07-09T14:56:52.504-0400

    I completed the initial work with lunr.js, it definitely distinguishes "preference" (which stems down to "prefer") from "reference" (which stems down to "refer"). I need to update the highlighting and make a go at quick phrase searching, and then I'll put in the pull.

  • Tony Atkins [RtF] commented 2019-07-09T15:01:17.639-0400

    My work to address this will also remove the docpad metadata from the index and digest, so that we don't end up with results like this: