Parser

1. Inherit from Base Classes

Use the provided base classes, which provide a number of utility methods:

  • QueryStringParser → for query strings

  • QueryListParser → for numbered sub-query lists

2. Tokenization

Start by defining regex patterns for the different token types.

Recommended components:

PARENTHESIS_REGEX = r"[\(\)]"
LOGIC_OPERATOR_REGEX = r"\b(AND|OR|NOT)\b"
PROXIMITY_OPERATOR_REGEX = r"NEAR/\d+"
FIELD_REGEX = r"\b\w{2}="  # generic: like TI=, AB=
TERM_REGEX = r"\"[^\"]+\"|\S+"

Join them into one pattern:

pattern = "|".join([
    PARENTHESIS_REGEX,
    LOGIC_OPERATOR_REGEX,
    PROXIMITY_OPERATOR_REGEX,
    FIELD_REGEX,
    TERM_REGEX
])

Note

Use a broad regex for field detection, and validate values later via the linter.

Implement tokenize() to assign token types and positions:

for match in re.finditer(self.pattern, self.query_str):
    token = match.group().strip()
    start, end = match.span()

    if re.fullmatch(self.PARENTHESIS_REGEX, token):
        token_type = TokenTypes.PARENTHESIS_OPEN if token == "(" else TokenTypes.PARENTHESIS_CLOSED
    elif re.fullmatch(self.LOGIC_OPERATOR_REGEX, token):
        token_type = TokenTypes.LOGIC_OPERATOR
    else:
        token_type = TokenTypes.UNKNOWN

    self.tokens.append(Token(value=token, type=token_type, position=(start, end)))

Use or override combine_subsequent_terms():

self.combine_subsequent_terms()

To join adjacent tokens like data sciencedata science.

3. Build the parse methods

Call the Linter to check for errors:

self.linter.validate_tokens()
self.linter.check_status()

Add artificial parentheses (position: (-1,-1)) to handle implicit operator precedence.

Implement parse_query_tree() to build the query object, creating nested queries for parentheses.

Note

Parsers can be developed as top-down parsers (see PubMed) or bottom-up parsers (see Web of Science).

For NOT operators, it is recommended to parse them as a query with two children. The second child is the negated part (i.e., the operator is interpreted as AND NOT). For example, A NOT B should be parsed as:

NOT[A, B]

Check whether SearchFields can be created for nested queries (e.g., TI=(eHealth OR mHealth) or only for individual terms, e.g., eHealth[ti] OR mHealth[ti].)

Parser Skeleton

class CustomParser(QueryStringParser):
    PARENTHESIS_REGEX = r"[\(\)]"
    LOGIC_OPERATOR_REGEX = r"\b(AND|OR|NOT)\b"
    FIELD_REGEX = r"\b\w{2}="
    TERM_REGEX = r"\"[^\"]+\"|\S+"

    pattern = "|".join(
        [PARENTHESIS_REGEX, LOGIC_OPERATOR_REGEX, FIELD_REGEX, TERM_REGEX]
    )

    def __init__(self, query_str, *, field_general=""):
        super().__init__(query_str, field_general=field_general)
        self.linter = CustomLinter(self)

    def tokenize(self):
        for match in re.finditer(self.pattern, self.query_str):
            token = match.group().strip()
            start, end = match.span()
            # assign token_type as shown above
            self.tokens.append(
                Token(value=token, type=token_type, position=(start, end))
            )

        self.combine_subsequent_terms()

    def parse_query_tree(self, tokens: list) -> Query:
        # Build query tree here
        ...

    def parse(self) -> Query:
        self.tokenize()
        self.linter.validate_tokens()
        self.linter.check_status()
        query = self.parse_query_tree(self.tokens)
        self.linter.validate_query_tree(query)
        self.linter.check_status()
        return query

List Format Support

Implement QueryListParser to handle numbered sub-queries and references like #1 AND #2.

Note

To parse a list format, the numbered sub-queries should be replaced to create a search string, which can be parsed with the standard string-parser. This helps to avoid redundant implementation.