Parser
============


1. Inherit from Base Classes
-----------------------------------

Use the provided base classes, which provide a number of utility methods:

- ``QueryStringParser`` → for query strings
- ``QueryListParser`` → for numbered sub-query lists

2. Tokenization
-----------------------------------

Start by defining regex patterns for the different token types.

Recommended components::

    PARENTHESIS_REGEX = r"[\(\)]"
    LOGIC_OPERATOR_REGEX = r"\b(AND|OR|NOT)\b"
    PROXIMITY_OPERATOR_REGEX = r"NEAR/\d+"
    SEARCH_FIELD_REGEX = r"\b\w{2}="  # generic: like TI=, AB=
    SEARCH_TERM_REGEX = r"\"[^\"]+\"|\S+"

Join them into one pattern::

    pattern = "|".join([
        PARENTHESIS_REGEX,
        LOGIC_OPERATOR_REGEX,
        PROXIMITY_OPERATOR_REGEX,
        SEARCH_FIELD_REGEX,
        SEARCH_TERM_REGEX
    ])

.. note::
   Use a **broad** regex for field detection, and validate values later via the linter.

Implement ``tokenize()`` to assign token types and positions::

    for match in re.finditer(self.pattern, self.query_str):
        token = match.group().strip()
        start, end = match.span()

        if re.fullmatch(self.PARENTHESIS_REGEX, token):
            token_type = TokenTypes.PARENTHESIS_OPEN if token == "(" else TokenTypes.PARENTHESIS_CLOSED
        elif re.fullmatch(self.LOGIC_OPERATOR_REGEX, token):
            token_type = TokenTypes.LOGIC_OPERATOR
        else:
            token_type = TokenTypes.UNKNOWN

        self.tokens.append(Token(value=token, type=token_type, position=(start, end)))

Use or override ``combine_subsequent_terms()``:

.. code-block:: python

    self.combine_subsequent_terms()

To join adjacent tokens like ``data`` ``science`` → ``data science``.

3. Build the parse methods
-----------------------------------

Call the Linter to check for errors:

.. code-block:: python

    self.linter.validate_tokens()
    self.linter.check_status()

Add artificial parentheses (position: ``(-1,-1)``) to handle implicit operator precedence.

Implement ``parse_query_tree()`` to build the query object, creating nested queries for parentheses.

.. note::

    Parsers can be developed as top-down parsers (see PubMed) or bottom-up parsers (see Web of Science).

    For NOT operators, it is recommended to parse them as a query with two children. The second child is the negated part (i.e., the operator is interpreted as ``AND NOT``). For example, ``A NOT B`` should be parsed as:

    .. code-block:: python

        NOT[A, B]

Check whether ``SearchFields`` can be created for nested queries (e.g., ``TI=(eHealth OR mHealth)`` or only for individual terms, e.g., ``eHealth[ti] OR mHealth[ti]``.)

**Parser Skeleton**

.. literalinclude:: parser_skeleton.py
   :language: python


List Format Support
-----------------------------------

Implement  ``QueryListParser`` to handle numbered sub-queries and references like ``#1 AND #2``.

.. note::

   To parse a list format, the numbered sub-queries should be replaced to create a search string, which can be parsed with the standard string-parser. This helps to avoid redundant implementation.