Parser
Versioned parsers
Parsers live in versioned modules such as
search_query/pubmed/v1/parser.py
. Keeping previous versions
allows reproducible parsing and backward compatibility. See
versioning policy for details.
The central registry in search_query.parser
exposes a PARSERS
mapping and resolves the appropriate version at runtime. Calling
parse(..., parser_version="latest")
loads the highest available
version for the chosen platform.
When introducing a new parser version, copy the previous versioned
directory, adjust the implementation, and register the version in the
PARSERS
dictionary.
1. Inherit from base classes
Use the provided base classes, which provide a number of utility methods:
QueryStringParser
→ for query stringsQueryListParser
→ for numbered sub-query lists
2. Tokenization
Start by defining regex patterns for the different token types.
Recommended components:
PARENTHESIS_REGEX = r"[\(\)]"
LOGIC_OPERATOR_REGEX = r"\b(AND|OR|NOT)\b"
PROXIMITY_OPERATOR_REGEX = r"NEAR/\d+"
FIELD_REGEX = r"\b\w{2}=" # generic: like TI=, AB=
TERM_REGEX = r"\"[^\"]+\"|\S+"
Join them into one pattern:
pattern = "|".join([
PARENTHESIS_REGEX,
LOGIC_OPERATOR_REGEX,
PROXIMITY_OPERATOR_REGEX,
FIELD_REGEX,
TERM_REGEX
])
Note
Use a broad regex for field detection, and validate values later via the linter.
Implement tokenize()
to assign token types and positions:
for match in re.finditer(self.pattern, self.query_str):
token = match.group().strip()
start, end = match.span()
if re.fullmatch(self.PARENTHESIS_REGEX, token):
token_type = TokenTypes.PARENTHESIS_OPEN if token == "(" else TokenTypes.PARENTHESIS_CLOSED
elif re.fullmatch(self.LOGIC_OPERATOR_REGEX, token):
token_type = TokenTypes.LOGIC_OPERATOR
else:
token_type = TokenTypes.UNKNOWN
self.tokens.append(Token(value=token, type=token_type, position=(start, end)))
Use or override combine_subsequent_terms()
:
self.combine_subsequent_terms()
To join adjacent tokens like data
science
→ data science
.
3. Build the parse methods
Call the Linter to check for errors:
self.linter.validate_tokens()
self.linter.check_status()
Add artificial parentheses (position: (-1,-1)
) to handle implicit operator precedence.
Implement parse_query_tree()
to build the query object, creating nested queries for parentheses.
Note
Parsers can be developed as top-down parsers (see PubMed) or bottom-up parsers (see Web of Science).
For NOT operators, it is recommended to parse them as a query with two children. The second child is the negated part (i.e., the operator is interpreted as AND NOT
). For example, A NOT B
should be parsed as:
NOT[A, B]
Check whether SearchFields
can be created for nested queries (e.g., TI=(eHealth OR mHealth)
or only for individual terms, e.g., eHealth[ti] OR mHealth[ti]
.)
Parser Skeleton
class CustomParser(QueryStringParser):
PARENTHESIS_REGEX = r"[\(\)]"
LOGIC_OPERATOR_REGEX = r"\b(AND|OR|NOT)\b"
FIELD_REGEX = r"\b\w{2}="
TERM_REGEX = r"\"[^\"]+\"|\S+"
pattern = "|".join(
[PARENTHESIS_REGEX, LOGIC_OPERATOR_REGEX, FIELD_REGEX, TERM_REGEX]
)
def __init__(self, query_str, *, field_general=""):
super().__init__(query_str, field_general=field_general)
self.linter = CustomLinter(self)
def tokenize(self):
for match in re.finditer(self.pattern, self.query_str):
token = match.group().strip()
start, end = match.span()
# assign token_type as shown above
self.tokens.append(
Token(value=token, type=token_type, position=(start, end))
)
self.combine_subsequent_terms()
def parse_query_tree(self, tokens: list) -> Query:
# Build query tree here
...
def parse(self) -> Query:
self.tokenize()
self.linter.validate_tokens()
self.linter.check_status()
query = self.parse_query_tree(self.tokens)
self.linter.validate_query_tree(query)
self.linter.check_status()
return query
List format support
Implement QueryListParser
to handle numbered sub-queries and references like #1 AND #2
.
Note
To parse a list format, the numbered sub-queries should be replaced to create a search string, which can be parsed with the standard string-parser. This helps to avoid redundant implementation.