Advanced String Parsing

9th Jan 2025

Advanced String Parsing

The Challenge
Our Solution
Tokenization
- The Scanner
- Advanced Tokenization with SQLScanner
The Grammar
- Defining the Tokens
- Defining the Valid Transitions
Parse and Validate the String
Final Code
How It Works
Sources

The Challenge

In our software, we have an API that accepts SQL-like strings as input. For example:

CLUSTER_NAME = "my_cluster" AND CLUSTER_STATUS IN ("ready", "installing")

As this input is received as a 'free string', we need a robust method to validate and parse it, transforming it into a parameterized query and extracting the values. Our goal is to convert the input into:

CLUSTER_NAME = ? AND CLUSTER_STATUS IN (?, ?)

And extract the values into an array:

values = []string{"my_cluster", "ready", "installing"}

Why implementing something new instead of using "tree search language"

We had the requirement of supporting JSONB which is not supported by tree search language.

Our Solution

We already had a SQL parser in the ClusterService project, so we decided to refactor it for general usage and put it into ocm-common. The solution works this way:

The parser takes the input string and tokenizes it
Each token is verified against a Grammar, which defines the list of valid tokens and the list of valid transitions from each token.

Let's now deep dive into how the parser works.

Tokenization

The first step in parsing is to break down the input string into its constituent tokens. This process is crucial for understanding the structure of the input.

The Scanner

To perform efficient tokenization, ocm-common provides a generic Scanner interface:

type Scanner interface {
    Next() bool
    Peek() (bool, *Token)
    Token() *Token
    Init(s string)
}

This interface serves as the foundation for our tokenization process, offering flexibility and extensibility for various parsing needs.

SimpleScanner: A Basic Implementation

As a reference implementation, ocm-common includes a SimpleScanner. This scanner represents the most basic approach, where each character in the input string is treated as a separate token.

Example Usage

Here's a demonstration of how to use the SimpleScanner:

func main() {
    // Initialize the SimpleScanner
    scanner := string_scanner.NewSimpleScanner()
    scanner.Init(`CLUSTER_NAME = "my_cluster" AND CLUSTER_STATUS IN ("ready", "installing")`)

    // Iterate through tokens
    for scanner.Next() {
        fmt.Printf("%-10s\t%s\n", tokenTypeToString(scanner.Token()), scanner.Token().Value)
    }
}

Output

The above code produces the following output (truncated for brevity):

ALPHA       C
ALPHA       L
ALPHA       U
ALPHA       S
ALPHA       T
ALPHA       E
ALPHA       R
SYMBOL      _
ALPHA       N
ALPHA       A
ALPHA       M
ALPHA       E
SYMBOL       
SYMBOL      =
... (output truncated) ...

Advanced Tokenization with SQLScanner

To address our specific parsing needs, we introduce a more sophisticated tokenizer: the SQLScanner. This scanner implements the required interface and provides more meaningful tokenization for SQL-like strings.

Example Usage

Let's modify our previous example to use the new SQLScanner:

func main() {
    // Initialize the SQLScanner
    scanner := sql_parser.NewSQLScanner()
    scanner.Init(`CLUSTER_NAME = "my_cluster" AND CLUSTER_STATUS IN ("ready", "installing")`)

    // Iterate through tokens
    for scanner.Next() {
        fmt.Printf("%-10s\t%s\n", tokenTypeToString(scanner.Token()), scanner.Token().Value)
    }
}

Output

The SQLScanner produces a more semantically meaningful output:

LITERAL    CLUSTER_NAME
OP         =
LITERAL    "my_cluster"
LITERAL    AND
LITERAL    CLUSTER_STATUS
LITERAL    IN
BRACE      (
LITERAL    "ready"
LITERAL    ,
LITERAL    "installing"
BRACE      )

The Grammar

The Grammar helps us give meaning to each token and configure what can happen after each token is encountered. This is called a 'Transition'.

Defining the tokens

In our SQL-like parsing system, token definitions play a crucial role in identifying and categorizing different elements of the input string.

Anatomy of a Token Definition

A token is typically defined as follows:

{
    Name:      openBrace,
    StateData: braceTokenFamily,
    Acceptor:  StringAcceptor(`(`),
}

Defining the valid transitions

In our SQL-like parsing system, defining valid transitions between tokens is crucial for ensuring the syntactic correctness of the input.

Anatomy of a Transition Definition

A typical transition definition looks like this:

{
    TokenName:        openBrace,
    ValidTransitions: []string{column, openBrace},
}

Parse and validate the string

Now that we have a scanner and a grammar defined, we can proceed to parse and validate SQL-like strings.

Setting Up the Parser

First, we create a parser using the StringParserBuilder:

parser := string_parser.NewStringParserBuilder().
    WithGrammar(sql_parser.BasicSQLGrammar()).
    WithScanner(sql_parser.NewSQLScanner()).
    Build()

Example Usage

Let's parse and validate a set of SQL-like strings:

func main() {
    // Initialize the parser
    parser := string_parser.NewStringParserBuilder().
        WithGrammar(sql_parser.BasicSQLGrammar()).
        WithScanner(sql_parser.NewSQLScanner()).
        Build()

    // Define test strings
    strings := []string{
        `CLUSTER_NAME = 'my_cluster' AND CLUSTER_STATUS IN ('ready', 'installing')`,
        `CLUSTER_NAME = 'my_cluster' AND CLUSTER_STATUS IN ('ready', 'installing)`,
        `CLUSTER_NAME = AND CLUSTER_STATUS IN ('ready', 'installing')`,
        `CLUSTER_NAME = 'my_cluster' AND CLUSTER_STATUS IN ('ready', 'installing'))`,
    }

    // Parse each string and print results
    for i, s := range strings {
        err := parser.Parse(s)
        fmt.Printf("%d) %s\n", i+1, s)
        if err != nil {
            fmt.Printf("   %v\n", err)
        } else {
            fmt.Printf("   CORRECT\n")
        }
    }
}

Final code

To address the limitation of unbalanced parentheses detection, we've introduced a more sophisticated parsing mechanism using a transition interceptor.

Example Usage of SQLParser

Here's how to use the new SQLParser:

func main() {
    // Initialize the SQLParser
    parser := sql_parser.NewSQLParser()

    // Define test strings
    testStrings := []string{
        `CLUSTER_NAME = 'my_cluster' AND CLUSTER_STATUS IN ('ready', 'installing')`,
        `CLUSTER_NAME = 'my_cluster' AND CLUSTER_STATUS IN ('ready', 'installing)`,
        `CLUSTER_NAME = AND CLUSTER_STATUS IN ('ready', 'installing')`,
        `CLUSTER_NAME = 'my_cluster' AND CLUSTER_STATUS IN ('ready', 'installing'))`,
    }

    // Parse each string and print results
    for i, s := range testStrings {
        _, _, err := parser.Parse(s)
        fmt.Printf("%d) %s\n", i+1, s)
        if err != nil {
            fmt.Printf("   %v\n", err)
        } else {
            fmt.Printf("   CORRECT\n")
        }
    }
}

Looking at the example code, you will notice that the SQLParser returns three values instead of one; the returned values are:

the parsed string. This is the input query with all literal values replaced by question marks (?). It represents a parameterized version of the original query.
an interface{}. An array containing all the literal values extracted from the original query. These parameters correspond to the question marks in the parsed string.
an error. An error object, which is nil if parsing was successful, or contains error information if parsing failed.

How It Works

The Parser uses a StateMachine under the hood to enforce the Grammar. It translates each Token into a State and each Transition into a StateTransition. The Parser then uses the StateMachine to validate the sequence of tokens produced by the Scanner.

Conclusion

By incorporating advanced features like parentheses balancing and utilizing a state machine approach, our SQL-like string parser offers a robust and comprehensive validation process. This implementation significantly improves the parser's ability to catch logical errors in query structures, leading to more reliable and secure query processing in applications that utilize SQL-like input strings.