In our software, we have an API that accepts SQL-like strings as input. For example:
CLUSTER_NAME = "my_cluster" AND CLUSTER_STATUS IN ("ready", "installing")
As this input is received as a 'free string', we need a robust method to validate and parse it, transforming it into a parameterized query and extracting the values. Our goal is to convert the input into:
CLUSTER_NAME = ? AND CLUSTER_STATUS IN (?, ?)
And extract the values into an array:
values = []string{"my_cluster", "ready", "installing"}
We had the requirement of supporting JSONB
which is not supported by tree search language
.
We already had a SQL parser in the ClusterService
project, so we decided to refactor it for general usage and put it into ocm-common
. The solution works this way:
Grammar
, which defines the list of valid tokens and the list of valid transitions from each token.Let's now deep dive into how the parser works.
The first step in parsing is to break down the input string into its constituent tokens. This process is crucial for understanding the structure of the input.
To perform efficient tokenization, ocm-common
provides a generic Scanner
interface:
type Scanner interface {
Next() bool
Peek() (bool, *Token)
Token() *Token
Init(s string)
}
This interface serves as the foundation for our tokenization process, offering flexibility and extensibility for various parsing needs.
As a reference implementation, ocm-common
includes a SimpleScanner
. This scanner represents the most basic approach, where each character in the input string is treated as a separate token.
Example Usage
Here's a demonstration of how to use the SimpleScanner
:
func main() {
// Initialize the SimpleScanner
scanner := string_scanner.NewSimpleScanner()
scanner.Init(`CLUSTER_NAME = "my_cluster" AND CLUSTER_STATUS IN ("ready", "installing")`)
// Iterate through tokens
for scanner.Next() {
fmt.Printf("%-10s\t%s\n", tokenTypeToString(scanner.Token()), scanner.Token().Value)
}
}
Output
The above code produces the following output (truncated for brevity):
ALPHA C
ALPHA L
ALPHA U
ALPHA S
ALPHA T
ALPHA E
ALPHA R
SYMBOL _
ALPHA N
ALPHA A
ALPHA M
ALPHA E
SYMBOL
SYMBOL =
... (output truncated) ...
To address our specific parsing needs, we introduce a more sophisticated tokenizer: the SQLScanner
. This scanner implements the required interface and provides more meaningful tokenization for SQL-like strings.
Example Usage
Let's modify our previous example to use the new SQLScanner
:
func main() {
// Initialize the SQLScanner
scanner := sql_parser.NewSQLScanner()
scanner.Init(`CLUSTER_NAME = "my_cluster" AND CLUSTER_STATUS IN ("ready", "installing")`)
// Iterate through tokens
for scanner.Next() {
fmt.Printf("%-10s\t%s\n", tokenTypeToString(scanner.Token()), scanner.Token().Value)
}
}
Output
The SQLScanner
produces a more semantically meaningful output:
LITERAL CLUSTER_NAME
OP =
LITERAL "my_cluster"
LITERAL AND
LITERAL CLUSTER_STATUS
LITERAL IN
BRACE (
LITERAL "ready"
LITERAL ,
LITERAL "installing"
BRACE )
The Grammar
helps us give meaning to each token and configure what can happen after each token is encountered. This is called a 'Transition'.
In our SQL-like parsing system, token definitions play a crucial role in identifying and categorizing different elements of the input string.
A token is typically defined as follows:
{
Name: openBrace,
StateData: braceTokenFamily,
Acceptor: StringAcceptor(`(`),
}
In our SQL-like parsing system, defining valid transitions between tokens is crucial for ensuring the syntactic correctness of the input.
A typical transition definition looks like this:
{
TokenName: openBrace,
ValidTransitions: []string{column, openBrace},
}
Now that we have a scanner and a grammar defined, we can proceed to parse and validate SQL-like strings.
First, we create a parser using the StringParserBuilder
:
parser := string_parser.NewStringParserBuilder().
WithGrammar(sql_parser.BasicSQLGrammar()).
WithScanner(sql_parser.NewSQLScanner()).
Build()
Let's parse and validate a set of SQL-like strings:
func main() {
// Initialize the parser
parser := string_parser.NewStringParserBuilder().
WithGrammar(sql_parser.BasicSQLGrammar()).
WithScanner(sql_parser.NewSQLScanner()).
Build()
// Define test strings
strings := []string{
`CLUSTER_NAME = 'my_cluster' AND CLUSTER_STATUS IN ('ready', 'installing')`,
`CLUSTER_NAME = 'my_cluster' AND CLUSTER_STATUS IN ('ready', 'installing)`,
`CLUSTER_NAME = AND CLUSTER_STATUS IN ('ready', 'installing')`,
`CLUSTER_NAME = 'my_cluster' AND CLUSTER_STATUS IN ('ready', 'installing'))`,
}
// Parse each string and print results
for i, s := range strings {
err := parser.Parse(s)
fmt.Printf("%d) %s\n", i+1, s)
if err != nil {
fmt.Printf(" %v\n", err)
} else {
fmt.Printf(" CORRECT\n")
}
}
}
To address the limitation of unbalanced parentheses detection, we've introduced a more sophisticated parsing mechanism using a transition interceptor.
Here's how to use the new SQLParser
:
func main() {
// Initialize the SQLParser
parser := sql_parser.NewSQLParser()
// Define test strings
testStrings := []string{
`CLUSTER_NAME = 'my_cluster' AND CLUSTER_STATUS IN ('ready', 'installing')`,
`CLUSTER_NAME = 'my_cluster' AND CLUSTER_STATUS IN ('ready', 'installing)`,
`CLUSTER_NAME = AND CLUSTER_STATUS IN ('ready', 'installing')`,
`CLUSTER_NAME = 'my_cluster' AND CLUSTER_STATUS IN ('ready', 'installing'))`,
}
// Parse each string and print results
for i, s := range testStrings {
_, _, err := parser.Parse(s)
fmt.Printf("%d) %s\n", i+1, s)
if err != nil {
fmt.Printf(" %v\n", err)
} else {
fmt.Printf(" CORRECT\n")
}
}
}
Looking at the example code, you will notice that the SQLParser
returns three values instead of one; the returned values are:
?
). It represents a parameterized version of the original query.interface{}
. An array containing all the literal values extracted from the original query. These parameters correspond to the question marks in the parsed string.error
. An error object, which is nil
if parsing was successful, or contains error information if parsing failed.The Parser
uses a StateMachine
under the hood to enforce the Grammar
. It translates each Token
into a State
and each Transition
into a StateTransition
. The Parser
then uses the StateMachine
to validate the sequence of tokens produced by the Scanner
.
By incorporating advanced features like parentheses balancing and utilizing a state machine approach, our SQL-like string parser offers a robust and comprehensive validation process. This implementation significantly improves the parser's ability to catch logical errors in query structures, leading to more reliable and secure query processing in applications that utilize SQL-like input strings.