正则语言没那么难
2025年3月20日程序员们肯定知道正则表达式,正则语言则属于theoratical computer science,但它没那么难。简单来说,能被正则表达式表示的,就是正则语言。
正面例子就不说了,大家肯定都有概念。我举个反例,什么不是正则语言呢?我用伪正则表达式\eqref{reg-n}来描述,a要重复n次,然后b也要重复n次。这种写法所有正则表达式引擎都不支持,因为正则表达式本身的数学定义就不支持backreference of a qualifier。从另一方面来说,正则表达式必须能被转换为有限自动机,必须有有限个节点。a{n}
跟a*
不同,前者需要记录匹配了多少次,以便被引用,而我们无法知道a最多被匹配多少次。
\begin{equation}
a\{n\}b\{n\}
\label{reg-n}
\end{equation}
如果我们知道\eqref{reg-n}里n的最大值呢?假设被匹配的字符串长度为20,那我们可以把这个表达式转换为\eqref{reg-10},这个正则表达式便代表了另一个正则语言。但这个正则语言的字符串长度是有限的。
\begin{equation}
a\{10\}b\{10\} \mid a\{9\}b\{9\} \mid a\{8\}b\{8\} \mid \cdots
\label{reg-10}
\end{equation}
在实际应用中,程序不知道输入的长度是多少,所以不能完全用\eqref{reg-10}来处理。
基本概念
An alphabet, by definition, is a finite, nonempty set of symbols.
A string (or word) is a finite sequence of symbols chosen from some alphabet.
If \(\Sigma\) is an alphabet, \(\Sigma^k\) is a language where all strings have length 5. \(\Sigma^\ast\) is a language with no limit to the length of its strings. Borrow the notation in regular expression [a-z]*
is a superset of the English language. It is a superset because some words, such as “aaa”, are not legal English words.
总的来说,正则语言只有三种运算:连接 (Concatenation)、并 (Union)、Kleene 星 (Kleene Star) or power。[1]
In automata theory, if \(\Sigma\) is an alphabet, and L is a language over \(\Sigma\), then the decision problem of L is:
Given a string w in \(\Sigma\), decide whether or not w is in L.
In other words, a decision problem is a function that takes a word, and returns a boolean value. Accordingly, a decision problem is not about how to do things, or what something will be. Other than decision problems, there are function problems, counting problems, etc.
Conversion between decision problems and function problems
Function f(x)=x+1 is a function problem because its output y is not true or false. Nevertheless, validation of x and y is a decision problem. For example, I say f(2)=4. The decision problem is whether 2+1 is 4.
总结
正则语言关心的problem是yes/no problem。虽然function problems and decision problems有时可以互相转化,但正则语言本身是围绕decision problems的研究。Regualr language does not “transform” strings.
References
- 圆方. 正则语言 (Regular Language). . 2023-02-06 [2025-03-20].↑