Objective
|
Lesson
A regular expression is a string. Python's re (regular expression) methods scan a string supplied to determine if the string supplied contains text matching the regular expression. If such text is found in the string supplied, the required action of the method may be to report it, to split the string at the text matching the regular expression, or to replace the text matching the regular expression. A regular expression may be as simple as a few characters to be interpreted literally, eg, >>> import re
>>>
>>> re.search(r'abc*', '123abDEF')
<_sre.SRE_Match object; span=(3, 5), match='ab'>
>>>
>>> re.search(r'abc*', '123abcDEF')
<_sre.SRE_Match object; span=(3, 6), match='abc'>
>>>
>>> re.search(r'abc*', '123abcccccccDEF')
<_sre.SRE_Match object; span=(3, 12), match='abccccccc'>
>>>
>>> re.search(r'abc*', '123acccccccDEF')
>>>
|
Matching literal characters
A regular expression may be as simple as one character.
Search for >>> import re
>>> re.search('e', 'jumped')
<_sre.SRE_Match object; span=(4, 5), match='e'>
>>>
>>> 'jumped'[4:5] == 'e'
True
>>>
Search for >>> s1 = 'jumped over everything'
>>> re.search('e', s1)
<_sre.SRE_Match object; span=(4, 5), match='e'> # 1st occurrence
>>> s1[4:5] == 'e'
True
>>>
>>> s2 = s1[5:] ; s2
'd over everything'
>>> re.search('e', s2)
<_sre.SRE_Match object; span=(4, 5), match='e'> # 2nd occurrence
>>> s2[4:5] == 'e'
True
>>>
>>> s3 = s2[5:] ; s3
'r everything'
>>> re.search('e', s3)
<_sre.SRE_Match object; span=(2, 3), match='e'> # 3rd occurrence
>>> s3[2:3] == 'e'
True
>>>
>>> s4 = s3[3:] ; s4
'verything'
>>> re.search('e', s4)
<_sre.SRE_Match object; span=(1, 2), match='e'> # 4th occurrence
>>> s4[1:2] == 'e'
True
>>>
>>> s5 = s4[2:] ; s5
'rything'
>>> re.search('e', s5)
>>>
Method re.findall(....) produces a list of all matches found: >>> L1 = re.findall('e', s1) ; L1
['e', 'e', 'e', 'e']
>>>
Iterating over matches found>>> print ('\n'.join([ str(p) for p in re.finditer('e', s1 ) ]))
<_sre.SRE_Match object; span=(4, 5), match='e'>
<_sre.SRE_Match object; span=(9, 10), match='e'>
<_sre.SRE_Match object; span=(12, 13), match='e'>
<_sre.SRE_Match object; span=(14, 15), match='e'>
>>>
Modifying the search>>> print ('\n'.join([ str(p) for p in re.finditer('R', s1, re.IGNORECASE ) ]))
<_sre.SRE_Match object; span=(10, 11), match='r'>
<_sre.SRE_Match object; span=(15, 16), match='r'>
>>>
Flag >>> print ('\n'.join([ str(p) for p in re.finditer('R # looking for r or R', s1, re.IGNORECASE|re.VERBOSE ) ]))
<_sre.SRE_Match object; span=(10, 11), match='r'>
<_sre.SRE_Match object; span=(15, 16), match='r'>
>>>
>>> print ('\n'.join([ str(p) for p in re.finditer('v # looking for v or V', s1.upper(), re.I|re.X ) ]))
<_sre.SRE_Match object; span=(8, 9), match='V'>
<_sre.SRE_Match object; span=(13, 14), match='V'>
>>>
Matching groups of charactersRegular expressions can become complicated and unintelligible quickly. It may help to name the more common expressions. By naming expressions you can specify exactly what you want. To match 'ee': >>> pattern = 'e' * 2 ; pattern
'ee'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Beets are sweet.') ]))
<_sre.SRE_Match object; span=(1, 3), match='ee'>
<_sre.SRE_Match object; span=(12, 14), match='ee'>
>>>
The special characters >>> any = r'{0,}' # Match any number of the preceding RE.
>>> one_or_more = r'{1,}' # Match one or more of the preceding RE.
>>> zero_or_one = r'{0,1}' # Match zero or one of the preceding RE.
>>>
>>> 'e' + any
'e{0,}' # Match any number of 'e'.
>>> 'e' + one_or_more
'e{1,}' # Match one or more of 'e'.
>>> 'e' + zero_or_one
'e{0,1}' # Match zero or one of 'e'.
>>>
To match one or more of >>> pattern = 'e' + one_or_more ; pattern
'e{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Beets are sweet.' ) ]))
<_sre.SRE_Match object; span=(1, 3), match='ee'>
<_sre.SRE_Match object; span=(8, 9), match='e'>
<_sre.SRE_Match object; span=(12, 14), match='ee'>
>>>
|
Matching members of a set
The string Alpha-numeric>>> pattern = 'abcdefghijklmnopqrstuvwxyz';len(pattern)
26
>>>
>>> lower = r'[' + pattern + r']' ; lower
'[abcdefghijklmnopqrstuvwxyz]'
>>> upper = r'[' + pattern.upper() + r']' ; upper
'[ABCDEFGHIJKLMNOPQRSTUVWXYZ]'
>>> alpha = r'[' + pattern + pattern.upper() + r']' ; alpha
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]'
>>>
>>> numeric = r'[0123456789]' ; numeric
'[0123456789]'
>>>
>>> alpha_numeric = alpha[:-1] + numeric[1:] ; alpha_numeric
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]'
>>> word = r'[_' + alpha_numeric[1:] ; word
'[_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]'
>>>
Find all groups of alpha characters: >>> pattern = alpha + one_or_more ; pattern
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(6, 9), match='are'>
<_sre.SRE_Match object; span=(10, 17), match='numeric'>
>>>
Find all groups of numeric characters: >>> pattern = numeric + one_or_more ; pattern
'[0123456789]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(0, 1), match='1'>
<_sre.SRE_Match object; span=(2, 3), match='2'>
<_sre.SRE_Match object; span=(4, 5), match='3'>
>>>
Find all words in the string that contain the letters >>> pattern = alpha + any + 'ee' + alpha + any ; pattern
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{0,}ee[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{0,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Beets are sweet.' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='Beets'>
<_sre.SRE_Match object; span=(10, 15), match='sweet'>
>>>
Find all words in the string that contain at least 5 letters: >>> pattern = alpha*5 + alpha + any ; pattern
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{0,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(10, 17), match='numeric'>
>>>
It's OK to be lazy. The important thing is to define the pattern accurately and then let the re method make sense of it. However, with a little practice you will probably write the above search as: >>> pattern = alpha + r'{5,}' ; pattern
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{5,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(10, 17), match='numeric'>
>>>
Non alpha-numericThe caret >>> non_lower = r'[^' + lower[1:] ; non_lower
'[^abcdefghijklmnopqrstuvwxyz]'
>>> non_upper = r'[^' + upper[1:] ; non_upper
'[^ABCDEFGHIJKLMNOPQRSTUVWXYZ]'
>>>
>>> non_alpha = r'[^' + alpha[1:] ; non_alpha
'[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]'
>>>
>>> non_numeric = r'[^' + numeric[1:] ; non_numeric
'[^0123456789]'
>>>
>>> non_alpha_numeric = r'[^' + alpha_numeric[1:] ; non_alpha_numeric
'[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]'
>>>
>>> non_word = r'[^' + word[1:] ; non_word
'[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]'
>>>
Find all groups that contain non numeric characters: >>> pattern = non_numeric + one_or_more ; pattern
'[^0123456789]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(1, 2), match=','>
<_sre.SRE_Match object; span=(3, 4), match=','>
<_sre.SRE_Match object; span=(5, 18), match=' are numeric.'>
>>>
Find all groups containing non alpha characters: >>> pattern = non_alpha + one_or_more ; pattern
'[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(0, 6), match='1,2,3 '>
<_sre.SRE_Match object; span=(9, 10), match=' '>
<_sre.SRE_Match object; span=(17, 18), match='.'>
>>>
White space>>> white = '[ \t\n\r\f\v]' ; white
'[ \t\n\r\x0c\x0b]'
>>> pattern = white + one_or_more ; pattern
'[ \t\n\r\x0c\x0b]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(5, 6), match=' '>
<_sre.SRE_Match object; span=(9, 10), match=' '>
>>>
Non white space>>> non_white = r'[^' + white[1:] ; non_white
'[^ \t\n\r\x0c\x0b]'
>>>
Find all blocks of non white space: >>> pattern = non_white + one_or_more ; pattern
'[^ \t\n\r\x0c\x0b]{1,}'
>>>
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='1,2,3'>
<_sre.SRE_Match object; span=(6, 9), match='are'>
<_sre.SRE_Match object; span=(10, 18), match='numeric.'>
>>>
Find all blocks of non white space that contain at least 4 letters: >>> pattern = non_white*3 + non_white + one_or_more ; pattern
'[^ \t\n\r\x0c\x0b][^ \t\n\r\x0c\x0b][^ \t\n\r\x0c\x0b][^ \t\n\r\x0c\x0b]{1,}'
>>>
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='1,2,3'>
<_sre.SRE_Match object; span=(10, 18), match='numeric.'>
>>>
|
Matching white space
White space is any one of The regular expression that means 'any white character' is >>> new_line = '''
... '''
>>> white = '[' + new_line + '\t\v\f ]' ; white
'[\n\t\x0b\x0c ]'
>>>
Some special characters that tell the methods how to interpret the other characters in the regular expression are: >>> any = r'*' # any number of
>>> one_or_more = r'+' # one or more of
>>> zero_or_one = r'?' # zero or one of
>>>
>>> white + any # any number of white characters
'[\n\t\x0b\x0c ]*'
>>> white + one_or_more # one or more white characters
'[\n\t\x0b\x0c ]+'
>>> white + zero_or_one # zero or one white characters
'[\n\t\x0b\x0c ]?'
>>>
Searching for white space: >>> s1 = '\v\n \t abcd EFG \v\t \n\n 234 \f\f\n' # 4 blocks of white space.
>>>
>>> re.search(white + one_or_more, s1)
<_sre.SRE_Match object; span=(0, 5), match='\x0b\n \t '> # 1st block
>>>
>>> re.search(white + one_or_more, s1[5:])
<_sre.SRE_Match object; span=(4, 14), match=' '> # 2nd block.
>>>
>>> re.search(white + one_or_more, s1[5:][14:])
<_sre.SRE_Match object; span=(3, 13), match=' \x0b\t \n\n '> # 3rd block.
>>>
>>> re.search(white + one_or_more, s1[5:][14:][13:])
<_sre.SRE_Match object; span=(3, 8), match=' \x0c\x0c\n'> # 4th block
>>>
>>> re.search(white + one_or_more, s1[5:][14:][13:][8:])
>>> # no more.
>>> 5+14+13+8 == len(s1)
True
>>> L1 = re.findall(white + one_or_more, s1) ; L1
['\x0b\n \t ', ' ', ' \x0b\t \n\n ', ' \x0c\x0c\n'] # 4 blocks of white space.
>>>
Iterating over matches found: >>> for p in re.finditer(white + one_or_more, s1 ) :
... print (p)
...
<_sre.SRE_Match object; span=(0, 5), match='\x0b\n \t '>
<_sre.SRE_Match object; span=(9, 19), match=' '>
<_sre.SRE_Match object; span=(22, 32), match=' \x0b\t \n\n '>
<_sre.SRE_Match object; span=(35, 40), match=' \x0c\x0c\n'>
>>>
Anchoring the pattern: >>> beginning = r'^' # Anchor pattern at beginning of string.
>>> end = r'$' # Anchor pattern at end of string.
>>>
>>> beginning + white + one_or_more # 1 or more white characters at beginning of string.
'^[\n\t\x0b\x0c ]+'
>>>
>>> white + one_or_more + end # 1 or more white characters at end of string.
'[\n\t\x0b\x0c ]+$'
>>>
Searching for white space at extremities of string: >>> L2 = re.findall(white + one_or_more + end, s1) ; L2
[' \x0c\x0c\n']
>>> L2[0] == L1[-1]
True
>>> L3 = re.findall(beginning + white + one_or_more, s1) ; L3
['\x0b\n \t ']
>>> L3[0] == L1[0]
True
>>>
|
Splitting on white space
>>> s1 = ' \n \t \n line 1a\n line 1b\n\n\t \n line 2a\n line 2b \n \t\t\n'
>>> print (s1)
line 1a
line 1b
line 2a
line 2b
>>>
Remove white space from beginning of s1, but preserve white space at beginning of line 1a: >>> pattern = beginning + white + any + new_line ; pattern
'^[\n\t\x0b\x0c ]*\n'
>>> re.split(pattern, s1)
['', ' line 1a\n line 1b\n\n\t \n line 2a\n line 2b \n \t\t\n']
>>> s2 = re.split(pattern, s1)[1] ; s2
' line 1a\n line 1b\n\n\t \n line 2a\n line 2b \n \t\t\n'
Remove white space from end of s2, but preserve white space at end of line 2b: >>> pattern = new_line + white + any + end ; pattern
'\n[\n\t\x0b\x0c ]*$'
>>> re.split(pattern, s2)
[' line 1a\n line 1b\n\n\t \n line 2a\n line 2b ', '']
>>> s3 = re.split(pattern, s2)[0] ; s3
' line 1a\n line 1b\n\n\t \n line 2a\n line 2b '
Split s3 into paragraphs: >>> pattern = new_line + white + any + new_line ; pattern
'\n[\n\t\x0b\x0c ]*\n'
>>> re.split(pattern, s3)
[' line 1a\n line 1b', ' line 2a\n line 2b ']
>>> paragraphs = re.split(pattern, s3) ; paragraphs
[' line 1a\n line 1b', ' line 2a\n line 2b ']
Produce s4, equivalent to s1 without extraneous white space: >>> s4 = '\n\n'.join(paragraphs) + new_line ; s4
' line 1a\n line 1b\n\n line 2a\n line 2b \n'
>>> print (s4,end='')
line 1a
line 1b
line 2a
line 2b
>>>
|
Special characters
Special characters are sometimes called metacharacters: . ^ $ * + ? { } [ ] \ | ( )
Special characters
|
International characters
The methods work with international characters: >>> pattern = white + any + 'στο' + white + one_or_more ; pattern
'[ \t\n\r\x0c\x0b]{0,}στο[ \t\n\r\x0c\x0b]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Καλώς ήρθατε στο Βικιεπιστήμιο' ) ]))
<_sre.SRE_Match object; span=(12, 17), match=' στο '>
>>>
Find all words that contain the letter 'α' (Greek alpha): >>> pattern = non_white + any + 'α' + non_white + any ; pattern
'[^ \t\n\r\x0c\x0b]{0,}α[^ \t\n\r\x0c\x0b]{0,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Καλώς ήρθατε στο Βικιεπιστήμιο' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='Καλώς'>
<_sre.SRE_Match object; span=(6, 12), match='ήρθατε'>
>>>
List all the words in the string: >>> print ('\n'.join([ str(p) for p in re.finditer(r'\w+', 'Καλώς ήρθατε στο Βικιεπιστήμιο' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='Καλώς'>
<_sre.SRE_Match object; span=(6, 12), match='ήρθατε'>
<_sre.SRE_Match object; span=(13, 16), match='στο'>
<_sre.SRE_Match object; span=(17, 30), match='Βικιεπιστήμιο'>
>>>
The special character |
Matching '^', '$', '*', '+', '?'
literally
Within a set (within brackets >>> pattern = r'[$]' + one_or_more ; pattern
'[$]+' # One or more of '$' literally.
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$?????' )
['$$$$$']
>>>
>>> pattern = r'[*]' + one_or_more + r'[$]' + any; pattern
'[*]+[$]*' # One or more of '*' and any number of '$'.
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$?????' )
['***', '**', '**$$$$$']
>>>
Characters listed individually within brackets >>> pattern = r'[2aX?*$]' ; pattern
'[2aX?*$]' # '2' or 'a' or 'X' or '?' or '*' or '$'.
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$?????' )
['a', '*', '*', '*', '2', '*', '*', 'X', '?', '?', '*', '*', '$', '$', '$', '$', '$', '?', '?', '?', '?', '?']
>>>
>>> pattern = r'[2aX?*$]' + one_or_more ; pattern
'[2aX?*$]+' # One or more of '2' or 'a' or 'X' or '?' or '*' or '$'.
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$?????' )
['a', '***', '2', '**X', '??', '**$$$$$?????']
>>>
The caret >>> pattern = r'[\^]' + one_or_more ; pattern
'[\\^]+'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^???' )
['^^']
>>>
or put it after first place in the set: >>> pattern = r'[$?^]' + one_or_more ; pattern
'[$?^]+'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^???' )
['??', '$$$$$??^^???']
>>>
Characters that may have a special meaning within a set include >>> pattern = r'123^]?*\ '[:-1] ; pattern
'123^]?*\\' # Backslash at end.
>>>
>>> pattern = r'[' + re.escape(pattern) + r']' ; pattern
'[123\\^\\]\\?\\*\\\\]'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( ))) {{{ }}}} ||| \\\ ' )
['*', '*', '*', '1', '2', '3', '*', '*', '?', '?', '*', '*', '?', '?', '^', '^', '?', '?', '?', ']', ']', ']', '\\', '\\', '\\']
>>>
>>> pattern = pattern + one_or_more ; pattern
'[123\\^\\]\\?\\*\\\\]+'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( ))) {{{ }}}} ||| \\\ ' )
['***123**', '??', '**', '??^^???', ']]]', '\\\\\\']
>>>
>>> pattern = r'^3]?*}{)(\ '[:-1] ; pattern
'^3]?*}{)(\\'
>>> pattern = r'[' + re.escape(pattern) + r']' + one_or_more ; pattern
'[\\^3\\]\\?\\*\\}\\{\\)\\(\\\\]+'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( ))) {{{ }}}} ||| \\\ ' )
['***', '3**', '??', '**', '??^^???', ']]]', '(((', ')))', '{{{', '}}}}', '\\\\\\']
>>>
>>> pattern = r""" '" """[1:3] + r'^3]?}\ '[:-1] ; pattern
'\'"^3]?}\\'
>>> pattern = r'[' + re.escape(pattern) + r']' + one_or_more ; pattern
'[\\\'\\"\\^3\\]\\?\\}\\\\]+' # One or more of "'" or '"' or '^' or '3' or ']' or '?' or '}' or backslash.
>>> re.findall (pattern, r"""abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( )))'''' {{{ }}}} ||| \\\ """ )
['3', '??', '??^^???', ']]]', "''''", '}}}}', '\\\\\\']
>>>
>>> pattern = r""" '" """[1:3] + r'^3]?}\ '[:-1] ; pattern # Carefully define the pattern.
'\'"^3]?}\\'
>>> pattern = r'[^' + re.escape(pattern) + r']' + one_or_more ; pattern # Build the regular expression.
'[^\\\'\\"\\^3\\]\\?\\}\\\\]+' # One or more of any character that is not ("'" or '"' or '^' or '3' or ']' or '?' or '}' or backslash).
>>> re.findall (pattern, r"""abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( )))'''' {{{ }}}} ||| \\\ """ )
['abc***12', '**XYZ', 'q**$$$$$', ' [[[ ', ' ((( )))', ' {{{ ', ' ||| ', ' ']
>>>
You can see that regular expressions can become complicated and unintelligible quickly.
>>> pattern = r""" '" """[1:3] + r'^3]?}\ '[:-1] ; pattern
'\'"^3]?}\\'
>>> L1 = list(pattern) ; L1
["'", '"', '^', '3', ']', '?', '}', '\\'] # Each member of L1 is one character.
>>>
>>> pattern_escaped = re.escape(pattern) ; pattern_escaped
'\\\'\\"\\^3\\]\\?\\}\\\\'
>>> r'''\'\"\^3\]\?\}\\''' == pattern_escaped == r"\'" + r'\"' + r'\^' + '3' + r'\]' + r'\?' + r'\}' + r'\\'
True # All characters in pattern except A-Za-z0-9_ have been escaped.
>>>
|
Advanced Regular Expressions
Matching datesA date has format 3 /9 / 1923 11/ 22/ 1987 Aug23,2017 Septe 4 , 2001 The ultimate regular expression will be pattern1 = r'''
\b # word boundary
\d{1,2} # 1 or 2 numeric
\s* # any white
/
\s* # any white
\d{1,2} # 1 or 2 numeric
\s* # any white
/
\s* # any white
\d{4} # 4 numeric
\b # word boundary
'''
pattern2 = r'''
\b # word boundary
''' + upper + lower + r'''{2,} # upper + 2 or more lower
\s* # any white
\d{1,2} # 1 or 2 numeric
\s* # any white
,
\s* # any white
\d{4} # 4 numeric
\b # word boundary
'''
pattern = pattern1 + '|' + pattern2
print (pattern)
\b # word boundary \d{1,2} # 1 or 2 numeric \s* # any white / \s* # any white \d{1,2} # 1 or 2 numeric \s* # any white / \s* # any white \d{4} # 4 numeric \b # word boundary | \b # word boundary [ABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyz]{2,} # upper + 2 or more lower \s* # any white \d{1,2} # 1 or 2 numeric \s* # any white , \s* # any white \d{4} # 4 numeric \b # word boundary The above verbose format is much more readable than: r'''\b\d{1,2}\s*/\s*\d{1,2}\s*/\s*\d{4}\b|\b[ABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyz]{2,}\s*\d{1,2}\s*,\s*\d{4}\b''' s3 = ''' 7/4 / 1776 3/2/2001 12 / 19
/ 2007
Jul4,1776 July 4 , 1776 xbcvgdf ,,
vnhgb August13 ,2003... Nove 22, 2007,,,February14,1776 '''
print ('\n\n', '\n'.join([ str(p.group()) for p in re.finditer(pattern, s3 , re.VERBOSE) ]), sep='')
7/4 / 1776 3/2/2001 12 / 19 / 2007 Jul4,1776 July 4 , 1776 August13 ,2003 Nove 22, 2007 February14,1776 Unix datesA "Unix date" has format $ date Wed Feb 14 08:24:24 CST 2018 In this section a regular expression to match a Unix date will accept Wed Feb 14 08:24:24 CST 2018 Wednes Feb 14 08:24:24 CST 2018 # More than 3 letters in name of day. Wed Febru 14 08:24:24 CST 2018 # More than 3 letters in name of month. Wed Feb 14 8:24 : 24 CST 2018 # White space in hh:mm:ss. wed FeB 14 8:24 : 24 cSt 2018 # Bad punctuation. Build parts of the regular expression. mo='''January February March April
May June July August September
October November December
'''
s1 = '|\n'.join([
'|'.join([ month[:p] for p in range (len(month), 2, -1) ])
for month in mo.title().split()
])
print (s1)
January|Januar|Janua|Janu|Jan| February|Februar|Februa|Febru|Febr|Feb| March|Marc|Mar| April|Apri|Apr| May| June|Jun| July|Jul| August|Augus|Augu|Aug| September|Septembe|Septemb|Septem|Septe|Sept|Sep| October|Octobe|Octob|Octo|Oct| November|Novembe|Novemb|Novem|Nove|Nov| December|Decembe|Decemb|Decem|Dece|Dec da='''Sunday Monday Tuesday
Wednesday Thursday Friday
Saturday
'''
s2 = '|\n'.join([
'|'.join([ day[:p] for p in range (len(day), 2, -1) ])
for day in da.title().split()
])
print (s2)
Sunday|Sunda|Sund|Sun| Monday|Monda|Mond|Mon| Tuesday|Tuesda|Tuesd|Tues|Tue| Wednesday|Wednesda|Wednesd|Wednes|Wedne|Wedn|Wed| Thursday|Thursda|Thursd|Thurs|Thur|Thu| Friday|Frida|Frid|Fri| Saturday|Saturda|Saturd|Satur|Satu|Sat Build the regular expression. reg2 = (
r'''\b # Word boundary.
(?P<day>
''' + s2 + r'''
)
\s+
(?P<month>
''' + s1 + r'''
)
\s+
(?P<date> ([1-9]) | ([12][0-9]) | (3[01]) ) # 1 through 31
\s+
(?P<hours> ((0{0,1}|1)[0-9]) | (2[0-3]) ) # (0 or 00) through 23
\s*\:\s*
(?P<minutes> [0-5]{0,1}[0-9] ) # (0 or 00) through 59
\s*\:\s*
(?P<seconds> [0-5]{0,1}[0-9] ) # (0 or 00) through 59
\s+
(?P<time_zone> [ECMP][SD]T )
\s+
(?P<year> (19[0-9][0-9]) | (20[01][0-9]) ) # 1900 through 2019
\b''' # Word boundary.
)
print (reg2)
\b # Word boundary. (?P<day> Sunday|Sunda|Sund|Sun| Monday|Monda|Mond|Mon| Tuesday|Tuesda|Tuesd|Tues|Tue| Wednesday|Wednesda|Wednesd|Wednes|Wedne|Wedn|Wed| Thursday|Thursda|Thursd|Thurs|Thur|Thu| Friday|Frida|Frid|Fri| Saturday|Saturda|Saturd|Satur|Satu|Sat ) \s+ (?P<month> January|Januar|Janua|Janu|Jan| February|Februar|Februa|Febru|Febr|Feb| March|Marc|Mar| April|Apri|Apr| May| June|Jun| July|Jul| August|Augus|Augu|Aug| September|Septembe|Septemb|Septem|Septe|Sept|Sep| October|Octobe|Octob|Octo|Oct| November|Novembe|Novemb|Novem|Nove|Nov| December|Decembe|Decemb|Decem|Dece|Dec ) \s+ (?P<date> ([1-9]) | ([12][0-9]) | (3[01]) ) # 1 through 31 \s+ (?P<hours> ((0{0,1}|1)[0-9]) | (2[0-3]) ) # (0 or 00) through 23 \s*\:\s* (?P<minutes> [0-5]{0,1}[0-9] ) # (0 or 00) through 59 \s*\:\s* (?P<seconds> [0-5]{0,1}[0-9] ) # (0 or 00) through 59 \s+ (?P<time_zone> [ECMP][SD]T ) \s+ (?P<year> (19[0-9][0-9]) | (20[01][0-9]) ) # 1900 through 2019 \b Regular expression dates = '''
MON Februar 12 0:30 : 19 CST 2018
Tue Feb 33 00:30:19 CST 2018 # Invalid.
Wed Feb 29 00:30:19 CST 1900 # Invalid.
Thursda feb 29 00:30:19 CST 1944
'''
List all valid dates in string d1 = dict ((
('Jan', 31), ('May', 31), ('Sep', 30),
('Feb', 28), ('Jun', 30), ('Oct', 31),
('Mar', 31), ('Jul', 31), ('Nov', 30),
('Apr', 30), ('Aug', 31), ('Dec', 31),
))
A listcomp accepts free-format Python: L1 = [
'\n'.join(( str(m), m[0], str(m.groupdict()) ))
for m in re.finditer(reg2, dates, re.IGNORECASE|re.VERBOSE)
for date in ( int(m['date']) ,) # Equivalent to assignment: date = int(m['date'])
for month in ( m['month'].title() ,)
for year in ( int(m['year']) ,)
for leap_year in ( # 'else' in a listcomp
( # equivalent to:
year % 4 == 0, # if year % 100 == 0:
year % 400 == 0 # leap_year = year % 400 == 0
)[year % 100 == 0] # else :
,) # leap_year = year % 4 == 0
for max_date in ( # if (month[:3] == 'Feb') and leap_year :
( # max_date = 29
d1[month[:3]], # else :
29 # max_date = d1[month[:3]]
)[(month[:3] == 'Feb') and leap_year] #
,)
if date <= max_date
]
print (
'\n\n'.join(L1)
)
<_sre.SRE_Match object; span=(1, 34), match='MON Februar 12 0:30 : 19 CST 2018'> MON Februar 12 0:30 : 19 CST 2018 {'day': 'MON', 'month': 'Februar', 'date': '12', 'hours': '0', 'minutes': '30', 'seconds': '19', 'time_zone': 'CST', 'year': '2018'} <_sre.SRE_Match object; span=(155, 251), match='Thursda feb 29 > # Output here is clipped. Thursda feb 29 00:30:19 CST 1944 # Correct data here. {'day': 'Thursda', 'month': 'feb', 'date': '29', 'hours': '00', 'minutes': '30', 'seconds': '19', 'time_zone': 'CST', 'year': '1944'} To access the groupdict of a field that matches: line = (L1[0].split('\n'))[2]
d2 = eval(line)
print ('d2 =', d2)
d2 = {'day': 'MON', 'month': 'Februar', 'date': '12', 'hours': '0', 'minutes': '30', 'seconds': '19', 'time_zone': 'CST', 'year': '2018'} A little philosophyBecause this example is contained within the page "Regular Expressions," there is much decision making
contained within
1) focuses on matching alpha-numeric patterns. Verification that February 29, 1944 was in fact a Thursday is outside the scope of this section. 2) does not consider the possibility of leap seconds. Saturday December 31 23:59:60 UTC 2016 was a legitimate time. It seems that accurate time (and how to display it) is a field of science unto itself and not yet standardized. 3) is not complete until properly tested. Testing the code could consume 3-10 times as much effort as writing it. 4) highlights that a listcomp is an ideal place for (almost) format-free Python code. 5) shows that, as a regular expression becomes more complicated, you may have to write Python code just to produce the regular expression. Matching integers and floatsIntegersExamples of integers are: Do not rely on Python's >>> date = '12/3/4' ; eval(date) ; isinstance(eval(date), float)
1.0
True
>>>
Searching for integers: >>> print (pattern)
^ # anchor at beginning
\s* # any white
[+-]? # 0 or 1 of ('+' or '-')
\s* # any white
\d+ # 1 or more numeric
\s* # any white
$ # anchor at end
>>> re.search (pattern, ' 123 ', re.VERBOSE)
<_sre.SRE_Match object; span=(0, 24), match=' 123 '>
>>> re.search (pattern, ' - 123 ', re.VERBOSE)
<_sre.SRE_Match object; span=(0, 25), match=' - 123 '>
>>> re.search (pattern, ' - 1 23 ', re.VERBOSE)
>>> # No match.
Method str.strip() produces (almost) a clean int: >>> ' +13 '.strip()
'+13'
>>> ' + 13 '.strip()
'+ 13'
>>>
Method str.replace() hides errors: >>> ' + 12 34 '.replace(' ', '') # Error in input.
'+1234' # Good output.
>>>
To produce a clean int: >>> print (pattern)
^ # anchor at beginning
\s* # any white
([+-]?) # 0 or 1 of ('+' or '-'). Notice the '()' around the '[+-]?'.
\s* # any white
(\d+) # 1 or more numeric. Notice the '()' around the '\d+'.
\s* # any white
$ # anchor at end
>>> re.search (pattern, ' 123 ', re.VERBOSE)
<_sre.SRE_Match object; span=(0, 24), match=' 123 '>
>>> m = re.search (pattern, ' - 123 ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 25), match=' - 123 '>
>>> m.group()
' - 123 '
>>> m.group(0)
' - 123 '
>>> m.group(1,2)
('-', '123') # Values that match the expressions in '()' above.
>>> ''.join(m.group(1,2))
'-123'
>>>
FloatsExamples of point floats: Examples of exponent floats: An exponent float can contain an If not exponent float, it must be point float. This means at least one Matching a point float:>>> print (pattern)
# for point float
^ # anchor at beginning
\s* # any white
([+-]?) # 0 or 1 of ('+' or '-'). Notice the '()' around '[+-]?'.
\s* # any white
(\.\d+|\d+\.|\d+\.\d+) # .3 or 3. or 3.3
\s* # any white
$ # anchor at end
>>>
>>> m = re.search (pattern, ' .123 ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 26), match=' 0.123 '>
>>> m.group(1,2)
('', '.123')
>>> m = re.search (pattern, ' - 0.123 ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 27), match=' - 0.123 '>
>>> m.group(1,2)
('-', '0.123')
>>>
Matching an exponent float:>>> print (patternE)
# for exponent float
^ # anchor at beginning
\s* # any white
([+-]?) # 0 or 1 of ('+' or '-'). Notice the '()' around '[+-]?'.
\s* # any white
(\.?\d+|\d+\.|\d+\.\d+) # 3 or .3 or 3. or 3.3
[eE]
([+-]?\d+) # exponent
\s* # any white
$ # anchor at end
>>>
>>> m = re.search (patternE, ' . 123 ', re.VERBOSE) ; m
>>> # No match.
>>> m = re.search (patternE, ' - 0.123e+2 ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 30), match=' - 0.123e+2 '>
>>> m.group(1,2,3)
('-', '0.123', '+2')
>>>
>>> m = re.search (patternE, ' - 3.3E-12 ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 29), match=' - 3.3E-12 '>
>>> m.group(1,2,3)
('-', '3.3', '-12')
>>> m.group(1) + m.group(2) + 'e' + m.group(3)
'-3.3e-12'
>>>
>>> [ m.group(p) for p in range(1, m.lastindex+1) ]
['-', '3.3', '-12']
>>>
Matching any floatThis example shows how substrings that match may be retrieved quickly and accurately from named groups. import re
reg_exp = r'''
(?P<sign_of_float>[-+])? # Named group sign_of_float.
\s*
(?P<significand>\d*\.?\d*) # Named group significand.
(
[eE]
(?P<sign_of_exponent>[-+])? # Named group sign_of_exponent.
(?P<exponent>\d+) # Named group exponent.
)?
'''
The above s1 = ' + 5e2 5 + 5 - 5 .e3 . 5 e-2 5 . 5 .e 2 3.3 - 3.3E+1 '
s2 = '''
# Substring that matched reg_exp.
# Same as m['sign_of_float'].
# Same as m['significand'].
# Group not named.
# Same as m['sign_of_exponent'].
# Same as m['exponent'].
'''
L2 = [p.strip() for p in s2.split('\n') if re.search(r'\S', p)]
L1 = [ m for m in re.finditer(reg_exp, s1, re.VERBOSE)
# Extra conditions for float:
if (
m['significand'] and m['exponent'],
len(m['significand']) >= 2
)[ '.' in m['significand'] ]
]
for m in L1 :
print (
'''
##########################
m = {}'''.format(m)
)
print ('\nInformation available in all groups:')
for p in range(0, len(m.groups())+1) :
if m[p] == None : s2 = 'm[{}] = None'.format( p )
else : s2 = "m[{}] = '{}'".format( p, m[p] )
s2 = (s2 + ' '*16)[:16] # Left justified in a string of 16 characters.
print (s2, L2[p])
exit (0)
########################## m = <_sre.SRE_Match object; span=(2, 7), match='+ 5e2'> Information available in all groups: m[0] = '+ 5e2' # Substring that matched reg_exp. m[1] = '+' # Same as m['sign_of_float']. m[2] = '5' # Same as m['significand']. m[3] = 'e2' # Group not named. m[4] = None # Same as m['sign_of_exponent']. m[5] = '2' # Same as m['exponent']. ########################## m = <_sre.SRE_Match object; span=(49, 53), match=' 3.3'> Information available in all groups: m[0] = ' 3.3' # Substring that matched reg_exp. m[1] = None # Same as m['sign_of_float']. m[2] = '3.3' # Same as m['significand']. m[3] = None # Group not named. m[4] = None # Same as m['sign_of_exponent']. m[5] = None # Same as m['exponent']. ########################## m = <_sre.SRE_Match object; span=(55, 63), match='- 3.3E+1'> Information available in all groups: m[0] = '- 3.3E+1 # Substring that matched reg_exp. m[1] = '-' # Same as m['sign_of_float']. m[2] = '3.3' # Same as m['significand']. m[3] = 'E+1' # Group not named. m[4] = '+' # Same as m['sign_of_exponent']. m[5] = '1' # Same as m['exponent']. Decoding a bytes objectL2 contains the contents of a L2 = ( ['11001110', '10010010', '11001110', '10111001', '11001110', '10111010', '11001110'] + ['10111001', '00100000', '11101100', '10011100', '10000100', '11101101', '10000010'] + ['10100100', '11101011', '10110000', '10110000', '11101100', '10011011', '10000000'] + ['00100000', '01010111', '01101001', '01101011', '01101001'] ) Produce list L4 that contains L2 in a format that conforms to standard L3 = []
for p in range (len(L2)-1,-1,-1) :
if re.search(r'^0[01]{7}$', L2[p]) :
L3 += [L2[p]]
continue
if re.search(r'^110[01]{5}$', L2[p]) :
if p+1 >= len(L2) : exit (99)
if re.search(r'^10[01]{6}$', L2[p+1]) :
L3 += [L2[p] + L2[p+1]]
continue
exit (98)
if re.search(r'^1110[01]{4}$', L2[p]) :
if p+2 >= len(L2) : exit (97)
if re.search(r'^10[01]{6}$', L2[p+1]) and re.search(r'^10[01]{6}$', L2[p+2]) :
L3 += [L2[p] + L2[p+1] + L2[p+2]]
continue
exit (96)
if re.search(r'^10[01]{6}$', L2[p]) :
if p == 0 : exit (95)
continue
exit (94)
L4 = L3[::-1]
print (
'''
L4 = (
{} + # Russian
{} + # '\\x20' is a space.
{} + # Korean
{} + # '\\x20' is a space.
{} ) # English
'''.format(L4[0:4], L4[4:5], L4[5:9], L4[9:10], L4[10:])
)
L4 = ( ['1100111010010010', '1100111010111001', '1100111010111010', '1100111010111001'] + # Russian ['00100000'] + # '\x20' is a space. ['111011001001110010000100', '111011011000001010100100', '111010111011000010110000', '111011001001101110000000'] + # Korean ['00100000'] + # '\x20' is a space. ['01010111', '01101001', '01101011', '01101001'] ) # English Decode L4: L5 = []
for p in range (0, len(L4)) :
if (len(L4[p]) == 8) :
m = re.search (r'^0[01]{7}$', L4[p])
if not m : exit (89)
I1 = int(L4[p], base=2) ; L5 += chr(I1)
continue
if (len(L4[p]) == 16) :
m = re.search (r'^110([01]{5})10([01]{6})$', L4[p])
if not m : exit (88)
if m.lastindex != 2 : exit (87)
I1 = int(m.group(1) + m.group(2), 2) ; L5 += chr(I1)
continue
if (len(L4[p]) == 24) :
m = re.search (r'^ 1110 ([01]{4}) 10 ([01]{6}) 10 ([01]{6}) $', L4[p], re.VERBOSE)
if not m : exit (86)
if m.lastindex != 3 : exit (85)
I1 = int(m.group(1) + m.group(2) + m.group(3), 2) ; L5 += chr(I1)
continue
exit (84)
print ('L5 =', L5)
exit (0)
L5 = ['Β', 'ι', 'κ', 'ι', ' ', '위', '키', '배', '움', ' ', 'W', 'i', 'k', 'i'] |
Compiling regular expressions
If a regular expression is complicated or is to be used frequently, it can be compiled to produce a pattern object. >>> print (pattern)
([+-]{1} # 1 of ('+' or '-').
\s* # any white
\d+) # 1 or more numeric.
|
(\d+) # 1 or more numeric.
>>>
The regular expression >>> integer = re.compile(pattern, re.VERBOSE)
The compiled pattern called 'integer' has methods similar to >>> s1 = ' 123 - 456(( !!+++ 2345 !! -2##'
Displaying all matchesDisplaying all matches manually, one after the other. >>> integer.search(s1)
<_sre.SRE_Match object; span=(4, 7), match='123'>
>>> integer.search(s1[7:])
<_sre.SRE_Match object; span=(7, 13), match='- 456'>
>>> integer.search(s1[7:][13:])
<_sre.SRE_Match object; span=(11, 20), match='+ 2345'>
>>> integer.search(s1[7:][13:][20:])
<_sre.SRE_Match object; span=(4, 6), match='-2'>
>>> integer.search(s1[7:][13:][20:][6:])
>>>
The method >>> m = integer.search(s1) ; m
<_sre.SRE_Match object; span=(4, 7), match='123'>
>>> m = integer.search(s1, 7) ; m
<_sre.SRE_Match object; span=(14, 20), match='- 456'>
>>> m = integer.search(s1, 20) ; m
<_sre.SRE_Match object; span=(31, 40), match='+ 2345'>
>>> m = integer.search(s1, m.span()[1]) ; m
<_sre.SRE_Match object; span=(44, 46), match='-2'>
>>> m = integer.search(s1, m.span()[1]) ; m
>>>
Iterating through all matches.>>> print ( '\n'.join([str(p) for p in integer.finditer(s1)]) )
<_sre.SRE_Match object; span=(4, 7), match='123'>
<_sre.SRE_Match object; span=(14, 20), match='- 456'>
<_sre.SRE_Match object; span=(31, 40), match='+ 2345'>
<_sre.SRE_Match object; span=(44, 46), match='-2'>
>>>
or: v = 0
while True :
m = integer.search(s1, v)
if not m : break
print (m)
v = m.span()[1]
Output is same as above. Splitting the stringSplitting the string Preserving substrings that match>>> s1
' 123 - 456(( !!+++ 2345 !! -2##'
>>>
>>> [m.groups() for m in integer.finditer(s1)]
[(None, '123'), # Match came from right hand side of '|' in pattern above.
('- 456', None), # Match came from left hand side of '|' in pattern above because it contains sign '-'.
('+ 2345', None), ('-2', None)]
>>>
>>> L1 = integer.split(s1) ; L1
[' ', None, '123', ' ', '- 456', None, '(( !!++', '+ 2345', None, ' !! ', '-2', None, '##']
>>> L1 # Edited for clarity:
[' ',
None, '123', # Same as m.groups()[0] above.
' ',
'- 456', None, # Same as m.groups()[1] above.
'(( !!++',
'+ 2345', None, # Same as m.groups()[2] above.
' !! ',
'-2', None, # Same as m.groups()[3] above.
'##']
>>>
>>> L2 = [p for p in L1 if p != None]
>>> print ('L2 =', L2)
L2 = [' ', '123', ' ', '- 456', '(( !!++', '+ 2345', ' !! ', '-2', '##']
>>>
>>> s2 = ''.join(L2) ; s2
' 123 - 456(( !!+++ 2345 !! -2##'
>>> s2 == s1
True
>>>
Without preserving substrings that matchIn >>> print (pattern_)
[+-]{1} # 1 of ('+' or '-').
\s* # any white
\d+ # 1 or more numeric.
|
\d+ # 1 or more numeric.
>>> integer_ = re.compile(pattern_, re.VERBOSE)
>>> s1 = ' 123 - 456(( !!+++ 2345 !! -2##'
>>> L1 = integer_.split(s1) ; L1
[' ', ' ', '(( !!++', ' !! ', '##'] # L1 does not contain the substrings that match.
>>>
Replacing all substrings that matchReplacing all integers in string s1: After splitting the string>>> L2
[' ', '123', ' ', '- 456', '(( !!++', '+ 2345', ' !! ', '-2', '##']
>>> L4 = ['INT_1', 'INT_2', 'INT_3', 'INT_4']
>>>
>>> L5 = [ (L2[p], L4[(p-1)>>1])[p & 1] for p in range (len(L2)) ] ; L5
[' ', 'INT_1', ' ', 'INT_2', '(( !!++', 'INT_3', ' !! ', 'INT_4', '##']
>>>
>>> s3 = ''.join(L5) ; s3
' INT_1 INT_2(( !!++INT_3 !! INT_4##'
>>>
Without splitting the stringprint ("s2 =", "'"+s2+"'",'\n')
L1 = [m for m in integer.finditer(s2)]
print ( '\n'.join(['4 matches found:'] + [str(p) for p in L1]),'\n' )
print ("L4 =", L4,'\n')
for p in range (3,-1,-1) :
m = L1[p]
repl = L4[p]
start,end = m.span()
s2 = s2[:start] + repl + s2[end:]
print (
'''s2 = '{}' after replacing span {}
'''.format(s2, m.span()),
end=''
)
s2 = ' 123 - 456(( !!+++ 2345 !! -2##' 4 matches found: <_sre.SRE_Match object; span=(4, 7), match='123'> <_sre.SRE_Match object; span=(14, 20), match='- 456'> <_sre.SRE_Match object; span=(31, 40), match='+ 2345'> <_sre.SRE_Match object; span=(44, 46), match='-2'> L4 = ['INT_1', 'INT_2', 'INT_3', 'INT_4'] s2 = ' 123 - 456(( !!+++ 2345 !! INT_4##' after replacing span (44, 46) s2 = ' 123 - 456(( !!++INT_3 !! INT_4##' after replacing span (31, 40) s2 = ' 123 INT_2(( !!++INT_3 !! INT_4##' after replacing span (14, 20) s2 = ' INT_1 INT_2(( !!++INT_3 !! INT_4##' after replacing span (4, 7) |
Python: truly international
Python, emacs and the Wikiversity editor recognize an almost infinite number of international characters. Some of them look exactly like their english counterparts: >>> ord('Ρ') # Greek rho
929
>>> ord('P') # English P.
80
>>> ord('H') # English
72
>>> ord('Н') # Cyrillic
1053
>>>
A few well chosen international characters can simplify the creation of a complicated regular expression. Let's revisit the matching of floats.
|
Assignments
Simplify the pattern?Under "Compiling regular expressions" above the expression for integer is: >>> print (pattern)
([+-]{1} # 1 of ('+' or '-').
\s* # any white
\d+) # 1 or more numeric.
|
(\d+) # 1 or more numeric.
>>>
Why not simplify the expression and use: >>> print (pattern)
([+-]{0,1} # 0 or 1 of ('+' or '-').
\s* # any white
\d+) # 1 or more numeric.
>>>
Because this expression produces the following matches: >>> print ( '\n'.join([str(p) for p in integer.finditer(s1)]) )
<_sre.SRE_Match object; span=(0, 7), match=' 123'> # This match is not considered accurate.
<_sre.SRE_Match object; span=(14, 20), match='- 456'>
<_sre.SRE_Match object; span=(31, 40), match='+ 2345'>
<_sre.SRE_Match object; span=(44, 46), match='-2'>
>>>
Matching a float?The reference offers regular expression >>> reg = r'[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?'
>>> re.search(reg, '4')
<_sre.SRE_Match object; span=(0, 1), match='4'>
>>>
Floats with extra zeroesIn the section "Floats" above there are several examples of regular expressions that match floats. However, they do not consider the possibility of extra leading and trailing zeroes. >>> eval ( ' + 0003.4000e-00005 ')
3.4e-05
>>>
How would you rewrite the expressions to remove unnecessary zeroes? |
Further Reading or Review
References
1. Python's documentation: "6.2. re — Regular expression operations," "Regular Expression HOWTO," "Common Problems"
"re.compile(pattern, flags=0)," "re.search(pattern, string, flags=0)," "re.split(pattern, string, maxsplit=0, flags=0)," "re.finditer(pattern, string, flags=0)," "re.escape(pattern)," "regex.search(string[, pos[, endpos]])," "regex.finditer(string[, pos[, endpos]])," "match.group([group1, ...])," "match.groupdict(default=None)," "match.span([group])"
|