A regular expression (or regex) is a string of characters, (some of which being reserved control characters,) which represent a pattern [1], i.e. a string which is designed to match a particular type of strings. Regular expressions provide the basic tool in searching, and are ubiquitous in the electronic world.
Getting started
There are many editors with regex functionalities. Here are a few examples (Please feel free to add or remove if you find better ones.)
- Regex tester - try your hand at regex here
- meta:User:Pathoschild/Scripts/Regex menu framework - a simple and useful wiki-editing javascript
- Codeproject
- - a useful editor with regex functionality
- Regexps manual - Emacs regular expression manual
- Regex tester - a firefox add-on
Learning materials
A lightning introduction
There are several "dialects" (e.g. javascript, perl, php, python) of regular exprssions which differ slightly in grammar. Let us focus on python regex for the moment (because I happen to have a reference [2] for it).
Control characters
- Python regex has the control characters :
\-.*+?$<!=|()[]^:#
First examples
[please verify]
- Any string (e.g.
abcdefg
)which does not contain any control characters is trivially a regular expression ("regex") pattern. It matches only itself - The pattern
[A-Z]
matches a character between A and Z (in the ASCII table) - A backslash (\) followed by any control character, such as
\.
or even the backslash itself\\
, refer to the control character itself (this pattern is called an "escape"). In our examples, \. matches the single dot . ; \\ matches the backslash - Combining the two examples above, the pattern
[A-Za-z0-9\-]
matches any alphanumeric character or the dash "-". - The pattern
\n
matches a newline - The pattern
abc.xyz
matches a string which starts with abc, ends with xyz, and, in the middle, an any character except the end of a line, for the dot - The pattern
a*
matches a string with as many characters "a" as possible; It also matches the empty string "". - Combining the previous two examples, we get a very common pattern:
abc.*xyz
matches a string which starts and ends with "abc" and "xyz" respectively, and between which is the longest available string (which could be empty) of any character except the newline.
Exercises
- Question: What is
[A-Za-z0-9\-]
? - Write a regular expression for (a) the url of any wikiversity page; (b) the url for any page on any wikimedia site. Check with a regex editor that your regex actually works. (c) the electronic address of all your friends
Write your proposed solutions below
Further lessons
[proposals]
- /Basics - the bare minimum to get one start working
- /Groups
- /How a regex engine works
- /Lookahead and lookbehind
- /Regex objects in python
- /The good and the bad
- /Cookbook
Wikimedia links
- b:regular expressions
- w:regular expressions
- mediawiki:titleblacklist - an application on wikiversity
External links
Notes
This article is issued from Wikiversity. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.