Objective
|
Lesson
Python StringsThe string is one of the simplest data types in python. Strings can be created by putting either single quotations (
>>> 'Hello!'
'Hello!'
>>> "Hello!"
'Hello!'
>>> "Hello," + " world!"
'Hello, world!'
>>> "Wiki" "versity" "!"
'Wikiversity!'
>>> print("hey" * 3)
heyheyhey
Examples of strings with >>> s = "This is John's shoe." ; s
"This is John's shoe."
>>>
>>> 'He said "I will come."'
'He said "I will come."'
>>>
>>> 'He said "I'll come."'
File "<stdin>", line 1
'He said "I'll come."'
^
SyntaxError: invalid syntax
>>>
>>> s = 'He said "I' "'" 'll come."' ; s
'He said "I\'ll come."' # more about escaped characters below.
Escape CharactersThere are some characters that cannot be easily expressed within a string. These characters, called escape characters, can be easily integrated within a string by using two or more characters. In Python, we denote escape characters with a backslash ( >>> "Hello, world!\n"
'Hello, world!\n'
>>> print("Hello, world!")
Hello, world!
>>> print("Hello, world!\n")
Hello, world!
>>>
>>> print("C:\new folder")
C:
ew folder
>>> print("C:\\new folder")
C:\new folder
print(r"C:\new folder")
C:\new folder
>>>
You can easily assign strings to variables. >>> spam = r"C:\new folder"
>>> print(spam)
C:\new folder
>>> s = 'He said "I\'ll come."\n' ; s ; print (s) # escaping the single quote
'He said "I\'ll come."\n'
He said "I'll come."
>>> s = "He said \"I'll come.\"\n" ; s ; print (s) # escaping the double quote
'He said "I\'ll come."\n'
He said "I'll come."
>>> s = r"He said \"I'll come.\"\n" ; s ; print (s) # a raw string
'He said \\"I\'ll come.\\"\\n'
He said \"I'll come.\"\n
>>>
The difference between displaying a string and printing a string: >>> s3 = r'\:' ; s3
'\\:'
>>> print("'{}'".format(s3))
'\:'
>>>
NewlinesNow, let's say you want to print some multi-line text. You could do it like this. >>> print("Heya!\nHi!\nHello!\nWelcome!")
>>> print("""
... Heya!
... Hi!
... Hello!
... Welcome!
... """)
Heya!
Hi!
Hello!
Welcome!
>>>
>>> print("""\
... Heya!
... Hi!
... Hello!
... Welcome!""")
Heya!
Hi!
Hello!
Welcome!
>>>
>>> print("I love Wikiversity!", end="")
I love Wikiversity!>>>
>>> spam = ("Hello,\
... world!")
>>> print(spam)
Hello, world!
>>> spam = ("hello, hello, hello, hello, hello, hello, hello, hello, "
... "world world world world world world world world world world.")
>>> print (spam)
hello, hello, hello, hello, hello, hello, hello, hello, world world world world world world world world world world.
>>> spam = ("hello, hello, hello, hello, hello, hello, hello, hello, " '\n'
... "world world world world world world world world world world.")
>>> print (spam)
hello, hello, hello, hello, hello, hello, hello, hello,
world world world world world world world world world world.
>>>
|
Formatting
Strings in Python can be subjected to special formatting, much like strings in C. Formatting serves a special purpose by making it easier to make well formatted output. You can format a string using a percent sign ( >>> print("The number three (%d)." % 3)
The number three (3).
>>> name = "I8086"
>>> print("Copyright (c) %s 2014" % name)
Copyright (c) I8086 2014
>>> name = "I8086"
>>> date = 2014
>>> print("Copyright (c) %s %d" % (name, date))
Copyright (c) I8086 2014
>>> name = "I8086"
>>> date = 2014
>>> print("Copyright (c) %s %d" % name, date)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: not enough arguments for format string
>>> print ("The characters %s are used as a numeric specification thus: %d" % ('%d', 1234))
The characters %d are used as a numeric specification thus: 1234
>>>
or >>> print ("The characters %d are used as a numeric specification thus: {}".format(1234))
The characters %d are used as a numeric specification thus: 1234
>>>
Examples of formatted strings:>>> name = 'Fred'
>>> 'He said his name is {0}.'.format(name) # name replaces {0}
'He said his name is Fred.'
>>> 'He said his name is {{{0}}}.'.format(name) # '{{}}' to print literal '{}'
'He said his name is {Fred}.'
>>>
>>> '{0}, {1}, {2}'.format('a', 'b', 'c')
'a, b, c'
>>> '{0}, {1}, {1}, {0}, {2}'.format('a', 'b', 'c')
'a, b, b, a, c'
>>>
>>> ('The complex number {0} contains a real part {0.real} '
... 'and an imaginary part {0.imag}.').format(11+3j)
'The complex number (11+3j) contains a real part 11.0 and an imaginary part 3.0.'
>>>
>>> coordinates1 = (3, -2) ; coordinates2 = (-17, 13)
>>> 'X1 = {0[0]}; Y1 = {0[1]}; X2 = {1[0]}; Y2 = {1[1]}.'.format(coordinates1, coordinates2)
'X1 = 3; Y1 = -2; X2 = -17; Y2 = 13.'
>>>
>>> '{:<30}||'.format('left aligned')
'left aligned ||'
>>> '{:>30}||'.format('right aligned')
' right aligned||'
>>> '{:.>30}||'.format(' right aligned') # right aligned with '.' fill
'................ right aligned||'
>>> '{:@^30}||'.format(' centered ') # centered with '@' fill
'@@@@@@@@@@ centered @@@@@@@@@@||'
>>>
>>> "int: {0:d}; hex: {0:x}; oct: {0:o}; bin: {0:b}".format(23) # conversions to different bases
'int: 23; hex: 17; oct: 27; bin: 10111'
>>>
>>> d = '${0:03.2f}'.format(123.456) ; d
'$123.46'
>>> 'Gross receipts are {0:.>15}'.format( ' ' + d )
'Gross receipts are ....... $123.46'
>>>
>>> 'Gross receipts for {}, {:d} {:.>15}'.format( # sequence {0} {1} {2} is the default.
... 'July', 2017,
... ' ' + '${:03.2f}'.format(123.456)
... )
'Gross receipts for July, 2017 ....... $123.46'
>>>
>>> d = 1234567.5678 ; d
1234567.5678 # d is float
>>> d = '{:03.2f}'.format(d) ; d
'1234567.57' # d is str
>>> d = '${:,}'.format( float(d) ) ; d # input to this format statement is float
'$1,234,567.57' # d is str formatted with '$' and commas
>>> 'Gross receipts for {}, {:d} {:.>20}'.format(
... 'July', 2017,
... ' ' + d
... )
'Gross receipts for July, 2017 ...... $1,234,567.57'
>>>
|
Indexing
Strings in Python support indexing, which allows you to retrieve part of the string. It would be better to show you some indexing before we actually tell you how it's done, since you'll grasp the concept more easily. >>> "Hello, world!"[1]
'e'
>>> spam = "Hello, world!"
>>> spam[1]
'e'
>>> spam = "abc"
>>> spam[3]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>> eggs = "Hello, world!"
>>> eggs[len(eggs)-1]
'!'
>>> spam = "I love Wikiversity!"
>>> spam[-1]
'!'
>>> spam[-2]
'y'
>>> spam = "Hello,"
>>> spam = spam + " world!"
>>> spam
'Hello, world!'
>>> spam = "Hello, world!"
>>> spam[3] = "y"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>> spam[7] = " Py-"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>> spam = "Hello, world!"
>>> spam = spam[:2] + "y" + spam[3:]
>>> spam
'Heylo, world!'
>>> spam = "Hello, world!"
>>> spam = spam[:6] + " Py-" + spam[7:]
>>> spam
'Hello, Py-world!'
|
Slicing
Slicing is an important concept that you'll be using in Python. Slicing allows you to extract a substring that is in the string. A substring is part of a string or a string within a string, so "I", "love", and "Python" are all substrings of "I love Python.". When you slice in Python, you'll need to remember that the colon ( >>> spam = "I love Python."
>>> spam[0:1]
'I'
>>> spam[2:6]
'love'
>>> spam[7:13]
'Python'
>>> spam[0:3] == ( spam[0] + spam[1] + spam[2] )
True
>>>
Now slicing like this can be helpful in situations, but what if you'd like to get the first 4 characters after the start of a string? We could use the >>> eggs = "Hello, world!"
>>> eggs[:6]
'Hello,'
>>> eggs[6:]
' world!'
>>> eggs = "Hello, world!"
>>> eggs[:6]+eggs[6:]
'Hello, world!'
>>> eggs[:6] + eggs[6:] == eggs
True
The handling of >>> "Hiya!"[10]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>> "Hiya!"[10:]
''
>>> "Hiya!"[10:11]
''
>>> "Hiya!"[:10]
'Hiya!'
>>> spam = "I love Wikiversity!"
>>> len(spam)
19
>>> spam[18]
'!'
>>> spam == spam[0:19]
True
>>>
>>> s1 = spam[-19:3] ; s2 = spam[3:9] ; s3 = spam[-10:10] ; s4 = spam[10:-1] ; s5 = spam[18] # Line 4 above
>>> spam == s1 + s2 + s3 + s4 + s5
True
>>>
>>> spam == s1[:] + s2[0:] + s3[-1:] + s4[0:8] + s5[-1:1]
True
>>> s4 == spam[10:18] == spam[10:-1] == spam[-9:18] == spam[-9:-1]
True
>>>
The expression >>> s5 = spam[-1:] ; s5 == '!'
True
>>> s5 = spam[-1] ; s5 == '!'
True
>>> s5 = spam[18:] ; s5 == '!'
True
>>> s5 = spam[18:19] ; s5 == '!'
True
>>> s5 = spam[18] ; s5 == '!'
True
>>>
>>> s5[-9:13] == '!' # slicing doesn't produce an error.
True
>>> s5[-9] # indexing produces an error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>>
>>> s5[12] # indexing produces an error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>>
a = 0b110001010110001111001000 # a is int
b = bin(a) # b is string
print ('a = ', b)
c = b[2:] # c is substring of b
print ('c = ', c)
insertPosition = len(c) - 4
print ('1)insertPosition = ' , insertPosition)
while insertPosition > 0 :
print ('2)insertPosition = ' , insertPosition)
c = c[:insertPosition] + '_' + c[insertPosition:]
insertPosition -= 4
c = '0b_' + c
print ('c =', c)
(int(c,0) == a) or print ('error in conversion') # check result
Execute the python code above and the result is: a = 0b110001010110001111001000
c = 110001010110001111001000
1)insertPosition = 20
2)insertPosition = 20
2)insertPosition = 16
2)insertPosition = 12
2)insertPosition = 8
2)insertPosition = 4
c = 0b_1100_0101_0110_0011_1100_1000
|
Encoding
So we know what a string is and how it works, but what really is a string? Depending on the encoding, it could be different things without changing. The most prominent string encodings are ASCII and Unicode. The first is a simple encoding for some, but not all, Latin characters and other things like numbers, signs, and monetary units. The second, called Unicode, is a larger encoding that can have thousands of characters. The purpose of Unicode is to create one encoding that can contain all of the world's alphabets, characters, and scripts. In Python 3 Unicode is the default encoding. So this means we can put almost any character into a string and have it print correctly. This is great news for non-English countries, because the ASCII encoding doesn't permit many types of characters. In fact, ASCII allows only 127 characters! ( >>> print("Witaj świecie!")
Witaj świecie!
>>> print("Hola mundo!")
Hola mundo!
>>> print("Привет мир!")
Привет мир!
>>> print("שלום עולם!")
שלום עולם!
A brief review of ASCIIEach ASCII character fits into one byte, specifically the least significant 7 bits of the byte.
Therefore each ASCII character has the value 0x00 .. 0x7F. The >>> chr(65); chr(0x41)
'A'
'A'
>>> ord('a'); hex(ord('a'))
97
'0x61'
>>>
The printable characters have values The numbers The letters The letters Control characters have values Control character >>> chr(0x01) == chr( ord('a') & 0x1F ) == chr( ord('A') & 0x1F )
True
>>> chr(0x09) == chr( ord('i') & 0x1F ) == chr( ord('I') & 0x1F ) == '\t'
True
>>> chr(0x0C) == chr( ord('l') & 0x1F ) == chr( ord('L') & 0x1F ) == '\f'
True
>>>
Named control characters are: \a Bell = ^G = 0x7
\b Backspace = ^H = 0x8
\f Form Feed = ^L = 0xc
\n Line Feed = ^J = 0xa
\r Carriage Return = ^M = 0xd
\t Horizontal Tab = ^I = 0x9
\v Vertical Tab = ^K = 0xb
See the table of Escape Sequences under "Escape Characters" above. To make sense of the named control characters think of the 1970's and a 'modern' typewriter for the period. A proficient typist who did not look at the keyboard needed an audible alarm when the carriage approached end of line, hence "Bell". A Line feed advanced the paper one line. A Form Feed advanced paper to the next page. Perhaps the only control characters relevant today are Tab, Backspace and Return (which is interpreted as '^J'.)
num=0
while num <= 0x7F: print ( hex(num), '=', chr(num) ); num += 1
For example If you send the output to a file and open the file with emacs, you will see that control
character >>> chr(0x80) == '\200'
True
>>>
Modern character setsIn times past when hardware was expensive and the English speaking world dominated computing, one character occupied seven bits (0-6) of one byte. Then bit 7 was used to provide 128 extra characters. For example
'\xhh' = chr(0xhh) = '\u00hh' = '\U000000hh'
chr(0xhhhh) = '\uhhhh' = '\U0000hhhh'
chr(0x1hhhh) = '\U0001hhhh'
where >>> '\x10be' == '\x10' + 'be'
True
>>>
>>> '\u004110be' == '\u0041' + '10be' == 'A10be'
True
>>>
>>> '\x61' == 'a' == '\u0061' == '\U00000061' == chr(0x61)
True
>>>
>>> ' \U000010be ' == ' Ⴞ ' == ' \u10be ' == ' ' + chr(0x10BE) + ' '
True
>>>
>>> ' \u004110be '
' A10be '
>>> ' \U004110BE '
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 1-10: illegal Unicode character
>>>
>>> '\u0041\u0042\u0043\u0044'
'ABCD'
>>>
>>> '\U0041\U0042\U0043\U0044'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-5: truncated \UXXXXXXXX escape
>>>
>>> '\U00000041\U00000042\U00000043\U00000044'
'ABCD'
>>>
String as
|
Common String methods
In the lesson above we've seen some methods in action: Information about string>>> '123456'.isnumeric()
True
>>>
>>> ' abcd efg abc '.isalnum()
False # not alpha-numeric
>>>
>>> 'abcd123efg789abc'.isalnum()
True # all alpha-numeric
>>>
>>> 'abcdABCefgZYXabc'.isalpha()
True # all alphabetic
>>>
>>> '01234567'.isdecimal()
True # all decimal
>>>
>>> ' '.isspace()
True
>>> ''.isspace()
False
>>>
Substring within stringExistence of substring>>> 'abc' in ' abc 123 '
True
>>> 'abcd' in ' abc 123 '
False
>>>
Position of substring>>> ' abcd efg abc '.find('c')
5 # found 'c' in position 5
>>>
>>> ' abcd efg abc '.find('x')
-1 # did not find 'x'
>>>
>>> ' abcd efg abc '.find('c',7)
14 # found 'c' at or after position 7 at position 14.
>>>
>>> ' abcd efg abc '.find('c',7,12)
-1 # did not find 'c' in range specified
>>>
>>> ' abcd efg abc '.index('c',7)
14 # same as 'find()' above unless not found.
>>>
>>> ' abcd efg abc '.index('c',7,12)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: substring not found
>>>
Formatting the string>>> ' 1234 '.strip()
'1234' # remove leading and trailing whitespace
>>>
>>> '\t2345\t56\t678'.expandtabs()
' 2345 56 678' # default is 8.
>>>
>>> '\t2345\t56\t678'.expandtabs(12)
' 2345 56 678'
>>>
>>> 'abc def'.zfill(30)
'00000000000000000000000abc def' # left fill with zeroes.
>>>
>>> 'abcd'.center(30)
' abcd ' # center given string in a length of 30
>>>
>>> ' abcd '.center(30,'+')
'++++++++++++ abcd ++++++++++++' # and fill with '+'
>>>
>>> 'value = {}'. format(6)
'value = 6'
>>>
Splitting the string>>> ' 1234 456 23456 '.split()
['1234', '456', '23456'] # retain non-whitespace in a list
>>>
>>> ' 1234.567e-45 '.partition('e')
(' 1234.567', 'e', '-45 ') # split into 3 around string supplied
>>>
>>> '\n'.join(['abc\n','def','012\n\n']).splitlines(keepends=True)
['abc\n', '\n', 'def\n', '012\n', '\n']
>>>
Miscellaneous>>> ' The quick brown fox .... '.replace('quick','lazy')
' The lazy brown fox .... '
>>>
>>> '_'.join(['abc','def','012'])
'abc_def_012'
>>>
Methods may be chained>>> ' 1234.567E-45 '.strip().lower().partition('e')
('1234.567', 'e', '-45')
>>>
>>> ' The quick brown fox .... '.replace('quick','lazy').upper().split()
['THE', 'LAZY', 'BROWN', 'FOX', '....']
>>>
Methods recognize international text>>> 'Βικιεπιστήμιο'.isupper()
False
>>> 'Βικιεπιστήμιο'.upper()
'ΒΙΚΙΕΠΙΣΤΉΜΙΟ'
>>> 'Βικιεπιστήμιο'.lower()
'βικιεπιστήμιο'
>>> 'Βικιεπιστήμιο'.isalpha()
True
>>>
|
More operations on strings
At this point we're familiar with strings as perhaps a single line. But strings can be much more than a single line. The whole of "War and Peace" could be a single string. In this part of the lesson we'll look at "paragraphs" where a paragraph contains one or more non-blank lines. Consider string: >>> a = '\n\n line 1 \n line 2 \n line 3 \n\n'
>>> print (a)
line 1
line 2
line 3
>>>
This string contains a paragraph surrounded by messy white space. We'll improve the appearance of the string by removing insignificant white space. First: >>> a = a.strip() ; a
'line 1 \n line 2 \n line 3'
>>>
>>> print (a)
line 1
line 2
line 3
Remove insignificant white space around each line: >>> L1 = a.splitlines() ; L1
['line 1 ', ' line 2 ', ' line 3'] # a list containing 3 lines.
>>> for p in range(len(L1)-1, -1, -1) : L1[p] = L1[p].strip()
...
>>> L1
['line 1', 'line 2', 'line 3'] # each line has had beginning and ending white space removed, including '\n'.
>>>
>>> s = ''.join(L1) ; s # lines joined with ''.
'line 1line 2line 3'
>>>
>>> print (s)
line 1line 2line 3
>>>
>>> s = '\n'.join(L1) ; s # lines joined with '\n'.
'line 1\nline 2\nline 3'
>>> print (s)
line 1
line 2
line 3 # a clean paragraph
The next string is a "page" where a page contains two or more paragraphs. >>> a = '''
...
...
... paragraph 1, line 1
... paragraph 1, line 2
...
...
...
... paragraph 2, line 1
... paragraph 2, line 2
... paragraph 2, line 3
...
...
... paragraph 3, line 1
...
...
... '''
>>>
With this page we'll do the same as above, that is, remove insignificant white space. >>> b = a.strip() ; print (b)
paragraph 1, line 1
paragraph 1, line 2
paragraph 2, line 1
paragraph 2, line 2
paragraph 2, line 3
paragraph 3, line 1
>>>
Remove white space around each line (including blank lines): >>> L1 = b.splitlines()
>>>
>>> for p in range(len(L1)-1, -1, -1) : L1[p] = L1[p].strip()
...
>>> print ('\n'.join(L1))
paragraph 1, line 1
paragraph 1, line 2
paragraph 2, line 1
paragraph 2, line 2
paragraph 2, line 3
paragraph 3, line 1
>>>
Remove extraneous lines between paragraphs: >>> for p in range(len(L1)-1, 0, -1) : # terminator here is 0.
... if len( L1[p] ) == len( L1[p-1] ) == 0 : del L1[p]
...
>>> print ('\n'.join(L1))
paragraph 1, line 1
paragraph 1, line 2
paragraph 2, line 1
paragraph 2, line 2
paragraph 2, line 3
paragraph 3, line 1
>>>
Complicated strings simplifiedIf you are working with strings containing complicated sequences of escaped characters, or if the whole concept of escaped characters is difficult, you might try: >>> back_slash = '\ '[:1] ; back_slash ; len(back_slash)
'\\'
1
>>>
>>> new_line = """
... """ # Between """ and """ there is exactly one return.
>>> new_line ; len(new_line)
'\n'
1
>>>
>>> tab = ' ' ; tab ; len(tab) # Between ' and ' there is exactly one tab.
'\t'
1
>>>
>>> hex(ord(back_slash))
'0x5c'
>>> hex(ord(new_line))
'0xa'
>>> hex(ord(tab))
'0x9'
>>>
Then you can build long strings: >>> 'abc' + back_slash*7 + '123' + (back_slash*5 + new_line*3)*5 + 'xyz'
'abc\\\\\\\\\\\\\\123\\\\\\\\\\\n\n\n\\\\\\\\\\\n\n\n\\\\\\\\\\\n\n\n\\\\\\\\\\\n\n\n\\\\\\\\\\\n\n\nxyz'
>>>
You can put the back_slash at the end of a string: >>> a = 'abc' + back_slash ; a ; len (a)
'abc\\'
4
>>>
If you have a long string, splitting it might help to reveal significant parts: >>> a = 'abc' + back_slash*7 + '123' + (back_slash*5 + new_line*3)*5 + 'xyz'
>>> a.split(back_slash)
['abc', '', '', '', '', '', '', '123', '', '', '', '', '\n\n\n', '', '', '', '', '\n\n\n', '', '', '', '', '\n\n\n', '', '', '', '', '\n\n\n', '', '', '', '', '\n\n\nxyz']
>>>
>>> a.split(back_slash*5)
['abc', '\\\\123', '\n\n\n', '\n\n\n', '\n\n\n', '\n\n\n', '\n\n\nxyz']
>>>
>>> a.split(back_slash*5 + new_line*3)
['abc\\\\\\\\\\\\\\123', '', '', '', '', 'xyz']
>>>
|
Assignments
It seems that modern international characters with numeric values greater then 0xFFFF are not standardized. Consider character '\U00010022'. This displays on the interactive Python command line as a little elephant with three legs, within emacs and on the Unix command line as a Greek delta (almost), and in Wikiversity as, well, that depends. >>> '\U00010022'
'𐀢' # Copied from Python in interactive mode.
>>> '\U00010022' == chr(0x10022) == '\U+10022'
True
>>>
>>> '\U+10022'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \UXXXXXXXX escape
>>>
Within interactive python, when you move the cursor over '\U+10022', it steps over one character. When you move the cursor over '\U00010022', it steps over ten characters. '\U00010022' is 'ð' as copied from emacs.
|
Further Reading or Review
|
References
4. Python's documentation:
"3.1.2. Strings", "4.7.1. String Methods", "String and Bytes literals", "String literal concatenation", "Formatted string literals", "Format String Syntax", "Standard Encodings", "Why are Python strings immutable?", "Why can’t raw strings (r-strings) end with a backslash?", "Unicode HOWTO"
5. Python's methods:
"bytes.decode()", "str.encode()"
6. Python's built-in functions: