Python Strings: A Character Might Not Be the Character You Thought

Have a deeper understanding of Python strings

Yong Cui

Better Programming

· ~5 min read · May 10, 2021 (Updated: January 6, 2022) · Free: No

Textual information is the most common form of data used in almost every application. In Python, textual data are usually represented by using strings with the str class. Those who have used Python for some time should have a good understanding of the basic string features. One such feature is that Python strings are sequences of characters, and this statement is so natural that we may have taken it for granted. However, have you ever wondered what characters in a string mean exactly?

Let's see the following example first. Wanna guess the length of the string text?

text = 'è'
print(len(text))

If you run the code in your console, you'll find out that 'è' has a length of two. Does that surprise you?

Probably yes? If you're not sure about the result, consider creating a list object by passing the string, as below. This operation is legal, because strings are iterables of characters, and the list constructor can take an iterable.

>>> list('è')
['e', '̀']

You'll see that there are two characters in the text string, which form a two-item list, as shown above. The second character looks funny because it's applied on top of the first single quote instead of occupying the position between the single quotes. Interested in knowing why this might have happened? Let's move on.

Characters — The Unicode Code Points

Before discussing the code that we've seen, it's necessary to know that Python strings consist of characters and the characters are actually Unicode characters, or, more precisely speaking, code points.

If you don't know, Unicode is a widely adopted standard to represent and handle text data. The Unicode system can code over 100,000 characters that cover almost all written languages. One Unicode character is often referred to as a Unicode code point, which is expressed as U+ followed by four to six hexadecimal values.

In Python, when we create a string, we can directly use Unicode code points. The syntax is that we use '\u' to prefix the code point's values. Observe some examples below.

>>> '\u0045'
'E'
>>> '\u0065'
'e'
>>> '\u2660'
'♠'

As you can see, all these Unicode points are correctly converted to the corresponding characters that they represent. Although these string variables appear to have more than one character, if you check their lengths, you'll find out that's not the case.

>>> len('\u0045')
1
>>> len('\u0065')
1
>>> len('\u2660')
1

As you can see, all of these strings have a length of one. The reason is simple: Each of these strings has just one Unicode code point, which is equivalent to one character.

Special Characters

We know that there are some special characters in Python that we use, which typically have a prefix of backslash. For instance, \t represents a horizontal tab and \n represents a new line. Have you ever wondered whether they're single characters (having a length of one)?

>>> tab = "\t"
>>> len(tab)
1
>>> newline = "\n"
>>> len(newline)
1

As shown above, they're both single characters. By now, if you understand that characters in Python strings are Unicode characters, you'll probably expect that these special characters should have their corresponding Unicode code points. More importantly, you can predict that they're equivalent because behind the scenes, they're all Unicode code points. Observe the effects below.

>>> "\u0009" == tab == "\t"
True
>>> "\u000A" == newline == "\n"
True

Thus we shall understand that special characters are special in the sense that they don't mean what they appear to (not a backslash with a letter t or n). However, special characters are not so special in that they're Unicode code points just like other regular characters.

Addressing the Question

Now we're ready to address the questions that we raised in the beginning: Why does a string 'è' have a length of two?

I think you probably can have a good answer or guess for this question. It's probably that the string 'è' has two Unicode code points.

Believe it or not, a character (from what it appears to be, such as 'è') can consist of two underlying Unicode characters. To find out the underlying code points, we can use the built-in ord function. This function returns the integer value of the code point of a Unicode character that is passed to the function, as shown below. Please note that as mentioned, the string 'è' has two characters, and thus we'll run the function on each of the two characters.

>>> ord('è'[0])
101
>>> ord('è'[1])
768

If you look up any online Unicode tables, you'll find out that these two code points are named Latin Small Letter E and Combining Grave Accent, and their corresponding hexadecimal values are 65 and 300, respectively. With these values, we can reconstruct the string with the Unicode code points.

>>> "\u0065\u0300"
'è'
>>> 'è' == "\u0065\u0300"
True

As a side note, the code point Combining Grave Accent is kind of unusual because it'll be combined with its previous character where applicable. For this reason, "\u0065\u0300" resumes the look of one character.

Don't Be Fooled by the Appearance

We have been using the example of the string 'è'. It's possible that some people may have thought that it's the Latin letter e with the grave accent. Although they do appear the same, they're interpreted differently by Python.

>>> 'è' == 'è'
False

The result shouldn't be surprising to you, because the Latin letter e with the grave accent is a single Unicode code point, as shown below.

>>> '\u00E8'
'è'

With the understanding of Python strings as sequences of Unicode code points, you won't be surprised to see 'è' == 'è' because the former has two code points while the latter has only one.

Conclusion

In this article, the central idea that we reviewed is that characters in Python strings are Unicode code points. Although they may appear to be one or more characters, their actual representations of the underlying Unicode code points determine whether they're single or multiple characters.

#programming #python #technology #artificial-intelligence #data-science