How can we tokenize a string into a list based on language? For example, if the string is "hi 안녕하세요 반가워요 hello", the resulting list should be ["hi", "안녕하세요 반가워요", "hello"].

To tokenize a string to a list by language, we can use a similar approach as before using regular expressions to tokenize the text based on the Unicode character properties. Here's an example code snippet that tokenizes a string to a list by language using regular expressions:
| 1 | import re | 
This will output the following list of tokens: ['hi', '안녕하세요 반가워요', 'hello'].
Note that for Korean text, we use the regular expression [^\s]+ to match any
sequence of non-whitespace characters, which includes punctuation symbols such
as period and comma. We then use the re.sub function to remove any remaining
punctuation from each token.