UTF-8 is not well suited for a general purpose string implementation because it ...

vardump · on May 19, 2015

UTF-16 is also a variable length encoding and addressing a character is a linear time operation. Then again, even UTF-32 can have composed characters, such as ¨a separately forming ä.

danbruc · on May 19, 2015

True but UTF-16 captures really a very large share of actually used code points. UTF-32 captures all of them but wastes at least 11 bit per code point, more right now because most of the code points are unassigned. It seems a good tradeoff for a lot of use cases and as you mentioned once operating on code points is no longer good enough you will have to face the issue regardless of the encoding.

vardump · on May 19, 2015

UTF-8 consumes less space and has pretty much same trade-offs in terms of iteration. No worries about endian (UTF-16LE vs UTF-16BE). Almost all input text is in UTF-8 and so is almost all output text. Conversion back and forth to UTF-16 is just wasted CPU time.

I think even counting number of UTF-8 code points in a string is faster in UTF-8 than in UTF-16, if you're allowed to use SSE2/AVX2/AVX-512, because all UTF-8 sequences start with a byte that has highest bit 0, all other bytes in the sequence have highest bit 1.

So just SIMD vector compare [1] to find all "positive" bytes (highest bit == 0), which gives you a nice mask. Then move the mask to a general purpose register [2] and popcount [3] it. 16/32/64 (SSE2/AVX2/AVX-512) bytes processed at a time, no branches other than loop control branch.

You can use the same idea to quickly scan UTF-8 string to approximately right position to retrieve a given (random) code point index. Still O(n), but with 10-50x smaller constant factor. If that's not enough, you can simply pre-index every n-th code point (say, every 64/128/256th) in a separate array for larger UTF-8 strings. That gives you constant time random access.

[1]: http://www.felixcloutier.com/x86/PCMPGTB:PCMPGTW:PCMPGTD.htm...

[2]: http://www.felixcloutier.com/x86/PMOVMSKB.html

[3]: http://www.felixcloutier.com/x86/POPCNT.html

Note: UTF-8 code points start either 0xxxxxxx or 11xxxxxx. Regardless of this the basic idea should work, just need to do two compares and to bitwise-or the masks. At the end, AVX-512 of course would use mask-register (k0-k7) and AVX2 would probably need to convert the mask in two parts, once for both 128-bit register halves.

Note 2: Thinking about it a bit more, I think it's enough to check if signed bytes are greater or equal than -64 (0xc0)! This covers bit patterns from 11000000 (-64) to 01111111 (127); all the sequences that can start a sequence. So no two compares and bitwise-or needed after all.

elFarto · on May 20, 2015

This page has the best implementation I've seen: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strle...

It basically counts continuation bytes (which all start 10xxxxxx) and substracts, rather than trying to count characters.

Additionally, if you know how many bytes are in the string, you can remove the check for the null terminator.

vardump · on May 20, 2015

Tested my idea quickly. Initial messy (but correct) version finishes in 31% of time that version you linked [1] (cp_strlen_utf8) takes to run. I think I can still improve it quite a bit.

Only tested with long 30 MB strings.

Edit: Now at 26%. But it can still be improved more. Both benchmarks are with hot cache.

  new_strlen_utf8 12352856 clock cycles
  cp_strlen_utf8  47544818 clock cycles

Edit 2: Well, 4x performance 32 bit, but compiling it 64-bit in VS2015RC manages to optimize cp_strlen_utf8 more, almost doubling performance. 45% then. Will try gcc 5, clang, etc. later. And it can still be optimized further.

Edit 3: Ended up at 35% (2.9x) execution time for 64-bit and 18% (5.5x) for 32-bit. My version is as fast in 32 and 64-bit, but cp_strlen_utf8 benefits quite a bit from 64-bit mode. Probably memory bandwidth limited at this point, but I didn't profile yet. In any case, it does utf-8 code point strlen at 16 GB/s at this point. CPU is i5-4430 CPU @ 3.00GHz, two memory channels @1600 MHz.

[1]: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strle...

deathanatos · on May 20, 2015

> and therefore addressing a character becomes a linear time operation. UTF-16 would probably be a better choice in most cases.

> but UTF-16 captures really a very large share of actually used code points.

That most characters[1] in use are a single code unit in UTF-16 is meaningless to code that needs to index[2] into a UTF-16 by code point (or grapheme): the only correct way to accomplish this in a typical UTF-16 string implementation is O(n).

[1]: I love emoji, and they are outside the BMP.

[2]: I think you'll find that most code does not need to index into a string. (Though languages that lack iterators on strings will make writing the code without indexing difficult.)

ta0967 · on May 19, 2015

UTF-8, UTF-16 and UTF-32 are different encodings of the same character set, so i'm not sure what you mean in UTF-16 captures really a very large share of actually used code points.

next, your claim that UTF-8 is not well suited for a general purpose string implementation because it is a variable length encoding and therefore addressing a character becomes a linear time operation is incoherrent: UTF-16 is a variable length encoding just as well. come out and say that you want to be lazy and pretend surrogate pairs don't exist.

kibibu · on May 19, 2015

Then you're not actually talking about UTF-16 but rather UCS-2

dlubarov · on May 19, 2015

Well now that we have supplementary planes, UTF-16 should also be considered variable-length.

vbezhenar · on May 20, 2015

I used Rust and Swift languages a little bit and I must admit that I almost never had to use indexing for characters. Iterating and slicing is enough for most algorithms.

Actually today it might be hard to understand what indexing is, even if you store your string in UCS-32 encoding. There are graphical symbols that may occupy variable number of UCS-32 items. And they are used out there (e.g. flags).