UTF-8 and UTF-16 are different encodings for the Unicode character set.
Let's discuss UTF-8 first. UTF-8 is what is known as a variable-length character set. This means that the amount of storage a character takes up depends on what character it is. For example, if we store the character A, it will only take up one byte. In fact, ASCII is a subset of UTF-8. That means UTF-8 encoding can work with ASCII data.
If you are new to computer storage, a byte is a very small amount of information. The smallest thing a computer can store is a bit. 1 or 0. On or off. There are 8 bits in a byte, 1024 bytes in a kilobyte, 1024 kilobytes in a megabyte, 1024 megabytes in a gigabyte, and 1024 gigabytes in a terabyte, and 1024 terabytes in a petabyte. Considering it is completely possible for a database to be multiple petabytes, you can understand that a byte is very small.
If you store a non-English character, the size of UTF-8 will increase to 2, 3, or 4 bytes.
If you think back to when we used the VARCHAR data type, we passed in 50 CHAR. The reason we throw in that CHAR is that the default for Oracle is 50 characters. Now you can understand why adding the CHAR might be important. If a character can take up multiple bytes, you cannot guarantee 50 characters.
Now, on to UTF-16. UTF-16 is also a variable length encoding, but it differs in that It is either 2 or 4 bytes. That means to store an A, it now takes two bytes rather than one. Even though a byte is so small, when you are storing billions of characters, an unnecessary byte really adds up to a lot of wasted storage. We can only represent so many characters with 2 bytes. When we run out of options, we move to four bytes to allow for other characters.
Which do we use? It often depends on what platform you are on and also what languages you are working with. For example, if you are working with Asian language, UTF-16 stores each character in 2 bytes while UTF-8 stores each character in 3 bytes. So you could save space by using UTF-16. Additionally, UTF-16 works better when you are writing code in Java or something from Microsoft .NET because UTF-16, or a subset of it called UCS-2, is widely adopted. Other than that, UTF-8 will be the one you want.
Now that we have built a pretty good foundation of character sets, we can now continue our discussion of data types.
Support me! http://www.patreon.com/calebcurry
Subscribe to my newsletter: http://bit.ly/JoinCCNewsletter
More content: http://CalebCurry.com
Amazing Web Hosting - http://bit.ly/ccbluehost (The best web hosting for a cheap price!)