Unicode Converter

Convert text to Unicode code points and back. View character codes, decode escape sequences, and work with different Unicode formats.

Text Input 0 chars

Unicode Output

Format:

How to Use the Unicode Converter

Enter text: Type or paste any text into the left input, including emojis, special characters, and text in any language.
Choose format: Select your preferred Unicode output format: U+ notation, JavaScript escapes, HTML entities, decimal, or hexadecimal.
Convert to Unicode: Click "To Unicode" to see each character's code point. The output shows the numeric representation of each character.
Convert to text: Paste Unicode codes into the left panel and click "To Text" to decode them back to readable characters.

What is Unicode?

Unicode is a universal character encoding standard that assigns a unique number (code point) to every character in every language, plus symbols, emojis, and special characters. Before Unicode, different systems used incompatible encoding schemes, making international text exchange difficult. Unicode solves this by providing a single, comprehensive system that covers over 149,000 characters from 159 writing systems.

Each character has a code point written as U+ followed by a hexadecimal number. For example, the letter 'A' is U+0041, and the emoji 🌍 is U+1F30D. Unicode enables computers to reliably represent and exchange text in any language, making modern international software possible.

Unicode Formats Explained

U+ notation (U+0048): The standard Unicode format showing code points in hexadecimal. Most readable and commonly used in Unicode documentation and references.

JavaScript escape (\u0048): Used in JavaScript, Java, and JSON strings to represent Unicode characters. Characters above U+FFFF use \uD800\uDC00 surrogate pairs or \u{1F30D} syntax.

HTML entity (H or H): Used in HTML and XML. Decimal (H) or hexadecimal (H) numeric character references that browsers convert to the actual character.

Decimal (72): The code point as a base-10 number. Simple and clear, often used in programming when working with character codes directly.

Hexadecimal (0x48): Code point in base-16 with 0x prefix. Common in low-level programming, memory dumps, and technical documentation.

Common Use Cases

Debug encoding issues: When text displays as gibberish or question marks, converting to Unicode reveals the actual character codes being used, helping identify encoding problems.

Work with emojis: Emojis are Unicode characters. Converting them to code points helps debug emoji rendering issues or work with emoji databases.

Internationalization: When building multilingual applications, understanding character codes helps handle text from different writing systems correctly.

Escape special characters: Generate JavaScript or HTML escape sequences for characters that can't be typed directly or might cause parsing issues.

Understanding Code Points

Basic Latin (U+0000 to U+007F): ASCII characters including English letters, numbers, and common symbols. These are the most widely supported characters.

Latin Extended (U+0080 to U+024F): Accented characters and letters from European languages like French, German, and Spanish.

Emoji (U+1F300 to U+1F9FF): Modern pictographs including faces, animals, food, and symbols. These require UTF-16 surrogate pairs in some systems.

CJK (U+4E00 to U+9FFF): Chinese, Japanese, and Korean characters. The unified ideographs block contains thousands of characters used across East Asian languages.

Mathematical symbols (U+2200 to U+22FF): Math operators, set theory symbols, and other mathematical notation.

UTF-8, UTF-16, and Code Points

Unicode code points are abstract numbers. UTF-8 and UTF-16 are encoding schemes that convert these numbers to bytes for storage and transmission. UTF-8 uses 1-4 bytes per character, making it efficient for English text (1 byte per character) while supporting all Unicode. UTF-16 uses 2 or 4 bytes, efficient for Asian languages but wasteful for English. Most web content uses UTF-8.

A code point like U+1F30D (🌍) is stored differently depending on encoding: UTF-8 uses 4 bytes (F0 9F 8C 8D), UTF-16 uses a surrogate pair (D83C DF0D), but the code point itself is always 1F30D. This tool shows code points, which are encoding-independent.

Working with Escape Sequences

JavaScript strings: Use \uXXXX for characters below U+FFFF. For higher characters like emojis, use \u{XXXXX} (ES6) or surrogate pairs \uD83C\uDF0D.

JSON: Strictly uses \uXXXX format. Characters above U+FFFF require surrogate pairs, as JSON doesn't support \u{} syntax.

CSS: Use backslash followed by hex: \48 for 'H'. Optional space after: \48 or \000048. Common for icon fonts and special characters in CSS content.

HTML: Use &#decimal; or &#xhex; anywhere in HTML content. Both work identically: H and H both produce 'H'.

Best Practices

Use UTF-8 everywhere: Set UTF-8 as your file encoding, database encoding, and HTML charset. This prevents encoding issues and ensures proper Unicode handling.

Don't rely on escapes: In modern systems, you can usually type characters directly rather than using escape sequences. Only escape when necessary for technical reasons.

Normalize text: Unicode has multiple ways to represent some characters (like é as one character or e + combining accent). Use Unicode normalization to ensure consistency.

Test with real data: Test your application with actual international text, not just ASCII. Many bugs only appear with non-English text or emojis.

Related Tools

Need to encode HTML entities? Use the HTML Entity Encoder. Also encode URLs with the URL Encoder or convert data formats with the JSON to YAML Converter.

Frequently Asked Questions

Why does my emoji turn into two \u sequences?

Emojis like the globe (U+1F30D) live above the Basic Multilingual Plane (U+FFFF), so UTF-16 represents them as a surrogate pair: 🌍. JavaScript's older \uXXXX escape can only encode values up to U+FFFF, which is why modern code uses \u{1F30D} or the full surrogate pair.

What's the difference between U+0041, A, and A?

They all represent the same character: capital 'A'. U+0041 is the formal Unicode standard notation, A is the JavaScript/JSON string escape, and A is the HTML decimal entity (or A in hex). The underlying code point (65 decimal / 41 hex) is identical.

How many bytes does Unicode use per character?

It depends on the encoding. UTF-8 uses 1 byte for ASCII (U+0000 to U+007F), 2 bytes for Latin extended and Greek, 3 bytes for most CJK, and 4 bytes for emojis. UTF-16 uses 2 bytes for the BMP and 4 bytes for supplementary characters. UTF-32 always uses 4 bytes per code point.

Common Mistakes to Avoid

Counting characters by byte length: "hello".length in JavaScript returns 5, but "👨‍👩‍👧".length returns 8 due to surrogate pairs and ZWJ joiners. Use Array.from() or Intl.Segmenter for grapheme counting.
Confusing code point with glyph: A single visible glyph like 'e' with accent can be one code point (U+00E9) or two (U+0065 + U+0301 combining acute). Always normalize with NFC before comparison.
Using \uXXXX for emojis: Many languages reject ἰD (it stops at ἰ and adds 'D'). Use \u{1F30D} in ES6+ or surrogate pair 🌍.
Mixing decimal and hex entities: H is hex (H), 0 is decimal (0). The 'x' prefix changes everything; missing it can produce completely different characters.

Quick Reference

Block	Range	Example
Basic Latin (ASCII)	U+0000 - U+007F	A = U+0041
Latin-1 Supplement	U+0080 - U+00FF	e-acute = U+00E9
Greek	U+0370 - U+03FF	alpha = U+03B1
CJK Unified	U+4E00 - U+9FFF	ni = U+4F60
Emoticons	U+1F600 - U+1F64F	smile = U+1F600
Misc Symbols (globe)	U+1F300 - U+1F5FF	globe = U+1F30D