Watch 10 video solutions for UTF-8 Validation, a medium level problem involving Array, Bit Manipulation. This walkthrough by Venkatesh Thallam has 4,820 views views. Want to try solving it yourself? Practice on FleetCode or read the detailed text solution.
Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).
A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
0, followed by its Unicode code.n bits are all one's, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.This is how the UTF-8 encoding would work:
Number of Bytes | UTF-8 Octet Sequence
| (binary)
--------------------+-----------------------------------------
1 | 0xxxxxxx
2 | 110xxxxx 10xxxxxx
3 | 1110xxxx 10xxxxxx 10xxxxxx
4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
x denotes a bit in the binary form of a byte that may be either 0 or 1.
Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
Example 1:
Input: data = [197,130,1] Output: true Explanation: data represents the octet sequence: 11000101 10000010 00000001. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.
Example 2:
Input: data = [235,140,4] Output: false Explanation: data represented the octet sequence: 11101011 10001100 00000100. The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character. The next byte is a continuation byte which starts with 10 and that's correct. But the second continuation byte does not start with 10, so it is invalid.
Constraints:
1 <= data.length <= 2 * 1040 <= data[i] <= 255Problem Overview: You receive an array of integers where each value represents a byte (only the lowest 8 bits are used). Determine whether the sequence forms a valid UTF-8 encoded character stream. UTF-8 characters may use 1 to 4 bytes, and continuation bytes must follow strict bit patterns.
Approach 1: Iterative Bit Pattern Validation (O(n) time, O(1) space)
This approach scans the array once while tracking how many continuation bytes are expected for the current UTF‑8 character. UTF‑8 encodings follow specific leading bit patterns: 0xxxxxxx for 1 byte, 110xxxxx for 2 bytes, 1110xxxx for 3 bytes, and 11110xxx for 4 bytes. Continuation bytes must always match 10xxxxxx. Iterate through the array and inspect the highest bits using bit masking and shifting. If a leading byte is encountered, determine how many continuation bytes should follow. For each expected continuation byte, verify the 10 prefix. If a mismatch occurs or the sequence ends early, the encoding is invalid. This method uses constant extra memory and relies purely on bit checks, making it the most practical solution. Problems like this commonly appear in bit manipulation practice where understanding binary prefixes is critical.
Approach 2: Recursive Sequence Validation (O(n) time, O(n) recursion depth worst case)
The recursive approach models the validation as processing one UTF‑8 character at a time. Start from index i, inspect the leading byte, and determine the total number of bytes in the character using the prefix bits. Then recursively validate the required continuation bytes and move to the next character start index. Each continuation byte must match the 10xxxxxx pattern before recursion proceeds. This method makes the structure of UTF‑8 characters explicit, but recursion adds overhead and stack usage. While still linear time, it consumes additional call stack space and is less practical than the iterative scan. However, it can help when reasoning about parsing sequences in array traversal problems or recursive parsers.
Recommended for interviews: The iterative bit‑masking approach is what interviewers typically expect. It demonstrates that you understand UTF‑8 byte prefixes and can efficiently validate them with bit operations. The recursive approach proves conceptual understanding of character grouping, but the iterative version shows stronger control over bit manipulation and constant‑space streaming validation.
| Approach | Time | Space | When to Use |
|---|---|---|---|
| Iterative Bit Pattern Validation | O(n) | O(1) | Best general solution. Single pass validation using bit masks and counters. |
| Recursive Sequence Validation | O(n) | O(n) worst case recursion | Useful for conceptual parsing of UTF‑8 characters or recursive stream validation. |