Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).
A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
0, followed by its Unicode code.n bits are all one's, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.This is how the UTF-8 encoding would work:
Number of Bytes | UTF-8 Octet Sequence
| (binary)
--------------------+-----------------------------------------
1 | 0xxxxxxx
2 | 110xxxxx 10xxxxxx
3 | 1110xxxx 10xxxxxx 10xxxxxx
4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
x denotes a bit in the binary form of a byte that may be either 0 or 1.
Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
Example 1:
Input: data = [197,130,1] Output: true Explanation: data represents the octet sequence: 11000101 10000010 00000001. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.
Example 2:
Input: data = [235,140,4] Output: false Explanation: data represented the octet sequence: 11101011 10001100 00000100. The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character. The next byte is a continuation byte which starts with 10 and that's correct. But the second continuation byte does not start with 10, so it is invalid.
Constraints:
1 <= data.length <= 2 * 1040 <= data[i] <= 255Problem Overview: You receive an array of integers where each value represents a byte (only the lowest 8 bits are used). Determine whether the sequence forms a valid UTF-8 encoded character stream. UTF-8 characters may use 1 to 4 bytes, and continuation bytes must follow strict bit patterns.
Approach 1: Iterative Bit Pattern Validation (O(n) time, O(1) space)
This approach scans the array once while tracking how many continuation bytes are expected for the current UTF‑8 character. UTF‑8 encodings follow specific leading bit patterns: 0xxxxxxx for 1 byte, 110xxxxx for 2 bytes, 1110xxxx for 3 bytes, and 11110xxx for 4 bytes. Continuation bytes must always match 10xxxxxx. Iterate through the array and inspect the highest bits using bit masking and shifting. If a leading byte is encountered, determine how many continuation bytes should follow. For each expected continuation byte, verify the 10 prefix. If a mismatch occurs or the sequence ends early, the encoding is invalid. This method uses constant extra memory and relies purely on bit checks, making it the most practical solution. Problems like this commonly appear in bit manipulation practice where understanding binary prefixes is critical.
Approach 2: Recursive Sequence Validation (O(n) time, O(n) recursion depth worst case)
The recursive approach models the validation as processing one UTF‑8 character at a time. Start from index i, inspect the leading byte, and determine the total number of bytes in the character using the prefix bits. Then recursively validate the required continuation bytes and move to the next character start index. Each continuation byte must match the 10xxxxxx pattern before recursion proceeds. This method makes the structure of UTF‑8 characters explicit, but recursion adds overhead and stack usage. While still linear time, it consumes additional call stack space and is less practical than the iterative scan. However, it can help when reasoning about parsing sequences in array traversal problems or recursive parsers.
Recommended for interviews: The iterative bit‑masking approach is what interviewers typically expect. It demonstrates that you understand UTF‑8 byte prefixes and can efficiently validate them with bit operations. The recursive approach proves conceptual understanding of character grouping, but the iterative version shows stronger control over bit manipulation and constant‑space streaming validation.
This approach involves solving the problem using an iterative method, where we use loops to perform the necessary calculations. This can be more efficient in terms of space complexity, especially if recursion would lead to excessive function call overhead.
In C, we iterate through a loop to perform operations for each element. This avoids recursion and reduces memory usage as we're only using a single loop with constant space.
C++
Java
Python
C#
JavaScript
Time Complexity: O(n), where n is the number of elements.
Space Complexity: O(1) since we are not using any extra space proportional to the input size.
This approach explores solving the problem through recursion, which can offer simplicity and expressiveness. However, care must be taken with recursion depth to avoid stack overflow.
The recursive solution in C uses a helper parameter `i` that tracks the current state, printing the current value, and making a recursive call with `i + 1` until `i` reaches `n`.
C++
Java
Python
C#
JavaScript
Time Complexity: O(n)
Space Complexity: O(n) due to the call stack.
| Approach | Complexity |
|---|---|
| Approach 1: Iterative Solution | Time Complexity: O(n), where n is the number of elements. |
| Approach 2: Recursive Solution | Time Complexity: O(n) |
| Approach | Time | Space | When to Use |
|---|---|---|---|
| Iterative Bit Pattern Validation | O(n) | O(1) | Best general solution. Single pass validation using bit masks and counters. |
| Recursive Sequence Validation | O(n) | O(n) worst case recursion | Useful for conceptual parsing of UTF‑8 characters or recursive stream validation. |
UTF-8 Validation Leetcode • Venkatesh Thallam • 4,820 views views
Watch 9 more video solutions →Practice UTF-8 Validation with our built-in code editor and test cases.
Practice on FleetCode