Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).
A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
0, followed by its Unicode code.n bits are all one's, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.This is how the UTF-8 encoding would work:
Number of Bytes | UTF-8 Octet Sequence
| (binary)
--------------------+-----------------------------------------
1 | 0xxxxxxx
2 | 110xxxxx 10xxxxxx
3 | 1110xxxx 10xxxxxx 10xxxxxx
4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
x denotes a bit in the binary form of a byte that may be either 0 or 1.
Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
Example 1:
Input: data = [197,130,1] Output: true Explanation: data represents the octet sequence: 11000101 10000010 00000001. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.
Example 2:
Input: data = [235,140,4] Output: false Explanation: data represented the octet sequence: 11101011 10001100 00000100. The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character. The next byte is a continuation byte which starts with 10 and that's correct. But the second continuation byte does not start with 10, so it is invalid.
Constraints:
1 <= data.length <= 2 * 1040 <= data[i] <= 255The key idea in UTF-8 Validation is to verify whether a sequence of integers represents a valid UTF-8 encoded character stream. Each integer represents a byte, and UTF-8 rules define how many bytes belong to a character based on the pattern of leading bits. The first byte determines the total number of bytes in the character, while continuation bytes must follow the 10xxxxxx pattern.
A common strategy is to iterate through the array while tracking how many continuation bytes are expected. By using bit manipulation, you can inspect the most significant bits of each byte to determine whether it starts a new character or continues a previous one. Counting leading 1s in the first byte helps determine whether the character spans 1, 2, 3, or 4 bytes. If any byte violates the expected pattern, the sequence is invalid.
This linear scan approach ensures efficient validation while using only a few variables to track state.
| Approach | Time Complexity | Space Complexity |
|---|---|---|
| Bit manipulation with byte pattern validation | O(n) | O(1) |
Venkatesh Thallam
Use these hints if you're stuck. Try solving on your own first.
Read the data integer by integer. When you read it, process the least significant 8 bits of it.
Assume the next encoding is 1-byte data. If it is not 1-byte data, read the next integer and assume it is 2-bytes data.
Similarly, if it is not 2-bytes data, try 3-bytes then 4-bytes. If you read four integers and it still does not match any pattern, return false.
This approach involves solving the problem using an iterative method, where we use loops to perform the necessary calculations. This can be more efficient in terms of space complexity, especially if recursion would lead to excessive function call overhead.
Time Complexity: O(n), where n is the number of elements.
Space Complexity: O(1) since we are not using any extra space proportional to the input size.
1public class Solution {
2 public static void solveProblem(int n) {
3 for (int i = 0; i < n; i++)
The Java implementation uses a simple loop with System.out.print for printing output. The iterative approach is simple and efficient.
This approach explores solving the problem through recursion, which can offer simplicity and expressiveness. However, care must be taken with recursion depth to avoid stack overflow.
Time Complexity: O(n)
Space Complexity: O(n) due to the call stack.
1#include
Watch expert explanations and walkthroughs
Practice problems asked by these companies to ace your technical interviews.
Explore More ProblemsJot down your thoughts, approach, and key learnings
Yes, UTF-8 Validation is a common medium-level interview question because it tests understanding of bit manipulation and edge-case handling. Companies often use similar problems to evaluate low-level data processing skills.
The optimal approach is to scan the array once while tracking how many continuation bytes are expected. Bit manipulation is used to check the leading bits of each byte and confirm whether it follows UTF-8 encoding rules. This method runs in linear time and uses constant extra space.
UTF-8 encoding relies on specific bit patterns in each byte. Bit manipulation allows you to check leading bits such as 0, 110, 1110, or 11110 to determine character length and verify continuation bytes that start with 10.
No complex data structure is required for this problem. A simple array traversal combined with integer variables and bit masking is sufficient to validate the UTF-8 byte patterns.
The recursive solution in C uses a helper parameter `i` that tracks the current state, printing the current value, and making a recursive call with `i + 1` until `i` reaches `n`.