Table: Fraud
+-------------+---------+ | Column Name | Type | +-------------+---------+ | policy_id | int | | state | varchar | | fraud_score | int | +-------------+---------+ policy_id is column of unique values for this table. This table contains policy id, state, and fraud score.
The Leetcode Insurance Corp has developed an ML-driven predictive model to detect the likelihood of fraudulent claims. Consequently, they allocate their most seasoned claim adjusters to address the top 5% of claims flagged by this model.
Write a solution to find the top 5 percentile of claims from each state.
Return the result table ordered by state in ascending order, fraud_score in descending order, and policy_id in ascending order.
The result format is in the following example.
Example 1:
Input: Fraud table: +-----------+------------+-------------+ | policy_id | state | fraud_score | +-----------+------------+-------------+ | 1 | California | 0.92 | | 2 | California | 0.68 | | 3 | California | 0.17 | | 4 | New York | 0.94 | | 5 | New York | 0.81 | | 6 | New York | 0.77 | | 7 | Texas | 0.98 | | 8 | Texas | 0.97 | | 9 | Texas | 0.96 | | 10 | Florida | 0.97 | | 11 | Florida | 0.98 | | 12 | Florida | 0.78 | | 13 | Florida | 0.88 | | 14 | Florida | 0.66 | +-----------+------------+-------------+ Output: +-----------+------------+-------------+ | policy_id | state | fraud_score | +-----------+------------+-------------+ | 1 | California | 0.92 | | 11 | Florida | 0.98 | | 4 | New York | 0.94 | | 7 | Texas | 0.98 | +-----------+------------+-------------+ Explanation - For the state of California, only policy ID 1, with a fraud score of 0.92, falls within the top 5 percentile for this state. - For the state of Florida, only policy ID 11, with a fraud score of 0.98, falls within the top 5 percentile for this state. - For the state of New York, only policy ID 4, with a fraud score of 0.94, falls within the top 5 percentile for this state. - For the state of Texas, only policy ID 7, with a fraud score of 0.98, falls within the top 5 percentile for this state. Output table is ordered by state in ascending order, fraud score in descending order, and policy ID in ascending order.
Problem Overview: You need to identify records that fall within the top fraud percentile based on a fraud-related metric (for example fraud rate, fraud score, or suspicious transaction count). The task is essentially a ranking problem: compute the relative position of each record and return only those that fall within the highest percentile threshold.
Approach 1: Aggregate + Threshold Subquery (O(n log n) time, O(n) space)
A straightforward approach computes the fraud metric for each entity (user, merchant, or account) using GROUP BY. After generating the aggregated results, sort them by the fraud metric and determine the cutoff for the desired percentile using a subquery. For example, you can compute the value at the 90th or 95th percentile and filter rows where the metric is greater than or equal to that threshold.
This approach works in most SQL systems but often requires multiple passes over the data. One query computes aggregated fraud metrics, another determines the percentile cutoff, and the outer query filters results. The database must perform sorting to determine percentile boundaries, giving roughly O(n log n) time due to ordering and O(n) space for intermediate results. This method is easy to reason about but becomes verbose and less flexible when you need multiple percentile-based filters.
Approach 2: Window Function Ranking (O(n log n) time, O(n) space)
The cleaner solution uses SQL window functions to compute the percentile rank directly for each row. After computing the fraud metric per entity, apply functions like PERCENT_RANK() or CUME_DIST() over an ordered window such as OVER (ORDER BY fraud_metric DESC). This assigns a percentile position between 0 and 1 to every record.
Once the percentile rank is available, filtering becomes trivial: keep rows where the percentile is greater than or equal to the required threshold (for example >= 0.9 for the top 10%). The database engine performs a single ranking pass after sorting the results. Sorting dominates the runtime, so the overall complexity is O(n log n), with O(n) space used for the ranking window.
This method is concise and expressive. It keeps the entire computation inside one query and avoids manually calculating percentile boundaries. Window functions are widely used in analytical SQL problems and are a core technique in database interview questions.
Recommended for interviews: The window function approach is the expected solution. Interviewers want to see that you understand ranking functions like PERCENT_RANK or CUME_DIST and can apply them to percentile-based filtering. Mentioning the aggregation + threshold method first shows you understand the baseline approach, but using a window function demonstrates stronger SQL skills and familiarity with modern analytical queries often used in SQL data analysis tasks.
We can use the RANK() window function to calculate the ranking of fraud scores for each state, then filter out the records with a rank of 1, and sort them as required by the problem.
MySQL
| Approach | Time | Space | When to Use |
|---|---|---|---|
| Aggregate + Percentile Threshold Subquery | O(n log n) | O(n) | When window functions are unavailable or when computing percentile cutoffs manually |
| Window Function (PERCENT_RANK / CUME_DIST) | O(n log n) | O(n) | Preferred modern SQL approach for ranking and percentile filtering |
Leetcode MEDIUM 3055 - Top Percentile Fraud PERCENT_RANK() SQL - Explained by Everyday Data Science • Everyday Data Science • 550 views views
Watch 1 more video solutions →Practice Top Percentile Fraud with our built-in code editor and test cases.
Practice on FleetCodePractice this problem
Open in Editor