[Solved] get statistics information by SQL query efficiently for table with 3 columns and 800 million rows

Question

The OP probably already knows this, but here is how to get the answer, disregarding efficiency. First, cards per location, as described in the comments:

SELECT locationid, COUNT(DISTINCT cardID)
FROM table 
GROUP BY locationid

Next the same thing per state.

SELECT substring(locationid, 1, 2) state, COUNT(DISTINCT cardID)
FROM table 
GROUP BY substring(locationid, 1, 2)

For a single state,

select COUNT(DISTINCT cardID)
from table 
where substring(locationid, 1, 2) = 'NY'

The problem with the first query is that it will return locationID’s like NY_1234. If you have not memorized what NY_1234 is in layman’s terms, Rochester for example, you have to go outside your database to see what it is.

The second query will be inefficient because it has to apply the substring function on a large number of records. The third will be inefficient because it will have to scan the entire table.

If you had a relational model you could do stuff like this.

select municipality, count(distinct cardID)
from table join location on table.locationid = location.locationID
group by municipality

select stateCode, count(distinct cardID)
from table join location on table.locationid = location.locationID
group by stateCode

State codes would be NY, NJ, etc If you wanted New York, New Jersey, etc, that would simply be another field.

Finally,

select count(distinct cardID)
from table join location on table.locationid = location.locationID
where stateCode="NY"

The second two relational queries would be faster than the single table queries because they wouldn’t have to use functions. You could speed it up even more with an index on stateCode, but even if you didn’t, a much smaller table would be scanned.

Accepted Answer