Once I was working with a big table.
This table had a thousand numeric columns. Some of them were dodgy and I wanted to know which.
I wanted to find some basic summary statistics, like the min, max and mean of each column. That would help me.
What to do?
It’s the kind of thing you can do in one line in Python. In SQL? I really didn’t want to do this:
select min(c1), max(c1), mean(c1), min(c2), max(c2), mean(c2), min(c3) -- and so on for 1000 columns from xyz
Not for the first time I found myself wishing that SQL had better capabilities for this stuff. Luckily, there’s a better way.
Here’s the idea: write SQL, that generates SQL!
What does this mean? Here’s a code snippet, written in PostgreSQL:
SELECT 'select \'spacer\' AS column_name, 0 AS min_value, 0 as avg_value, 0 as max_value, 0 as n_nulls, 0 as n_non_nulls, 0 as n_rows' union all SELECT 'union all select \'' || "column" || '\', min(' || "column" || '), avg(cast(' || "column" || ' as float)), max(' || "column" || '), sum(case when ' || "column" || ' is null then 1 else 0 end) n_nulls, count(1) - sum(case when ' || "column" || ' is null then 1 else 0 end) n_non_nulls, count(1) n_rows from ' || tablename FROM pg_table_def WHERE tablename = 'XYZ' and type in ('integer', 'real');
If you run this statement, the output will be a table. Copy-paste the table into a new editor and you’ll see a SQL statement. Run this statement to get your summary statistics! Just remember to modify the WHERE clause to use the table name you are interested in.
How does it work? The statement uses the
pg_table_def table to find the column names in your table. It checks which ones are numeric, then constructs an elaborate string concatenation to generate a SQL query.
No typing required! Once you get the basic idea, you can extend the code above to only include particular metrics that you are interested. Above we find the
- number of nulls
- number of non-nulls
- total number of rows
for each column, but there are plenty more that you could include, like
- number of zeros
- standard deviation / variance
- percentiles (5th percentile, 25th percentile, 75th percentile, 95th percentile)
- percentage of outliers (numbers over a certain value
Just be careful if you have large data. The metrics that require sorting to work will be slow (like the median or percentiles) so you might have to do without those (or use a sortkey).
I’m finding this method of writing SQL very useful. I’ve found four main benefits:
- I type way less SQL
- I make less mistakes in my queries
- I’m way more likely to calculate summary statistics for my tables, which leads to
- being more likely to find errors in my data
This last point is the big one.
Crap data kills machine learning. Garbage in equals garbage out. The best model in the world will fail if its data sources aren’t good.
To improve the model, improve the data. Automatic SQL generation will help.