There are two methods. One (csvkit_only) is fast and easy, but doesn't get you set up to use other python tools. The second (at the bottom) takes a bit more setup but will get you set up to do more with python than just run CSVkit.
In this tutorial we will be doing some basic data analysis in the command prompt or terminal using bash and csvkit. We will be comparing it with python's pandas. Get More Here - Building ML Web. Installing csvkit with Yosimite or higher (OSX 10.10) - gist:5cbaf75c57a06ba75a8b8759737a6aa7.
CSVkit is a suite of utilities written inPython. It is available as a Python module, which means we can use one of Python's module installers to install it on your computer. OSX ships with easy_install by default, but like a lot of people, I prefer
pip
. But pip
isn't installed by default.As it turns out, OSX ships with
easy_install
. This isn't a python class or a programming class, it's a data class and all I really want is to walk you all through how you might dig in to a monster CSV file with some command line tools. For that, it really doesn't matter how you install CSVkit as long as you install it.So here are some good options:
- Use
sudo easy_install csvkit
to install CSVkit. Then read a bit about what it does. If you secretly think you'll never touch the terminal again after this semester, this is your best route. - Install Homebrew. Then do
pip install csvkit
at the command line. If you get an error that suggests you don't have permission to install it, trysudo pip install csvkit
to install with root privileges. Tryman sudo
if you want to understand what the command does. If you want to explore more programming, this is probably your best option. - Alternatively, you can install
pip
witheasy_install
(usingsudo easy_install pip
) and then install CSVkit withsudo pip install csvkit
. Choosing this option won't mean you can't install Homebrew later.
What if I can't sudo
?
If you don't have admin privileges on your computer you'll have a hard time following the instructions that come with most software. You can still install Python modules and plenty else without admin privileges. You just need to follow a few instructions. Your first step is to tell Python where to install packages so it doesn't try to put them in a system directory. In a text editor (like TextWrangler), create a new file.
Name the file
.pydistutils.cfg
(the dot at the beginning matters). If you already have a file called .pydistutils.cfg
, edit it and add these lines to it:You can confirm that you put it in the right place by doing
cat ~/.pydistutils.cfg
-- that should spit back exactly what you put into the file. If it says file not found, you didn't put it in your home directory.Next you're going to make the directories that you just told Python to use, by running each of the commands below.
Then do
easy_install csvkit
and you should be able to install it just fine. You'll still have a hard time running it, however. You'll need to create another text file, this time called .bashrc
(again, if it already exists, just add to it). In that file you're going to put:And then at the terminal, type
source .bashrc
-- then (Finally!) try running which csvcut
and you should see that it is installed at User/{your name}/Library/Python/bin
If you find that CSVkit isn't in your path next time you open Terminal.app, you probably need to open your Preferences and tell Terminal that 'Shells open with: Command' where the command is probably
/bin/bash
.Why? It's complicated
The csvkit installation instructions tell you to run
pip install csvkit
. They call that simple. It is simple if you have pip
installed and know what it is. But if you don't, we need to take a few steps back.csvkit
is a python package, pip
is a package manager for Python and python, is a programming language. So what you actually need to do is...First: make sure python
is installed.
Start by running
which python
or python --version
-- that ought to give you a clue about whether or not you've got python running already. Some versions of OSX ship with python pre-installed.Python-Guide has great instructions for getting started on Windows or OSX.
Jue Yang's walk through on getting setup is another good place to start.
Second: make sure pip
is installed.
If you have python but not homebrew, you'll start with
easy_install pip
(or possibly sudo easy_install pip
). If you went the homebrew route, brew install pip
should do it.Third: install csvkit
.
Now try running
pip install csvkit
-- if you still get an error, it is time to ask for help.This tutorial will show how to load a large CSV file into PostgreSQL using CSVKit, then turn the LAT/LON columns into geometry using PostGIS.
CSVKit is a utility used at command line / terminal and installed via Python.
Not only is CSVKit the best way to generate a SQL 'create table' expression from your CSV file (accurately determines the column types based on the data in the table), it allows you to directly load a CSV using that column definition AND the super-fast PostgreSQL COPY tool all in one command.
A lot of this is based on the Eleven Awesome Things you can do with CSVKit article.
Installing CSVKit
While in command line / terminal:
Note: this requires PIP for Python to be installed
Inspect a CSV
Next browse to a folder where you have a CSV loaded, for me in my Downloads folder. I will use the City and County of Denver Crimes dataset which you can download for free.
The downloaded file is called crime.csv
Use CSVKit to inspect the data and column types:
Since this is a large dataset, it may take a few minutes to run.
The output gives a summary of each column of data in the CSV. Here's a sample of the first column:
Next, you can get a SQL table definition from the CSV file (but don't run the SQL command, we'll load it all in the next step):
Note: I've added a flag to specify I want the SQL syntax in PostgreSQL format. You can get all the database types here.
Again, since this is a big file, it will take a few moments to run.
The output will give us a SQL statement that can be used to create an empty table in PostgreSQL (but don't do this yet):
Again, the next step will show how to load the data with the correct schema definition in one command. Otherwise, you could take that SQL statement, run it in the database, then use the Import / COPY command to load your CSV.
Load the CSV into PostgreSQL
Next, we'll load the data into the PostgreSQL database in one easy command skipping the CSVSQL command above and the COPY step.
This is the syntax to specify the database to connect to, the name of the table to create, and the csv file you want to load:
Again, this is a large dataset, but when complete, open up PGAdmin and see the new table:
Create Geometry from Lat/Lon
The table we created from the CSV doesn't have a geometry column, instead it comes with the LAT and LON values in both State Plane (GEO_X, GEO_Y) and WGS84 (GEO_LON, GEO_LAT).
Using the a couple of spatial functions, we can turn the GEO_LON and GEO_LAT columns into geometry in WGS84 (SRID: 4326)
First, our denver_crime table needs a geometry column. In the PGAdmin SQL window, run:
Then, fill the empty geometry column up with a geometry object built from the GEO_LON and GEO_LAT columns:
The above SQL uses the ST_MakePoint to assemble a geometry object from the two columns. Then, wrap that with the ST_SetSRID function to set the spatial reference of the points to 4326 (WGS84).
The denver_crime table now has geometry in the geom column, so you can now view the points in QGIS: