Jan 7, 2016

How to Extract Columns from a CSV file (without using a spreadsheet)

One of my colleagues who does data analysis had a CSV (comma-separated values) file that he could not open in Excel nor in any other spreadsheet program he tried.

Either due to its sheer filesize of 125MB, or its number of rows of more than 1,500,000, the programs would gag.

It turned out that for his purposes, he did not need all of the data, only a subset of the columns. Maybe extracting only what he needed into a smaller CSV would enable him to be able to work with it.

I had earlier read parts of the book The Linux Command Line, by William Shotts. I vaguely remembered mention of a utility to selectively pull fields out of a text file. That turned out to be the cut command.

Here's what the first few rows of the original data file looked like.

id,type,distance,userid,charityID,time,lat,lon
1003529743,walk,4.48342,1000545086,2166731,"2015-06-30 00:00:05",40.2501,-76.6714
1003529744,Run,4.087,1000402641,15048,"2015-06-30 00:00:21",45.5244,-89.7398
1003529745,run,2.631135,1000258381,61635018,"2015-06-30 00:00:23",41.6281,-87.193
1003529746,Bike,1.216703,1000505816,18010,"2015-06-30 00:00:24",43.0306,-78.7963
1003529747,walk,2.069664,1000015957,18010,"2015-06-30 00:00:25",39.2481,-76.5165
1003529748,Bike,6.174,1000126350,18010,"2015-06-30 00:00:25",29.5913,-82.4298
1003529749,run,1.47652,1000542869,92985044,"2015-06-30 00:00:26",40.7115,-89.4287


Only the type and userid columns were actually required. Using the following command I was able to generate another CSV with just those two fields.

$ cut -f 2,4 -d, july.csv > type-userid.csv

The -f option specifies which fields to extract. In this case, its the 2nd and the 4th fields. The -d option is the field delimiting character, which in our case is the comma. It defaults to tabs.

july.csv is the input file, type-userid.csv captures the standard-out.

The first few rows of the resulting file were

type,userid
walk,1000545086
Run,1000402641
run,1000258381
Bike,1000505816
walk,1000015957
Bike,1000126350
run,1000542869


And the filesize was reduced to about 24MB, which was usable and much more manageable.

(However, depending on which spreadsheet app you are using, the row count might exceed the limit. For example, both Excel 2007 and LibreOffice 4.2 Calc can handle a maximum of 1,048,576 rows.)

  ***

The cut command works blazingly fast. It only took about a second to process the input file. After I handed off the outputted  file, I later realized that I might have been able to save my colleague some tedium and waiting time by applying other CLI text-processing commands as well, such as sort, uniq, and wc, depending on what he wanted to do.

Source:

The Linux Command Line, by William Shotts
http://linuxcommand.org/tlcl.php
(You can buy the paper book or download the pdf for free)

No comments:

Post a Comment