Optimizely Module: Notes on Converting to Drupal 8: 2016

Nov 22, 2016

Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 72 bytes)

I ran into the above error message while modifying some custom Drupal code. At first, I thought I just needed to increase the memory limit due to some newly called core functions consuming more than what was allocated.

There's good discussion about different ways to change the PHP memory limit in the article Fatal error: Allowed memory size of X bytes exhausted (tried to allocate Y bytes)...

On my local stack, one relevant file is /etc/php5/apache2/php.ini which had a limit of 512M that I had set about two years ago. That seemed like a lot already. I upped it to 640M, but got the same error at exactly the same line of code.

So I fired up a debugger to trace through. When execution reached the loop body containing the reported line where execution died (line 4 below), I single-stepped, single-stepped, single-stepped, ... and it didn't leave after the expected number of iterations.

1: $dirname = pathinfo($current_path, PATHINFO_DIRNAME);
2: while ($dirname && $dirname != '.') {
3: $page['#cache']['tags'][] = 'optimizely:' . $dirname . '/*';
4: $dirname = pathinfo($dirname, PATHINFO_DIRNAME);
5: }

Line 3 kept adding entries to the $page array until memory was exhausted. It was my own coding mistake in inadvertently creating a runaway infinite loop.
∞

Sources:

Fatal error: Allowed memory size of X bytes exhausted (tried to allocate Y bytes)...
https://www.drupal.org/node/76156

Nov 3, 2016

Using Drupal 8 Cache Tags for Page Caching

This is a small case study in converting how page caching is done in the D7 version of the Optimizely module to Drupal 8.

The purpose of the module is to manage the insertion of certain <script> elements into designated pages of a site. To do so, the user creates one or more projects. Each project has one or more url paths. When a project is enabled, all pages that match one of its paths have the <script> element added.

To specify project paths, use of a trailing * wildcard is allowed, as are special page designators. For example, these are all valid paths.

/node/2
/node/*
/admin/config/system
/admin/*
*
<front>

In the case of /admin/* it would match against any of

/admin/
/admin/config/
/admin/config/system/
/admin/people
/admin/people/create
...

In the D7 version, invalidating is done through calls to cache_clear_all(), which takes three parameters. The function is used by the module in three different ways.

(1) To invalidate a particular page, e.g.

    cache_clear_all('/node/2', 'cache_page', FALSE);

(2) To invalidate a path with a trailing wildcard, e.g.

    cache_clear_all('/node/*', 'cache_page', TRUE);

(3) To invalidate all pages of the site,

    cache_clear_all('*', 'cache_page', TRUE);

In Drupal 8 the Cache API is completely different. Function cache_clear_all() has disappeared.

After some research, it looked like using cache tags would be the way to go. The article Cacheability of render arrays was especially helpful in how to think about caching as applied to page rendering.

There are two facets to implementing this. The first is what needs to be done when a page is rendered, the second is what to invalidate when triggering changes occur.

For page rendering, I already had an implementation of hook_page_attachments() that checked for inserting the element into any page whose path matched against any of the enabled project paths.

This hook function is where cache tags could be added to the page. This turned out to be a little tricky. A page must be invalidated for two different use cases: when the page contains the element which now needs to be removed, and when the page does not contain the element but it now needs to be added.

For the first case, I decided to just use the matching project path act as the cache tag (there can only be one because overlapping project paths are not allowed).

But for the second case, I had to cover all the possible project paths that might be enabled in the future. So, for example, for the page at /node/2 there are three such possible paths.

/node/2
/node/*
*

For every page rendered, it was necessary to attach all of these possible project paths as cache tags. Here is a snippet of code that shows how this is done in hook_page_attachments().

// Site-wide wildcard.
$page['#cache']['tags'][] = 'optimizely:*';

// Non site-wide wildcards. Repeat for every directory level.
$dirname = pathinfo($current_path, PATHINFO_DIRNAME);
while ($dirname && $dirname != '/') {
    $page['#cache']['tags'][] = 'optimizely:' . $dirname . '/*';
    $dirname = pathinfo($dirname, PATHINFO_DIRNAME);
}

// The specific page url.
$page['#cache']['tags'][] = 'optimizely:' . $current_path;

// Finally, if there is an alias for the page, tag it.
if ($current_path_alias) {
    $page['#cache']['tags'][] = 'optimizely:' . $current_path_alias;
}

The optimizely: prefix follows the convention of prefixing a group name as part of the tag where appropriate. For example, cache tags from core include node:1 and config:node.type.article.

Finally, there is the matter of what and how to invalidate when changes to the projects and their paths are submitted. This turned out to be fairly easy to implement.

An array of all relevant project paths is passed to a function that carries out the following.

$cache_tags = [];
foreach ($path_array as $path) {
    $cache_tags[] = 'optimizely:' . $path;
}

\Drupal::service('cache_tags.invalidator')->invalidateTags($cache_tags);

For debugging purposes, outputting X-Drupal-Cache-Tags in HTTP headers was extremely useful. See my earlier post Enable and Use X-Drupal-Cache-Tags in HTTP headers.

This post is about the caching that is done by Drupal itself. The D7 version also checks for the presence of the varnish module and calls a function of that module if it exists. I did not pursue a replacement for that functionality, but the articles Varnish and Use Drupal 8 Cache Tags with Varnish and Purge look promising.

Sources:

Function cache_clear_all() has been removed
https://optimizely-to-drupal-8.blogspot.com/2014/07/function-cacheclearall-has-been-removed.html

Cacheability of render arrays
https://www.drupal.org/developing/api/8/render/arrays/cacheability

Cache tags
https://www.drupal.org/developing/api/8/cache/tags

Allow to set #cache metadata in hook_page_attachments() https://www.drupal.org/node/2475749

public static function Cache::invalidateTags
https://api.drupal.org/api/drupal/core!lib!Drupal!Core!Cache!Cache.php/function/Cache%3A%3AinvalidateTags/8

Oct 17, 2016

Enable and Use X-Drupal-Cache-Tags in HTTP headers

While converting from Drupal 7's caching to that of Drupal 8, for debugging purposes I wanted to enable the display of X-Drupal-Cache-Tags in HTTP headers.

How to do this is documented in the article CacheableResponseInterface and is actually simple, but I had enough trouble with getting this to work that I'm providing a few of my own notes here.

In Drupal 8 core, there is the file /sites/default/default.services.yml Copy this file to the same directory, creating a new file named services.yml (assuming you don't already have such a file).

For our purposes, in services.yml the only key you need is http.response.debug_cacheability_headers.

In my situation, I ended up with a services.yml that only contained the following two lines, where the value of the key is changed to true. I deleted all of the other keys.

parameters:
http.response.debug_cacheability_headers: true

After creating and/or editing services.yml be sure to do "a container rebuild, which is necessary when changing a container parameter". One way to accomplish this is via Clear all caches in the admin UI.

Finally, to see the display of X-Drupal-Cache-Tags in HTTP headers, I use the Chrome browser on a Linux system. It happens to be version 39.0.

Navigate to: More tools > Developer tools > Network tab

After loading a page whose cache tags you are interested in, on the left side of the Developer tools panel, click on the url for the page. Then on the right, click on the Headers subtab and look under Response Headers for the X-Drupal-Cache-Tags property. You will see a number of cache tags that come with Drupal core in addition to your own custom ones.

Source:

CacheableResponseInterface
https://www.drupal.org/developing/api/8/response/cacheable-response-interface#debugging

Aug 30, 2016

Call of MySQL database function fails

This post describes (another) database issue that was tracked down by my colleague Luis Delacruz.

We had both a live site and a corresponding development site using the same MySQL database server. On the live site, carrying out a particular user task worked fine, but doing the same task on the development site would fail with the message "DB Error: unknown error".

The problem was eventually traced back to the fact that the contents of the database for the development site had been loaded by importing a backup of the live site.

The live site contained several database functions. For example, here's one of the exported function definitions.

CREATE DEFINER=`user1`@`localhost`
FUNCTION `clean_string`(in_str varchar(4096))
RETURNS varchar(4096) CHARSET latin1

BEGIN
/**
   * Function will strip all non-ASCII and unwanted ASCII characters in string
   *
   * @author Shay Anderson 10.11
   *
   * @param VARCHAR in_arg
   * @return VARCHAR
   */
    DECLARE i, len SMALLINT DEFAULT 1;
DECLARE ret CHAR(255) DEFAULT '';
DECLARE c CHAR(1);
    SET len = CHAR_LENGTH( in_str );
    REPEAT
      BEGIN
        SET c = MID( in_str, i, 1 );
    IF c REGEXP '[[:alnum:]]' THEN
      SET ret=CONCAT(ret,c);
        END IF;
        SET i = i + 1;
      END;
    UNTIL i > len END REPEAT;
    RETURN ret;
END

Luis noticed that the definitions of this database function as well as others included the clause DEFINER = 'user1'@'localhost'.

By default, the SQL SECURITY characteristic of a function is DEFINER. That means that when the routine is executed, it does so within the security context of the user specified as the DEFINER.

This worked fine for the live site because its database runs under user user1.

However, for the staging site, that user does not have access to the staging database, so it was denied. That is, the user did not have sufficient permissions for executing the function body as applied to the other database.

Deleting the DEFINER clause and reloading the function confirmed that the clause was the problem.

Also, a key factor was that both databases are managed by the same db server, within which user1 exists. If the function definition were imported into a different db server for which user1 does not exist, it could be that the problem would not occur.

Further research uncovered this advice:

"For a stored routine or view, use SQL SECURITY INVOKER in the object definition when possible so that it can be used only by users with permissions appropriate for the operations performed by the object. ".

In other words, set the SQL SECURITY characteristic to INVOKER so that the security context of whichever user invokes the function is in effect. This seems ideal.

So we modified the beginning part of the function definition as follows:

CREATE ~~DEFINER=`user1`@`localhost`~~
FUNCTION `clean_string`(in_str varchar(4096))
RETURNS varchar(4096) CHARSET latin1
SQL SECURITY INVOKER

BEGIN
    ....
END

(Incidentally, the error message "DB Error: unknown error" is apparently output by the Drupal 7 CiviCRM module.)

Sources:

Access Control for Stored Programs and Views
https://dev.mysql.com/doc/refman/5.7/en/stored-programs-security.html

May 28, 2016

Two sites on the same database server - PDOException: SQLSTATE[HY000][1129] Host 'nnn.nnn.nnn.nnn' is blocked because of many connection errors

A web search shows a number of forums and postings about the error message in the title of this post, but our particular problem involved two sites on the same web server and the same database server. One of the sites appeared to repeatedly bring down the other site.

We had had a working live site implemented in Drupal and wanted to add a separate staging site in order to have a more transparent workflow that is less risky.

So for the staging site, a separate Drupal instance was installed and a separate Drupal database was created using the same respective servers as for the live site.

The sites use the CiviCRM module, which has its own civicrm.settings.php file for configuration.

- - - - -

After staging was created, testing of the staging site was showing very odd errors in the form of inconsistent user profiles and such.

With help from the primary developers, I was able to track down that the civicrm.settings.php file was incorrectly specifying the live database for use by the staging site. Instead, it should have been specifying the staging database.

I corrected the configuration in civicrm.settings.php so that staging was accessing its own database via its own database user.

Here are the original constants defined in civicrm.settings.php

// These ip addresses are not the actual ones.
define( 'CIVICRM_UF_DSN', 'mysql://live_db_user:user_password@12.210.138.193/live_db?new_link=true' );

define( 'CIVICRM_DSN', 'mysql://live_db_user:user_password@12.210.138.193/live_db?new_link=true' );

And the revised constants in civicrm.settings.php

// These ip addresses are not the actual ones.
define( 'CIVICRM_UF_DSN', 'mysql://staging_db_user:user_password@12.210.138.193/staging_db?new_link=true' );

define( 'CIVICRM_DSN', 'mysql://staging_db_user:user_password@12.210.138.193/staging_db?new_link=true' );

I then emailed our tester about the changes late at night, without doing any of my own testing. Mea culpa.

That's when things got interesting.

- - - - -

The next morning, both the staging and the live sites were broken with the same error even though I had not changed any code on live.

PDOException: SQLSTATE[HY000][1129] Host '12.210.138.193' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts' in lock_may_be_available() (line 167 of /var/www/oursite.org/httpdocs/includes/lock.inc).

Since the error message stated "unblock with 'mysqladmin flush-hosts' ", I logged onto the database server and did flush-hosts on the live database in order to keep the live site running. Live came back up. And I sighed with relief thinking that was enough.

Then I browsed to the staging site again to troubleshoot it. Now it came up with a different error.

PDOException: SQLSTATE[28000] [1045] Access denied for user 'staging_db_user'@'12.210.138.193' (using password: YES) in lock_may_be_available() (line 167 of /var/www/staging.oursite.org/httpdocs/includes/lock.inc).

A bit later, I browsed to the live site again. And it was broken again.

The pattern was: (1) fix the live site by running the flush-hosts command, (2) browse to the staging site, (3) and the live site is broken again.

Since keeping the live site running was a priority, this was driving me crazy!

Those of you who are more familiar with MySQL than I was may already know what had happened. With hindsight, the issue is so obvious that it makes me reflect on my own thinking processes.

- - - - -

It was my colleague Luis who cracked this. The following explanation is due to him.

The basic problem is that the database user staging_db_user did not have privileges to access the database via the 12.210.138.193 host address (although it did have access through a different ip). So whenever I browsed to the staging site, connection errors would occur.

The number of connection errors would quickly accumulate and exceed the allowed maximum. The relevant variable in MySQL is max_connect_errors, with a default value of something like 1000.

When the maximum number of errors was reached, the MySQL server would then block any further connection requests from that host ip, including for the live site!

This explains why browsing to the staging site would result in access errors for the live site as well. Because of the common ip, the two sites were inadvertently coupled and mutually dependent.

In the end, we decided to use a different host ip address for the staging site to access the database precisely to avoid this cross-site brittleness.

Source:

Troubleshooting Problems Connecting to MySQL
https://dev.mysql.com/doc/refman/5.7/en/problems-connecting.html

Mar 15, 2016

Blocking Chinese and Korean spam for user-submitted content

I help maintain a Drupal 7 site called Power Poetry that is a platform for publishing poetry. It's really a great site with awesome writing that has grown steadily in both the amount of content and the number of page views.

Unfortunately, its popularity has apparently attracted some foreign spam content, particularly in Chinese and Korean.

Our initial take was that being U.S.-centric, we should limit submissions to only English and Spanish. So I did some research into language detection, found a service with an API that looked promising, and did an initial implementation. However, for technical reasons that I never did uncover, that API was not working on our hosting setup (even though it worked on my local system!).

We then changed our focus from language detection to the fact that our current problem was almost entirely due to submissions made in certain scripts (specific human writing systems).

I found that regular expressions in PHP support Unicode character properties. Those property codes include values that designate the Chinese script (which includes both traditional and simplified characters) and the Korean script.

We decided to take a zero-tolerance approach. A single Chinese or Korean character causes the user submission not to validate. So far, implementing this has drastically reduced the amount of inappropriate content and comments.

Here's the code, which is called by our Drupal form validation functions.

/**
* Check whether a string contains any characters from a
* banned script, such as Chinese or Korean.
*
* @param string $text
*   The piece of text to be checked.
*
* @return TRUE | FALSE
*   Returns TRUE if any Chinese or Korean characters detected.
*   Otherwise, returns FALSE.
*/
function _contains_banned_scripts($text) {

$unicode_modifier = 'u';

// Detect whether there are any Chinese characters.
$chinese_regex = '\p{Han}+';
$preg_regex = '/' . $chinese_regex . '/' . $unicode_modifier;
$chinese_found = preg_match($preg_regex, $text);

if ($chinese_found == 1) {
    return TRUE;
}

// Detect whether there are any Korean characters.
$korean_regex = '\p{Hangul}+';
$preg_regex = '/' . $korean_regex . '/' . $unicode_modifier;
$korean_found = preg_match($preg_regex, $text);

if ($korean_found == 1) {
    return TRUE;
}

return FALSE;
}

Sources:

Forum spam
https://en.wikipedia.org/wiki/Forum_spam

Unicode Regular Expressions
http://www.regular-expressions.info/unicode.html

Unicode Character Properties
https://secure.php.net/manual/en/regexp.reference.unicode.php

Feb 24, 2016

An Exercise in Pair Programming

I've spent many years as a software developer working in contented solitude: wrapping my mind around spaghetti code; stepping through a debugger; implementing an algorithm. I'm an introvert, so I'm comfortable being alone.

The past few years, though, I've felt pulled to collaborate more with others and to participate more actively as a member of a team. I've done this by communicating with end users, clients, and non-technical team members as well as doing informal mentoring and teaching.

I've known about pair programming for some time but have never practiced it. So when it turned out that one of my colleagues and I were both learning jQuery and JavaScript, I suggested we do an exercise together via pair programming.

Luis and I have pretty different backgrounds. He is a college student majoring in computer engineering. I'm a very seasoned programmer who later moved into web technologies and have re-entered that field after a long hiatus.

We built a client-side game for a person to scramble and then solve the 15-puzzle (source code in GitHub repo ). This was done over a few weeks during off hours when we were able to schedule snippets of time together.

(Update: play the puzzle )

It went really well, and I want to jot down a few thoughts about what enabled it to be a positive, constructive experience.

* We already had a good working relationship.

Luis and I are both part-time developers at a digital consulting agency. We had worked together before and felt personally compatible, so much so that we've made a point of going into the office on the same days so that we can readily consult with each other.

* We did it as part of a learning exercise.

This was done not as part of our paid work but of our individual learning endeavors. Luis has been using one of the online services, I've been going through a traditional book. This meant that not much was at stake in terms of producing results, which reduced the potential stress level considerably.

* Neither of us dominated.

Although I am far senior in terms of general software experience, I am a novice with respect to JavaScript and jQuery. And most recently, my career path has been a pretty choppy one with a lot of non-technical roads taken and committed to. These factors together with a strong preference to collaborate rather than compete means that I simply treat Luis as a peer.

* We were open to each other's criticism and suggestions.

If you can sufficiently suspend your own preferences, habits, and beliefs about what's "right", it's possible to learn a lot from another developer. Each of us was willing to try coding differently. This ran the gamut from how to do indentation to what data structures to use for the underlying model.

Small example: in jQuery there is consistent and frequent use of anonymous functions that are passed as parameters. Luis suggested giving names to those functions even though the names are never used. Although this appears to go against the most common coding practice, I ended up liking the additional clarity and expression of intent that such unnecessary function names provide.

* Debugging went much faster and more efficiently.

Case in point. Luis realized that certain event handlers were not working because they no longer existed -- the DOM elements they were attached to were being removed and then some of those elements were re-created as replacements, minus their handlers. Stuff like that is obvious when someone else points it out!

For my part, I correctly suspected that a recursive use of a particular jQuery selector was causing the failure of an iterator to affect all intended objects. It's a subtle bug that I still don't understand the root cause of, but my background in programming languages led to a good guess.

* Both of us have an academic background.

This allowed me, for example, to jump into explaining terminology such as "operator precedence" and "operator associativity" without Luis' eyes glazing over. He not only has the intellectual chops, he gets it that taking the time to learn basic concepts and acquire fundamental understanding is worth it.

Doing this exercise has given me momentum and encouragement to look for opportunities to do more pair programming and to expand further my own envelope. Nice!

Resources:

Pair Programming
http://guide.agilealliance.org/guide/pairing.html

Pair Programming Considered Harmful?
http://techcrunch.com/2012/03/03/pair-programming-considered-harmful/

Pair Programming vs. Code Reviews
https://blog.codinghorror.com/pair-programming-vs-code-reviews/

Feb 1, 2016

Unwanted Side Effect of Reducing Permissions on Drupal Text Formats

On a Drupal 7 site, we were getting malicious comments that were triggering undesired page redirects, as I wrote about in Malicious page redirects and Drupal's Filtered-HTML text format.

One cause of those redirects was malicious html using the <meta> tag. This tag was allowed because Authenticated Users had access to the Full HTML text format. So we decided to limit use of Full HTML only to Administrators.

Since then, I have learned that this caused an undesirable side-effect. The symptom was that when a user tried to edit content that they had previously created, in the text area for doing the editing they got the following message.

This field has been disabled because you do not have sufficient permissions to edit it

Very puzzling! Why would a user not be able to edit their own content that they had created?

It turns out that when text content is created, the text format in effect at that time is stored with the content and is applied when editing is done later.

If you limit the roles that have access to a particular text format, then text content created earlier by users who have that role may no longer be editable by those users. They are blocked from making any changes.

On our site, this is not a serious problem. Text content is almost always written and submitted once and not revised further.

* * *

However, a site with content that is actively edited is going to run into a wall. There are a couple of things you might be able to do.

If the number of content items is small, then while logged in as an Administrator or some other role with sufficient permissions, edit each item so that it uses a Text Format that the author has permission to use. In our case, it would have been from Full HTML to Filtered HTML.

If you have a lot of content and you're dangerous enough to hack around directly with the database, then a database global change will also work.

For example, in our case, we have two tables that are relevant. Executing these SQL commands changes the values.

update field_data_body
    set body_format = 'filtered_html'
    where body_format = 'full_html';

update field_revision_body
    set body_format = 'filtered_html'
    where body_format = 'full_html';

(And don't forget to clear cache!)

Thanks to David Needham for a particularly pithy and helpful comment in the second article cited below.

Sources:

Signature box for authenticated users - This field has been disabled because you do not have sufficient permissions to edit it
https://www.drupal.org/node/1034064

This field has been disabled because you do not have sufficient permissions to edit it
https://www.drupal.org/node/1064168

Jan 15, 2016

jQuery.get() Ajax call fails on localhost url

While going through the book Learning jQuery, 4th Edition, I've been doing the exercises by placing and editing code files in a local directory on the desktop and opening index.html as a file in Firefox.

This worked fine until I had to code an Ajax call that referred to a php file. This required the use of a web server, so I created a subdirectory under the local server root and copied the php file into the subdir.

Keeping all other code on the desktop, I then edited the Ajax call so that it requested from the url

http://localhost/learning-jquery/e.php

The call silently failed and returned nothing. Console windows would show nothing, no errors.

Eventually, I found out about the same-origin policy. In brief, this policy means that by default JavaScript is prevented from making requests across domain boundaries, where same-origin means that the protocol, hostname, and port number must be identical.

So I moved all code files into the subdir under the server root. Working from there rather than the desktop, the Ajax calls then succeeded.

(And I changed the requested url back to just e.php)

   ***

But what about carrying out cross-origin access? One possible way is to configure the server to allow it. For Apache on my Linux system, I added the following directive in /etc/apache2/apache2.conf

<Directory /var/www/html/learning-jquery>
Header set Access-Control-Allow-Origin "*"
</Directory>

However, checking this configuration change with the command

# apachectl -t

resulted in

AH00526: Syntax error on line 205 of /etc/apache2/apache2.conf:
Invalid command 'Header', perhaps misspelled or defined by a module not included in the server configuration
Action '-t' failed.
The Apache error log may have more information.

It turns out that I also had to enable a module named headers

# a2enmod headers

Then restart the server. That enabled cross-origin access.

Not yet satisfied, I was wondering if the wildcard "*" in the directive could be replaced by something more restrictive. For the way I'm developing locally, the following also works because opening a file via the file:// protocol results in an origin of null.

<Directory /var/www/html/learning-jquery>
Header set Access-Control-Allow-Origin "null"
</Directory>

   ***

The fact that the calls had failed silently was very troubling. One way to avoid this is to register a global Ajax error handler by calling ajaxError(). Here's a handler that simply puts up an alert box. (I'm using jQuery 1.9)

$(document).ready(
function setAjaxErrorHandler() {
    $(document).ajaxError(
      function alertError(event, jqxhr, settings, thrownErr) {
        alert('Ajax error handler: ' + jqxhr.status
                + ' ' + jqxhr.statusText);
      }
    );
}
);

For the cross-origin error, this puts up the message "Ajax error handler: 0 error", which is not exactly super helpful but better than nothing.

Another way is to attach an error handler to the particular request by using the fail() method. For example,

$.get('http://localhost/learning-jquery/e.php', { ... },
       function (data) { ... })
    .fail(function (jqxhr) {
        alert('GET error: ' + jqxhr.status + ' '
                + jqxhr.statusText);
    });

Sources:

jQuery.get()
https://api.jquery.com/jQuery.get/

Same-origin policy
https://en.wikipedia.org/wiki/Same-origin_policy

Why is CORS important?
http://enable-cors.org/

CORS on Apache
http://enable-cors.org/server_apache.html

Configure Apache To Accept Cross-Site XMLHttpRequests on Ubuntu
https://harthur.wordpress.com/2009/10/15/configure-apache-to-accept-cross-site-xmlhttprequests-on-ubuntu/

Jan 7, 2016

How to Extract Columns from a CSV file (without using a spreadsheet)

One of my colleagues who does data analysis had a CSV (comma-separated values) file that he could not open in Excel nor in any other spreadsheet program he tried.

Either due to its sheer filesize of 125MB, or its number of rows of more than 1,500,000, the programs would gag.

It turned out that for his purposes, he did not need all of the data, only a subset of the columns. Maybe extracting only what he needed into a smaller CSV would enable him to be able to work with it.

I had earlier read parts of the book The Linux Command Line, by William Shotts. I vaguely remembered mention of a utility to selectively pull fields out of a text file. That turned out to be the cut command.

Here's what the first few rows of the original data file looked like.

id,type,distance,userid,charityID,time,lat,lon
1003529743,walk,4.48342,1000545086,2166731,"2015-06-30 00:00:05",40.2501,-76.6714
1003529744,Run,4.087,1000402641,15048,"2015-06-30 00:00:21",45.5244,-89.7398
1003529745,run,2.631135,1000258381,61635018,"2015-06-30 00:00:23",41.6281,-87.193
1003529746,Bike,1.216703,1000505816,18010,"2015-06-30 00:00:24",43.0306,-78.7963
1003529747,walk,2.069664,1000015957,18010,"2015-06-30 00:00:25",39.2481,-76.5165
1003529748,Bike,6.174,1000126350,18010,"2015-06-30 00:00:25",29.5913,-82.4298
1003529749,run,1.47652,1000542869,92985044,"2015-06-30 00:00:26",40.7115,-89.4287

Only the type and userid columns were actually required. Using the following command I was able to generate another CSV with just those two fields.

$ cut -f 2,4 -d, july.csv > type-userid.csv

The -f option specifies which fields to extract. In this case, its the 2nd and the 4th fields. The -d option is the field delimiting character, which in our case is the comma. It defaults to tabs.

july.csv is the input file, type-userid.csv captures the standard-out.

The first few rows of the resulting file were

type,userid
walk,1000545086
Run,1000402641
run,1000258381
Bike,1000505816
walk,1000015957
Bike,1000126350
run,1000542869

And the filesize was reduced to about 24MB, which was usable and much more manageable.

(However, depending on which spreadsheet app you are using, the row count might exceed the limit. For example, both Excel 2007 and LibreOffice 4.2 Calc can handle a maximum of 1,048,576 rows.)

***

The cut command works blazingly fast. It only took about a second to process the input file. After I handed off the outputted file, I later realized that I might have been able to save my colleague some tedium and waiting time by applying other CLI text-processing commands as well, such as sort, uniq, and wc, depending on what he wanted to do.

Source:

The Linux Command Line, by William Shotts
http://linuxcommand.org/tlcl.php
(You can buy the paper book or download the pdf for free)