CSV parsers and wx

Released: Fri, 11 Apr 2014 17:35:00 +0000

Screenshot of ImportCSV so far
The latest Salstat module is ImportCSV. It's designed to be a combination of GUI and CSV parser that can deal with a lot of awkward cases.

It begins with a file open dialog and then displays a more complex dialog with options allowing a user to select delimiters, etc, before importing data. It also has a nice preview that allows users to see what is happening.

The big problem is that CSV files are not a standard. In fact, given that much data science requires using sets that are often concatenated from different sources, it's hard to say that a single set's delimiters will be consistent and singular. 

So it has to deal with a use case where delimiters are spaces, semi-colons, commas and maybe something else. Plus it has to handle tokens in quotes properly (i.e., a token begun with a double quote should finish on a double quote and not a single quote. 

The CSV module was restricting because it only seemed to allow single delimiters to be defined. Real-life is different, so I needed to have more flexibility. Regex's were getting too complex for me to understand when combined with the quotation mark issues and the other module, shlex, although useful, didn't cope with non-quoted tokens which are the primary unit of Salstat's interest.

In short, I had to write my own parser.

And it copes reasonably well with (say) the following file:

Col1, col2 col3; col4   "col5,;", col6
1,2 3.3;4 5,6
6,5 4;3 2.2,1

This tiny but awful monstrosity should make a QA tester weep. There's enough nonsense in there to cause most parsers issues: CSV won't deal with it without awkward work-arounds because it uses multiple delimiters; shlex won't cope with the non-quoted characters; and regex's will quickly become unwieldy.

With this widget, I can specify comma, space and semi-colon as delimiters, and the preview shows the obligatory 6 straight columns of data, but it's not perfect yet. I need to figure out the logic harder because the comma and semi-colon from "col5,;" results in just col5 being shown when col5,; should be shown; but finishing that should be easy enough I feel. Users can also select from the 3 common line-endings (\n, \r, \r\n) and the system makes a guess at the most likely (most common) solution by parsing the file. If line endings are mixed, then there will be problems so I might have more work to do yet.

The entire widget has some way to go: explanatory text, fixed-width files and the header row, line to begin importing from need to work; but I think most of it's been done now. 

Code is in Github: https://github.com/salmoni/Salstat

Contact us:

We're happy to hear from you

Email us
Contact form

Salstat's social world:

Salstat is an open source project fostered by Thought Into Design Ltd
Thought Into Design Ltd is registered in England and Wales (Companies House number 7367421)