Excel CSV output using something other than commas

While dealing with the ICD 10 codes, we found that the SQL import process was adding spaces to text fields because of the tab delimiters we were using. A naive attempt to use the CSV format was quickly killed because the ICD 10 descriptions include commas, as I already knew from using J.

Notepad++ choked on the massive find-and-replace operations (I should probably see about submitting a bug). While it would complete the operation, somewhere along the way it would corrupt the file.

While something like dnGREP would probably be the right tool for find/replace in massive files, I instead used Excel 2010 to output the CSV using the pipe character ( | ) as the delimiter.

This is a kind of secret, as Excel doesn’t ask you what delimiter you want to use. Instead, you have to globally change the delimiter for Windows:

1. Open the Control Panel and select “Region and Language”.

2015-07-09 15_16_16-All Control Panel Items

2. Click on “Additional settings…”

2015-07-09 15_17_03-Region and Language

3. Change the “List separator” from a comma ( , ) to the character you want.

2015-07-09 15_16_44-Customize Format

And then remember to do this all again when you’re done, because you probably want your default to stay a comma.

Excel CSV output using something other than commas

ICD 10 code hierarchy, over-helpful Excel, and Jsoftware.com

I recently generated the parent-child and ancestor-progeny data for the ICD-10 diagnostic codes. You can find them on the GitHub repo I set up.

While generating the codes was a pretty straightforward matter of =vlookup() and =left(), I discovered only after exporting from Excel 2010 to a tab-separated text format that Excel or Word had converted my ASCII dash (-) and ellipses (…) into special dash (–) and ellipses (…) characters somewhere along the way.

I was using Notepad++ to review the file generated and I decided to determine just what characters existed in the CDC-provided data. I built up the following PCRE regex one piece at a time until I had the exact list (case insensitive):

[^-0-9A-Z \t\r\n.,\[\]/()'%<>=+%]

Of course, upon reflection, I just realized that J has a great way to find the characters:

/:~ ~. 1!:1 <'C:/Path/ICD10/icd10withHierarchy.txt'
%&'()+,-./0123456789<=>ABCDEFGHIJKLMNOPQRSTUVWXYZ[]abcdefghijklmnopqrstuvwxyz

view raw
FindNubWithJ.ijs
hosted with ❤ by GitHub

Right after the command is the result, which starts with a tab, line feed, carriage return, and then a space before the percent sign (%).

1!:1 <'filename' is just the notation for reading a file. ~. is the Nub verb, which removes all duplicates from an array (of which a string is a kind). /:~ determines the sort order for an array and applies it to itself.

ICD 10 code hierarchy, over-helpful Excel, and Jsoftware.com