While dealing with the ICD 10 codes, we found that the SQL import process was adding spaces to text fields because of the tab delimiters we were using. A naive attempt to use the CSV format was quickly killed because the ICD 10 descriptions include commas, as I already knew from using J.
Notepad++ choked on the massive find-and-replace operations (I should probably see about submitting a bug). While it would complete the operation, somewhere along the way it would corrupt the file.
While something like dnGREP would probably be the right tool for find/replace in massive files, I instead used Excel 2010 to output the CSV using the pipe character (
| ) as the delimiter.
This is a kind of secret, as Excel doesn’t ask you what delimiter you want to use. Instead, you have to globally change the delimiter for Windows:
1. Open the Control Panel and select “Region and Language”.
2. Click on “Additional settings…”
3. Change the “List separator” from a comma (
, ) to the character you want.
And then remember to do this all again when you’re done, because you probably want your default to stay a comma.
I recently generated the parent-child and ancestor-progeny data for the ICD-10 diagnostic codes. You can find them on the GitHub repo I set up.
While generating the codes was a pretty straightforward matter of
=left(), I discovered only after exporting from Excel 2010 to a tab-separated text format that Excel or Word had converted my ASCII dash (-) and ellipses (…) into special dash (–) and ellipses (…) characters somewhere along the way.
I was using Notepad++ to review the file generated and I decided to determine just what characters existed in the CDC-provided data. I built up the following PCRE regex one piece at a time until I had the exact list (case insensitive):
Of course, upon reflection, I just realized that J has a great way to find the characters:
|| /:~ ~. 1!:1 <'C:/Path/ICD10/icd10withHierarchy.txt'
Right after the command is the result, which starts with a tab, line feed, carriage return, and then a space before the percent sign (%).
1!:1 <'filename' is just the notation for reading a file.
~. is the Nub verb, which removes all duplicates from an array (of which a string is a kind).
/:~ determines the sort order for an array and applies it to itself.