<img src="https://certify.alexametrics.com/atrk.gif?account=J5kSo1IWhd105T" style="display:none" height="1" width="1" alt="">

Nexosis @ Work & Play

5 Text Transforming Techniques for Total Techies

February 10, 2018 @ 10:52 AM | Musings, Technical

Guy shares some tools and techniques he uses to manipulate large amounts of data.

Text is the stock and trade of any developer. We spend significant portions of our lives manipulating, transforming, creating, and destroying text. A lot of this is code but just as often we get some file full of data that isn't quite what we need. Here are some tools and techniques I use to manipulate large amounts of data.

1. Good old-fashioned search and replace

We all know this trick. If you have a text editor then you've done this. I usually use Visual Studio Code. Just search for some text, enter some replacement text, and select replace all.

This can be used a little less conventionally than one might think. I've used it to get rid of weird delimiters before, such as when some programmer decided to create a pipe (|) delmited file.

Gozer|Destructor|Gozerian LLC
Zuul|Gatekeeper|Gozerian LLC
Vinz Clortho|Keymaster|Gozerian LLC

Just search and replace all the pipes with the delimiter you need (like a comma) and you're good to go.

Gozer,Destructor,Gozerian LLC
Zuul,Gatekeeper,Gozerian LLC
Vinz Clortho,Keymaster,Gozerian LLC

You can also search and replace on newlines. This works well when newlines are being used to delimit fields in a record such as in a vCard.

ORG:Gozerian LLC
ORG:Gozerian LLC
FN:Vinz Clortho
ORG:Gozerian LLC

Just replace the newlines with your delimiter of choice:


Then replace ,END:VCARD, with a newline:

BEGIN:VCARD,FN:Gozer,TITLE:Destructor,ORG:Gozerian LLC
BEGIN:VCARD,FN:Zuul,TITLE:Gatekeeper,ORG:Gozerian LLC
BEGIN:VCARD,FN:Vinz Clortho,TITLE:Keymaster,ORG:Gozerian LLC,END:VCARD

Replace BEGIN:VCARD,, FN:, TITLE:, and ORG: empty string:

Gozer,Destructor,Gozerian LLC
Zuul,Gatekeeper,Gozerian LLC
Vinz Clortho,Keymaster,Gozerian LLC,END:VCARD

And manually remove the last END:VCARD:

Gozer,Destructor,Gozerian LLC
Zuul,Gatekeeper,Gozerian LLC
Vinz Clortho,Keymaster,Gozerian LLC

2. Regular expresssions

Perhaps you'd heard the saying about regular expressions.

"I had a problem that I tried to solve with regular expressions. Now I have two problems."

Well, don't let the haters get to you. Regexes are super powerful and I use them to transform text regularly. You can use them to do basic search and replace, matching on a regex instead of a string, but the real power comes in when you match strings within what you are searching for and insert them into the replacement string.

Let's say you have the following data and you want to extract the first and last name as separate fields, keep the title, and get rid of the organization name. Sounds hard, right? Nope.

Gozer Gozerian,Destructor,Gozerian LLC
Zuul Zuulian,Gatekeeper,Gozerian LLC
Vinz Clortho,Keymaster,Gozerian LLC

The following regex will match the all the records above and, because of the groups I've added with parentheses, it will extract the first name, last name, and title.

 ^([a-zA-Z]+) ([a-zA-Z ]+),([a-zA-Z ]+),[a-zA-Z ]+$

Now, we can replace it with text that uses the matched group. Just use $1 for the first group, $2 for the second, and so on. So, if the replacement text is $2,$1,$3 then we will get the following transformed text:


Nifty! You can use this trick to parse apart formatted string like phone numbers or dates as well. Replace (\d{3})(\d{3})(\d{4}) with ($1) $2-$3 to neatly format a phone number from 8005552368 to (800) 555-2368. Convert a common date of 06/08/1984 to an ISO date of 1984-06-08 by replacing (\d{2})\/(\d{2})\/(\d{4}) with $3-$1-$2.

3. Multiple cursors

A lot of text editors, including Visual Studio Code and Atom, support multiple cursors. You can access this hitting Command+D or Control+D and whatever is selected, the next occurance of it will also be selected. In Visual Studio Code, Command+Shift+L or Control+Shift+L will select all the text that matches what is currently selected. Other text editors have similiar features. Hunt around a bit and you'll find them.

Once you have all this stuff selected, you have a cursor at every selected place. If you type, all the cursors will spit out text. If you delete, all the cursors will delete. If you move the cursor to the left, they will all move to the left.

This works really well with the search feature, espectially if you use regular expressions. You can search for some text and then select all the text that matches. You can use this to select the end of each line in a file by searching for newlines or by search for the end of a line with a regex of $. You can select the begining of every line using a regex of ^. Or perhaps ever line that starts with Venkman by searching for ^Venkman.*$

4. Manipulating JSON

JSON data is already pretty useful but it's not always in the structure you want. You might want to flatten a heirarchy, rename some fields, or get rid of some. It's tempting to write some code to just do this and that might be your first instinct, but, before you do, check out jq.

jq is a JSON query tool. It works kind of like sed but specifically for JSON files. I'm not going to go into the details here, as they do a much better job on their website on how to use it. But, if you're looking to fix some JSON, this is definately the tool to use.

5. Code it

Sometimes, you just gotta write code. It's OK. I'm not judging. I do it sometimes too. My personal go to langauge for this sort of stuff is Ruby as it has a fairly full-featured set of string manipulation functions and I can get it up can running quickly.

I'm sure you've got lots of tricks and techniques you use as well. I'd love to hear about them. Share them on our community discussion board.

Ready to start building machine learning applications?

Sign up for free  Talk to an expert

Guy Royse

Guy is one of our developer evangelists at Nexosis. He spends his days sharing with developers why our API is so great and his nights reminiscing about Hogwarts and dreaming of retiring to his dream job: Santa Claus.