AWK - Merging two files | ziccardi home

25th Jul 2022

AWK - Merging two files

ACME Corporation is creating its new CRYPTO exchange platform. They receive crypto data in CSV format where the first column is the crypto coin's short name and the second column is the value in US dollars. The separator is ;. For their internal needs, it is required that new output is produced where the short crypto name is replaced with the human-readable name and the separator is replaced with a space.

Wile E. is in charge of such implementation.
Below are the input files:

human-readable.csv

BTC;Bitcoin
ETH;Ethereum
USDT;Tether
USDC;USD Coin
BNB;Binance Coin
BUSD;Binance USD
ADA;Cardano

current-exchange.csv

BTC;19280
ETH;1044
USDT;1
USDC;1
BNB;220.18
BUSD;1
ADA;0.4139

Wile: "Ah.... what a boring task..." After 30 minutes, he provides the team with the following go code that satisfies the requirement:

package main

import (
    "bufio"
    "fmt"
    "os"
    "strings"
)

func check(e error) {
    if e != nil {
        panic(e)
    }
}

func fileToStringArray(path string) ([]string, error) {
    file, err := os.Open(path)
    if err != nil {
        return nil, err
    }
    defer file.Close()

    var lines []string
    scanner := bufio.NewScanner(file)
    for scanner.Scan() {
        lines = append(lines, scanner.Text())
    }
    return lines, scanner.Err()
}

func stringAryToMap(ary []string) map[string]string {
    res := map[string]string{}
    for _, s := range ary {
        keyValuePair := strings.Split(s, ";")
        res[keyValuePair[0]] = keyValuePair[1]
    }
    return res
}

// First parameter: path to the file containing the coin's human-readable name
// Second parameter: path to the file containing the coin values in US dollars
func main() {
    // No sanity checks for readability
    descPath := os.Args[1]
    valuesPath := os.Args[2]
    descriptionsAry, err := fileToStringArray(descPath)
    check(err)
    valuesAry, err := fileToStringArray(valuesPath)
    check(err)

    descMap := stringAryToMap(descriptionsAry)

    for _, s := range valuesAry {
        coinValuePair := strings.Split(s, ";")
        fmt.Printf("%s %s\n", descMap[coinValuePair[0]], coinValuePair[1])
    }
}

Wile: "Ok. I think I deserve a coffee"
Wile is going to go when he receives an email from "beep.beep@acme.org":

Subject: What a waste of time!

That code is a waste of time. With AWK I could do the same with one line

Wile gets quite mad and answers back:

Subject: Re: What a waste of time!

Ok, I challenge you to make it with a one-liner then!

After a few seconds, Wile receives another email from Beep Beep:

Subject: Re: Re: What a waste of time!

BEGIN{FS=";"} NR==FNR{a[$1]=$2;next} {$1=a[$1]}1

Wile starts laughing and sends another email to Beep Beep:

Subject: Re: Re: Re: What a waste of time!

What happened? The cat walked on your keyboard?

But an answer arrives immediately:

Subject: Re: Re: Re: Re: What a waste of time!

Ah ah... Funny. Then I guess my cat outsmarted you.
Try to believe:
awk 'BEGIN{FS=";"} NR==FNR{a[$1]=$2;next} {$1=a[$1]}1' human-readable.csv current-exchange.csv

Wile runs the provided command and doesn't believe his eyes. It works! So he asks immediately to Beep Beep to come and explain how that magic happened.

Explanation

Before we can go into explaining the details we will describe the meaning of the awk keywords we used:

BEGIN: designates a block that is executed one time before awk starts reading the input(s). Easy to guess, there is an END block too that is executed after the inputs are processed.
FS: awk automatically splits each line of the input. The line is split using the Field Separator. The value of the Field Separator can be changed/read by setting the value of FS
NR: stands for 'Number of Records' and it is the total count of lines awk has processed so far in the current execution.
FNR: it is the count of lines awk has processed so far in the current file. The difference with NR is that if awk receives more than one input, FNR starts again from 0 on each new file, while NR is never resetted
next: it can be seen as continue in Java/C++/Go for loop. It immediately passes to the next line of input
$x: each of the fields resulting from the record split is saved into one of the $1, $2, $3...$n variables in the order they are found. $0 is the whole line.

So, here is how BEGIN{FS=";"} NR==FNR{a[$1]=$2;next} {$1=a[$1]}1 works:

It sets the field separator as ';'
Loops through all the lines of the first file (human-readable.csv) and creates a map called a where the key is the first column of the file, while the value is the second column
After the map is ready, loops through all the lines of the second file and replaces the value of the first column (the short crypto name) with the value of its corresponding description from the map
Print the line

I already hear your thoughts: "Ok, nice, but I still don't understand how it does those things".

Let's discover that step by step.

Step 1 - Do different elaborations for the first and second file

How can we accomplish this? Here NR and FNR will help us.
Since NR will never reset, while FNR will reset on any new file, we will have that, when processing the first file, NR and FNR will have the same value for each line of the input!
So, we can write some code like this:

#!/usr/bin/awk -f

BEGIN{
    FS=";"
}
{
    if (NR == FNR) {
        print "FIRST FILE: "$0
    } else {
        print "SECOND FILE: "$0
    }
}

How can we do that with a one-liner? AWK has the following syntax: 'condition {action}'. condition is much similar to C++: any non zero value is true.
By following that syntax, we can try:

$ awk 'BEGIN{FS=";"} NR==FNR { print "FIRST FILE: "$0 } { print "SECOND FILE: "$0 }' human-readable.csv current-exchange.csv

FIRST FILE: BTC;Bitcoin
SECOND FILE: BTC;Bitcoin
FIRST FILE: ETH;Ethereum
SECOND FILE: ETH;Ethereum
FIRST FILE: USDT;Tether
SECOND FILE: USDT;Tether
FIRST FILE: USDC;USD Coin
SECOND FILE: USDC;USD Coin
FIRST FILE: BNB;Binance Coin
SECOND FILE: BNB;Binance Coin
FIRST FILE: BUSD;Binance USD
SECOND FILE: BUSD;Binance USD
FIRST FILE: ADA;Cardano 
SECOND FILE: ADA;Cardano 
SECOND FILE: BTC;19280
SECOND FILE: ETH;1044
SECOND FILE: USDT;1
SECOND FILE: USDC;1
SECOND FILE: BNB;220.18
SECOND FILE: BUSD;1
SECOND FILE: ADA;0.4139

Ok, it is not what we were expecting. The reason is that the second block of code has no condition, so it is always executed. We can easily solve that by adding a next keyword so that awk will immediately jump to the next line:

$ awk 'BEGIN{FS=";"} NR==FNR { print "FIRST FILE: "$0; next } { print "SECOND FILE: "$0 }' human-readable.csv current-exchange.csv
FIRST FILE: BTC;Bitcoin
FIRST FILE: ETH;Ethereum
FIRST FILE: USDT;Tether
FIRST FILE: USDC;USD Coin
FIRST FILE: BNB;Binance Coin
FIRST FILE: BUSD;Binance USD
FIRST FILE: ADA;Cardano 
SECOND FILE: BTC;19280
SECOND FILE: ETH;1044
SECOND FILE: USDT;1
SECOND FILE: USDC;1
SECOND FILE: BNB;220.18
SECOND FILE: BUSD;1
SECOND FILE: ADA;0.4139

Now it looks way better!

Step 2 - Create a map with the values from the first file

Let's modify our script now so that it creates the map. That's quite easy:

#!/usr/bin/awk -f

BEGIN{
        FS=";"
}
{
    if (NR == FNR) {
        a[$1]=$2
        printf("a[%s]=%s\n", $1, $2)
    } else {
        print "SECOND FILE: "$0
    }
}

Let's do the same thing with a one-liner:

$ awk 'BEGIN{FS=";"} NR==FNR { a[$1]=$2; printf("a[%s]=%s\n", $1, $2); next } { print "SECOND FILE: "$0 }' human-readable.csv current-exchange.csv
a[BTC]=Bitcoin
a[ETH]=Ethereum
a[USDT]=Tether
a[USDC]=USD Coin
a[BNB]=Binance Coin
a[BUSD]=Binance USD
a[ADA]=Cardano 
SECOND FILE: BTC;19280
SECOND FILE: ETH;1044
SECOND FILE: USDT;1
SECOND FILE: USDC;1
SECOND FILE: BNB;220.18
SECOND FILE: BUSD;1
SECOND FILE: ADA;0.4139

Outstanding! Now we have our map! We can proceed to the next step!

Step 3 - Replacing crypto short name with the descriptive one

Again, let's start with updating the script:

#!/usr/bin/awk -f

BEGIN{
        FS=";"
}
{
    if (NR == FNR) {
        a[$1]=$2
    } else {
        # replace the first column with the mapping
        $1=a[$1]
        print $0
    }
}

And now let's do the same with a one-liner:

$ awk 'BEGIN{FS=";"} NR==FNR { a[$1]=$2; next } { $1=a[$1]; print }' human-readable.csv current-exchange.csv
Bitcoin 19280
Ethereum 1044
Tether 1
USD Coin 1
Binance Coin 220.18
Binance USD 1
Cardano  0.4139

That's it! We now have our output with just a few keystrokes!

However, the original code from Beep Beep was a bit shorter:

Our code:    BEGIN{FS=";"} NR==FNR{a[$1]=$2;next} {$1=a[$1];print }
Beep Beep's: BEGIN{FS=";"} NR==FNR{a[$1]=$2;next} {$1=a[$1]}1

How does that work? If you remember at the beginning of the post, I gave a piece of very important information. AWK syntax is as below:

condition {action}

There is now another very important info to know:

when the condition is not specified, it defaults to true
when action is not specified, it defaults to print $0.

So let's analyse now Beep Beep's code:

condition	action
NR==FNR	{ a[$1]=$2; next }
1 (implicit)	{$1=a[$1]}
1	print $0 (implicit)

The magic is finally explained: by adding that 1 (always true condition) at the end of the AWK one-liner, Beep Beep forced the execution of the default action!

Recap

AWK is very powerful, but sometimes it can look cumbersome. Let's split the one-liner piece by piece and recap what is the meaning of each piece
BEGIN{FS=";"} NR==FNR { a[$1]=$2; next } { $1=a[$1]}1

BEGIN{FS=";"} Action: before starting the elaboration, change the Field Separator to ';'
NR==FNR Condition: this is true only when the total count of lines elaborated by AWK (NR) is the same as the count of lines elaborated by AWK in the current file (NFR). This happens only on the first file, ie, the following code block is executed only on the first file. This is the condition for the following-up action.
{ a[$1]=$2; next } Action: map the value of the first column ($1) to the value of the second column ($2) then interrupt the process and move to the next line.
{ $1=a[$1]} Action: replace the content of the first column with its mapped value from the a map
1 Condition: this condition is always true. Since no action for this is specified, the default action is executed (print $0) and prints the whole line

Conclusions

When you need to manipulate text, awk and sed are your friends and can make you save a lot of time. It is worth spending some time studying how they work.

Previous Next