, 1 min read
Unix Command comm: Compare Two Files
One lesser known Unix command is comm
. This command is far less known than diff
. comm
needs two already sorted files FILE1 and FILE2. With the options
-1
suppress column 1 (lines unique to FILE1)-2
suppress column 2 (lines unique to FILE2)-3
suppress column 3 (lines that appear in both files)
For example, comm -12 F1 F2
prints all common lines in files F1 and F2.
I thought that comm
had a bug, so I wrote a short Perl script to simulate the behaviour of comm
. Of course, there was no bug, I just missed to notice that the records in the two files did not match due to white space.
#!/bin/perl -W
use strict;
use Getopt::Std;
my %opts = ('d' => 0, 's' => 0);
getopts('ds:',\%opts);
my $debug = ($opts{'d'} != 0);
my $member = defined($opts{'s'}) ? $opts{'s'} : 0;
my ($set,$prev) = (1,"");
my %H;
while (<>) {
$prev = $ARGV if ($prev eq "");
if ($ARGV ne $prev) {
$set *= 2;
$prev = $ARGV;
}
chomp;
$H{$_} |= $set;
printf("\t>>\t%s: %s -> %d\n",$ARGV,$_,$H{$_}) if ($debug);
}
$member = 2*$set - 1 if ($member == 0);
printf("\t>>\tmember = %d\n",$member) if ($debug);
for my $i (sort keys %H) {
printf("%s\n",$i) if ($H{$i} == $member);
}
Above Perl scripts does not need sorted input files, as it stores all records of the files in memory, in a hash. It uses a bitmask as a set. For example, mycomm -s2 F1 F2
prints only those records, which are only in file F2 but not in F1.