Two Ways To Get Started With diff For Bioinformatics: Using Python And Also Excel To Output A Bash Shell Script With Pairwise File Comparisons

Comparing two files is an essential, a valuable and a documentable process in any industry including bioinformatics and there are many reasons why a bioinformatician would want to do this reproducibly for others to interpret, to understand and to use in the future. For example, two bioinformaticians are writing a script together without using git, finding the differences each bioinformatician contributed and changed is important to understand the history of the code project. Another useful scenario is quickly understanding the differences in the analysis files produced for the same project that was analyzed by two different bioinformaticians in two different directories.

diff is a unix command line tool that compares the content of two files or directories and shows their differences as its output. Although more upgraded versions of diff exist, which is appropriate given how data have evolved since the conception of diff in the 1970s, they require admin privileges for their installation and use while diff is built-in to unix.

Comparing two files using diff is very intuitive and simple:

> diff File_A.txt File_B.txt

The output produced shows the differences between the two files which can be piped into a documentable file name such as:

> diff File_A.txt File_B.txt > File_A_File_B_Comparison.txt

or to be more pedantic and documentable, the current date (and also bioinformatician initials and etc.) can be concatenated to the piped output as follows:

> diff_outfile = “File_A_File_B_Comparison_$(date +%Y%m%d).txt"
> diff File_A.txt File_B.txt > "$diff_outfile"

The output of diff shows the line where the difference in the two files has occurred.

> cat File_A.txt
The diff command in UNIX is used to compare files line-by-line. The 
utility comes with a variety of flags that can modify its behavior.

> cat File_B.txt
The diff command in UNIX is used to compare files line-by-line. The 
utility comes with a variety of flags that can modify its behavior

> diff File_A.txt File_B.txt > File_A.txt_File_B.txt_20230827.txt

> cat File_A.txt_File_B.txt_20230827.txt
2c2
< utility comes with a variety of flags that can modify its behavior
---
> utility comes with a variety of flags that can modify its behavior.

Here we can understand that the difference between the two files is the ending period.

To use diff to compare two directories the following command is used:

> diff -r Directory_A Directory_B

The output will contain the pairwise differences in the two directories of files with the same name, recursively with the flag -r.

There are times however when a bioinformatician would want to compare differences in two directories where the file names are similar but not exact such as File_A.txt with File_A_1.txt. The built-in unix tool diff would not be able to compare these file pairs without writing a script that creates the appropriate pairwise file relationships that would be the inputs for diff.

The following are two of many other methods to generate this pairwise file relationship: either writing a script in Python using os that outputs a bash script or using Microsoft Excel to visually organize and to write a bash script.

Python Script to Output A Bash Script

The following is a Python script (also found here on Github) I wrote for this post that loops through file paths, finds the comparison pairwise files in Directory B from Directory A:

import os

# Function to grep string s given list l, outputs a list
def grep(l, s):
    return [i for i in l if s in i]

# Returns the paths of all *.txt files in a given directory path
def get_filepaths_from_directory(loc_dir):
    paths = []
    #Find all *txt in all subdirectories
    for root, dirs, files in os.walk(path):
        for file in files:
            if file.endswith(".txt"):
                print(os.path.join(root, file))
                s = os.path.join(root, file)
                paths.append(s)
    return paths

loc_dirA = r'/Users/user/Documents/herlog/Directory_A'
path_dirA = loc_dirA
files_dirA = os.listdir(path_dirA)
paths_dirA = []

loc_dirB = r'/Users/user/Documents/herlog/Directory_B'
path_dirB = loc_dirB
files_dirB = os.listdir(path_dirB)
paths_dirB = []

paths_dirA = get_filepaths_from_directory(path_dirA)
paths_dirB = get_filepaths_from_directory(path_dirB)

print("#!/bin/bash")

for j, fileA in enumerate(files_dirA):
    fileA_without_extension = fileA.split(".")[0]
    found_dirA_full_path = grep(paths_dirA, fileA) #
    found_dirB_file = grep(files_dirB, fileA_without_extension)[0]
    #fileB_without_extension = found_dirB_file.split(".")[0]
    found_dirB_full_path = grep(paths_dirB, found_dirB_file)
    print("diff ", found_dirA_full_path[0], " ", found_dirB_full_path[0], " > ", fileA, "_", found_dirB_file, "_$(date +%Y%m%d).txt", sep = "")

The output of this Python script is as follows:

#!/bin/bash
diff /Users/user/Documents/herlog/Directory_A/File_A.txt /Users/user/Documents/herlog/Directory_B/File_A_1.txt > File_A.txt_File_A_1.txt_$(date +%Y%m%d).txt
diff /Users/user/Documents/herlog/Directory_A/File_B.txt /Users/user/Documents/herlog/Directory_B/File_B_3.txt > File_B.txt_File_B_3.txt_$(date +%Y%m%d).txt

After saving the output as an executable bash shell script, the pairwise comparisons are made by diff and saved into their appropriately name text files and can be zipped and saved or emailed as an attachment.

File_A.txt_File_A_1.txt_20230827.txt
File_B.txt_File_B_3.txt_20230827.txt

Using Excel to Output A Bash Script

There are some times that intuitively using Excel to visually curate files is more applicable in a situation than using a loop that exhaustively generates a pairwise comparison between every file. Excel is also a great way to explain and to understand how to build a repetitive command for new and transitioning bioinformaticians.

First generate the list of files in Directory A:

> cd Directory_A
> ls -1 #Output is one column with the -1 flag

Create and fill a new spreadsheet in Microsoft Excel as follows:

An Excel spreadsheet to generate pairwise comparisons using diff and unique file name piped outputs.

Copy the ls output to a new spreadsheet into column C. Generate the list of files in Directory B and copy the ls into column E. Fill in the rest of the columns by using Control+D to fill the same values down the rows. Column G is equal to Column C and Column I is equal to Column E. Fill in column B and D by using pwd on the directories and be sure to end your file path with /.

Once the spreadsheet is all filled in, save the Excel file as a tab delimited file. Replace the tabs with spaces and “/ “ with an empty string. Rename the bash command file as .sh and chmod to executable. The bash script should now run and produce the output files like the Python script.

Subscribe to herlog, translates to "every person" or "all people" in urdu

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe