Two Ways To Get Started With diff For Bioinformatics: Using Python And Also Excel To Output A Bash Shell Script With Pairwise File Comparisons
Comparing two files is an essential, a valuable and a documentable process in any industry including bioinformatics and there are many reasons why a bioinformatician would want to do this reproducibly for others to interpret, to understand and to use in the future. For example, two bioinformaticians are writing a script together without using git
, finding the differences each bioinformatician contributed and changed is important to understand the history of the code project. Another useful scenario is quickly understanding the differences in the analysis files produced for the same project that was analyzed by two different bioinformaticians in two different directories.
diff
is a unix command line tool that compares the content of two files or directories and shows their differences as its output. Although more upgraded versions of diff
exist, which is appropriate given how data have evolved since the conception of diff
in the 1970s, they require admin privileges for their installation and use while diff
is built-in to unix.
Comparing two files using diff
is very intuitive and simple:
> diff File_A.txt File_B.txt
The output produced shows the differences between the two files which can be piped into a documentable file name such as:
> diff File_A.txt File_B.txt > File_A_File_B_Comparison.txt
or to be more pedantic and documentable, the current date (and also bioinformatician initials and etc.) can be concatenated to the piped output as follows:
> diff_outfile = “File_A_File_B_Comparison_$(date +%Y%m%d).txt"
> diff File_A.txt File_B.txt > "$diff_outfile"
The output of diff shows the line where the difference in the two files has occurred.
> cat File_A.txt
The diff command in UNIX is used to compare files line-by-line. The
utility comes with a variety of flags that can modify its behavior.
> cat File_B.txt
The diff command in UNIX is used to compare files line-by-line. The
utility comes with a variety of flags that can modify its behavior
> diff File_A.txt File_B.txt > File_A.txt_File_B.txt_20230827.txt
> cat File_A.txt_File_B.txt_20230827.txt
2c2
< utility comes with a variety of flags that can modify its behavior
---
> utility comes with a variety of flags that can modify its behavior.
Here we can understand that the difference between the two files is the ending period.
To use diff
to compare two directories the following command is used:
> diff -r Directory_A Directory_B
The output will contain the pairwise differences in the two directories of files with the same name, recursively with the flag -r
.
There are times however when a bioinformatician would want to compare differences in two directories where the file names are similar but not exact such as File_A.txt
with File_A_1.txt
. The built-in unix tool diff
would not be able to compare these file pairs without writing a script that creates the appropriate pairwise file relationships that would be the inputs for diff
.
The following are two of many other methods to generate this pairwise file relationship: either writing a script in Python using os
that outputs a bash script or using Microsoft Excel to visually organize and to write a bash script.
Python Script to Output A Bash Script
The following is a Python script (also found here on Github) I wrote for this post that loops through file paths, finds the comparison pairwise files in Directory B
from Directory A
:
import os
# Function to grep string s given list l, outputs a list
def grep(l, s):
return [i for i in l if s in i]
# Returns the paths of all *.txt files in a given directory path
def get_filepaths_from_directory(loc_dir):
paths = []
#Find all *txt in all subdirectories
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(".txt"):
print(os.path.join(root, file))
s = os.path.join(root, file)
paths.append(s)
return paths
loc_dirA = r'/Users/user/Documents/herlog/Directory_A'
path_dirA = loc_dirA
files_dirA = os.listdir(path_dirA)
paths_dirA = []
loc_dirB = r'/Users/user/Documents/herlog/Directory_B'
path_dirB = loc_dirB
files_dirB = os.listdir(path_dirB)
paths_dirB = []
paths_dirA = get_filepaths_from_directory(path_dirA)
paths_dirB = get_filepaths_from_directory(path_dirB)
print("#!/bin/bash")
for j, fileA in enumerate(files_dirA):
fileA_without_extension = fileA.split(".")[0]
found_dirA_full_path = grep(paths_dirA, fileA) #
found_dirB_file = grep(files_dirB, fileA_without_extension)[0]
#fileB_without_extension = found_dirB_file.split(".")[0]
found_dirB_full_path = grep(paths_dirB, found_dirB_file)
print("diff ", found_dirA_full_path[0], " ", found_dirB_full_path[0], " > ", fileA, "_", found_dirB_file, "_$(date +%Y%m%d).txt", sep = "")
The output of this Python script is as follows:
#!/bin/bash
diff /Users/user/Documents/herlog/Directory_A/File_A.txt /Users/user/Documents/herlog/Directory_B/File_A_1.txt > File_A.txt_File_A_1.txt_$(date +%Y%m%d).txt
diff /Users/user/Documents/herlog/Directory_A/File_B.txt /Users/user/Documents/herlog/Directory_B/File_B_3.txt > File_B.txt_File_B_3.txt_$(date +%Y%m%d).txt
After saving the output as an executable bash shell script, the pairwise comparisons are made by diff
and saved into their appropriately name text files and can be zipped and saved or emailed as an attachment.
File_A.txt_File_A_1.txt_20230827.txt
File_B.txt_File_B_3.txt_20230827.txt
Using Excel to Output A Bash Script
There are some times that intuitively using Excel to visually curate files is more applicable in a situation than using a loop that exhaustively generates a pairwise comparison between every file. Excel is also a great way to explain and to understand how to build a repetitive command for new and transitioning bioinformaticians.
First generate the list of files in Directory A
:
> cd Directory_A
> ls -1 #Output is one column with the -1 flag
Create and fill a new spreadsheet in Microsoft Excel as follows:
Copy the ls
output to a new spreadsheet into column C. Generate the list of files in Directory B and copy the ls
into column E. Fill in the rest of the columns by using Control+D
to fill the same values down the rows. Column G is equal to Column C and Column I is equal to Column E. Fill in column B and D by using pwd
on the directories and be sure to end your file path with /
.
Once the spreadsheet is all filled in, save the Excel file as a tab delimited file. Replace the tabs with spaces and “/
“ with an empty string. Rename the bash command file as .sh
and chmod
to executable. The bash script should now run and produce the output files like the Python script.