Rename downloaded files
Rename downloaded files to use descriptive file names
Rename downloaded files
For most files included in data packages downloaded from NCBI Datasets, the default filenames are generic.
For example, protein files included in the NCBI Datasets genome data package
are all named protein.faa
.
Use the simple script below to rename the protein.faa
files included in the NCBI Datasets genome data package
to use descriptive file names that include the genome assembly accession.
Before running the file renaming script
# Note that all protein sequence files share the same generic name, protein.faa
ls ncbi_dataset/data/GC*/*protein.faa | head -3
ncbi_dataset/data/GCA_000774145.1/protein.faa
ncbi_dataset/data/GCA_009818265.1/protein.faa
ncbi_dataset/data/GCA_016735085.1/protein.faa
After running the file renaming script
# Note that each protein sequence file has been renamed to include the genome assembly accession
ls ncbi_dataset/data/GC*/*protein.faa | head -3
ncbi_dataset/data/GCA_000774145.1/GCA_000774145.1_protein.faa
ncbi_dataset/data/GCA_009818265.1/GCA_009818265.1_protein.faa
ncbi_dataset/data/GCA_016735085.1/GCA_016735085.1_protein.faa
How to run the file renaming script
First, create a file called rename.sh
and open it in the nano
text editor by running the following.
nano rename.sh
Next, copy and paste the following script into nano.
#!/bin/bash
for file in ncbi_dataset/data/*/protein.faa
do
directory_name=$(dirname $file)
accession=$(basename $directory_name)
mv "${file}" "${directory_name}/${accession}_$(basename $file)"
done
Use ctrl
+X
to save the file and exit nano.
Finally, run the script while you are in the directory containing the extracted NCBI Datasets data package.bash rename.sh