Download large genome data packages
Use the datasets command-line tool to get large NCBI Datasets genome data packages
Download large genome data packages
If you want to download genome data for more than 1,000 genomes or the genome data package exceeds 15 GB, you’ll need to use the datasets command-line tool (CLI).
The datasets CLI downloads a large NCBI Datasets genome data package as a dehydrated zip archive that contains only metadata and the location of the data on NCBI servers.
You can get the data in three steps:
- Download the dehydrated zip archive.
- Unzip the downloaded zip archive.
- Rehydrate the extracted zip archive to retrieve the data.
1. Download
Download a dehydrated data package (< 5 KB) for the human GRCh38 RefSeq genome using the datasets CLI.
datasets download genome accession GCF_000001405.40 --dehydrated --filename human_GRCh38_dataset.zip
2. Unzip
Unzip the dehydrated zip archive to a directory, for example my_human_dataset:
unzip human_GRCh38_dataset.zip -d my_human_dataset
The output will look like this:
Archive: human_GRCh38_dataset.zip
inflating: my_human_dataset/README.md
inflating: my_human_dataset/ncbi_dataset/data/assembly_data_report.jsonl
inflating: my_human_dataset/ncbi_dataset/fetch.txt
inflating: my_human_dataset/ncbi_dataset/data/dataset_catalog.json
3. Rehydrate
Run the rehydrate command to get the genome sequence:
datasets rehydrate --directory my_human_dataset/
A progress bar will indicate the number of files to be retrieved. When complete, the output looks like this:
Found 1 of 1 files for rehydration
Completed 1 of 1 [================================================] 100%