1000 Genomes 简介
1000 Genomes Project(缩写为1KGP)于2008年1月启动,是一项国际研究工作,旨在建立迄今为止最详细的人类遗传变异目录。科学家计划在接下来的三年内使用新开发的技术对来自不同种族群体的至少一千名匿名参与者的基因组进行测序,这些技术更快,更便宜。 2010年,该项目完成了试验阶段,在“自然”杂志的一篇出版物中对此进行了详细描述。2012年,1092个基因组的测序在Nature出版物中公布。 2015年,“自然”杂志上的两篇论文报告了结果,项目的完成以及未来研究的机会。确定了许多罕见的变异,仅限于密切相关的群体,并分析了8个结构变异类别。
该项目将来自世界各地研究所的多学科研究团队联合起来,包括中国,意大利,日本,肯尼亚,尼日利亚,秘鲁,英国和美国。每一个都将为庞大的序列数据集和精细的人类基因组图谱做出贡献,这些图谱将通过公共数据库免费提供给科学界和公众。
1000 Genome Project 的目标是发现在人群中频率大于1%的变异位点,对来自不同人群的大量样本进行测序,识别到了许多的变异位点,为人类遗传变异的研究提供了一个综合的资源。
基因组相关网址:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/
[ftp://ftp.sanger.ac.uk/pub/1000genomes/])(ftp://ftp.sanger.ac.uk/pub/1000genomes/)
ftp://ftp.ebi.ac.uk/pub/databases/1000genomes
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp
uk 10k gene:
https://www.uk10k.org/data_access.html
生信人必学ftp站点之1000genomes,总结了很多可用的站点:
http://www.bio-info-trainee.com/1841.html
vcf文件格式说明:
http://www.omicsclass.com/article/6
基因组计划详细说明
人类基因组由大约30亿个DNA碱基对组成,估计携带约20,000个蛋白质编码基因。在设计研究时,该联盟需要解决有关项目指标的若干关键问题,如技术挑战,数据质量标准和序列覆盖。
整个项目划分为四个阶段,试点阶段和三个主要阶段。
- 为了确定整个项目的最终设计,设计了三个试点研究,并将在项目的第一年内进行
- 第一个试点旨在对低覆盖率(2x)的3个主要地理群体的180个人进行基因分型。
- 第二项初步研究,两个核心家族(父母和成年子女)的基因组将进行深度覆盖(每个基因组20倍)的测序。
- 第三项试点研究涉及对1000名深度覆盖(20x)的1000个基因的编码区(外显子)进行测序。
- 主要阶段中只有第一阶段和第三阶段产生了数据,结果发现,平均而言,每个人在注释基因中携带约250-300个功能丧失变体,并且先前涉及遗传性疾病的50-100个变体。
- 来自4个群体的180个个体的低覆盖度全基因组测序
- 2个三人组(母亲 - 孩子)的高覆盖率排序
- 来自7个群体的697个个体的外显子靶向测序
整个项目从2008年开始到2013年结束,最终的版本为2013年5月2日发布的数据, 包含了来自26个人群,共2504个样本的SNP分型结果。根据Fort Lauderdale principles原则,所有基因组序列数据(包括变体调用)随着项目的进展免费提供,1000G的数据是免费公开的,可以通过ftp下载得到。
1000 Genomes 数据库结构-ftp
数据存储在ftp上,其目录结构如下图,包含多个说明文件*.md:
README_ebi_aspera_info.md–高速下载说明
README_file_formats_and_descriptions.md–文件格式和描述
README_ftp_site_structure.md–ftp网站文件结构
README_missing_files.md–缺失文件
README_populations.md–人群
README_using_1000genomes_cram.md–cram文件读取
data_collections–数据存储的文件夹
README_ebi_aspera_info.md–高速下载说明
How to download files using Aspera
Download Aspera
Aspera provides a fast method of downloading data. To use the Aspera service you need to download the Aspera connect software. This provides a bulk download client called ascp.
Browser
Our aspera browser interace no longer works. If you wish to download files using a web interface we recommend using the Globus interface we present. If you are previously relied on the aspera web interface and wish to discuss the matter please email us at info@1000genomes.org to discuss your options.
Command line
For the command line tool ascp, for versions 3.3.3 and newer, you need to use a command line like:
1
2 > ascp -i bin/aspera/etc/asperaweb_id_dsa.openssh -Tr -Q -l 100M -P33001 -L- fasp-g1k@fasp.1000genomes.ebi.ac.uk:vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ./
>
For versions 3.3.2 and older, you need to use a command line like:
1
2 > ascp -i bin/aspera/etc/asperaweb_id_dsa.putty -Tr -Q -l 100M -P33001 -L- fasp-g1k@fasp.1000genomes.ebi.ac.uk:vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ./
>
Note, the only change between these commands is that for newer versions of ascp asperaweb_id_dsa.openssh replaces asperaweb_id_dsa.putty. This change is noted by Aspera here. You can check the version of ascp you have using:
1
2 > ascp --version
>
The argument to -i may also be different depending on the location of the default key file. The command should not ask you for a password. All the IGSR data is accessible without a password but you do need to give ascp the ssh key to complete the command.
For the above commands to work with your network’s firewall you need to open ports 22/tcp (outgoing) and 33001/udp (both incoming and outgoing) to the following EBI IPs:
- 193.62.192.6
- 193.62.193.6
- 193.62.193.135
If the firewall has UDP flood protection, it must be turned off for port 33001.
Further details
For further information, please contact info@1000genomes.org.
README_file_formats_and_descriptions.md–文件格式和描述
File formats and descriptions
This file provides information on some of the file formats used to make data available on this site.
CRAM
.cram files use a reference-based format to store data. This format is being used to supply alignment data on this site. Detailed information on working with our .cram files is provided in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README_using_1000genomes_cram.md.
CRAI
.crai files are index files which accompany CRAM files. They must be present in order to work with .cram files for most purposes.
BAS
.bas files are .cram or .bam *statistic files *with one line per readgroup and columns separated by
tabs. The first line is a header that describes each column. The first six columns
provide meta information about each readgroup.The remaining columns provide various statistics about the readgroup, calculated
by going through the release bams. Where data isn’t available to calculate the
result for a column, the default value will be 0.Each column is described in detail below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22 > Column 1 'bam_filename': The DCC bam file name in which the readgroup data can be found.
> Column 2 'md5': The md5 checksum of the bam file named in column 1.
> Column 3 'study': The SRA study id this readgroup belongs to.
> Column 4 'sample': The sample (individual) identifier the readgroup came from.
> Column 5 'platform': The sequencing platform (technology) used to sequence the readgroup.
> Column 6 'library': The name of the library used for the readgroup.
> Column 7 'readgroup': The readgroup identifier. This is unique per .bas file. The remaining columns summarise data for reads with this RG tag in the bam file given in column 1.
> Column 8 '#_total_bases': The sum of the length of all reads in this readgroup.
> Column 9 '#_mapped_bases': The sum of the length of all reads in this readgroup that did not have flag 4 (== unmapped).
> Column 10 '#_total_reads': The total number of reads in this readgroup.
> Column 11 '#_mapped_reads': The total number of reads in this readgroup that did not have flag 4 (== unmapped).
> Column 12 '#_mapped_reads_paired_in_sequencing': As for column 10, but also requiring flag 1 (== reads paired in sequecing).
> Column 13 '#_mapped_reads_properly_paired': As for column 10, but also requiring flag 2 (== mapped in a proper pair, inferred during alignment).
> Column 14 '%_of_mismatched_bases': Calculated by summing the read lengths of all reads in this readgroup that have an NM tag, summing the edit distances obtained from the NM tags, and getting the percentage of the latter out of the former to 2 decimal places.
> Column 15 'average_quality_of_mapped_bases': The mean of all the base qualities of the bases counted for column 8, to 2 decimal places.
> Column 16 'mean_insert_size': The mean of all insert sizes (ISIZE field) greater than 0 for properly paired reads (as counted in column 12) and with a mapping quality (MAPQ field) greater than 0. Rounded to the nearest whole number.
> Column 17 'insert_size_sd': The standard deviation from the mean of insert sizes considered for column 15. To 2 decimal places.
> Column 18 'median_insert_size': The median insert size, using the same set of insert sizes considered for column 15.
> Column 19 'insert_size_median_absolute_deviation': The median absolute deviation of the column 17 data.
> Column 20 '#_duplicate_reads': The number of reads which were marked as duplicates.
> Column 21 '#_duplicate_bases': The number of bases which were narked as duplicated
>
INDEX
Various types of index file exist on the site. These are tab-delimited files where the data is arranged in columns. Immediately before the body of the file there is a header line, which starts with #, that gives the column names. In addition, index files may have further information at the head of the file. These lines start with ## and can provide descriptions of the columns, the date the index was generated and other pieces of information, as appropriate to the file and data set.
An example of the start of such a file, in this case an alignment index file, is below:
1
2
3
4
5
6
7
8
9
10
11
12 > ##FileDate=20150914
> ##Project=Illumina Platinum pedigree
> ##CRAM_FILE=Path to CRAM file - information on CRAM files can be found in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README_using_1000genomes_cram.md and information on the alignment process accompanies the data collection
> ##CRAM_MD5=md5 for CRAM file
> ##CRAI_FILE=Path to CRAI file
> ##CRAI_MD5=md5 for CRAI file
> ##BAS_FILE=Path to BAS file
> ##BAS_MD5=md5 for BAS file
> #CRAM_FILE CRAM_MD5 CRAI_FILE CRAI_MD5 BAS_FILE BAS_MD5
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/illumina_platinum_pedigree/data/CEU/NA12893/alignment/NA12893.alt_bwamem_GRCh38DH.20150706.CEU.illumina_platinum_ped.cram 4aa7f5b61d4365a556c980278b9be5a1 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/illumina_platinum_pedigree/data/CEU/NA12893/alignment/NA12893.alt_bwamem_GRCh38DH.20150706.CEU.illumina_platinum_ped.cram.crai ae3d1ac0de67d58d192a96508bad85b4 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/illumina_platinum_pedigree/data/CEU/NA12893/alignment/NA12893.alt_bwamem_GRCh38DH.20150706.CEU.illumina_platinum_ped.bam.bas 7fb61b29c6b5fc716e9167affab56d92
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/illumina_platinum_pedigree/data/CEU/NA12892/alignment/NA12892.alt_bwamem_GRCh38DH.20150706.CEU.illumina_platinum_ped.cram 8acdcd17349546b5ca1b45111e30fc07 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/illumina_platinum_pedigree/data/CEU/NA12892/alignment/NA12892.alt_bwamem_GRCh38DH.20150706.CEU.illumina_platinum_ped.cram.crai 555de96d6baf2e16630dadd9a6b5b038 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/illumina_platinum_pedigree/data/CEU/NA12892/alignment/NA12892.alt_bwamem_GRCh38DH.20150706.CEU.illumina_platinum_ped.bam.bas 5079d6d9dfcb0eb4631d11035ec71b16
>
Further information
For further information, please contact info@1000genomes.org.
README_ftp_site_structure.md–ftp网站文件结构
Structure of the FTP site
This file provides an overview of the structure of the FTP site.
Top level of the site
At the top level of the site, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/, there are a number of files and directories.
Files present in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/:
- README files: These provide information on a variety of topics related to this FTP site and the data it contains.
- CHANGELOG file: This file records alterations made to this FTP site.
- current.tree file: This file lists all directories and files currently present on this FTP site.
Directories present in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/:
- changelog_details
- data
- data_collections
- historical_data
- phase1
- phase3
- pilot_data
- release
- technical
These directories and described in the next section of this file.
Directories
changelog_details
This directory contains a series of files detailing the changes made to the FTP site over time.
data
The data directory formerly housed data generated during the 1000 Genomes Project. The data that was previously located here has been integrated into data_collections and is present under ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data. Further information on this move can be found in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/README_data_has_moved.md.
data_collections
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the 1000 Genomes Project data.
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/README_data_collections.md.
historical_data
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/historical_data/README_historical_data.md.
phase1
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.
phase3
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.
pilot_data
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.
release
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.
Examples of release subdirectories are:
- ftp://ftp.1000genomes.ebi.ac.uk/release/2008_12/
- ftp://ftp-trace.ncbi.nih.gov/1000genomes/release/2008_12/
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.
For example, the directory
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/
contains the release versions of SNP and indel calls based on the
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/historical_data/former_toplevel/sequence_indices/20100804.sequence.index
file.technical
The technical directory contains subdirectories for other data sets such as simulations, files for
method development, interm data sets, reference genomes, etc..An example of data stored under technical is ftp://ftp.1000genomes.ebi.ac.uk/technical/simulations/.
WARNING: ftp://ftp.1000genomes.ebi.ac.uk/technical/working/
The working directory under technical contains data that has experimental (non-public release) status
and is suitable for internal project use only. Please use with caution.Further information
Should you require further assistance in navigating the FTP site, please contact info@1000genomes.org.
README_missing_files.md–缺失文件
The 1000 Genomes FTP site under goes periodic rearrangments to accomodate new data sets coming in
and to preserve old data in a new location.If there is a particular file where you don’t know if it still exists, the best place to try is
our current.tree file.ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree
This file contains five columns:
- relative path
- type (directory or file)
- size in bytes
- last updated time stamp
- md5
You should be able to search this file with your filename and see if the file still exists and find its new location.
If you still can’t find your file email info@1000genomes.org and the support team will be able to tell you what has
happened to it.
README_populations.md–人群
Populations
This file describes the population codes where assigned to samples collected for the 1000 Genomes project. These codes are used to organise the files in the data_collections’ project data directories and can also be found in column 11 of many sequence index files.
There are also two tsv files, which contain the population codes and descriptions for both the sub and super populations that were used in phase 3 of the 1000 Genomes Project:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/20131219.populations.tsv
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/20131219.superpopulations.tsvPopulations and codes
CHB Han Chinese Han Chinese in Beijing, China JPT Japanese Japanese in Tokyo, Japan CHS Southern Han Chinese Han Chinese South CDX Dai Chinese Chinese Dai in Xishuangbanna, China KHV Kinh Vietnamese Kinh in Ho Chi Minh City, Vietnam CHD Denver Chinese Chinese in Denver, Colorado (pilot 3 only) CEU CEPH Utah residents (CEPH) with Northern and Western European ancestry TSI Tuscan Toscani in Italia GBR British British in England and Scotland FIN Finnish Finnish in Finland IBS Spanish Iberian populations in Spain YRI Yoruba Yoruba in Ibadan, Nigeria LWK Luhya Luhya in Webuye, Kenya GWD Gambian Gambian in Western Division, The Gambia MSL Mende Mende in Sierra Leone ESN Esan Esan in Nigeria ASW African-American SW African Ancestry in Southwest US ACB African-Caribbean African Caribbean in Barbados MXL Mexican-American Mexican Ancestry in Los Angeles, California PUR Puerto Rican Puerto Rican in Puerto Rico CLM Colombian Colombian in Medellin, Colombia PEL Peruvian Peruvian in Lima, Peru GIH Gujarati Gujarati Indian in Houston, TX PJL Punjabi Punjabi in Lahore, Pakistan BEB Bengali Bengali in Bangladesh STU Sri Lankan Sri Lankan Tamil in the UK ITU Indian Indian Telugu in the UK
Should you have any queries, please contact info@1000genomes.org.
README_using_1000genomes_cram.md–cram文件读取
IGSR CRAM Tutorial
From the first release of GRCh38 alignments onwards, we are releasing our alignment files in CRAM format. CRAM is a reference-based compression of sequence data.
Both htslib and picard can read CRAM files, many standard tools should be able to read these files natively.
Here are details about how to view CRAM files, convert from CRAM to BAM, how we produced the cram files and the CRAM specification.
Using CRAM files
CRAM Files can be read by both samtools and picard. EMBL-EBI also provides a java API called cramtools (http://www.ebi.ac.uk/ena/software/cram-toolkit)
- Reading a CRAM file with samtools - samtools view commands work with CRAM files. This functionality needs samtools v1.2 or higher
>samtools view $input.cram -h chr22:1000000-1500000 | less
- Converting a CRAM file to a BAM File - some tools still need BAM files rather than CRAM. You can convert from CRAM to BAM easily
>java -jar cramtools-3.0.jar bam -I $input.cram -R $reference.fa -O $output.bam
Please note the first time you run these commands the program reading the CRAM file must download the reference sequence data from an online cache. This process can be speeded up if you download the required reference file and build a local copy of the cache in advance. This process is described below in the CRAM reference registry section.
The CRAM reference registry
Because CRAM does not contain the same level of sequence data as BAM files, it relies on the CRAM reference registry to provide reference sequences for CRAM to output uncompressed sequences. The reference must be available at all times. Losing it is equivalent to losing all your read sequences. Retrieval of reference data from the registry is supported by using MD5 or SHA1 checksums using the following URLs:
www.ebi.ac.uk/ena/cram/md5/<hashvalue>
www.ebi.ac.uk/ena/cram/sha1/<hashvalue>
The md5 values for all GRCh38 reference chromosomes and contigs are included in CRAM file headers. These are mandatory fields in the CRAM specification.
The following process can be used to download and prepare the cache in advance to speed up initial reads of the sequence data from CRAM files.
- Download the reference file from our FTP site. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
- Run the seq_cache_populate.pl script from samtools - http://www.htslib.org/workflow/#mapping_to_cram
>perl samtools/misc/seq_cache_populate.pl -root /path/to/cache /path/to/GRCh38_full_analysis_set_plus_decoy_hla.fa
- Set the cache environment variables
>export REF_PATH=/path/to/cache/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s
>export REF_CACHE=/path/to/cache/%2s/%2s/%s
How did we compress the BAMs to CRAMs
We use cramtools to convert the BAM files our alignment pipeline produces to CRAM Files. The files are compressed using a lossy mode, this bins all quality stores into the 8-binning scheme defined by Illumina.
>java -jar cramtools-3.0.jar cram --ignore-tags OQ:CQ:BQ --capture-all-tags --lossy-quality-score-spec '*8' --preserve-read-names -O $output.cram -R GRCh38_full_analysis_set_plus_decoy_hla.fa -I $input.bam
More information about CRAM format.
As mentioned above CRAM represents a reference-based compression of sequence data. After aligning sequences reads to a reference genome, rather than storing every base pair of a sequence read, the approach stores only the difference between the read and the reference, hence reducing the space needed for storing sequence reads. Additional compression can be archived in lossy mode by controlled loss of quality information and unaligned reads, by dropping read names and other information. The level of compression can be fine tuned based on users’ experiment design. With associated tools, the compressed data can be seamlessly uncompressed and fed into downstream analysis.
This compression method was first developed by Ewan Birney’s group in European Bioinformatics Institute (EBI) (Hsi-Yang Fritz, et al. (2011). Genome Res. 21:734-740). The specification itself is maintained by the HTSlib group alongside the BAM, VCF and BCF specifications http://samtools.github.io/hts-specs/CRAMv3.pdf.
If you have any questions about our CRAM files or our alignment pipeline, please email info@1000genomes.org
1000 Genomes 数据库结构-data_collections
该文件主要存储基因组的2000人的数据,本层数据说明文件如下:
Data collections
Collections
The International Genome Sample Resource (IGSR) provides access to data from multiple projects, including the 1000 Genomes Project. As a consequence of this, data is organised in collections to reflect these different projects. Each of the directories in this directory contains data for a given collection. Data from the 1000 Genomes projects can be found under 1000_genomes_project.
Collection layout
Within each collection directory, you will find information in README files, which describe the data and any processing which has been done.
In addition, index files provide a catalogue of the files available for that collection. The body of each index file is tab delimited. Index files also have a header section with lines starting with ##, providing information about the file, data and the columns. The column header is a line starting with a single # immediately before the body of the file. The sequence indices contain the locations of the sequence data in the European Nucleotide Archive (ENA).
The data directory in each collection directory is where the data is housed. Data is organised by population and then by sample, for example, 1000_genomes_project/data/YRI/NA19150/.
Further information
If you are looking for a specific file, a list of all files on the site can be found in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree. Should you have any queries, please contact info@1000genomes.org.
1000G_2504_high_coverage
这是1000基因组2504基因高覆盖数据集合,如下是他的说明文件
1000 Genomes 2504 phase 3 panel sequenced to high coverage
This README refers to 30x Illumina NovaSeq sequencing of 2504 samples from the 1000 Genomes project phase 3 sample set. These data were generated at the New York Genome Center with funds provided by NHGRI Grant 3UM1HG008901-03S1. Please email service@nygenome.org with questions or interest in undertaking collaborative analysis of this dataset. All cell lines were obtained from the Coriell Institute for Medical Research and were consented for full public release of genomic data. Please see Coriell (https://www.coriell.org) for more information about specific cell lines. The following cell lines/DNA samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: [NA06984, NA06985, NA06986, NA06989, NA06994, NA07000, NA07037, NA07048, NA07051, NA07056, NA07347, NA07357, NA10847, NA10851, NA11829, NA11830, NA11831, NA11832, NA11840, NA11843, NA11881, NA11892, NA11893, NA11894, NA11918, NA11919, NA11920, NA11930. NA11931, NA11932, NA11933,
NA11992, NA11994, NA11995, NA12003, NA12004, NA12005, NA12006, NA12043, NA12044, NA12045, NA12046, NA12058, NA12144, NA12154, NA12155, NA12156, NA12234, NA12249, NA12272, NA12273, NA12275, NA12282, NA12283, NA12286, NA12287, NA12340, NA12341, NA12342, NA12347, NA12348, NA12383, NA12399, NA12400, NA12413,, NA12414, NA12489, NA12546, NA12716, NA12717, NA12718, NA12748, NA12749, NA12750, NA12751, NA12760, NA12761, NA12762, NA12763, NA12775, NA12776, NA12777, NA12778, NA12812, NA12813, NA12814, NA12815, NA12827, NA12828, NA12829, NA12830, NA12842, NA12843, NA12872, NA12873, NA12874, NA12878, NA12889, NA12890].The sequence data is available in ENA and a listing of files, with metadata, is available in the accompanyiing index.
Analysis
Analysis work is being done by a number of groups, working toward variant calling, including identification of structural variation.
Initial analysis has been done by NYGC, including aligning the data to GRCh38, creating the CRAMs in ENA. The document NYGC_b38_pipeline_description.pdf contains a description of that analysis work and details of the alignment pipeline.
Should you have questions about this data please contact info@1000genomes.org
1000_genomes_project
这是1000基因组的原始数据存储文件
This directory contains sequence data generated by the 1000 Genomes Project, and alignment data of the reads to GRCh38.
Subsequent analysis results such as variant call sets will be in this directory when they become available.Sequence Data
Moving forward we are aligning the raw sequence data unfiltered, as such we no long rehost the fastq files but instead our sequence index files point you to the FTP url for the fastq files on ENA servers
The data in the sequence index are categorised into 3 analysis groups (column 26) based on the library strategy used:
low coverage - Low coverage whole genome sequencing
exome - Whole exome sequencing
high coverage - PCR-free high coverage whole genome sequencingWhen align to GRCh38, data from the three analysis groups were aligned independently, producing three sets of CRAM files.
Some of the runs have a withdrawn flag (column 21); these runs were not included in subsequent analysis.
Column 22 indicates the reason why a run is withdrawn. Below are the possible reasons:SUPPRESSED IN ARCHIVE - runs were withdrawn by submitters so the fastq file is no longer available from ENA/SRA
TOO_SHORT - alll reads in a whole genome sequencing run are shorter than 70bp or in a exome run shorter than 68bp
NOT_ILLUMINA - a run generated on a platform that is not Illumina. For data consistency, we only include data produced on the Illumina platform for subsequent analysis.Alignment Data
The sequence data described by the sequence index file (1000genomes.sequence.index) were aligned to GRCh38 using ALT-aware bwa-mem. Please see details of the alignment process in README.1000genomes.GRCh38DH.alignment. Sample level CRAMs were produced; the CRAM files and ancillary files are listed in alignment index files in this directory.
总结
1000基因组的数据文件非常大,存储达到1000G左右,所以存储是一个不小的问题,需要花费很多计算资源和数据存储资源,并且算法上面也是一个很高维度的计算方法才能对基因的问题有一个比较好的描述和解释,目前所采用的方法大多是传统的方法,就是因为目前的方法都是基于简单计算的,有很好的公式进行表示计算,但是人工智能领域的方法并没有很好的描述这个问题,现在的技术方法就是需要用人工智能的方法拟合传统公式所需要表达的思想,然后通过神经网络拟合的网络进行计算,得到更加精确的解释,但是这样也有一定的风险,因为人工智能的方法能不能刻画这个模型,或者刻画这个模型以后是不是有更好的精确解释,这是一个比较难表示的问题。