Microarray Data - Laboratorio di Evoluzione Microbica e Molecolare
Transcript
Microarray Data - Laboratorio di Evoluzione Microbica e Molecolare
Laboratorio di Bioinformatica Lezione #2 Dr. Marco Fondi Contact: [email protected] www.unifi.it/dblemm/ – tel. 0552288308 Dip.to di Biologia Evoluzionistica Laboratorio di Evoluzione Microbica e Molecolare, Università di Firenze Lezione #2 b)Web resources for bioinformatics b) BLAST (Basic Local Alignment Search Tool) ? Wet-Lab experiments DATA Bibliographic Databases Taxonomic Databases WEB Databases Nucleotide Databases Genomic Databases Protein Databases Microarray Databases Knowledge bases = Biological databases Punto di partenza di qualsiasi analisi bioinformatica (e non). Melanie Sequence Data/Genome Data …atgctggactgagtaatcct… …MQYYLERRSQMPGYTRYMML… Gene Prediction (ORF finding) Protein Structure Taxonomy Metabolic pathways information Expression profiles (Microarray Data) DataBase overview Sequence Data/Genome Data …atgctggactgagtaatcct… …MQYYLERRSQMPGYTRYMML… Gene Prediction (ORF finding) Protein Structure Taxonomy Metabolic pathways information Expression profiles (Microarray Data) EMBL-EBI GenBank PDB (Protein DataBank) database JGI Database sequence in FASTA Format FASTA Format >gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of glycerald GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT gi number Locus Name ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA CTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTAT Database Identifiers GACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC Accession number TTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGC gb GenBank CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGC emb EMBL CAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACC dbj DDBJ CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGG CTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC sp SWISS-PROT GGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTC pdb Protein Databank GTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA TGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCG pir PIR GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC FASTA Definition Line >gi|193425|gb|M60978.1|MUSGAPDS ref RefSeq “Text” search DB Sequence in FASTA Format BLAST Sequence similarity search >gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of glycerald GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA Sequence Data/Genome Data …atgctggactgagtaatcct… …MQYYLERRSQMPGYTRYMML… Gene Prediction (ORF finding) Protein Structure Taxonomy Metabolic pathways information Expression profiles (Microarray Data) Molecola di DNA Sequenza in formato FASTA: >Cromosoma (TITOLO) ATCATTATTGATCCTGATCGGTTAGCAT CGTATTTCCTTACCGGGACCCCATGATC GATACAGTAAACCTTAGGATGATTATTG ATGCTGATCGGTTAGCATCGTATTTCCT TACCGGGACCCCATGATCGATACAGTA AACCTTAGGTGATTATTGATCCTGATCG GTTAGCATCGTATTTCCTTACCGGGACC CCATGATCGATACAGTAATAATTAGGAT GATTATTGATCCTGATCGGTTAGCATCG TATTTCCTTACCGGGACCCCATGATCGA TACAGTAAACCTTAGGATGATTATTGAT CCTGATCGGTTAGCATCGTATTTCCTTA CCGGGACCCCATGATCGATACAGTAAA CCTTAGATGATTATTGATCCTGATCGGT ATGCATCGTATTTCCTTACCGGGACCCC ATGATCGATACAGTAAACCTTAGGTTGA ATCGTATTTCCTTACCGGGACCCCATGA TCGATACAGTAAACCTTAGGTAGCATCG TATTTCCTTACCGGGACCCCATGATCGA ATGAGTAAACCTTAGGTAGCATTGAATT TCCTTACCGGGACCCCATGATCGATACA GTAAACCTTAGG….. ORF Finder @ NCBI: Sequence Data/Genome Data …atgctggactgagtaatcct… …MQYYLERRSQMPGYTRYMML… Gene Prediction (ORF finding) Protein Structure Taxonomy Expression profiles (Microarray Data) Metabolic pathways information Ho un gene (una sequenza), in quale processo metabolico è coinvolto? Dato un processo metabolico, quali sono i geni coinvolti? Metabolic pathways information @ KEGG Metabolic pathways information @ KEGG Apoptosis in Homo sapiens Apoptosis in Monodelphis domestica Sequence Data/Genome Data …atgctggactgagtaatcct… …MQYYLERRSQMPGYTRYMML… Protein Structure Gene Prediction (ORF finding) Taxonomy Metabolic pathways information Expression profiles (Microarray Data) Ogni proteina ha una sua struttura 3D Amino acid sequence NLKTEWPELVGKSVEE AKKVILQDKPEAQIIVL PVGTIVTMEYRIDRVR LFVDKLDNIAEVPRVG Folding! Protein Structure in the WEB Strutture note Predizioni di strutture If prediction = true Protein structure prediction Protein structure @ NCBI Disegno di farmaci drug design Protein-protein docking Evoluzione Proteomica Assegnazione funzionale Sequence Data/Genome Data …atgctggactgagtaatcct… …MQYYLERRSQMPGYTRYMML… Gene Prediction (ORF finding) Protein Structure Taxonomy Metabolic pathways information Expression profiles (Microarray Data) Expression profiles (Microarray Data) Array Analysis Hierarchical Clustering Gene Expression @ NCBI Expression profile: Interazioni proteina-proteina Assegnazione funzionale Proteomica NCBI ( http://www.ncbi.nlm.nih.gov/) • • • • Entrez interface to databases – Medline/OMIM – Genbank/Genpept/Structures BLAST server(s) – Five-plus flavors of blast Draft Human Genome Much, much more… INTEGRATION!!! Things to know and remember about using web server-based tools • State usando il computer di qualcun altro • (Probabilmente) state utilizzando un insieme ristretto delle opzioni disponibili • Grande utilità per analisi preliminari e “veloci”. Per analisi più accurate e complesse è preferibile utilizzare database e software in maniera “locale” • La pratica e gli errori (intelligenti!!!) sono il miglior modo per imparare Sequence Comparison BLAST Basic Local Alignment Search Tool Perché comparare le sequenze? Per individuare quali altri organismi possiedono il gene sotto studio (query) (es. produzione antibiotici, target per farmaci) Per una preliminare attribuzione funzionale (hypothetical protein, putative function) Attribuzione funzionale AACGT TTGCC TATAG Confronto sequenze (BAST) proteina X – funzione ignota Database sequenze Sequenze simili Trasferimento dell’informazione relativa alla funzione proteina X – funzione A proteina 1 – funzione A proteina 2 – funzione A proteina 3 – funzione A proteina 4 – funzione A proteina 5 – funzione A proteina 6 – funzione A proteina 7 – funzione A proteina 8 – funzione A Sequence in FASTA Format QUERY >gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of glycerald GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA BLAST DB Lista di sequenze simili alla query BLAST in the web @NCBI Using Basic BLAST Methods • Example: MASH-1 protein sequence from mouse • Can I find similar proteins in Human? Input Query Choose Database Submitting Your Query • Input query sequence – FASTA – Raw – Accession/ ID • Choose Database – Many available; varies with program – For complete list follow the link to: Finds Conserved Domains Limit results with entrez query E-Value cut off Submitting Your Query • CD Search – Finds conserved domains in query sequence – Compares to patterns and profiles of CDs • Limit by entrez query – Restricts results to single organism etc. • E-value cut off – Restricts results to ones falling below defined e-value – Default = 10 – Will revisit concept of e-value Filtering Matrix Gap Penalties Submitting Your Query • Low complexity filtering – Low complexity sequence can lead to spurious alignments – Filtering “hides” these regions – On by default – SEG (proteins) or DUST (nucleic acids) – Should turn it off in some cases… what if your entire sequence gets filtered? Submitting Your Query • Choice of scoring matrix – Different ones available – BLOSUM matrices based on observed frequencies of a.a. substitutions – Each tailored to different levels of sequence divergence and length – BLOSUM 62 = default – Shown to be best at detecting most protein similarities… don’t usually need to change – Follow link for detailed information Submitting Your Query • Gap Penalties – Accounts for insertions and deletions in different sequences – Scores are penalized for gaps to prevent aberrant alignments – Opening penalty is high; extension penalty is lower – Defaults may change depending on matrix choice – Rarely need to change default value Protein Words Query:GTQITVEDLFYNIATRRKALKN GTQ Word size = 3 (default) TQI Word size can only be 2 or 3 QIT ITV Make a lookup table of words TVE VED EDL DLF ... Query: GTQITVEDLFYNIATRRKALKN TQI QIT ITV TVE VED EDL DLF ... ch ! M at GTQ DB extend extend TVEDLFRRLKIAGTQEDLRRT GGHPYTTFWWYQLMERGTQ GRTHPYTTTWWEWHHRGTQ GRTHPYTTTWWEWHHRGTQ GRTHPYTTTWWEWHHRGTQ GRTHPYTTTWWEWHHRGTQ Query: GTQITVEDLFYNIATRRKALKN TVEDLFRRLKIAGTQEDLRRT GGHPYTTFWWYQLMERGTQ GRTHPYTTTWWEWHHRGTQ GRTHPYTTTWWEWHHRGTQ GRTHPYTTTWWEWHHRGTQ ….. GRTHPYTTTWWEWHHRGTQ Score Score Score Score Score Score ….. E-values Bit Scores Click for more info Take note Basic BLAST programs and databases In 6 frames Nucleotide Sequence blastn Protein Sequence Translated Protein Sequence tblastn blastp blastx Nucleotide DB In 6 frames tblastx Protein DB Translated DB (contain amino acid sequences)
Documenti analoghi
Lezione06 - Blast e Fasta
blastp: cerca similarità in banche dati proteiche a partire da un a query
di amino acidi.
blastn: cerca similarità in banche dati di nucleotidi a partire da una query
di nucleotidi.
blastx: cerca s...
BLAST: Basic Local Alignment Search Tool
BLAST è fondamentale per capire la relazione di
una sequenza query con altre proteine o
sequenze di DNA note.
I suoi utilizzi comprendono:
• individuare ortologhi e paraloghi
• scoperta di nuovi ge...