MBB submission

User 4 / 15 Group 4 / 20Guest restrictions

MACSE

Multiple Alignment of coding sequence



proposed by : Vincent RANWEZ, Sebastien HARISPE, Frederic DELSUC, Emmanuel DOUZERY
version 0.9b1 - Download here.


Citation: Ranwez V, Harispe S, Delsuc F, Douzery EJP (2011) MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons. PLoS ONE 6(9): e22594. doi:10.1371/journal.pone.0022594

Until now the most efficient solution to align nucleotide sequences containing open reading frames was to use indirect procedures that align amino acid translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment.
We present an algorithm that has the same space and time complexity as the classical Needleman-Wunsch algorithm while accommodating sequencing errors and other biological deviations from the coding frame. The resulting pairwise coding sequence alignment method was extended to a multiple sequence alignment (MSA) algorithm implemented in a program called MACSE (Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons). MACSE is the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence.
MACSE is distributed as an open-source java file executable with freely available source code and can be used via this web interface.




Required field icon required conditionally required optional
E-mail


default genetic code

Genetic Codes File : please enter either :
  1. Select a file:
  2. or the actual data here:




initial DNA alignment : please enter either :
  1. Select a file:
  2. or the actual data here:

(help on sequence format)


E-mail


show/hide

Help


Output

prefix of the output files.
prefix of the output files. Two will be created: outputF_AA_MACS_E_v0.9.fas (for AA) and outputF_DNA_MACS_E_v0.9.fas (for DNA)

Genetic codes

default genetic code
indicate the default genetic code
Genetic Codes File
file containing the list of genetic code to use for each sequence

Advanced

initial DNA alignment
initial DNA alignment in fasta format

Sequences less reliable

Sequences less reliable
less reliable (pseudoGene Reads) sequence input file in fasta format
Type of sequences
Select the type of less reliable sequences
cost within a less reliable sequence (negative value)
cost of a frameshift within a less reliable sequence (negative value)
cost of a stop codon not at the end of a less reliable sequence (neg value)
cost of a stop codon not at the end of a less reliable sequence (negative value)

Sequences

Sequences
initial DNA alignment in fasta format
creation (negative value)
cost of creating a gap (negative value)
cost of a gap extension (negative value)
cost of a gap extension (negative value)
cost (negative value)
cost of a frameshift within the sequence (negative value)
cost of a stop codon not at the end of the sequence (negative value)
cost of a stop codon not at the end of the sequence (negative value)
Select the name of a file or the actual data
if you are using Netscape 2.x or later, you can select a file by typing its name, or better, by selecting it with the Netscape file browser (Browse button)
OR you can type your data in the next area, or cut and paste it from another application.
(but not both)
Sequence format
Fasta
Command line Help


This is the release v0.9b1 of MACS_E, a multiple sequence alignment program dedicated to coding DNA sequences.
MACS_E is able to handle some sequences errors and is thus especially adapted to align reads or EST together with more reliable sequences.
The most basic use of MACS_E only require to precise an input fasta file containing the sequences to be aligned and an output file where the resulting alignment will be written:
java -jar MACS_E_v0.9.jar -i input.fasta -o output.fasta
Note that since the program is quite memory consumming you may have to add some extra memory to the java machine using the Xmx option java -jar -Xmx600m MACS_E_v0.9 -i input.fasta -o output.fasta

Other options are provided to precise the genetic code that must be used, the cost of introducing some frameshift within normal sequences or less reliable sequences (reads or pseudogene for instance) and so on.Further command line example are provided at the end of this help. usage: java -jar -Xmx600m testJar_macse.jar
-a,--initial_alignment <DNA_alignment.fasta> initial DNA alignment in fasta format
-c,--geneticCodeF <gc_file.txt> file containing the list of genetic code to use for each sequence
-d,--defaultGC <The_Standard_Code> indicate the default genetic code
-f,--frameshiftC <-30> cost of a frameshift within the sequence (negatove value)
-f_lr,--frameshiftC_lr <-10> cost of a frameshift within a less reliable sequence (negatove value)
-g,--gap_creationC <-7> cost of creating a gap(negatove value)
-i,--inputF <input.fasta> sequence input file in fasta format
-i_lr,--input_less_reliable_input_F <input_lr.fasta> less reliable (pseudoGene Reads) sequence input file in fasta format
-o,--outpuF <output> prefix of the output files. Two will be created: outputF_AA_MACS_E_v0.9.fas (forAA) and outputF_DNA_MACS_E_v0.9.fas (for DNA)
-s,--stopC <-100> cost of a stop codon not at the end of the sequence (negatove value)
-s_lr,--stopC_lr <-60> cost of a stop codon not at the end of a less reliable sequence (negatove value)
-x,--gap_extensionC <-1> cost of a gap extension (negatove value)

here are some command line example for various scenario

example 1 aligning mitochondrial sequences: java -jar MACS_E_v0.9.jar -i input.fasta -o output.fasta -d 2
Available genetic codes are those of the NCBI:
  • 1 The_Standard_Code
  • 2 The_Vertebrate_Mitochondrial_Code
  • 3 The_Yeast_Mitochondrial_Code
  • 4 The_Mold_Protozoan_and_Coelenterate_Mitochondrial_Code_and_the_Mycoplasma_Spiroplasma_Code
  • 5 The_Invertebrate_Mitochondrial_Code
  • 6 The_Ciliate_Dasycladacean_and_Hexamita_Nuclear_Code
  • 9 The_Echinoderm_and_Flatworm_Mitochondrial_Code
  • 10 The_Euplotid_Nuclear_Code
  • 11 The_Bacterial_Archaeal_and_Plant_Plastid_Code
  • 12 The_Alternative_Yeast_Nuclear_Code
  • 13 The_Ascidian_Mitochondrial_Code
  • 14 The_Alternative_Flatworm_Mitochondrial_Code
  • 15 Blepharisma_Nuclear_Code
  • 16 Chlorophycean_Mitochondrial_Code
  • 21 Trematode_Mitochondrial_Code
  • 22 Scenedesmus_obliquus_mitochondrial_Code
  • 23 Thraustochytrium_Mitochondrial_Code

example 2 aligning sequences using different genetics code java -jar MACS_E_v0.9.jar -i input.fasta -o output.fasta -c GC_correspondance.txt
the GC_correspondance.txt is a file containing a list of associations relating sequence names to specific genetic code the advantage is that you could build this file once and use it for various alignment. We follow the translatorX convention, each line contains: sequence_name tabulation genetic_code_number Here is an example of a GC_correspondance.txt file: my_mito_yeast_seq 3 my_nuc_yeast_seq 12 my_nuc_pan_seq 1
example 3 aligning sequences using different cost for frameshift, stop codon and gap java -jar MACS_E_v0.9.jar -i input.fasta -o output.fasta -g -7 -x -1 -f -30 -s 100
launch the programm with its default options a penality of 7 for gap creation (divided between gap opening and gap closing), a penalty of 1 for gap extension. A very high penalty of 100 for stop codon appearing inside the sequence, and a frameshift penalty of 30
example 4 aligning sequences with reads or EST (advised parameters) java -jar MACS_E_v0.9.jar -i input.fasta -o output.fasta -i_lr reads.fasta -s_lr -60 -f_lr -10
lr stand for less reliable sequences. In this case, the set of more reliable sequences are first aligned then the less reliable ones are sequencially added to this initial alignmentthe presence of frameshifts and stop codon must be less penalized in less reliable sequences. In reads they may be due to sequencing errors.Frameshift errors are less penalized than stop codon since errors are within less reliablaunch the programm with its default options a penality of 7 for gap creation (divided between gap opening and gap closing), a penalty of 1 for gap extension.A very high penalty of 100 for stop codon appearing inside the sequence, and a frameshift penalty of 30For this kind of sequences, frameshifts are frequent in homopolymer region and thus less penalyzed than stop codon
example 4 aligning sequences with pseudogene (advised parameters) java -jar MACS_E_v0.9.jar -i input.fasta -o output.fasta -i_lr reads.fasta -s_lr -10 -f_lr -20
lr stand for less reliable sequences. In this case, the set of more reliable sequences are first aligned then the less reliable ones are sequencially added to this initial alignmentthe presence of frameshifts and stop codon must be less penalized in less reliable sequences.In pseudo gene they may have appeared since the pseudogenisation. Yet the sequences history have been strongly influence by its history as a gene, stop and frameshift are thus penalizedSince stop codon resulted from mutation while frameshits resulted from insertion/deletion this latter are more penalized for pseudogenes.

References:
Citation: Ranwez V, Harispe S, Delsuc F, Douzery EJP (2011) MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons. PLoS ONE 6(9): e22594. doi:10.1371/journal.pone.0022594