User 17 / 300 Group 17 / 300Guest restrictions
Citation: Ranwez V, Harispe S, Delsuc F, Douzery EJP (2011) MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons. PLoS ONE 6(9): e22594. doi:10.1371/journal.pone.0022594
Until now the most efficient solution to align nucleotide sequences containing open reading frames was to use indirect procedures that align amino acid translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment.
We present an algorithm that has the same space and time complexity as the classical Needleman-Wunsch algorithm while accommodating sequencing errors and other biological deviations from the coding frame. The resulting pairwise coding sequence alignment method was extended to a multiple sequence alignment (MSA) algorithm implemented in a program called MACSE (Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons). MACSE is the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence.
MACSE is distributed as an open-source java file executable with freely available source code and can be used via this web interface.
java -jar MACS_E_v0.9.jar -i input.fasta -o output.fasta
java -jar -Xmx600m MACS_E_v0.9 -i input.fasta -o output.fasta
java -jar -Xmx600m testJar_macse.jar
-a,--initial_alignment <DNA_alignment.fasta> | initial DNA alignment in fasta format |
-c,--geneticCodeF <gc_file.txt> | file containing the list of genetic code to use for each sequence |
-d,--defaultGC <The_Standard_Code> | indicate the default genetic code |
-f,--frameshiftC <-30> | cost of a frameshift within the sequence (negatove value) |
-f_lr,--frameshiftC_lr <-10> | cost of a frameshift within a less reliable sequence (negatove value) |
-g,--gap_creationC <-7> | cost of creating a gap(negatove value) |
-i,--inputF <input.fasta> | sequence input file in fasta format |
-i_lr,--input_less_reliable_input_F <input_lr.fasta> | less reliable (pseudoGene Reads) sequence input file in fasta format |
-o,--outpuF <output> | prefix of the output files. Two will be created: outputF_AA_MACS_E_v0.9.fas (forAA) and outputF_DNA_MACS_E_v0.9.fas (for DNA) |
-s,--stopC <-100> | cost of a stop codon not at the end of the sequence (negatove value) |
-s_lr,--stopC_lr <-60> | cost of a stop codon not at the end of a less reliable sequence (negatove value) |
-x,--gap_extensionC <-1> | cost of a gap extension (negatove value) |
java -jar MACS_E_v0.9.jar -i input.fasta -o output.fasta -d 2
java -jar MACS_E_v0.9.jar -i input.fasta -o output.fasta -c GC_correspondance.txt
my_mito_yeast_seq 3
my_nuc_yeast_seq 12
my_nuc_pan_seq 1
java -jar MACS_E_v0.9.jar -i input.fasta -o output.fasta -g -7 -x -1 -f -30 -s 100
java -jar MACS_E_v0.9.jar -i input.fasta -o output.fasta -i_lr reads.fasta -s_lr -60 -f_lr -10
java -jar MACS_E_v0.9.jar -i input.fasta -o output.fasta -i_lr reads.fasta -s_lr -10 -f_lr -20