DNA synthesis and sequencing technology is advancing rapidly, allowing for the design of high-throughput experiments which were previously hindered by technological constraints. Massively Parallel Reporter Assay experiments (MPRAs) is a novel method, that uses the technological advances and the reduction in the associated cost, to study the regulatory activities of tens or hundreds of thousands of DNA oligonucleotides. The oligonucleotides are synthesized using microarrays, each of which contains a uniquely identifiable barcode. Next, the oligonucleotides are amplified, put into vectors with a reporter gene and transfected into cells. By measuring the expression levels of the reporter gene, the regulatory properties of the corresponding sequence can be inferred. Therefore, MPRAs and other similar high-throughput methods, enable the systematic examination of regulatory synthetic sequences to identify rules that govern regulation of gene expression or to examine the type of regulatory role of specific elements, or motif positioning.
For instance, some questions that can be explored using synthetic sequences are:
Additionally, MPRAs allow for the effect of regulatory Single Nucleotide Polymorphisms, small insertions or small deletions regarding gene expression. By designing oligonucleotide sequences that differ specifically at a SNP, or contain a small insertions or a small deletions the regulatory role of variants found in the human genome can be examined.
Overall, MPRAs and associated technologies are very powerful, but currently systematically designing these type of experiments remains very challenging. We provide MPRAnator, a novel tool for the design of MPRA experiments to study motifs or polymorphisms in a systematic manner.This page allows the user to synthesize oligonucleotides for MPRAs experiments to investigate the rules that govern transcription factor occupancy. A set of variables provides fine control of how motifs are placed into the sequences. The user is able to specify the locations to substitute the motifs, insert restriction sites , adapter sites and barcodes.
Parameters :
Enter FASTA sequences
: The Fasta sequences as text.Upload FASTA sequences
: The Fasta sequences as a text file.Enter your motifs
: The motifs in FASTA format.Reverse complement sequence
: The option to reverse complement the sequences before motif substitution.Minimum Spacing
: The minimum spacing between the motifs.Maximum Spacing
: The maximum spacing between the motifs.Distance from left edge
: This is the minimum distance from left-most nucleotide of the motif to the left edge of the sequence.Distance from right edge
: This is the minimum distance from right-most nucleotide of the motif to the right edge of the sequence.Interval of substitution of motifs
: This specifies the interval length to insert the motifsBarcode length
: The size of the barcode specifiedMinimum barcode GC content (%)
: The minimum amount (in percent) of GC content in each of the barcodesMaximum barcode GC content (%)
: The maximum amount (in percent) of GC content in each of the barcodesBarcode edit distance
: The Levenshtein distance between each barcode. The default is 2.Number of barcodes per sequence
: This specifies the number of barcodes inserted per sequence (replicates). Restriction sites
: The ability to add upto 2 restriction sites into the final product.Adapter sites
: The ability to add upto 2 adapter sites into the final product.Ordering
: Allows the user to order the constituent parts therefore providing flexibility in the design of the experiment.The result page displays the synthesized
oligonucleotides in the FASTA format. The user is able to download ( plain text file ) the generated oligonucleotides.
The description line (header) has information about the
options chosen by the user during submission. A header is composed of
one or more DESCRIPTOR
s and each DESCRIPTOR
is composed of a LABEL
and
INFO
. The descriptors are delimited by a |
, i.e. a "pipe".
Image below is an excerpt of the Results page ( showing 8 nucleotides ). The nucleotides in red are the substituted motifs.
The format of the FASTA header for each sequence is shown below.
Note that the order of the DESCRIPTOR
in the header is insignificant.
>
<LABEL> - <INFO> |
<LABEL> - <INFO> |
<LABEL> - <INFO> |
...
The LABEL
is one of the options as shown below and INFO
is either
a number or a word which describes the LABEL
in more detail.
LABEL
types:
<MOTIF>
: A particular motif inserted by the user. BARCODE
: The variant of the same sequence RESTRICTION
: This is the restriction site put by the user ADAPTER
: The presence of the adapter sequence of the specified number DUPLICATE_RESTRICTION_SITES
: The restriction site which has multiple copies present. Example of a header:
> ATGTG - 53|AAAAA-61|RESTRICTION - 1|RESTRICTION - 2
There are 4 DESCRIPTOR
s.
ATGTG - 53
is the motif starting at position 54 in the background
sequence.AAAAA - 61
is the motif starting at position 61 in the background sequence.RESTRICTION - 1
signifies the presence of the restriction site 1 in the final sequence.RESTRICTION - 2
signifies the presence of the restriction site 2 in the final sequence.This page allows the user to synthesize oligonucleotides for MPRA experiments to study the effects of SNPs. The user can select to include or exclude combinations of SNPs when designing MPRA experiments.
Parameters :
Enter FASTA sequences
: Sequence in Fasta Format. Header contains the location information.Upload FASTA sequences
: Sequence in Fasta Format as a text file. Header contains the location information.Enter your SNPs (VCF Format)
: SNPs in VCF format (as text).Make Snp Combinations?
: Create all combinations of SNPs to be substituted to sequences.Barcode length
: The size of the barcode specifiedMinimum barcode GC content (%)
: The minimum amount (in percent) of GC content in each of the barcodesMaximum barcode GC content (%)
: The maximum amount (in percent) of GC content in each of the barcodesBarcode edit distance
: The Levenshtein distance between each barcode. The default is 2.Number of barcodes per sequence
: This specifies the number of barcodes inserted per sequence (replicates). Restriction sites
: The ability to add upto 2 restriction sites into the final product.Adapter sites
: The ability to add upto 2 adapter sites into the final product.Ordering
: Allows the user to order the constituent parts therefore providing flexibility in the design of the experiment.The result page (plain text view) displays the synthesized oligonucleotides in FASTA format. The description line (header) has information about the options chosen by the user during submission. A header is composed of one or more DESCRIPTORs and each DESCRIPTOR is composed of a LABEL and INFO. The descriptors are delimited by a |, i.e. a "pipe".
>
<LABEL> - <INFO> |
<LABEL> - <INFO> |
<LABEL> - <INFO> |
...
LABEL
types:
<SNP>
: Name of SNP <NUCLEOTIDE>
: REF / ALT BARCODE
: This is the unique identifier for the variants RESTRICTION
: This is the restriction site(s) put by the user ADAPTER
: This is the adapter site(s) put by the user DUPLICATE_RESTRICTION_SITES
: Report the restriction site which has multiple copies present in the oligo The input will accept all nucleic acid IUPAC letters.
Example of a header:
> sequence2|Mutated_nucleotides - 3|Scrambled - No| Reverse - No | Complemented - Yes
There are 3 DESCRIPTOR
s.
sequence2
This is the header of the sequenceMutated_nucleotides - 3
This is the number of randomly chosen mutated nucleotides.
Scrambled - No
Denotes whether the sequence was scrambled. Reversed - No
Denotes whether the sequence was reversed. Complemented - No
Denotes whether the sequence was complemented. There are 2 ways to use this tool.
PWM Seq-Gen allows for stochastic conversion of Position Weight Matrices (PWMs) into Motif Sequences, using the weights for each nucleotide at each position for the corresponding matrices. The output of PWM Seq-Gen can be inputted into MPRAs Motifs tool, therefore allowing for the design of MPRA experiments using PWMs.
Parameters:
Example of a header:
> MA0056.1 MZF1_1-4 | Simulation number - 4
There are 2 DESCRIPTOR
s.
MA0056.1 MZF1_1-4
This is the header of the simulated motif.Simulation number - 4
This is the simulation number of the PWM.
Inspired by the results presented by Nguyen et al, we would like to investigate the effects of AP1 (TGACTCA), ELK1 (ACCGGAAGT) and RFX (CGTTGCTAGGCAACG) on gene expression. In the original study, the authors focused on one motif at a time to show that they have a strong impact on the ability of a sequence to act either as a promoter or as an enhancer. It was also shown that the effect of each motif can vary significantly depending on the background.
To explore the impact of different backgrounds, different motif combinations and different spacings, we use the following settings:
First we pick two regions from the mouse genome that do not show any regulatory activity according to the available information from the sources available through the UCSC genome browser. The center of each tile (in mm9 coordinates) is reported in the header.
>bg1_chr6:77195320
tgtgtcttaaaaaaacaaacaaacaaacaaaatcccgaaataaaacacaacaaaaaaaaccccaccccataatcttcaggacagtctgtc
>bg2_chr9:37271330
gtatctactctctgcccttacaacctcctcccagaaagaataaaatgtttctcatcctggaagctacagtgtgtcacacagtatactctt
The following motifs are used:
>AP1
TGACTCA
>ELK1
ACCGGAAGT
>RFX
CGTTGCTAGGCAACG
Here we assume that the total tile length is 117 base pairs. For the experimental design, two restriction sequences are necessary: CACGTG and CAATTG. The restriction sites are rearranged so that they flank the background.
Since this gives us 90 bps to place three motifs and a barcode, we use 15 bps for the barcodes, and to reduce the bias we limit the GC-content of the barcodes to between 40 and 60%. We require an edit distance of 3 and we use 6 different barcodes for each sequence.
This results in a total of 5856 sequences.