Nature biotechnology. Feb/29/2012; 30(3): 271-277

Published online Feb/25/2012

PMID: 22371084

PMC: 3297981

doi: 10.1038/nbt.2137

Rapid dissection and model-based optimization of inducible enhancers in human cells using a massively parallel reporter assay

Curtis G. Callan+4 authors

Abstract

INTRODUCTION

Genetic analysis and engineering would greatly benefit from an improved understanding of how transcriptional regulatory elements are encoded in DNA. Evolutionary analysis and chromatin-state mapping have revealed myriad regulatory elements across the human genome^1–2, but we are largely unable to explain why an element is active in a specific cell type or to predict the effect of a specific mutation. Moreover, synthetic regulatory elements can provide powerful tools for genetics, high-throughput screening and gene therapy^3–5, but our ability to engineer such elements is similarly limited - particularly in mammalian systems, where traditional functional assays⁶ suffer from low throughput and directed evolution is generally impractical.

To enable systematic dissection and optimization of transcriptional regulatory elements, we developed a massively parallel reporter assay (MPRA; Fig. 1). MPRA is similar in principle to the in vitro transcription-based assay of ref. 7, but is suitable for use with mammalian cells. Briefly, we first synthesize tens of thousands of oligonucleotides⁸ that contain a library of regulatory elements, each coupled to a short tag. We use the oligonucleotides to generate a pool of plasmids, where each plasmid contains one of the regulatory elements, an optional invariant promoter, an arbitrary open reading frame (ORF) and an identifying sequence tag. We co-transfect these plasmids into cells, where active elements drive transcription of mRNAs containing the tags in their 3' UTRs. To estimate their relative activities, we sequence and count the tags in the reporter mRNAs and the plasmids pools, and then take the ratios of these counts. The resulting data is amenable to a variety of analyses, including high-resolution footprinting and quantitative modeling⁹.

As a proof of concept, we applied MPRA to study two inducible enhancers: a synthetic cAMP-regulated enhancer (CRE), which is widely used as a cellular cAMP sensor⁴, and the virus-inducible enhancer of the human interferon beta (IFNB) gene, which is one of the most comprehensively studied mammalian regulatory elements¹⁰. These sequences represent two different enhancer architectures¹¹. The synthetic CRE is a “billboard” enhancer that contains multiple non-overlapping binding sites for the cAMP-responsive transcription factor CREB. In contrast, the IFNB enhancer contains overlapping binding sites for six different transcription factors that assemble into a highly cooperative “enhanceosome”.

In this report, we first establish that MPRA can accurately identify functional sequence features in both enhancers at nucleotide resolution. Next, we use MPRA data to train quantitative sequence-activity models (QSAM)^9,12–13 that describe the activity of the enhancers in their induced or uninduced states. Finally, we demonstrate that these QSAMs can be combined to identify mutations that increase enhancer inducibility (the ratio of induced versus uninduced activity).

RESULTS

Experimental design and mutagenesis strategies

We synthesized 142mer oligonucleotide pools containing 87 nt CRE and INFB enhancer variants, as well as 10 nt tags and various invariant sequences required for cloning (Supplementary Fig. 1).

We tested two different mutagenesis strategies. The first was a “single-hit scanning” strategy⁷ where we assayed ~1,000 specific enhancer variants, including all possible single substitutions, multiple series of consecutive substitutions, and small insertions at all positions (Supplementary Table 1). Each scanning variant was linked to 13 tags for a total of 13,000 distinct enhancer-tag combinations. This redundancy provides parallel measurements for each variant, which can be used to both quantify and to reduce the impact of experimental noise, including tag-dependent bias (Supplementary Fig. 2). The second was a “multi-hit sampling” strategy⁹ where we assayed ~27,000 distinct enhancer variants (Supplementary Table 2), each linked to a single tag. These variants were constructed by introducing random nucleotide substitutions into the enhancers at a rate of 10% per position. Because the variants were designed in silico and then synthesized, they provide a uniform mutational spectrum. This strategy has the advantages that each substitution is assayed in a larger fraction of the variants and that the use of multiple substitutions enables detection of interactions; a disadvantage is less accurate measurements for individual variants.

We cloned oligonucleotide pools synthesized according to both strategies into identical plasmid backbones, inserted a minimal TATA-box promoter and a luciferase gene between the variants and tags, and transfected the resulting plasmid pools into human embryonic kidney (HEK293T) cells. To induce the CRE or IFNB enhancer, we treated the transfected cells with forskolin or infected them with Sendai virus, respectively. To estimate the relative activities of the enhancer variants, we sequenced 20–120 million PCR-amplified mRNA and plasmid tags from each transfection.

Assay validation

We validated the resulting data using several different approaches (Supplementary Fig. 3).

First, we examined the distributions of plasmid tag counts. We found that the vast majority (≥99.6%) of the tags we designed were indeed present in each pool, and that their relative concentrations were similar (coefficient of variation = 0.45–1.0). This confirms successful generation of high complexity plasmid pools.

Second, we synthesized and transfected each of the two CRE plasmid pools twice. We found that the ~13,000 and ~27,000 pairs of mRNA/plasmid tag ratios obtained from the single-and multi-hit pools, respectively, were highly correlated (Pearson r² = 0.61 and 0.67, least significant p < 10⁻¹⁰⁰). The medians of the 13 tag ratios from each distinct variant in the replicate single-hit pools were even more similar (r² = 0.89, p < 10⁻¹⁰⁰). This indicates that MPRA is robust, and that the noise level can be controlled by adjusting the number of distinct tags linked to each distinct variant.

Finally, we subcloned 24 plasmids from each of two CRE pools and individually measured their luciferase expression levels following forskolin treatment. We found a linear relationship between the MPRA- and luciferase-based activities for both pools (r² = 0.45 and 0.75, p < 0.0002). This indicates that MPRA is directly comparable to traditional reporter assays.

Single-hit scanning

We began our analysis by attempting to dissect the two induced enhancers using the scanning mutagenesis data. We estimated the relative activity of each variant by comparing the median of its 13 mRNA/plasmid tag ratios to the median ratio for tags linked to the corresponding “wild-type” enhancer⁷.

We first focused on the CRE, which contains two consensus CREB dimer binding sites (denoted as sites 1 and 4 in Fig. 2a) separated by two monomer sites (sites 2 and 3). We found that 154 of the 261 possible single substitutions significantly altered its activity (5% FDR), with the majority (79%) resulting in decreased activity (Fig. 2b). The substitutions that resulted in the largest decreases were in or immediately flanking the CREB sites. Substitutions in the promoter-proximal CREB site 4 had the largest effects, which is consistent with reports of the cAMP-responsiveness of CREB sites being inversely correlated with their distance to a TATA-box¹⁴. Within the two dimer sites, substitutions in the central CGs were the most deleterious. This is consistent with biochemical data that show that this dinucleotide is critical for high affinity CREB-DNA interactions¹⁵.

Substitutions at 47 of 61 positions outside of the CREB sites also caused significant, although generally more subtle, changes in activity. This may reflect the effects of cryptic non-CREB binding sites. In particular, two substitutions upstream of CREB site 1, as well as almost every substitution in a C-rich motif flanking CREB site 4, resulted in increased CRE activity. These substitutions may therefore cause either increased recruitment of activating factors or decreased recruitment of repressors.

Scanning the CRE with blocks of eight consecutive substitutions caused changes that were consistent with the single substitutions, but often more deleterious (Fig. 2c; Supplementary Fig. 4). Notably, while most single substitutions in CREB site 1 had no detectable effects, this site was clearly identified by the introduction of multiple substitutions.

Insertions of both 5 and 10 nucleotides were well-tolerated at multiple positions between CREB sites 1 and 2 and between sites 3 and 4 (Fig. 2d; Supplementary Fig. 5). This implies that the CRE activity is not dependent on specific spacing or phasing between these sites. In contrast, insertions between sites 2 and 3 resulted in decreased activity, despite single substitutions having small effects in the same region. This may reflect a direct interaction between proteins at these two sites, which was also suggested by a study of these sites in their natural context¹⁶.

We next focused on the IFNB enhancer, which is a 44 nt sequence containing overlapping, non-consensus binding sites for an ATF-2/c-Jun heterodimer, two IRF-3 and two IRF-7 proteins, and a p50/RELA (NF-κB) heterodimer (Fig. 3a)¹⁰. We included a small amount of flanking genomic sequence, for a total length of 87 nt. We found that 83 of the 261 possible single substitutions altered its activity in virus-infected cells (5% FDR), and that almost all (92%) of these were within the 44 nt core (Fig. 3b). Scanning with consecutive substitutions did not reveal any unambiguously functional sequences outside of this core (Fig. 3c; Supplementary Fig. 6).

Within the core, there were only nine positions where all alternate nucleotides could be introduced without affecting its activity. Strikingly, seven of these positions were in gaps between the 5′- and 3′-halves of IRF sites, where these proteins primarily interact with the DNA backbone¹⁰. Insertions were also largely deleterious within the core (Fig. 3d; Supplementary Fig. 7). Both 5 and 10 nt insertions were, however, tolerated between IRF-7 site 2 and the p50/RELA site, which is consistent with the absence of a known protein or interaction spanning this gap.

Finally, seven single substitutions within the core caused a significant increase in activity. At least four of these would be predicted to increase the affinity of a protein-DNA interaction, by introducing a central CG into the ATF-2/c-Jun site (TGACATAG to TGACGTAG), changing the 3′-halves of IRF-3 site 1 or 2 to its consensus (AAAA or GAGA toGAAA) or changing the NF-κB 5′ half-site to a sequence specifically preferred by the p50 subunit (GGGAA to GGGGA)¹⁷. We note that introduction of such consensus sites are, however, likely to decrease the specificity of the enhancer towards viral infection (see below and ref. 18).

Multi-hit information footprints

We next attempted to dissect the two enhancers using the multi-hit sampling data⁹. To quantify the dependency between enhancer activity and substitutions at a specific position, we estimated the mutual information between the nucleotides at that position and the corresponding tag ratios across the ^~27,000 variants. To infer the effect of substitutions on the basal enhancer activities, we also assayed the variants in untreated cells. The resulting “information footprints”^9,19 are shown in Figures 4 and 5.

We found that the 27 most informative positions in the induced CRE footprint were all located in or immediately flanking the four CREB sites (Fig. 4a). The more symmetric footprint of dimeric CREB site 4 compared to site 1 likely reflects the palindromic flanks of the former (ATTGACGTCAATvs.AGTGACGTCAGC). The information contents of CREB sites 2–4 were substantially lower in the uninduced state, which is consistent with cAMP-dependence. In contrast, the information contents of CREB site 1 and the cryptic binding sites near CREB sites 1 and 4 were higher in the uninduced footprint. This is again consistent with the most promoter-distal CREB site being less cAMP-dependent¹⁴ and suggests that these sites may be important for controlling the basal CRE activity.

The IFNB enhancer footprint from virus-infected cells shows, as expected, that its functionally relevant nucleotides are concentrated in the 44 nt core (Fig. 5a). Indeed, 35 of 46 positions that had significant information content (5% FDR) are located in the core. Strikingly, the uninduced IFNB footprint revealed only 8 informative positions, compared to 73 in the uninduced CRE footprint. This likely reflects the very low basal activity of the IFNB enhancer (at least 5-fold lower than the uninduced CRE in luciferase assays).

Quantitative sequence-activity models

We next attempted to develop QSAMs^9,12–13 for the two enhancers, with the goal of predicting the activity of novel variants. As a first step, we used linear regression to train QSAMs where each nucleotide position is simply assumed to contribute additively to the log-transformed activity of the enhancers in the induced or uninduced states^12–13.

Linear QSAMs trained on the multi-hit data are shown in Figures 4b and 5b (see Supplementary Figs. 8 and 9 for models trained on single-hit data). Inspection revealed good qualitative correspondence with the sequence features described above. For example, the two CRE models show that CREB site 1 is critical for maximizing the induced activity, while site 4 has the largest influence on the basal activity.

To quantify how well the linear models describe our data, we compared their predictions to the observed activities for both the ^~27,000 variants in the multi-hit training sets and the 261 single substitutions in the independent single-hit data. For the CRE, we found that the linear model for the induced state generates predictions that are highly correlated with the observed activities of both multi- and single-hit variants (r² = 0.63, p < 10⁻¹⁰⁰ and r² = 0.79, p < 10⁻⁸⁹, respectively). Remarkably, this model therefore explains ^~90% of the non-technical variance in both data sets (compare to r² = 0.67 and 0.89 between replicates, see above). The large number of multi-hit measurements ensures that this is not the result of over-fitting (r² ≥ 0.62 on five-fold cross-validation). In contrast, the induced IFNB model performed significantly better on single-hit variants (r² = 0.61, p < 10⁻⁵⁴) than on multi-hit variants (r² = 0.071, p < 10⁻¹⁰⁰), despite being trained on the latter set.

The difference in the fit of linear models appears to reflect the different architectures of the enhancers. Most CRE multi-hit variants disrupt one or more of the non-overlapping consensus CREB sites, which caused large (median = 4.7-fold) and roughly additive reductions in its induced activity, until an apparent minimum is reached (Supplementary Fig. 8b). Multiple substitutions in the induced IFNB enhancer generally caused weaker (median = 1.8-fold) and non-additive reductions in activity, which may reflect its initially weaker non-consensus binding sites or more complex interactions between its transcription factors.

Since both enhancers showed evidence of nonlinear responses, we next attempted to refine our QSAMs by incorporating functional nonlinearities. We fit a variety of QSAMs to the data, including ones describing either dinucleotide interactions or biophysical interactions between DNA-bound proteins (Supplementary Notes, Supplementary Tables 5 and 6). Model parameters were optimized using linear regression or mutual information maximization⁹. For the CRE, the best performing QSAM was a “linear-nonlinear” model²⁰ in which each nucleotide position is assumed to contribute additively to a linear activation measure, and a sigmoidal function of that measure then gives the transcriptional response. The optimal parameters for the linear part of this model are virtually identical (r² = 0.98) to the strictly linear QSAM, but the two additional parameters that describe the sigmoidal nonlinearity allow the model to describe both minimum and maximum activation levels. Interestingly, this nonlinearity appears to capture much of the remaining non-technical variance in the induced CRE data (r² = 0.72, p < 10⁻¹⁰⁰, compared to r² = 0.67 between the two replicates). For the IFNB enhancer, the best performing models were those that incorporated dinucleotide interactions, which is consistent with its more complex architecture, although no model provided more than a modest improvement over the linear QSAM (up to r² = 0.10, p < 10⁻¹⁰⁰). Thus, while linear QSAMs are imperfect representations of the underlying biological systems, in these cases they appear to provide a reasonable trade-off between complexity and predictive power.

Model-based optimization

Linear QSAMs have previously proven useful for engineering regulatory elements in bacteria^12,21. To explore the potential for model-based optimization of synthetic regulatory elements in mammals, we next attempted to design enhancers with modified activities.

We first attempted a “greedy” approach to maximize the induced enhancer activities. We selected, for each position, the nucleotide predicted to make the largest activity contribution according to the corresponding linear model. This resulted in changing the CRE at 36 of 87 positions (CRE-A1 in Fig. 6a). These changes left the consensus CREB sites intact, but introduced predicted activating mutations into the flanks of CREB sites 1–3 and into the two cryptic binding sites. For the IFNB enhancer, we limited modifications to the 44 nt core. This resulted in changes at 15 positions (IFNB-A1 in Fig. 6c) including conversion of every non-consensus IRF half-site to the GAAA consensus and strengthening the p50 half-site. We individually synthesized these two variants and then compared them to their “wild-types” using a luciferase assay. We found that both new variants had significantly higher induced activities (2.1-fold for CRE-A1 and 2.6-fold for IFNB-A1; Fig. 6b, d). Interestingly, the increase for CRE-A1 (2.1-fold) was substantially lower than predicted by the simple linear model (32-fold), but close to the value predicted by the linear-nonlinear model (1.7-fold). In contrast, the increase for IFNB-A1 (2.6-fold) was close to the value predicted by its linear model (2.1-fold). This difference likely reflects that the wild-type CRE is composed of consensus activator sites and therefore operates much closer to saturation than the IFNB enhancer. We also found, however, that both new variants had disproportionately higher uninduced activities (19-fold for CRE-A1 and 17-fold for IFNB-A1). This suggests that mutations that increase the induced activity of an enhancer may often decrease its inducibility, which would likely be detrimental in most biological and engineering contexts.

Accordingly, we attempted instead to maximize the inducibility of the two enhancers. We simultaneously considered the induced and uninduced linear QSAMs and, for each position, selected the nucleotide predicted to maximize inducibility, without (1) increasing the uninduced activity or (2) decreasing the induced activity relative to the wild-type. For the CRE, we synthesized three variants (CRE-I1 to -I3 in Fig. 6a). CRE-I1 and -I2 were predicted by QSAMs trained on each of the two replicate CRE data sets and contained 10 and 12 substitutions. CRE-I3 contained only the five substitutions that were shared between the first two. Only one variant (CRE-I2) contained any activating substitutions in the cryptic motifs near CREB sites 1 and 4. We found that all three variants showed a significant increase in induced activity without the large decrease in inducibility seen for CRE-A1 (Fig 6b). Moreover, CRE-I3 showed no increase in uninduced activity, which resulted in a ~25% increase in inducibility relative to the wild-type (~44-fold vs. ~35-fold). Notably, we failed to isolate variants with similar or higher inducibilities from the original random variants (Supplementary Fig. 10). For the IFNB enhancer, we synthesized one variant containing five substitutions in the core, none of which modified the non-consensus sites (IFNB-I1 in Fig. 6c). This variant also showed increased inducibility relative to the wild-type (~100-fold vs. ~67-fold).

DISCUSSION

We have developed a massively parallel reporter assay (MPRA) that enables functional analysis of transcriptional regulatory elements at significantly higher throughput than traditional bioluminescence- and fluorescence-based assays. In our initial experiments, we used MPRA to map functional transcription factor binding sites at single-nucleotide resolution and to train simple quantitative sequence-activity models. The ability to infer QSAMs of arbitrary functional form using data similar to ours has been demonstrated in bacteria⁹. Applied to mammalian cells, this approach may help elucidate the biophysical basis of inducible and cell type-specific enhancer activity.

MPRA can be readily adapted to other experimental designs by varying the oligonucleotide composition, promoter-ORF insert, plasmid backbone or transfected cell types. Advances in DNA synthesis technology promise to enable analysis of longer elements in the near future^22–23 and transposon- or virus-derived sequences can be included in the backbone to support genomic integration. We therefore expect that the assay will provide a powerful tool for screening and dissecting the large variety of regulatory elements that are being identified by the ENCODE Project¹, the NIH Roadmap Program on Epigenomics²⁴ and similar efforts.

Beyond studying variants of naturally-occurring DNA sequences, the flexibility and decreasing cost of DNA synthesis is enabling development of novel regulatory elements. Strong synthetic promoters have previously been selected from combinatorial libraries using FACS^25–26. It may be challenging, however, to design direct selection strategies for elements with more complex characteristics, such as optimal inducibility, dynamic range or cell type-specificity. Model-based optimization represents an alternative to direct selection. In this approach, all synthesized elements are first profiled in multiple cell states, with the resulting data being integrated to identify sequences that optimize complex objectives. This approach can be applied iteratively, which would be conceptually similar to genetic algorithm-based optimization²⁷. With the development of more sophisticated mutagenesis and modeling strategies, we expect that this approach will provide a powerful tool for synthetic biology.

METHODS

Oligonucleotide library design and synthesis

142mer oligonucleotides were designed to contain, in order, the universal primer site ACTGGCCGCTTCACTG, an 87 nt variable sequence, Kpnl/Xbal restriction sites (GGTACCTCTAGA), a 10 nt variable tag sequence and the universal primer site AGATCGGAAGAGCGTCG (see Supplementary Fig. 1). The “wild-type” CRE sequence was derived from pGL4.29 (Promega). The wild-type interferon beta enhancer sequence was derived from the NCBl36/hg18 human genome reference assembly. The enhancer variants were designed as described in `Experimental design and mutagenesis strategies', and 100 distinct wild-type enhancer-tag pairs were included in each scanning pool. The distinct tags were selected from randomly generated 10 nt sequences, with the following constraints: (1) must contain all four nucleotides, (2) must not contain a run of more than four identical nucleotides, (3) must not contain a Kpnl or Xbal restriction site, and (4) must not contain a known mammalian microRNA seed sequence (obtained from http://www.targetscan.org, April 2009).

The resulting oligonucleotide libraries were synthesized by Agilent, Inc. as previously described⁸. Sanger sequencing of subcloned MPRA plasmids suggested that the synthesis error rate was 1 in 200–300, with small deletions being the most common failure mode (data not shown).

Plasmid construction

Oligonucleotide libraries were resuspended in TE 0.1 buffer (10 mM Tris-HCl, 0.1 mM EDTA, pH 8.0) and amplified using 8–12 cycles of PCR using Phusion High-Fidelity PCR Master Mix with HF buffer (NEB) and primers ACTGGCCGCTTCACTG and CGACGCTCTTCCGATCT. The resulting PCR products were size-selected on 4% NuSieve 3:1 agarose gels (Lonza), purified using QIAquick G e l Extraction kits (Qiagen) and re-amplified with primers GCTAAGGGCCTAACTGGCCGCTTCACTG and GTTTAAGGCCTCCGAGGCCGACGCTCTTC to add Sfil sites.

To generate the plasmid backbone for the MPRA constructs, the luc2 reporter gene was removed from pGL4.10[luc2] (Promega) by Hindlll-Xbal digestion. The 5' extension of the Hindlll site was filled in with Klenow fragment of DNA polymerase I (NEB) and the Xbal site was eliminated by treatment with Mung Bean nuclease (NEB). The resulting linear plasmid was self-ligated to generate cloning vector pGL4.10M.

To insert the variable regions into the MRPA vector, purified oligonucleotide PCR products were digested with Sfil (NEB) and directionally cloned into Sfil digested pGL4.10M using One Shot® TOP10 Electrocomp™ E. coli cells (Invitrogen). To preserve library complexity, the efficiency of transformation was maintained at >3×10⁸ cfu/μg. Isolated plasmid pools were digested with Kpnl/Xbal to cut between the enhancer variants and tags, ligated with the 1.78 kb Kpnl-Xbal fragment of pGL4.23[luc2/minP] (Promega), which contains a minimal TATA-box promoter and the luc2 ORF, and then transformed into E. coli as described above. Finally, to remove vector background, the resultant plasmid pools were digested with Kpnl, size selected on a 1% agarose gel, self-ligated and re-transformed into E. coli.

For validation of QSAM optimized enhancers, each variant was individually synthesized with the constant flanking sequences CTGGCCTAACTGGCCGCTTCACTG and GGTACCTGAGCTCGC (IDT). The oligonucleotides were PCR amplified as described above with primers CTGGCCTAACTGGCC and GCGAGCTCAGGTACC, cloned into pGL4.24[luc2P/minP] (Promega) using the In-Fusion® PCR Cloning System (Clontech) and verified by Sanger sequencing prior to transfection.

Cell culture and transfection

HEK293T/17 cells (ATCC CRL-11268) were cultured in DMEM (Mediatech) supplemented with 10% FBS and L-glutamine/penicillin/streptomycin.

For transfection of a plasmid pool, 4×10⁶ cells were grown to 40–50% confluence in a 10 cm culture dish. Cells were transfected with 10 μg DNA from each plasmid pool in 1 mL OptiMEM® I Reduced Serum Medium (Invitrogen) using 30 μL Lipofectamine™ LTX and 10 μL Plus Reagent (Invitrogen). The transfection mixtures were removed by media exchange after 5 hours. After 24 hours, cells transfected with CRE plasmid pools were treated for 5 hours with 100 μM forskolin (Sigma) in DMSO (induced state) or an equivalent volume of DMSO only (uninduced state). Cells transfected with IFNB plasmid pools were infected with Sendai virus (ATCC VR-907) at an MOI of 10 (induced state) or mock infected (uninduced state) for 16 hours. Immediately following these treatments, cells were lysed in RLT buffer (Qiagen) and frozen at −80 °C. Total RNA was isolated from cell lysates using RNeasy kits (Qiagen).

For transfection of individual validation plasmids, 2.3 ×10⁴ cells were seeded into each well of 96-well plates. Each well was transfected with 15 μl of Opti-MEM® I Reduced Serum Medium (Invitrogen) containing 100 ng of luc2 reporter plasmid with CRE- or IFNB-derived variants and 10 ng of pGL4.73[hRluc/SV40] (Promega) for normalization, 0.25 μL Lipofectamine™ LTX and 0.1 μL Plus Reagent (Invitrogen). Cell were treated with forskolin or infected with Sendai virus as described above. Luciferase activities were measured using Dual-Glo® Luciferase Assay (Promega) and an EnVision 2103 Multilabel Plate Reader (PerkinElmer).

Tag-Seq

mRNA was extracted from total RNA using MicroPoly(A)Purist™ kits (Ambion) and treated with DNase I using the Turbo DNA-free™ kit (Ambion). First-strand cDNA was synthesized from 400–700 ng mRNA using High Capacity RNA-to-cDNA kits (Applied Biosystems).

Tag-Seq sequencing libraries were generated directly from 12% of a cDNA reaction or 50 ng plasmid DNA by 26 cycle PCR using Pfu Ultra HS DNA polymerase 2× master mix (Agilent) and primers AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT and CAAGCAGAAGACGGCATACGAGATXXXXXXXXGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCGAGGTGCCTAAAGG (where XXXXXXXX is a library-specific index sequence). The resultant PCR products were size-selected using 2% agarose E-Gel EX (Invitrogen). The libraries were sequenced in indexed pools of eight, or individually, using 36 nt single-end reads on Illumina HiSeq 2000 instruments.

To infer the tag copy numbers in each Tag-Seq library, all sequence reads were examined, regardless of their quality scores. If the first ten nucleotides of a read perfectly matched one of the 13,000 or 27,000 designed tags and the remaining nucleotides matched the expected upstream MPRA construct sequence, this was counted as one occurrence of that tag. All reads that did not meet this criterion were discarded. All tags that did not have a count of at least 20 in every sequenced CRE or IFNB enhancer plasmid pool were also discarded. The mRNA/plasmid tag ratios were normalized by multiplying by the ratio of the total number of plasmid and mRNA tag counts from the corresponding Tag-Seq libraries.

Analysis of single-hit scanning variants

To estimate the relative activity of each distinct enhancer variant, the median of its 13 mRNA/plasmid tag ratios were compared to the median of the mRNA/plasmid ratios for tags linked to the corresponding WT enhancer. To increase the accuracy of this comparison, 65 distinct WT enhancer-tag pairs were included in each pool design. Significant differences in the median ratios were inferred by applying the Mann-Whitney U-test to all variant-WT pairs and then applying the Benjamini-Hochberg procedure to identify the 5% false discovery rate (FDR) threshold²⁸.

Analysis of multi-hit sampling variants

Information footprints were generated as described in ref. 9. Briefly, the mRNA/plasmid tag ratios from each transfection experiment were first quantized by partitioning into five equally sized bins. The mutual information values between the bases at each position and the quantized activities were then estimated using the Treves-Panzeri limited sample correction ²⁹:I(bi;μ)≈∑bi,μf(bi,μ)log2f(bi,μ)f(bi)f(μ)−6Nlog2ewhere b_i is the base at the ith position, μ is the quantized activity, f() gives the corresponding joint and marginal frequency distributions and N is the number of assayed variants.

Error bars on these values were determined by computing uncorrected mutual information estimates Inaive50%(bi;μ) for 10,000 random sub-samples that each contained 50% of the enhancer variants. The uncertainties in I(b_i;μ) were computed from the variance of these estimates:δI(bi;μ)=12var(Inaive50%(bi;μ))

To identify positions with significant information content, empirical null distributions for I(b_i;μ) were generated from 10,000 random permutations of the mapping between the quantized activities and the enhancer variants. The probability of the absence of information at the ith position was estimated as (n_i+1)/10,000, where n_i is the number of random permutations for which I(b_i;μ) exceeded the original value. The Benjamini-Hochberg procedure was then applied to identify the 5% FDR threshold²⁸.

Quantitative sequence-activity modeling

The method of ordinary least squares was used to train linear QSAMs of the formlog(activity(σ))=∑b,iAbixbiwhere A_bi is the activity contribution of base b at the ith position, and x_bi is an indicator variable that is 1 if the enhancer variant σ contains base b at the ith position and 0 otherwise. Other models, including nonlinear QSAMS, are described in Supplementary Note 1.

Model-based optimization of the induced activity of each enhancer was performed by identifying and synthesizingargmaxσactivityinduced(σ)based on the corresponding linear QSAMs (without interaction terms).

Model-based optimization of the inducibility of each enhancer was performed by identifying and synthesizingargmaxσactivityinduced(σ)activityuninduced(σ)based on the corresponding linear QSAMs, with the constraintsAσiinduced≥AWTiinducedAσiuninduced≤AWTiuninducedwhere WTi is the base at the ith position of the wild-type enhancer.

Supplementary Material

NIHMS351608-supplement-1.pdf

NIHMS351608-supplement-2.xlsx

NIHMS351608-supplement-3.xlsx

NIHMS351608-supplement-4.xlsx

NIHMS351608-supplement-5.xlsx

Figure 1

Overview of MPRA

Oligonucleotides containing enhancer variants coupled to distinguishing tags are first generated using microarray-based DNA synthesis. The variants and tags are separated by two common restriction sites (circle/square). The oligonucleotides are PCR amplified from universal primer sites (not shown) and directionally cloned into a plasmid backbone. An invariant promoter-ORF segment is then inserted between the variants and tags by double digestion and directional ligation. The resulting reporter plasmid pool is cotransfected into a population of cells. The relative regulatory activities of the transfected variants are inferred by sequencing and counting their corresponding tags from the cellular mRNA and the transfected plasmid pool. See Supplementary Figure 1 for additional details.

Figure 2

Single-hit scanning mutagenesis of the cAMP-responsive enhancer

(a) The CRE sequence with known and putative transcription factor binding sites indicated. (b) Changes in induced activity due to single nucleotide substitutions. Each bar shows the log-ratio of the median variant and wild-type activity estimates. (c) Changes in induced activity due to 8 consecutive substitutions. The plot shows the medians of three different types of substitutions, see Supplementary Fig. 3. Each bar is located at the fourth nucleotide in the corresponding 8 nt substitution. (d) Changes in induced activity due to 5 nt (top) and 10 nt (bottom) insertions. The plots show the means of two different insertions, see Supplementary Fig. 4. Each bar is located one nucleotide to the right of the insertion. Error bars show the first and third quartile. Red indicates a significant change from wild-type (Mann-Whitney U-test, 5% FDR). Numerical values are provided in Supplementary Table 3.

Figure 3

Single-hit scanning mutagenesis of the virus-inducible IFNB enhancer

(a) The IFNB enhancer with known transcription factor binding sites indicated. (b) Changes in induced activity due to single nucleotide substitutions. Each bar shows the log-ratio of the median variant and wild-type activity estimates. (c) Changes in induced activity due to 8 consecutive substitutions. The plot shows the medians of three different types of substitutions, see Supplementary Fig. 5. Each bar is located at the fourth nucleotide in the corresponding 8 nt substitution. (d) Changes in induced activity due to 5 nt (top) and 10 nt (bottom) insertions. The plots show the means of two different insertions, see Supplementary Fig. 6. Each bar is located one nucleotide to the right of the insertion. Error bars show the first and third quartile. Red indicates a significant change from wild-type (Mann-Whitney U-test, 5% FDR). Numerical values are provided in Supplementary Table 4.

Figure 4

Multi-hit sampling mutagenesis of the cAMP-responsive enhancer

(a) Information footprints of the CRE in its induced (top) and uninduced (bottom) states. Red indicates significant information content at the corresponding position (permutation test, 5% FDR). Error bars show uncertainties inferred from sub-sampling. (b) Visual representations of linear QSAMs of the CRE in its induced (top) and uninduced (bottom) states. The color in each entry represents the estimated additive contribution of the corresponding nucleotide to the log-transformed activity of the enhancer. The matrices are re-scaled such that the lowest entry in each column is zero and the highest entry anywhere is one. Both matrices are shown on the same scale. Numerical values are provided in Supplementary Table 3.

Figure 5

Multi-hit sampling mutagenesis of the virus-inducible IFNB enhancer

(a) Information footprints of the IFNB enhancer in its induced (top) and uninduced (bottom) states. Red indicates significant information content at the corresponding position (permutation test, 5% FDR). Error bars show uncertainties inferred from sub-sampling. (b) Visual representations of linear QSAMs of the IFNB enhancer in its induced (top) and uninduced (bottom) states. The color in each entry represents the estimated additive contribution of the corresponding nucleotide to the log-transformed activity of the enhancer. The matrices are re-scaled such that the lowest entry in each column is zero and the highest entry anywhere is one. Both matrices are shown on the same scale. Numerical values are provided in Supplementary Table 4.

Figure 6

Model-based optimization

(a) CRE variants predicted to maximize induced activity (A1) or inducibility (I1ȓI3) based on linear QSAMs trained on multi-hit data. Differences from WT are indicated by red shading. Darker shading indicates a higher predicted contribution to the change in activity. (b) Luciferase activity of the wild-type (WT) and optimized CRE variants in untreated and forskolin-treated cells. (c) Inducibility of the CRE variants in response to cAMP elevation caused by forskolin treatment. (d) IFNB enhancer variants predicted to maximize induced activity (A1) or inducibility (I1) based on linear QSAMs trained on multi-hit data. (e) Luciferase activity of the WT and optimized IFNB enhancer variants in uninfected and virus-treated cells. (f) Inducibility of the IFNB enhancer variants in response to virus infection. Blue bars show mean activity across 12 replicates in the induced or uninduced states. Error bars show standard errors of the means (SE). All statistical comparisons are relative to WT in the same state; n.s., not significant; *, p ≤ 0.05; ***, p ≤ 0.0001; two-tailed t-test. Orange bars show the ratio of the corresponding induced and uninduced mean activities. Error bars show the range from (induced mean - induced SE)/(uninduced mean + uninduced SE) to (induced mean + induced SE)/(uninduced mean - uninduced SE).

Footnotes

AUTHOR CONTRIBUTIONS A.Mel., X.Z., P.R., A.G. and T.S.M. developed MPRA and performed the molecular biology experiments. L.W. performed the cell culture, plasmid transfections and luciferase assays. A.Mur., T.T., S.F., C.G.C., J.B.K., M.K., E.S.L. and T.S.M. analyzed the data. T.S.M. wrote the main text with substantial input from all authors. C.G.C. and J.B.K. wrote the Supplementary Notes with substantial input from A.Mur and T.S.M.

COMPETING FINANCIAL INTERESTS A patent application describing ideas presented in this article has been filed by the Broad Institute.

ACCESSION NUMBERS All analyzed sequence data has been deposited in NCBI GEO under accession GSE31982.

ACKNOWLEDGEMENTS

The authors would like to thank E. M. LeProust and S. Chen of Agilent, Inc. for oligonucleotide library synthesis, R. P. Deering for assistance with Sendai virus infections and the staff of the Broad Institute and the Bauer Core facilities for assistance with data generation. This project was supported by funds from the Broad Institute, the Harvard Stem Cell Institute (T.S.M.), NHGRI grant R01HG004037 (M.K.), the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory (J.B.K.), NSF grant PHY-0957573 (C.G.C., T.T.) and NSF grant PHY-1022140 (A. Mur.).

References

1. BirneyEIdentification and analysis of functional elements in 1% of the human genome by the ENCODE pilot projectNature2007447799816[PubMed][Google Scholar]
2. LanderESInitial impact of the sequencing of the human genomeNature2011470187197[PubMed][Google Scholar]
3. DorerDENettelbeckDMTargeting cancer by transcriptional control in cancer gene therapy and viral oncolysisAdv Drug Deliv Rev200961554571[PubMed][Google Scholar]
4. FanFWoodKVBioluminescent assays for high-throughput screeningAssay Drug Dev Technol20075127136doi:10.1089/adt.2006.053[PubMed][Google Scholar]
5. LoewRHeinzNHampfMBujardHGossenMImproved Tet-responsive promoters with minimized background expressionBMC Biotechnol20101081[PubMed][Google Scholar]
6. CareyMPetersonCLSmaleSTTranscriptional regulation in eukaryotes : concepts, strategies, and techniques2009Cold Spring Harbor Laboratory Press2nd edn
7. PatwardhanRPHigh-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesisNat Biotechnol20092711731175[PubMed][Google Scholar]
8. LeProustEMSynthesis of high-quality libraries of long (150mer) oligonucleotides by a novel depurination controlled processNucleic Acids Res20103825222540[PubMed][Google Scholar]
9. KinneyJBMuruganACallanCGJr.CoxECUsing deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequenceProc Natl Acad Sci U S A201010791589163[PubMed][Google Scholar]
10. PanneDManiatisTHarrisonSCAn atomic model of the interferon-beta enhanceosomeCell200712911111123[PubMed][Google Scholar]
11. ArnostiDNKulkarniMMTranscriptional enhancers: Intelligent enhanceosomes or flexible billboards?J Cell Biochem200594890898[PubMed][Google Scholar]
12. JonssonJNorbergTCarlssonLGustafssonCWoldSQuantitative sequence-activity models (QSAM)--tools for sequence designNucleic Acids Res199321733739[PubMed][Google Scholar]
13. StormoGDSchneiderTDGoldLQuantitative analysis of the relationship between nucleotide sequence and functional activityNucleic Acids Res19861466616679[PubMed][Google Scholar]
14. MayrBMontminyMTranscriptional regulation by the phosphorylation-dependent factor CREBNat Rev Mol Cell Biol20012599609[PubMed][Google Scholar]
15. BenbrookDMJonesNCDifferent binding specificities and transactivation of variant CRE's by CREB complexesNucleic Acids Res19942214631469[PubMed][Google Scholar]
16. FinkJSThe CGTCA sequence motif is essential for biological activity of the vasoactive intestinal peptide gene cAMP-regulated enhancerProc Natl Acad Sci U S A19888566626666[PubMed][Google Scholar]
17. KunschCRubenSMRosenCASelection of optimal kappa B/Rel DNA-binding motifs: interaction of both subunits of NF-kappa B with DNA is required for transcriptional activationMol Cell Biol19921244124421[PubMed][Google Scholar]
18. FalvoJVParekhBSLinCHFraenkelEManiatisTAssembly of a functional beta interferon enhanceosome is dependent on ATF-2-c-jun heterodimer orientationMol Cell Biol20002048144825[PubMed][Google Scholar]
19. SchneiderTDStormoGDExcess information at bacteriophage T7 genomic promoters detected by a random cloning techniqueNucleic Acids Res198917659674[PubMed][Google Scholar]
20. BishopCMPattern recognition and machine learning2006Springer
21. De MeyMMaertensJLequeuxGJSoetaertWKVandammeEJConstruction and model-based analysis of a promoter library for E. coli: an indispensable tool for metabolic engineeringBMC Biotechnol2007734[PubMed][Google Scholar]
22. QuanJParallel on-chip gene synthesis and application to optimization of protein expressionNat Biotechnol201129449452[PubMed][Google Scholar]
23. MatzasMHigh-fidelity gene synthesis by retrieval of sequence-verified DNA identified using high-throughput pyrosequencingNat Biotechnol20102812911294[PubMed][Google Scholar]
24. BernsteinBEThe NIH Roadmap Epigenomics Mapping ConsortiumNat Biotechnol20102810451048[PubMed][Google Scholar]
25. EdelmanGMMeechROwensGCJonesFSSynthetic promoter elements obtained by nucleotide sequence variation and selection for activityProc Natl Acad Sci U S A20009730383043[PubMed][Google Scholar]
26. SchlabachMRHuJKLiMElledgeSJSynthetic design of strong promotersProc Natl Acad Sci U S A201010725382543[PubMed][Google Scholar]
27. HollandJHAdaptation in natural and artificial systems : an introductory analysis with applications to biology, control, and artificial intelligence1992MIT Press1st MIT Press edn
28. BenjaminiYHochbergYControlling the False Discovery Rate: A Practical and Powerful Approach to Multiple TestingJournal of the Royal Statistical Society. Series B (Methodological)199557289300[Google Scholar]
29. TrevesAPanzeriSThe upward bias in measures of information derived from limited samplesNeural Comput19957399407[Google Scholar]