Normalizing Illumina 450 DNA methylation data

Schalkwyk 1 Normalizing Illumina 450 DNA methylation data Leo Schalkwyk, Ruth Pidsley, Manuela Volta, Chloe Wong, Katie Lunnon and Jon Mill Institute of Psychiatry Social, Genetic and Developmental Psychiatry April 20, 2012

Schalkwyk 2 Metrics for 450 array normalization Leonard C Schalkwyk - SGDP - IoP - 2012 450K DNA methylation array processing methods as recommended by the manufacturer are very simple the bead intensities are local-background adjusted and averaged an estimate of methylation fraction for each CG feature is calculated from Methylated(M) and Unmethylated(U) intensities: β = M/(M + U + 100) the product, like the 27K predecessor, produces stable βs with a characteristic bimodal distribution

Schalkwyk 3 Need for normalization Illumina claims normalization not required due to division by total intensity M + U one of the possible problems with this: background inflates both numerator and denominator of β = M/(M + U + 100) this moves β toward 0.5 by a possibly variable amount

Schalkwyk 4 experience from gene expression and allelotyping nothing you can do with normalization replaces careful experimental design a nuisance variable confounded with what you want to test can t be fixed eg case and control in different batches nothing you can do with normalization replaces rigorous QC samples with unusual raw intensity distributions multivariate methods such as PCA often identify mislabeled samples background correction generally counterproductive

Schalkwyk 5 quantile normalization Bolstad, Irizarry et al 2003 descendant of nonparametric techniques where values are replaced by their ranks within array here instead values are replaced by the mean of values of the same rank raw data sorted adjusted reordered

Schalkwyk 6 quantile normalization raw intensity normalized intensity Density 0.0000 0.0010 0.0020 0.0030 Density 0.0000 0.0010 0.0020 0.0030 0 200 400 600-200 0 200 400 600 intensity intensity depends on the majority of the profile being similar shown to perform well for gene expression using dilution and spike-in data

Schalkwyk 7 possible problems applying quantile normalization to 450 arrays no standard testing data sets single probe pair for each CpG site two different assays on the same array I : M and U same colour, different beads II: M and U different colour, same bead mid range beta values most interesting and few in number known potential problem with quantile normalization where quantiles far apart this is one argument against the common practice of quantile-normalizing betas

Schalkwyk 8 Dedeurwaerder et al 2011 recognised a distribution difference between assay I and assay II they devised an ad-hoc scaling to force the distributions together

Schalkwyk 9 Dedeurwaerder et al 2011 testing: yield of differences in methylation between wild type and DNMT double KO cell lines does not distinguish real from spurious positives

Schalkwyk 10 processing ideas adjust background difference assay I vs II likely variance penalty quantile normalize M and U separately quantile normalize I and II separately

Schalkwyk 11 Assay I vs II equalization AssayI is black, II is red. Methylated solid and unmethylated dashed.

Schalkwyk 12 Data sets NIH Alzheimer Project: 90+ arrays for each of A Dorsolateral prefrontal cortex BA9 (90) E Entorhinal cortex (94) F Superior temporal gyrus BA22 (94) H cerebellum (89) Schizophrenia project cortex (44) cerebellum (42) Autism project cerebellum (24)

Schalkwyk 13 possible metrics of normalization goodness several sets of probes with behaviour we can develop tests for imprinting DMRs: monoallelic methylation non-pseudoautosomal X chromosome: differs vs SNP probes: distinct AA, AB, BB genotypes

Schalkwyk 14 imprinting DMR known DMR from https://atlas.genetics.kcl.ac.uk/ (Reiner Schulz) these are a conservative subset 308 CpGs assayed on the array are within known DMR at least one each in 33 of the 38 documented DMR joint distribution quite tight and peaks at β = 0.56 can look at variance across samples (main issue for normalization) but also consistency across loci

Schalkwyk 15

Schalkwyk 16

Schalkwyk 17 X-inactivation less baseline data than you might expect expect full monoallelic methylation in female expect less methylation in hemizygous males should be able to detect male female difference

Schalkwyk 18 X-inactivation there are 11232 X chromosome features in the array annotation none of them are in the pseudoautosomal regions in cerebellum data set for example 9796 Bonferroni sex differences 8969 on X the Bonferroni test recovers 80% of X chromosome probes 91% of the bonferroni differences are X chromosome lends itself to ROC analysis

Illumina blue, beta6 green Schalkwyk 19

Schalkwyk 20

Schalkwyk 21 genotyping genotyping assay not very different from methylation 65 selected SNPs on array (very useful for checking identities) βs should fall in 3 genotype classes

genotype variances genotypes assigned by one-dimension k-means clustering within cluster sums of squares summed across SNPs Schalkwyk 22

Schalkwyk 23 conclusions bgeqqn wins almost every time requires a better name

Schalkwyk 24 next steps package these tests up test a wider variety of data bad data! separate analysis of performance I and II worry about dye bias test other methods effect size and power