Chemistry and reactions from non- US patents Daniel Lowe and Roger Sayle NextMove Software Cambridge, UK
Topics Coverage of USPTO vs EPO patents Extraction of non-chemical-compound data Can text-mining provide insights into reaction informatics
What am I missing by only using the USPTO data? 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
EPO and USPTO background USPTO: 303k grants published in 2013 EPO: 67k grants published in 2013 EPO patents must be in one or more of English, French or German USPTO documents are all in English
Epo/USPTO Compounds (using StdInChI) 445,588 3,345,877 2,674,069 USPTO = 1976-2013 patent grants + 2001-2013 patent applications EPO = 1978-2013 patent grants and applications
Epo/USPTO Reactions (using separately sorted StdInChIs for reactants/agents and products) 99,681 706,524 317,533 USPTO = 1976-2013 patent grants + 2001-2013 patent applications EPO = 1978-2013 patent grants and applications
Effect of different Languages 4000000 3500000 3000000 Number of structures extracted 2500000 2000000 1500000 1000000 English German French 500000 0 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Translation performed by Lexichem v2014.jun
First Disclosure Occurrence by Language All compounds Excluding compounds disclosed by USPTO (1976-2013) patents
Timeliness Competitive intelligence requires up to date information Methodology: Consider compounds present in both EPO and USPTO date not disclosed by either prior to 2006 Compare the earliest publication date for each compound
Average lag between EPO/USPTO 180000 160000 140000 Mean: US 540 days 120000 Compound disclosures 100000 80000 60000 40000 20000 0 US more than 5 US 3-5 US 2-3 US 1-2 US 6-12 months US 3-6 months US 1-3 months US 1-4 weeks Within a week US 1-4 weeks US 1-3 months US 3-6 months US 6-12 months US 1-2 Compounds first disclosed between 2006-2013 US 2-3 US 3-5 US more than 5
70000 Applications Vs Applications 60000 50000 Mean: US 223 days Compound disclosures 40000 30000 20000 10000 0 US more than 5 US 3-5 US 2-3 US 1-2 US 6-12 months US 3-6 months US 1-3 months US 1-4 weeks Within a week US 1-4 weeks US 1-3 months US 3-6 months US 6-12 months US 1-2 Compounds first disclosed between 2006-2013 US 2-3 US 3-5 US more than 5
140000 Grants vs Grants 120000 100000 Mean: US 287 days Compound disclosures 80000 60000 40000 20000 0 US more than 5 US 3-5 US 2-3 US 1-2 US 6-12 months US 3-6 months US 1-3 months US 1-4 weeks Within a week US 1-4 weeks US 1-3 months US 3-6 months US 6-12 months US 1-2 Compounds first disclosed between 2006-2013 US 2-3 US 3-5 US more than 5
Epo/US 1976-2013 and chembl19 overlap 1,155,034 6,278,048 187,486
10000 Average lag between Patents/CHEMbl19 9000 8000 7000 Mean: Patents 345 days 6000 5000 4000 3000 2000 1000 0 Patents 4-5 Patents 3-4 Patents 2-3 Patents 1-2 Patents 0-1 Patents 0-1 Patents 1-2 Patents 2-3 Patents 3-5 Patents 4-5 Compounds first disclosed between 2006-2013
Other data in patents Genes/proteins Reactions including role of reagents Physical quantities Spectra Diseases Organisms Companies
Gene/protein identification Identify trends in drug target popularity Count number of patents (in a year) mentioning a given gene or its gene products and map to the HGNC symbol Many genes are referenced by short ambiguous terms hence great care is required to have good precision
Gene/protein identification EGFR MET Percentage of protein/gene mentions in that year (Patents) 0.35% 0.30% 0.25% 0.20% 0.15% 0.10% 0.05% 0.00% 1976197819801982198419861988199019921994199619982000200220042006200820102012 0.90% 0.80% 0.70% 0.60% 0.50% 0.40% 0.30% 0.20% 0.10% 0.00% Percentage of protein/gene mentions in that year (ChEMBL) Percentage of protein/gene mentions in that year (Patents) 0.14% 0.12% 0.10% 0.08% 0.06% 0.04% 0.02% 0.00% 1976197819801982198419861988199019921994199619982000200220042006200820102012 0.70% 0.60% 0.50% 0.40% 0.30% 0.20% 0.10% 0.00% Percentage of protein/gene mentions in that year (ChEMBL) (Epidermal growth factor receptor) (Hepatocyte growth factor receptor)
Gene/protein identification MTOR PLAT Percentage of protein/gene mentions in that year (Patents) 0.14% 0.12% 0.10% 0.08% 0.06% 0.04% 0.02% 0.00% 1976197819801982198419861988199019921994199619982000200220042006200820102012 0.45% 0.40% 0.35% 0.30% 0.25% 0.20% 0.15% 0.10% 0.05% 0.00% Percentage of protein/gene mentions in that year (ChEMBL) Percentage of protein/gene mentions in that year (Patents) 0.50% 0.45% 0.40% 0.35% 0.30% 0.25% 0.20% 0.15% 0.10% 0.05% 0.00% 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 0.40% 0.35% 0.30% 0.25% 0.20% 0.15% 0.10% 0.05% 0.00% Percentage of protein/gene mentions in that year (ChEMBL) (Mammalian target of rapamycin) (Tissue plasminogen activator)
Melting/Boiling Point Extraction
Melting/Boiling Point Extraction Over 99k so far!
CHEMICAL reactions 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Predicting yield Features to consider: 2D fingerprints (especially around the reaction centers) Reaction type Temperature Time Change in complexity e.g. chiral centres
Data extraction Can pull out yields, quantities, times and temperatures Can sanity check text-mined yield by calculating from amounts (or even masses) mmmm oo cccccccc(g) mmmmm mmmm [cccc ffff sssssssss](gggg 1 ) = mmmm oo cccccccc mmmm oo ppppppp mmmm oo llllllll rrrrrrrr = %yyyyy
Outliers inevitable
scale vs yield 2001-2013 US applications, Suzuki couplings 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Temperatures
700000 Identify Synthetic Routes 600000 500000 Occurrences 400000 300000 200000 Intermediates Terminal Products 100000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Intermediates 197702 103114 56611 31403 17268 9230 5057 2701 1256 639 301 136 58 15 5 2 Terminal Products 385149 149445 81837 47579 27670 16619 9320 5263 2511 1330 678 373 111 63 8 6 5 Number of steps
Number of steps vs Complexity ringcomplexityscore = log 10 (nringbridgeatoms + 1) + log 10 (nspiroatoms + 1) stereocomplexityscore = log 10 (nstereocenters + 1) macrocyclepenalty = log 10 (nmacrocycles + 1) sizepenalty = natoms 1.005 natoms Ertl & Schuffenhauer 2009 doi:10.1186/1758-2946-1-8
Number of steps vs Complexity Number of product heavy atoms 50 45 40 35 30 25 20 15 10 5 Average number of heavy atoms 0 1 2 3 4 5 6 7 8 9 10 11 Number of steps
Yield vs Change in Complexity 2001-2013 US applications, all reactions 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Trends in Reaction Types 8.0% Suzuki couplings as a percentage of reactions in a year 7.0% 6.0% 5.0% 4.0% 3.0% 2.0% 1.0% 0.0% 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Classification performed using NameRXN
Trends In Solvent Use 20.0% Percentage of reactions in that year 15.0% 10.0% 5.0% Tetrahydrofuran Dichloromethane Water Dimethylformamide Methanol Ethyl acetate Ethanol 1,4-Dioxane Toluene Acetonitrile Acetic acid Chloroform Acetone Benzene 0.0% 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Are solvents getting greener? 1976 2013 Water (21%) Tetrahydrofuran (15%) Ethanol (11%) Dichloromethane (14%) Benzene (8%) Water (13%) Methanol (7%) Dimethylformamide (10%) Tetrahydrofuran (5%) Methanol (8%) Dichloromethane (4%) Ethyl acetate (7%) Dimethylformamide (4%) Ethanol (5%) Acetic acid (4%) 1,4-Dioxane (4%) Chloroform (3%) Toluene (3%) Acetone (3%) Acetonitrile (3%) Total for top 10: 71% 82%
Conclusions A significant amount of novel chemistry, from EPO patents, comes from the non-english patents Compounds disclosed by both the USPTO and EPO are on average published by the USPTO but for many compounds an EPO patent will be the disclosure Gene/protein identification can identify clear changes in patenting behaviour over time Text mining provides the tools to answer many reaction informatics questions
Acknowledgements Funding provided by:
Thank you for your time! http://nextmovesoftware.com http://nextmovesoftware.com/blog daniel@nextmovesoftware.com