Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

Machine learning system with two encoder towers for semantic matching

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • Publication Date:
    January 07, 2025
  • معلومة اضافية
    • Patent Number:
      12191,004
    • Appl. No:
      17/850763
    • Application Filed:
      June 27, 2022
    • نبذة مختصرة :
      This disclosure describes a machine learning system that includes a contrastive learning based two-tower model for retrieval of relevant chemical reaction procedures given a query chemical reaction. The two-tower model uses attention-based transformers and neural networks to convert tokenized representations of chemical reactions and chemical reaction procedures to embeddings in a shared embedding space. Each tower can include a transformer network, a pooling layer, a normalization layer, and a neural network. The model is trained with labeled data pairs that include a chemical reaction and the text of a chemical reaction procedure for that chemical reaction. New queries can locate chemical reaction procedures for performing a given chemical reaction as well as procedures for similar chemical reactions. The architecture and training of the model make it possible to perform semantic matching based on chemical structures. The model is highly accurate providing an average recall at K=5 of 95.9%.
    • Inventors:
      MICROSOFT TECHNOLOGY LICENSING, LLC (Redmond, WA, US)
    • Assignees:
      MICROSOFT TECHNOLOGY LICENSING, LLC (Redmond, WA, US)
    • Claim:
      1. A machine learning system for identifying one or more candidate chemical reaction procedures from a chemical reaction sketch, the system comprising: a processor; a memory comprising computer-readable instructions executable by the processor; a datastore comprising a corpus of chemical reaction procedures; an interface configured to receive the chemical reaction sketch from a user computing device; a reaction encoder configured to create a reaction embedding of the chemical reaction sketch, the reaction encoder comprising a reaction transformer network, a reaction pooling layer, a reaction normalization layer, and a reaction neural network; a procedure encoder configured to create procedure embeddings of the chemical reaction procedures in the corpus of chemical reaction procedures, the procedure encoder comprising a procedure transformer network, a procedure pooling layer, a procedure normalization layer, and a procedure neural network; a similarity-assessing mechanism configured to determine a similarity between the reaction embedding and the procedure embeddings in a shared embedding space; and an output mechanism configured to provide to the interface a predetermined number of candidate chemical reaction procedures from the corpus of chemical reaction procedures, the candidate chemical reaction procedures corresponding to procedure embeddings identified by the similarity-assessing mechanism as having the highest similarity to the reaction embedding.
    • Claim:
      2. The machine learning system of claim 1 , wherein the reaction transformer network comprises multiple layers each including a multi-head attention layer and a fully-connected feed-forward layer configured to generate a reaction transformer output and the procedure transformer network comprises multiple layers each including a multi-head attention layer and a fully-connected feed-forward layer configured to generate a procedure transformer output.
    • Claim:
      3. The machine learning system of claim 2 , wherein the reaction pooling layer generates a first high-dimensional vector from the reaction transformer output and the procedure pooling layer generates a second high-dimensional vector from the procedure transformer output.
    • Claim:
      4. The machine learning system of claim 3 , wherein the reaction normalization layer generates a first normalized vector from the first high-dimensional vector and the procedure normalization layer generates a second normalized vector from the second high-dimensional vector.
    • Claim:
      5. The machine learning system of claim 4 , wherein the reaction neural network and the procedure neural network are both fully connected, feed-forward, multilayer neural networks and the reaction neural network is configured to generate the reaction embedding from the first normalized vector and the procedure neural network is configured to generate the procedure embedding from the second normalized vector.
    • Claim:
      6. The machine learning system of claim 1 , wherein the reaction encoder and the procedure encoder are trained using contrasting learning on labeled pair-wise data of training chemical reaction procedures and training representations of chemical reactions, the training chemical reaction procedures provided to the procedure encoder and the training representations of chemical reactions provided to the reaction encoder.
    • Claim:
      7. A computer-implemented method of identifying one or more chemical reaction procedures from a chemical reaction sketch comprising: receiving from a user computing device the chemical reaction sketch; tokenizing the chemical reaction sketch to create a reaction token sequence; generating a reaction embedding from the reaction token sequence by a reaction encoder of a contrastive learning based two-tower model, the contrastive learning based two-tower model trained by contrastive loss on training data that includes training chemical reactions and training chemical reaction procedures for performing the training chemical reactions; determining similarity between the reaction embedding and procedure embeddings in a shared embedding space, the procedure embeddings generated by a procedure encoder of the contrastive learning based two-tower model from chemical reaction procedures in a corpus of chemical reaction procedures; and outputting a predetermined number of candidate chemical reaction procedures corresponding to procedure embeddings having a highest similarity to the reaction embedding.
    • Claim:
      8. The computer-implemented method of claim 7 , wherein the chemical reaction sketch is a simplified molecular-input line-entry system (SMILES) representation of all or part of a chemical reaction.
    • Claim:
      9. The computer-implemented method of claim 7 , wherein the similarity is a semantic similarity based on functional groups and carbon backbone structures.
    • Claim:
      10. The computer-implemented method of claim 7 , wherein the reaction encoder comprises: a transformer network that generates a transformer output from the reaction token sequence; a pooling layer that generates a high-dimensional vector from the transformer output; a normalization layer that generates a normalized vector from the high-dimensional vector; and a neural network that generates the reaction embedding in the shared embedding space from the normalized vector.
    • Claim:
      11. The computer-implemented method of claim 10 , wherein the transformer network has six layers, the pooling layer comprises a max pooler, the high-dimensional vector has 512 dimensions, and the neural network has two layers.
    • Claim:
      12. The computer-implemented method of claim 7 , wherein the procedure encoder comprises: a transformer network that generates a transformer output from procedure token sequences that are tokenizations of the chemical reaction procedures; a pooling layer that generates a high-dimensional vector from the transformer output; a normalization layer that generates a normalized vector from the high-dimensional vector; and a neural network that generates the procedure embeddings in the shared embedding space from the normalized vector.
    • Claim:
      13. The computer-implemented method of claim 12 , wherein the transformer network has six layers, the pooling layer comprises a max pooler, the high-dimensional vector has 512 dimensions, and the neural network has two layers.
    • Claim:
      14. A computer-implemented method of training a machine learning system for identifying chemical reaction procedures from chemical reaction sketches comprising: accessing training data from a training datastore, the training data comprising labeled data pairs of training chemical reactions and training chemical reaction procedures for performing the chemical reactions; tokenizing the training chemical reactions from the training data to create reaction token sequences; providing the reaction token sequences to a reaction encoder that generates reaction embeddings in a shared embedding space; tokenizing the training chemical reaction procedures from the training data to create procedure token sequences; providing the procedure token sequences to a procedure encoder that generates procedure embeddings in the shared embedding space; and training the reaction encoder and the procedure encoder with the training data by backpropagation to minimize a loss function between corresponding pairs of the reaction embeddings and the procedure embeddings.
    • Claim:
      15. The computer-implemented method of claim 14 , further comprising cleaning the training data by separating the training chemical reactions into reactants and products.
    • Claim:
      16. The computer-implemented method of claim 14 , further comprising cleaning the training data by removing any representations of a chemical reaction that indicates a valance of an atom that exceeds a maximum valance for the atom.
    • Claim:
      17. The computer-implemented method of claim 14 , wherein the reaction encoder comprises a reaction transformer network followed by a reaction pooling layer followed by a reaction normalization layer followed by a reaction neural network and the procedure encoder comprises a procedure transformer network followed by a procedure pooling layer followed by a procedure normalization layer followed by a procedure neural network.
    • Claim:
      18. The computer-implemented method of claim 17 , wherein the reaction transformer network and the procedure transformer network both comprise one or more layers where each layer comprises a multi-head attention layer and a fully-connected feed-forward layer.
    • Claim:
      19. The computer-implemented method of claim 17 , wherein the reaction neural network and the procedure neural network are both fully-connected, feed-forward, multilayer neural networks.
    • Claim:
      20. The computer-implemented method of claim 14 , further comprising initializing the reaction encoder with weights developed from a machine learning model configured to convert between a first representation of chemical entity and a second representation of chemical entity and initializing the procedure encoder with weights from a BERT-based language model of scientific publications.
    • Patent References Cited:
      4591948 May 1986 Sato et al.
      4922052 May 1990 Shimizu et al.
      4944848 July 1990 Kaufhold
      5082358 January 1992 Tabata et al.
      5221721 June 1993 Selvig
      7115589 October 2006 Weigele et al.
      7148245 December 2006 Bernardon et al.
      7435837 October 2008 Gross et al.
      7991730 August 2011 Wagner et al.
      8299267 October 2012 Ghosh et al.
      8926998 January 2015 Hedrick et al.
      8927117 January 2015 Buesing et al.
      8927281 January 2015 Boitano et al.
      2022/0270711 August 2022 Feala
      2023/0410950 December 2023 Godin
      2024/0135172 April 2024 Hyuga













    • Other References:
      “Formulation Examples”, Retrieved from: https://web.archive.org/web/20210518173423/https://www.jrspharma.com/pharma_en/resources/formulation-examples/, May 18, 2021, 3 Pages. cited by applicant
      Beltagy, et al., “SciBERT: A Pretrained Language Model for Scientific Text”, In Repository of arXiv:1903.10676v3, Sep. 10, 2019, 6 Pages. cited by applicant
      Chithrananda, et al., “ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction”, In Repository of arXiv:2010.09885v2, Oct. 23, 2020, 7 Pages. cited by applicant
      Devlin, et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, In Repository of arXiv:1810.04805v2, May 24, 2019, 16 Pages. cited by applicant
      Finelli, Luca, “The Art of Drug Design in a Technological Age”, Retrieved from: https://www.novartis.com/stories/art-drug-design-technological-age?utm_source=LT, Nov. 18, 2021, 5 Pages. cited by applicant
      Guo, et al., “MM-Deacon: Multimodal Molecular Domain Embedding Analysis via Contrastive Learning”, In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, May 22, 2022, 21 Pages. cited by applicant
      Lowe, Daniel, “Chemical Reactions from US Patents (1976-Sep. 2016)”, Retrieved from: https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873, Jun. 13, 2017, 1 Page. cited by applicant
      Rajan, et al., “DECIMER: Towards Deep Learning for Chemical Image Recognition”, In Journal of Cheminformatics, vol. 12, Article 65, Oct. 27, 2020, 9 Pages. cited by applicant
      Rajan, et al., “STOUT: SMILES to IUPAC Names using Neural Machine Translation”, In Journal of Cheminformatics, vol. 13, Article 34, Apr. 27, 2021, 14 Pages. cited by applicant
      Vaswani, et al., “Attention Is All You Need”, In Repository of arXiv:1706.03762v5, Dec. 6, 2017, 15 Pages. cited by applicant
      Guo, et al., “MM-Deacon: Multimodal molecular domain embedding analysis via contrastive learning”, Retrieved From: https://www.biorxiv.org/content/10.1101/2021.09.17.460864v1.full.pdf, Sep. 20, 2021, 21 Pages. cited by applicant
      Guo, et al., “Multilingual Molecular Representation Learning via Contrastive Pre-training”, In repository of arXiv:2109.08830v3, Apr. 18, 2022, 13 Pages. cited by applicant
      “International Search Report and Written Opinion Issued in PCT Application No. PCT/US23/022629”, Mailed Date: Sep. 13, 2023, 12 Pages. cited by applicant
      Schwaller, et al., “Mapping the Space of Chemical Reactions Using Attention-Based Neural Networks”, In repository of arxiv:2012.06051v1, Dec. 9, 2020, 36 Pages. cited by applicant
    • Primary Examiner:
      Islam, Mohammad K
    • Attorney, Agent or Firm:
      Keim, Benjamin
      Newport IP, LLC
    • الرقم المعرف:
      edspgr.12191004