Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

Feature prediction for minority class data augmentation

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • Publication Date:
    February 04, 2025
  • معلومة اضافية
    • Patent Number:
      12217,875
    • Appl. No:
      17/978009
    • Application Filed:
      October 31, 2022
    • نبذة مختصرة :
      A method for generating synthetic training records for use in training a model to predict low-incidence events. A synthetic training record is generated from a minority-class training record by substituting a different value for a feature in the minority-class training record, where the probability of the different value occurring in the minority-class training record exceeds a probability threshold. Also disclosed are a non-transitory storage medium comprising minority-class training records and synthetic training records and a method of training a machine-leaning model using training records augmented with synthetic training records. An exemplary synthetic training records is a synthetic medical record for use in training a model to predict drug overdoses.
    • Inventors:
      Pulselight Holdings, Inc. (Austin, TX, US)
    • Assignees:
      Pulselight Holdings, Inc. (Austin, TX, US)
    • Claim:
      1. A computer-implemented method of generating synthetic minority-class training records for machine learning, the method performed by a computer system, said computer system comprising one or more processors and computer-usable non-transitory storage media operationally coupled to the one or more processors, comprising: storing in the non-transitory storage media a plurality of original minority-class training records, including a first minority-class training record, wherein each of the plurality of original minority-class training records is labeled with a same first label and comprises a feature value for each of a plurality of features, including a first feature, and wherein the first minority-class training record comprises a first feature value for the first feature; using a computational process performed by the one or more processors executing software instructions stored in the computer-usable non-transitory storage media, determining that the probability of the first feature having a different second feature value in the first minority-class training record exceeds a pre-determined probability threshold; and generating a first synthetic minority-class training record from the first minority-class training record, comprising changing the feature value of the first feature in the first minority-class training record from the first feature value to the second feature value, and storing the modified version of the first minority-class training record as the first synthetic minority-class training record in the non-transitory storage media, thereby augmenting the plurality of original minority-class training records with the first synthetic minority-class training record.
    • Claim:
      2. The computer-implemented method of claim 1 of generating synthetic minority-class training records for machine learning, wherein the first feature can be measured with respect to health.
    • Claim:
      3. The computer-implemented method of claim 1 of generating synthetic minority-class training records for machine learning, wherein the original minority-class training records comprise medical records.
    • Claim:
      4. The computer-implemented method of claim 1 of generating synthetic minority-class training records for machine learning, wherein the original minority-class training records are high-dimensional.
    • Claim:
      5. The computer-implemented method of claim 1 of generating synthetic minority-class training records for machine learning, wherein the first label describes a low incidence event.
    • Claim:
      6. The computer-implemented method of claim 5 of generating synthetic minority-class training records for machine learning, wherein the first label describes a health care incident.
    • Claim:
      7. The computer-implemented method of claim 6 of generating synthetic minority-class training records for machine learning, wherein the first label describes a medication incident.
    • Claim:
      8. The computer-implemented method of claim 1 of generating synthetic minority-class training records for machine learning, wherein the computational process comprises logistic regression.
    • Claim:
      9. The computer-implemented method of claim 1 of generating synthetic minority-class training records for machine learning, wherein the probability threshold is within the range 0.175 to 0.7.
    • Claim:
      10. The computer-implemented method of claim 9 of generating synthetic minority-class training records for machine learning, wherein the probability threshold is 0.2.
    • Claim:
      11. The computer-implemented method of claim 1 of generating synthetic minority-class training records for machine learning, wherein the first feature is binary.
    • Claim:
      12. The computer-implemented method of claim 1 of generating synthetic minority-class training records for machine learning, wherein the first feature is categorical.
    • Claim:
      13. A computer-usable non-transitory storage medium comprising an augmented plurality of minority-class training records for machine learning, comprising: a plurality of original minority-class training records wherein each of the plurality of original minority-class training records is labeled with a same first label and comprises a feature value for each of a plurality of features; and one or more synthetic minority-class training records having the same first label, wherein each of the one or more synthetic minority-class training records has been generated using a computational method implemented by one or more processors executing software instructions, the computational method comprising: changing the feature value of a first feature in an original minority-class training record from a first feature value to a different second feature value, wherein the probability of the first feature having the different second feature value in the original minority-class training record exceeds a pre-determined probability threshold.
    • Claim:
      14. The computer-usable non-transitory storage medium comprising an augmented plurality of minority-class training records for machine learning of claim 13 , wherein the first feature can be measured with respect to health.
    • Claim:
      15. The computer-usable non-transitory storage medium comprising an augmented plurality of minority-class training records for machine learning of claim 13 , wherein the first label describes a low-incidence event.
    • Claim:
      16. A computer-implemented method of training a machine learning model to predict a low-incidence event, the method performed by a computer system, said computer system comprising one or more processors and computer-usable non-transitory storage media operationally coupled to the one or more processors, comprising: storing in the non-transitory storage media a plurality of machine-learning training records, the plurality of machine-learning training records comprising: a plurality of majority-class training records; a plurality of original minority-class training records, each original minority-class training record having a same minority-class label describing a low-incidence event; and one or more synthetic minority-class training records having the same minority-class label, wherein each of the one or more synthetic minority-class training records has been generated using a computational method, the method comprising: changing the feature value of a first feature in an original minority-class training record from a first feature value to a different second feature value, wherein the probability of the first feature having the different second feature value in the original minority-class training record exceeds a pre-determined probability threshold; and using a computational machine-learning method performed by one or more processors executing software instructions stored in the computer-usable non-transitory storage media, training a machine learning model with the plurality of machine-learning training records to predict the low-incidence event.
    • Claim:
      17. The computer-implemented method of training a machine learning model to predict a low-incidence event of claim 16 , wherein the first feature can be measured with respect to health.
    • Claim:
      18. The computer-implemented method of training a machine learning model to predict a low-incidence event of claim 16 , wherein each of the plurality of minority-class training records comprises medical records.
    • Claim:
      19. The computer-implemented method of training a machine learning model to predict a low-incidence event of claim 16 , wherein the low-incidence event comprises a health care incident.
    • Claim:
      20. The computer-implemented method of training a machine learning model to predict a low-incidence event of claim 19 , wherein the low-incidence event comprises a medication incident.
    • Patent References Cited:
      11488723 November 2022 Mugan
      11977991 May 2024 Mugan
      2017/0330109 November 2017 Maughan
      2018/0052961 February 2018 Shrivastava
      WO-2016016459 February 2016
    • Other References:
      Li, X., et al. “Using machine learning to predict opioid overdoses among prescription opioid users.” Value in Health 21 (2018): S245. (Year: 2018). cited by examiner
    • Assistant Examiner:
      Siozopoulos, Constantine
    • Primary Examiner:
      Dunham, Jason B
    • Attorney, Agent or Firm:
      Williams, Jr., J. Roger
    • الرقم المعرف:
      edspgr.12217875