Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

DATA DIGITIZATION VIA CUSTOM INTEGRATED MACHINE LEARNING ENSEMBLES

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • Publication Date:
    January 23, 2025
  • معلومة اضافية
    • Document Number:
      20250029417
    • Appl. No:
      18/355198
    • Application Filed:
      July 19, 2023
    • نبذة مختصرة :
      The present disclosure relates generally to the digitization of documents and more particularly, to a system, method and computer program which integrates multiple trained machine learning ensembles to identify, extract, and map a data set. The method, for example, includes receiving a data set from sources; identifying ensembles, each ensemble comprising machine learning models and each ensemble to determine an outcome; identifying a type for the data set based on a vendor type and the data set; executing a section detection module to identify sections of the data set and classify the sections; executing a page classification module; generating associations between the sections and the classifications; transforming, based on the association, the sections, the classifications, and the type, the data set into a second file type; and presenting the transformed data set for integration into a capital management system.
    • Assignees:
      ADP, Inc. (Roseland, NJ, US)
    • Claim:
      1. A system, comprising: one or more processors, coupled with memory, to: receive a data set comprising sheets in a first file type from a plurality of sources, the data set in one of a plurality of formats corresponding to one or more of the plurality of sources; identify a plurality of ensembles, each ensemble of the plurality of ensembles comprising one or more machine learning models and each ensemble to determine an outcome based on an outcome of each machine learning model of each respective ensemble; identify, using a first ensemble of the plurality of ensembles, a type for each sheet of the data set based on a vendor type and the data set; execute, using a second ensemble of the plurality of ensembles, a section detection module to identify sections for each sheet of the data set based on the respective type for each sheet and images and text within each sheet; execute, using a third ensemble of the plurality of ensembles, a page classification module to identify classifications within each sheet based on the data set; generate an association between the sections and the classifications for each type of each sheet of the data set; transform, using a fourth ensemble of the plurality of ensembles based on the association, the sections, the classifications, and the type, the data set into a format of a second file type different from the plurality of formats; and provide, for render by a display device coupled with the one or more processors, the transformed data set for integration into an electronic transaction system.
    • Claim:
      2. The system of claim 1, comprising the one or more processors to: receive a second data set comprising a first subset of data to be an input into the one or more machine learning models and a second subset of data to compare against an output of the one or more machine learning models; generate, using the first subset of data, the plurality of ensembles, each ensemble of the plurality of ensembles comprising a subset of the one or more machine learning models and each ensemble to be generated sequentially; and determine, using the second subset of data, that each machine learning model of each ensemble of the plurality of ensembles is below a threshold error.
    • Claim:
      3. The system of claim 1, comprising the one or more processors to: determine that an error of one or more ensembles of the plurality of ensembles is greater than or equal to a threshold error; aggregate a second data set comprising a first subset of data to be an input into the one or more machine learning models and a second subset of data to compare against an output of the one or more machine learning models; generate, using the first subset of data, a second plurality of ensembles for each ensemble of the plurality of ensembles with its error greater than or equal to the threshold error, each ensemble of the second plurality of ensembles comprising a subset of the one or more machine learning models; determine, using the second subset of data, that each machine learning model of each ensemble of the second plurality of ensembles is below the threshold error; and replace the plurality of ensembles with the second plurality of ensembles for each ensemble of the plurality of ensembles determined to have its error greater than or equal to the threshold error.
    • Claim:
      4. The system of claim 1, comprising the one or more processors to: validate, using a fifth ensemble of the plurality of ensembles responsive to executing the section detection module, a label for each of the sections by comparing the text of the respective sheet to the label of the respective section, the label for each of the sections assigned by the section detection module.
    • Claim:
      5. The system of claim 1, comprising the one or more processors to: determine, responsive to executing the section detection module, that the sections comprise at least entities or tables.
    • Claim:
      6. The system of claim 1, wherein executing the section detection module comprises the one or more processors to: identify, using the second ensemble, the sections of each sheet of the data set by performing object recognition on the images of each sheet; and assign, using the second ensemble, a label to each section of each sheet of the data set by parsing the text for an indication of the label.
    • Claim:
      7. The system of claim 1, wherein executing the section detection module comprises the one or more processors to: determine, using a fifth machine learning ensemble of the plurality of ensembles, that one or more of the sections is an entity by identifying a paired pattern of the text of each section; and determine, using a sixth machine learning ensemble of the plurality of ensembles, that one or more of the sections is a table by parsing the text of each section.
    • Claim:
      8. The system of claim 1, wherein executing the page classification module comprises the one or more processors to: identify, using the third ensemble, the classifications within each sheet based on the data set by parsing the text of each sheet of the data set for a relation to the classifications.
    • Claim:
      9. The system of claim 1, wherein the classifications comprise balances and totals.
    • Claim:
      10. The system of claim 1, comprising the one or more processors to parallelly execute the section detection module and the page classification module.
    • Claim:
      11. A method comprising: receiving, by one or more processors coupled with memory, a data set comprising sheets in a first file type from a plurality of sources, the data set in one of a plurality of formats corresponding to one or more of the plurality of sources; identifying, by the one or more processors, a plurality of ensembles, each ensemble of the plurality of ensembles comprising one or more machine learning models and each ensemble to determine an outcome based on an outcome of each machine learning model of each respective ensemble; identifying, by the one or more processors using a first ensemble of the plurality of ensembles, a type for each sheet of the data set based on a vendor type and the data set; executing, by the one or more processors using a second ensemble of the plurality of ensembles, a section detection module to identify sections for each sheet of the data set based on the respective type for each sheet and images and text within each sheet; executing, by the one or more processors using a third ensemble of the plurality of ensembles, a page classification module to identify classifications within each sheet based on the data set; generating, by the one or more processors, an association between the sections and the classifications for each type of each sheet of the data set; transforming, by the one or more processors using a fourth ensemble of the plurality of ensembles based on the association, the sections, the classifications, and the type, the data set into a format of a second file type different from the plurality of formats; and provide, for rendering by a display device coupled with the one or more processors, the transformed data set for integration into a capital management system.
    • Claim:
      12. The method of claim 11, comprising: receiving, by the one or more processors, a second data set comprising a first subset of data to be an input into the one or more machine learning models and a second subset of data to compare against an output of the one or more machine learning models; generating, by the one or more processors using the first subset of data, the plurality of ensembles, each ensemble of the plurality of ensembles comprising a subset of the one or more machine learning models and each ensemble to be generated sequentially; and determining, by the one or more processors using the second subset of data, that each machine learning model of each ensemble of the plurality of ensembles is below a threshold error.
    • Claim:
      13. The method of claim 11, comprising: determining, by the one or more processors, that an error of one or more ensembles of the plurality of ensembles is greater than or equal to a threshold error; aggregating, by the one or more processors, a second data set comprising a first subset of data to be an input into the one or more machine learning models and a second subset of data to compare against an output of the one or more machine learning models; generating, by the one or more processors using the first subset of data, a second plurality of ensembles for each ensemble of the plurality of ensembles with its error greater than or equal to the threshold error, each ensemble of the second plurality of ensembles comprising a subset of the one or more machine learning models; determining, by the one or more processors using the second subset of data, that each machine learning model of each ensemble of the second plurality of ensembles is below the threshold error; and replacing, by the one or more processors, the plurality of ensembles with the second plurality of ensembles for each ensemble of the plurality of ensembles determined to have its error greater than or equal to the threshold error.
    • Claim:
      14. The method of claim 11, comprising: validating, by the one or more processors using a fifth ensemble of the plurality of ensembles responsive to executing the section detection module, a label for each of the sections by comparing the text of the respective sheet to the label of the respective section, the label for each of the sections assigned by the section detection module.
    • Claim:
      15. The method of claim 11, wherein executing the section detection module comprises: identifying, by the one or more processors using the second ensemble, the sections of each sheet of the data set by performing object recognition on the images of each sheet; and assigning, by the one or more processors using the second ensemble, a label to each section of each sheet of the data set by parsing the text for an indication of the label.
    • Claim:
      16. The method of claim 11, wherein executing the section detection module comprises: determining, by the one or more processors using a fifth machine learning ensemble of the plurality of ensembles, that one or more of the sections is an entity by identifying a paired pattern of the text of each section; and determining, by the one or more processors using a sixth machine learning ensemble of the plurality of ensembles, that one or more of the sections is a table by parsing the text of each section.
    • Claim:
      17. The method of claim 11, wherein executing the page classification module comprises: identifying, by the one or more processors using the third ensemble, the classifications within each sheet based on the data set by parsing the text of each sheet of the data set for a relation to the classifications.
    • Claim:
      18. The method of claim 11, comprising: executing, by the one or more processors, the section detection module and the sheet classification module in parallel.
    • Claim:
      19. A non-transitory computer-readable medium, executing instructions embodied thereon, the instructions to cause one or more processors to: receive a data set comprising sheets in a first file type from a plurality of sources, the data set in one of a plurality of formats corresponding to one or more of the plurality of sources; identify a plurality of ensembles, each ensemble of the plurality of ensembles comprising one or more machine learning models and each ensemble to determine an outcome based on an outcome of each machine learning model of each respective ensemble; identify, using a first ensemble of the plurality of ensembles, a type for each sheet of the data set based on a vendor type and the data set; execute, using a second ensemble of the plurality of ensembles, a section detection module to identify sections for each sheet of the data set based on the respective type for each sheet and images and text within each sheet; execute, using a third ensemble of the plurality of ensembles, a page classification module to identify classifications within each sheet based on the data set; generate an association between the sections and the classifications for each type of each sheet of the data set; transform, using a fourth ensemble of the plurality of ensembles based on the association, the sections, the classifications, and the type, the data set into a format of a second file type different from the plurality of formats; and provide, for render by a display device coupled with the one or more processors, the transformed data set for integration into an electronic transaction system.
    • Claim:
      20. The non-transitory computer-readable medium of claim 19, comprising the instructions to cause the one or more processors to: validate, using a fifth ensemble of the plurality of ensembles responsive to executing the section detection module, a label for each of the sections by comparing the text of the respective sheet to the label of the respective section, the label for each of the sections assigned by the section detection module.
    • Current International Class:
      06; 06; 06
    • الرقم المعرف:
      edspap.20250029417