DATA DIGITIZATION VIA CUSTOM INTEGRATED MACHINE LEARNING ENSEMBLES

Item request has been placed!

Item request cannot be made.

Processing Request

اقرأ على الانترنت اقرأ أكثر حفظ في قائمتي

Publication Date:
January 23, 2025

معلومة اضافية
- Document Number:
  20250028734
- Appl. No:
  18/355186
- Application Filed:
  July 19, 2023
- نبذة مختصرة :
  Data digitization via custom integrated machine learning ensembles is provided. For example, a system integrates multiple trained machine learning ensembles to identify, extract, and map data. The system receives a data set from sources. The system identifies ensembles can include machine learning models that can determine an outcome. The system filters a subset of data from the data set. The system identifies a layout for the data set based on a vendor type, data type, and the data set. The system executes a block detection module to identify blocks of the layout. The system executes a header detection module. The system executes a policy detection module to identify the headers as policies. The system transforms, based on the headers, the layout, the blocks, and the policies, the data set into a second file type, and presents the transformed data set for integration into a capital management system.
- Assignees:
  ADP, Inc. (Roseland, NJ, US)
- Claim:
  1. A system, comprising: one or more processors, coupled with memory, to: receive a data set comprising sheets in a first file type from a plurality of sources, the data set in one of a plurality of formats corresponding to one or more of the plurality of sources; identify a plurality of ensembles, each ensemble of the plurality of ensembles comprising one or more machine learning models and each ensemble to determine an outcome based on an outcome of each machine learning model of each respective ensemble; filter, using a first ensemble of the plurality of ensembles, a subset of data from the data set based on a threshold of the first ensemble; identify, using a second ensemble of the plurality of ensembles, a layout for each sheet of the data set based on a vendor type, data type, and the data set; execute, using a third ensemble of the plurality of ensembles, a block detection module to identify blocks for each sheet of the data set based on the layout, wherein each block of the blocks comprises a subset of the data set; execute, using a fourth ensemble of the plurality of ensembles, a header detection module to identify headers of each sheet of the data set according to the layout; execute, using a fifth ensemble of the plurality of ensembles, a policy detection module to identify one or more of the headers as policies using a comparison of a first header of the headers to a second header of the headers for each of the headers; transform, using a sixth ensemble of the plurality of ensembles based on the headers, the layout, the blocks, and the policies, the data set into a format of a second file type different from the plurality of formats; and present, by a display device coupled with the one or more processors, the transformed data set for integration into an electronic transaction system.
- Claim:
  2. The system of claim 1, comprising the one or more processors to: receive a second data set comprising a first subset of data to be an input into the one or more machine learning models and a second subset of data to compare against an output of the one or more machine learning models; generate, using the first subset of data, the plurality of ensembles, each ensemble of the plurality of ensembles comprising a subset of the one or more machine learning models and each ensemble to be generated sequentially; and determine, using the second subset of data, that the output of the one or more machine learning models is below a threshold error.
- Claim:
  3. The system of claim 1, comprising the one or more processors to: determine that an error of one or more ensembles of the plurality of ensembles is greater than or equal to a threshold error; aggregate a second data set comprising a first subset of data to be an input into the one or more machine learning models and a second subset of data to compare against an output of the one or more machine learning models; generate, using the first subset of data, a second plurality of ensembles for each ensemble of the plurality of ensembles with its error greater than or equal to the threshold error, each ensemble of the second plurality of ensembles comprising a subset of the one or more machine learning models; determine, using the second subset of data, that each machine learning model of each ensemble of the second plurality of ensembles is below the threshold error; and replace the plurality of ensembles with the second plurality of ensembles for each ensemble of the plurality of ensembles determined to have its error greater than or equal to the threshold error.
- Claim:
  4. The system of claim 1, comprising the one or more processors to: classify, using a seventh ensemble of the plurality of ensembles responsive to filtering the data set, the data set into the vendor type for each sheet of the data set; and classify, using an eighth ensemble of the plurality of ensembles, subsections of each sheet of the data set into one or more data types, wherein the one or more data types comprises indicative data and non-indicative data.
- Claim:
  5. The system of claim 1, comprising the one or more processors to: identify, responsive to executing the block detection module, using a seventh ensemble of the plurality of ensembles, rows and columns of each sheet as one of at least two labels based on the layout.
- Claim:
  6. The system of claim 1, comprising the one or more processors to: identify, responsive to executing the header detection module, duplicate headers of the headers; and remove the duplicate headers from the data set.
- Claim:
  7. The system of claim 1, comprising the one or more processors to: validate, using a seventh ensemble of the plurality of ensembles, the headers for each sheet to categorize each header, based on a threshold of the seventh ensemble.
- Claim:
  8. The system of claim 1, wherein executing the block detection module comprises the one or more processors to: identify, using the third ensemble, the subset of the data set by filtering the layout for corresponding rows and columns within each sheet; and generate, using the third ensemble, the blocks for each sheet of the data set based on the subset of the data set associated with the corresponding rows and columns.
- Claim:
  9. The system of claim 1, wherein executing the header detection module comprises the one or more processors to: identify, using the fourth ensemble, nested headers of the headers, wherein the nested headers comprise one or more of the headers hierarchically arranged under a different header of the headers and wherein each header of the headers is a titular apex for corresponding data of the data set; select, using the fourth ensemble, one header of the nested headers for each of the corresponding data of the data set for each nested header; evaluate, using the fourth ensemble based on the headers, the corresponding data, and the layout, each sheet of the data set through one or more decision trees to converge on pre-defined header categories; and identify, using the fourth ensemble, each header of the headers as one of the pre-defined header categories, responsive to evaluating each sheet through the one or more decision trees.
- Claim:
  10. The system of claim 1, wherein executing the policy detection module comprises the one or more processors to: identify, using the fifth ensemble, the headers responsive to executing the header detection module, determine, using the fifth ensemble, based on the layout and the headers, neighboring headers, wherein the neighboring headers are adjacent to one another for each sheet of the data set; determine, using the fifth ensemble, a probability of each header of the headers being a policy based on the neighboring headers of the header; and identify, using the fifth ensemble, one or more policies from the headers based on the probability of each header being at or above a threshold of the fifth ensemble.
- Claim:
  11. A method comprising: receiving, by one or more processors coupled with memory, a data set comprising sheets in a first file type from a plurality of sources, the data set in one of a plurality of formats corresponding to one or more of the plurality of sources; identifying, by the one or more processors, a plurality of ensembles, each ensemble of the plurality of ensembles comprising one or more machine learning models and each ensemble to determine an outcome based on an outcome of each machine learning model of each respective ensemble; filtering, by the one or more processors using a first ensemble of the plurality of ensembles, a subset of data from the data set based on a threshold of the first ensemble; identifying, by the one or more processors using a second ensemble of the plurality of ensembles, a layout for each sheet of the data set based on a vendor type, data type, and the data set; executing, by the one or more processors using a third ensemble of the plurality of ensembles, a block detection module to identify blocks of the layout for each sheet of the data set, wherein each block of the blocks comprises a subset of the data set; executing, by the one or more processors using a fourth ensemble of the plurality of ensembles, a header detection module to identify headers of each sheet of the data set according to the layout; executing, by the one or more processors using a fifth ensemble of the plurality of ensembles, a policy detection module to identify one or more of the headers as policies using a comparison of a first header of the headers to a second header of the headers for each of the headers; transforming, by the one or more processors using a sixth ensemble of the plurality of ensembles based on the headers, the layout, the blocks, and the policies, the data set into a format of a second file type different from the plurality of formats; and presenting, by a display device coupled with the one or more processors, the transformed data set for integration into a capital management system.
- Claim:
  12. The method of claim 11, comprising: receiving, by the one or more processors, a second data set comprising a first subset of data to be an input into the one or more machine learning models and a second subset of data to compare against an output of the one or more machine learning models; generating, by the one or more processors using the first subset of data, the plurality of ensembles, each ensemble of the plurality of ensembles comprising a subset of the one or more machine learning models and each ensemble to be generated sequentially; and determining, by the one or more processors using the second subset of data, that each machine learning model of each ensemble of the plurality of ensembles is below a threshold error.
- Claim:
  13. The method of claim 11, comprising: determining, by the one or more processors, that an error of one or more ensembles of the plurality of ensembles is greater than or equal to a threshold error; aggregating, by the one or more processors, a second data set comprising a first subset of data to be an input into the one or more machine learning models and a second subset of data to compare against an output of the one or more machine learning models; generating, by the one or more processors using the first subset of data, a second plurality of ensembles for each ensemble of the plurality of ensembles with its error greater than or equal to the threshold error, each ensemble of the second plurality of ensembles comprising a subset of the one or more machine learning models; determining, by the one or more processors using the second subset of data, that each machine learning model of each ensemble of the second plurality of ensembles is below the threshold error; and replacing, by the one or more processors, the plurality of ensembles with the second plurality of ensembles for each ensemble of the plurality of ensembles determined to have its error greater than or equal to the threshold error.
- Claim:
  14. The method of claim 11, comprising: classifying, by the one or more processors using a seventh ensemble of the plurality of ensembles responsive to filtering the data set, the data set into the vendor type for each sheet of the data set; and classifying, by the one or more processors using an eighth ensemble of the plurality of ensembles, subsections of each sheet of the data set into one or more data types, wherein the one or more data types comprises indicative data and non-indicative data.
- Claim:
  15. The method of claim 11, comprising: identifying, by the one or more processors responsive to executing the header detection module, duplicate headers of the headers; and removing, by the one or more processors, the duplicate headers from the data set.
- Claim:
  16. The method of claim 11, wherein executing the block detection module comprises: identifying, by the one or more processors using the third ensemble, the subset of the data set by filtering the layout for corresponding rows and columns within each sheet; and generating, by the one or more processors using the third ensemble, the blocks for each sheet of the data set based on the subset of the data set associated with the corresponding rows and columns.
- Claim:
  17. The method of claim 11, wherein executing the header detection module comprises: identifying, by the one or more processors using the fourth ensemble, nested headers of the headers, wherein the nested headers comprise one or more of the headers hierarchically arranged under a different header of the headers and wherein each header of the headers is a titular apex for corresponding data of the data set; selecting, by the one or more processors using the fourth ensemble, one header of the nested headers for each of the corresponding data of the data set for each nested header; evaluating, by the one or more processors using the fourth ensemble based on the headers, the corresponding data, and the layout, each sheet of the data set through one or more decision trees to converge on pre-defined header categories; and identifying, by the one or more processors using the fourth ensemble, each header of the headers as one of the pre-defined header categories, responsive to evaluating each sheet through the one or more decision trees.
- Claim:
  18. The method of claim 11, wherein executing the policy detection module comprises: identifying, by the one or more processors using the fifth ensemble, the headers responsive to executing the header detection module, determining, by the one or more processors using the fifth ensemble, based on the layout and the headers, neighboring headers, wherein the neighboring headers are adjacent to one another for each sheet of the data set; determining, by the one or more processors using the fifth ensemble, a probability of each header of the headers being a policy based on the neighboring headers of the header; and identifying, by the one or more processors using the fifth ensemble, one or more policies from the headers based on the probability of each header being at or above a threshold of the fifth ensemble.
- Claim:
  19. A non-transitory computer-readable medium, executing instructions embodied thereon, the instructions to cause one or more processors to: receive a data set comprising sheets in a first file type from a plurality of sources, the data set in one of a plurality of formats corresponding to one or more of the plurality of sources; identify a plurality of ensembles, each ensemble of the plurality of ensembles comprising one or more machine learning models and each ensemble to determine an outcome based on an outcome of each machine learning model of each respective ensemble; filter, using a first ensemble of the plurality of ensembles, a subset of data from the data set based on a threshold of the first ensemble; identify, using a second ensemble of the plurality of ensembles, a layout for each sheet of the data set based on a vendor type, data type, and the data set; execute, using a third ensemble of the plurality of ensembles, a block detection module to identify blocks for each sheet of the data set based on the layout, wherein each block of the blocks comprises a subset of the data set; execute, using a fourth ensemble of the plurality of ensembles, a header detection module to identify headers of each sheet of the data set according to the layout; execute, using a fifth ensemble of the plurality of ensembles, a policy detection module to identify one or more of the headers as policies using a comparison of a first header of the headers to a second header of the headers for each of the headers; transform, using a sixth ensemble of the plurality of ensembles based on the headers, the layout, the blocks, and the policies, the data set into a format of a second file type different from the plurality of formats; and present, by a display device coupled with the one or more processors, the transformed data set for integration into an electronic transaction system.
- Claim:
  20. The non-transitory computer-readable medium of claim 19, comprising the instructions to cause the one or more processors to: classify, using a seventh ensemble of the plurality of ensembles responsive to filtering the data set, the data set into the vendor type for each sheet of the data set; and classify, using an eighth ensemble of the plurality of ensembles, subsections of each sheet of the data set into one or more data types, wherein the one or more data types comprises indicative data and non-indicative data.
- Current International Class:
  06
- الرقم المعرف:
  edspap.20250028734

تعليقات

No Comments.

DATA DIGITIZATION VIA CUSTOM INTEGRATED MACHINE LEARNING ENSEMBLES

اتصل بنا

اتبع