Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

System and method for input data fault recovery in a massively parallel real time computing system

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • Publication Date:
    September 18, 2018
  • معلومة اضافية
    • Patent Number:
      10078,559
    • Appl. No:
      15/166724
    • Application Filed:
      May 27, 2016
    • نبذة مختصرة :
      A massively parallel real-time computing system receives input data events across many compute nodes, each with a processing algorithm in its processing pipeline. An Event Manager is placed before the algorithm processing pipelines, receives metadata about each incoming event, and collects and organizes it in a database. A fast histogram compares the metadata about each event to that of all the other events, in a processing interval. For sufficiently matching metadata, the events are forwarded to the processing nodes as “regular” events for processing. If the metadata for a processing interval does not match sufficiently, the histogram decides which events are the “correct” events and which events are “incorrect.” The “correct” events are sent on for processing and the “incorrect” events are combined with the “correct” metadata and sent back to the processing nodes to supplement or modify their incoming data to match the other nodes' expectations.
    • Inventors:
      Raytheon Company (Waltham, MA, US)
    • Assignees:
      Raytheon Company (Waltham, MA, US)
    • Claim:
      1. A method of preventing buffer overflow resulting from data faults in a parallel computing system, comprising: creating a two dimensional (2D) array space in computer storage, the 2D array space configured to store event notifications received from j respective input data streams via one or more input buffers of N compute nodes, where j≥N, during T processing interval time slots, wherein the T time slots have incrementable values from a first value to a T th value, wherein the event notifications comprise metadata including at least a timestamp, wherein arrival of a first event notification causes initiation of a timeout count for all j data streams, and wherein T is the maximum number of open time slots for a given timeout count; creating and initializing to zero in the 2D array space in the computer storage a first data array of next time slots of size T respectively associated with individual ones of the j input data streams; receiving a first event notification from one of the j input data streams, during a first time slot, and, responsive to receiving the first event notification, beginning a first timeout count for all j input data streams, and saving metadata of the received event notification in the first time slot in the 2D array space in the computer storage; incrementing the value of the timeslot for the one of the j input data streams by one to arrive at the value of the next time slot and, if the incremented value is equal to or greater than T, setting the value to the value of the first time slot; determining whether the 2D array space in the computer storage for any one of the T time slots stores metadata from all j input data streams; responsive to determining that array storage for any one of the T time slots stores metadata from all j input data streams, determining a value of the time stamp for each metadata stored in the 2D array space in the computer storage for the one of the T time slots; and responsive to determining that the value of the time stamp, within a tolerance, for at least one of the j input data streams of the first time slot, is greater than the time stamp for any other of the j input data streams for the first time slot, delaying event notification for the at least one of the j input data streams by at least one time slot, transmitting a pseudo-event notification to a compute algorithm for the at least one of the j input data streams, and transmitting an event notification to respective compute algorithms for the j input data streams other than the at least one of the j input data streams, wherein data overflow of the one or more input buffers is prevented.
    • Claim:
      2. The method of claim 1 further comprising determining whether any open time slot in the 2D array space has timed out and, responsive to determining that any open time slot has timed out, delaying the event notification for the open time slot that has timed out and transmitting a pseudo-event notification to a compute algorithm for the input data stream associated with the open time slot that has timed out.
    • Claim:
      3. The method of claim 2 further comprising transmitting event notifications to respective compute algorithms for each of the input data streams associated with open time slots that have not timed out.
    • Claim:
      4. The method of claim 1 wherein the pseudo-event notification indicates incorrect data has arrived in an input data stream.
    • Claim:
      5. The method of claim 4 wherein the pseudo-event notification comprises one of blanked-out data or a combination of metadata.
    • Claim:
      6. The method of claim 1 further comprising receiving a second event notification comprising metadata from any of the j input data streams, during a next time slot, and, responsive to receiving the second event notification, beginning a second timeout count for all j input data streams for the next time slot, and saving the metadata of the received second event notification in the next time slot in the 2D array space.
    • Claim:
      7. The method of claim 1 wherein the 2D array space is a component of a system comprising N compute nodes, each compute node comprising at least one data input stream, wherein each of the at least one input data stream has associated therewith an input buffer and a plurality of processing threads.
    • Claim:
      8. The method of claim 1 wherein the pseudo-events and the events are transmitted to the compute algorithms during any time slot before the respective compute algorithms begin processing during a processing interval associated with the any time slot.
    • Claim:
      9. The method of claim 1 wherein data overflow of at least one of the one or more input buffers is prevented by a pseudo-event notification.
    • Claim:
      10. One or more computer-readable hardware storage devices having embedded therein a set of instructions which, when executed by one or more processors of a computer, causes the computer to execute operations comprising: creating a two dimensional (2D) array space in a computer storage, the 2D array space in the computer storage configured to store event notifications received from j respective input data streams via one or more input buffers of N compute nodes, where j≥N, during T processing interval time slots, wherein the T time slots have incrementable values from a first value to a Tth value, wherein the event notifications comprise metadata including at least a timestamp, wherein arrival of a first event notification causes initiation of a timeout count for all j data streams, and wherein T is the maximum number of open time slots for a given timeout count; creating and initializing to zero in the 2D array space in the computer storage a first data array of size T of next time slots respectively associated with individual ones of the j input data streams; receiving a first event notification from one of the j input data streams, during a first time slot, and, responsive to receiving the first event notification, beginning a first timeout count for all j input data streams, and saving metadata of the received event notification in the first time slot in the 2D array space in the computer storage; incrementing the value of the timeslot for the one of the j input data streams by one to arrive at the value of the next time slot and, if the incremented value is equal to or greater than T, setting the value to the value of the first time slot; determining whether the 2D array space in the computer storage for any one of the T time slots stores metadata from all j input data streams; responsive to determining that array storage for any one of the T time slots stores metadata from all j input data streams, determining a value of the time stamp for each metadata stored in the 2D array for the one of the T time slots; and responsive to determining that the value of the time stamp, within a tolerance, for at least one of the j input data streams of the first time slot, is greater than the time stamp for any other of the j input data streams for the first time slot, delaying event notification for the at least one of the j input data streams by at least one time slot, transmitting a pseudo-event notification to a compute algorithm for the at least one of the j input data streams, and transmitting an event notification to respective compute algorithms for the j input data streams other than the at least one of the j input data streams, wherein data overflow of the one or more input buffers is prevented.
    • Claim:
      11. The one or more computer-readable hardware storage devices of claim 10 , the operations further comprising determining whether any open time slot in the 2D array space has timed out and, responsive to determining that any open time slot has timed out, delaying the event notification for the open time slot that has timed out and transmitting a pseudo-event notification to a compute algorithm for the input data stream associated with the open time slot that has timed out.
    • Claim:
      12. The one or more computer-readable hardware storage devices of claim 11 , the operations further comprising transmitting event notifications to respective compute algorithms for each of the input data streams associated with open time slots that have not timed out.
    • Claim:
      13. The one or more computer-readable hardware storage devices of claim 10 wherein the pseudo-event notification indicates that incorrect data has arrived in an input data stream.
    • Claim:
      14. The one or more computer-readable hardware storage devices of claim 13 wherein the pseudo-event notification comprises one of blanked-out data or a combination of metadata.
    • Claim:
      15. The one or more computer-readable hardware storage devices of claim 10 , the operations further comprising receiving a second event notification comprising metadata from any of the j input data streams, during a next time slot, and, responsive to receiving the second event notification, beginning a second timeout count for all j input data streams for the next time slot, and saving the metadata of the received second event notification in the next time slot in the 2D array space.
    • Claim:
      16. The one or more computer-readable hardware storage devices of claim 10 wherein the 2D array space is a component of a system comprising N compute nodes, each compute node comprising at least one data input stream, wherein each of the at least one input data stream has associated therewith an input buffer and a plurality of processing threads.
    • Claim:
      17. The one or more computer-readable hardware storage devices of claim 10 wherein the pseudo-events and the events are transmitted to the compute algorithms during any time slot before the respective compute algorithms begin processing during a processing interval associated with the any time slot.
    • Claim:
      18. The one or more computer-readable hardware storage devices of claim 10 wherein a overflow of at least one of the one or more input buffers is prevented by a pseudo-event notification.
    • Claim:
      19. A system for preventing input buffer overflow resulting from data faults in a parallel processing computer, comprising: computer storage configured to store computer-readable instructions; and one or more computer processor associated with the computer storage and configured to execute at least some of the computer-readable instructions to perform operations comprising: creating a two dimensional (2D) array space in the computer storage, the 2D array space configured to store event notifications received from j respective input data streams via one or more input buffers of N compute nodes, where j≥N, during T processing interval time slots, wherein the T time slots have incrementable values from a first value to a T th value, wherein the event notifications comprise metadata including at least a timestamp, wherein arrival of a first event notification causes initiation of a timeout count for all j data streams, and wherein T is the maximum number of open time slots for a given timeout count; creating and initializing to zero in the 2D array space in the computer storage a first data array of next time slots of size T respectively associated with individual ones of the j input data streams; receiving a first event notification from one of the j input data streams, during a first time slot, and, responsive to receiving the first event notification, beginning a first timeout count for all j input data streams, and saving the metadata of the received event notification in the first time slot in the 2D array space in the computer storage; incrementing the value of the timeslot for the one of the j input data streams by one, to arrive at the value of the next time slot and, if the incremented value is equal to or greater than T, setting the value to the value of the first time slot; determining whether the 2D array space in the computer storage for any one of the T time slots stores metadata from all j input data streams; responsive to determining that the 2D array space in the computer storage for any one of the T time slots stores metadata from all j input data streams, determining a value of a time stamp for each metadata stored in the 2D array space in the computer storage for the one of the T time slots; and responsive to determining that the value of the time stamp, within a tolerance, for at least one of the j input data streams of the first time slot, is greater than the time stamp for any other of the j input data streams for the first time slot, delaying event notification for the at least one of the j input data streams by at least one time slot, transmitting a pseudo-event notification to a compute algorithm for the at least one of the j input data streams, and transmitting an event notification to respective compute algorithms for the j input data streams other than the at least one of the j input data streams, wherein data overflow of the one or more input buffers of each of the N compute nodes is prevented.
    • Claim:
      20. The system of claim 19 , the instructions further comprising determining whether any open time slot in the 2D array space in the computer storage has timed out and, responsive to determining that any open time slot has timed out, delaying the event notification for the open time slot that has timed out and transmitting a pseudo-event notification to a compute algorithm for the input data stream associated with the open time slot that has timed out and transmitting event notifications to respective compute algorithms for each of the input data streams associated with open time slots that have not timed out.
    • Patent References Cited:
      5481719 January 1996 Ackerman et al.
      8156493 April 2012 Ellis et al.
      8260492 September 2012 Stange et al.
      2004/0019744 January 2004 Boucher
      2008/0189573 August 2008 Darrington
      2010/0042632 February 2010 Johnson
      2010/0293532 November 2010 Andrade
      2011/0191785 August 2011 Archer et al.
      2011/0283262 November 2011 Ceze et al.
      2012/0137164 May 2012 Uhlig
      2015/0347131 December 2015 Howe et al.
      2016/0092317 March 2016 Akiyama
      1330900 September 2007
      2866144 April 2015
      201741878 December 2017
      WO-2017204893 November 2017




























    • Other References:
      “U.S. Appl. No. 14/289,852, Advisory Action dated Apr. 14, 2017”, 3 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Advisory Action dated Jul. 13, 2016”, 6 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Examiner Interview Summary dated May 17, 2017”, 3 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Examiner Interview Summary dated Jun. 9, 2016”, 3 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Examiner Interview Summary dated Jul. 28, 2017”, 3 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Final Office Action dated Jan. 12, 2017”, 13 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Final Office Action dated Mar. 11, 2016”, 12 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Non Final Office Action dated Jun. 28, 2017”, 12 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Non Final Office Action dated Jul. 30, 2015”, 11 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Non Final Office Action dated Aug. 26, 2016”, 13 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Response filed Mar. 8, 2017 to Final Office Action dated Jan. 12, 2017”, 12 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Response filed Aug. 9, 2017 to Non-Final Office Action dated Jun. 28, 2017”, 10 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Response filed Nov. 30, 2015 to Non-Final Office Action dated Jul. 30, 2015”, 13 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Response filed Dec. 1, 2016 to Non-Final Office Action dated Aug. 26, 2016”, 13 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Response filed May 10, 2017 to Final Office Action dated Jan. 12, 2017”, 9 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Examiner Interview Summary dated Mar. 3, 2017”, 5 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Response filed Jun. 7, 2016 to Final Office Action dated Mar. 11, 2016”, 11 pgs. cited by applicant
      “International Application Serial No. PCT/US2017/023989, International Search Report dated Jul. 13, 2017”, 5 pgs. cited by applicant
      “International Application Serial No. PCT/US2017/023989, Written Opinion dated Jul. 13, 2017”, 10 pgs. cited by applicant
      “Linux Programmer's Manual—pthread_setcancelstate”, (Mar. 20, 2012), 3 pgs. cited by applicant
      Fagg, Graham E, et al., “Building and Using a Fault-Tolerant MPI Implementation”, (2004), 10 pgs. cited by applicant
      Howe, Ben, “System and Method for Input Data Fault Recovery in a Massively Parallel Realtime Computing System”, Raytheon Proprietary, (Aug. 30, 2015), 8 pgs. cited by applicant
      Kumar, Arvind, et al., “Fault Tolerance in Real Time Distributed System”, International Journal on Computer Science and Engineering. vol. 3 No. 2., (Feb. 2011), 933-939. cited by applicant
      Peng, Li, et al., “Deadlock avoidance for streaming computations with filtering”, Parallelism in Algorithms and Architectures, ACM, 2 Penn Plaza, Suite 701 New York NY 10121-0701 USA,, (Jun. 13, 2010), 243-252. cited by applicant
      Probe, et al., “MPI—2.2 documentation”, (Sep. 10, 2009), 5 pgs. cited by applicant
      Stephen, Cleary, “Detection of Half-Open (Dropped) Connections”, (May 16, 2009), 19 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Non Final Office Action dated Dec. 26, 2017”, 11 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Response filed Apr. 25, 2018 to Non-Final Office Action dated Dec. 26, 2017”, 12 pgs. cited by applicant
      “U.S. Appl. No. 14/289,852, Examiner Interview Summary dated Apr. 5, 2018”, 3 pgs. cited by applicant
    • Primary Examiner:
      Manoskey, Joseph D
    • Attorney, Agent or Firm:
      Schwegman, Lundberg & Woessner, P.A.
    • الرقم المعرف:
      edspgr.10078559