Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

Method and system for recognizing human action in apron based on thermal infrared vision

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • Publication Date:
    March 04, 2025
  • معلومة اضافية
    • Patent Number:
      12243,314
    • Appl. No:
      18/007599
    • Application Filed:
      December 06, 2021
    • نبذة مختصرة :
      The present disclosure discloses a method and system for recognizing human action in an apron based on thermal infrared vision, the method comprises: acquiring a plurality of video sequences from an infrared monitoring video; labeling a set target in each image frame in each video sequence with a target box to obtain a target tracking result; intercepting, for each image frame in the video sequence, a target-box enlarged area according to the labeled target box; adding, for each image frame in the video sequence, the position information of the image labeled with target box to the target-box enlarged area to obtain a three-channel sub-image; training an action recognition model by using three-channel sub-image sequences corresponding to a plurality of video sequences as a training set, to obtain a trained action recognition model; obtaining a to-be-recognized video sequence from another infrared monitoring video, and obtaining a three-channel sub-image sequence corresponding to the to-be-recognized video sequence; inputting the three-channel sub-image sequence corresponding to the to-be-recognized video sequence into the trained action recognition model to output a target action type.
    • Inventors:
      Nanjing University of Aeronautics and Astronautics (Jiangsu, CN)
    • Assignees:
      Nanjing University of Aeronautics and Astronautics (Jiangsu, CN)
    • Claim:
      1. A method for recognizing a human action in an apron based on thermal infrared vision, comprising: acquiring a plurality of video sequences from an infrared monitoring video, wherein the plurality of video sequences comprise a plurality of types of preset target actions; labeling a set target in each image frame in each video sequence with a target box to obtain a target tracking result, wherein the target tracking result comprises position information of an image labeled with the target box, in each frame; intercepting, for each image frame in each video sequence, a target-box enlarged area according to the labeled target box, wherein a side length of the target-box enlarged area is greater than a maximum side length of the corresponding labeled target box; adding, for each image frame in each video sequence, the position information of the image labeled with target box to the target-box enlarged area to obtain a three-channel sub-image, wherein the three-channel sub-image includes an abscissa channel image, an ordinate channel image and an image corresponding to the target-box enlarged area, and various three-channel sub-images are arranged in chronological order to form a three-channel sub-image sequence; training an action recognition model by using a plurality of three-channel sub-image sequences corresponding to a plurality of video sequences as a training set, to obtain a trained action recognition model; obtaining a to-be-recognized video sequence from another infrared monitoring video, and obtaining a three-channel sub-image sequence corresponding to the to-be-recognized video sequence; inputting the three-channel sub-image sequence corresponding to the to-be-recognized video sequence into the trained action recognition model to output a target action type.
    • Claim:
      2. The method according to 1 , wherein the action recognition model includes a spatial feature extraction network and a spatiotemporal feature extraction network, an output of the spatial feature extraction network is connected to an input of the spatiotemporal feature extraction network; the spatial feature extraction network includes six convolutional layers and three maximum pooling layers; and the spatiospatial feature extraction network includes three layers of convLSTM.
    • Claim:
      3. The method according to 1 , wherein an input of the action recognition model is a three-channel sub-image sequence of 30 frames.
    • Claim:
      4. The method according to 1 , wherein the action recognition model also includes a Softmax function, and the Softmax function is used to determine classification results.
    • Claim:
      5. The method according to 1 , wherein the target-box enlarged area is a square, and a side length of the square is expressed as: [mathematical expression included] where L i represents a side length of the target-box enlarged area corresponding to a i-th frame image in the video sequence, α is a scale coefficient, w i represents a short side length of the target box, and h i represents a long side length of the target box.
    • Claim:
      6. A system for recognizing a human action in an apron based on thermal infrared vision, comprising: a video sequence obtaining module configured to obtain a plurality of video sequences from an infrared monitoring video, wherein the plurality of video sequences include a plurality of types of preset target actions; a target box labeling module configured to label a set target in each image frame in each video sequence with a target box, to obtain a target tracking result, wherein the target tracking result includes position information of an image labeled with the target box, in each frame; a target box enlargement module configured to, for each image frame in each video sequence, intercept a target-box enlarged area according to the labeled target box, wherein a side length of the target-box enlarged area is greater than a maximum side length of the corresponding labeled target box; a three-channel sub-image sequence determining module configured to, for each image frame in each video sequence, add the position information of the image labeled with the target box to the target-box enlarged area so as to obtain a three-channel sub-image, wherein the three-channel sub-image includes an abscissa channel image, an ordinate channel image, and an image corresponding to the target-box enlarged area; and various three-channel sub-images are arranged in chronological order to form a three-channel sub-image sequence; an action recognition model training module configured to train an action recognition model by using a plurality of three-channel sub-image sequences corresponding to a plurality of video sequences as a training set, so as to obtain a trained action recognition model; a to-be-recognized video sequence obtaining module configured to obtain the to-be-recognized video sequence from another infrared monitoring video, and to obtain a three-channel sub-image sequence corresponding to the to-be-recognized video sequence; a target action recognition module configured to input the three-channel sub-image sequence corresponding to the to-be-recognized video sequence into the trained action recognition model, so as to output a target action type.
    • Claim:
      7. The system according to 6 , wherein the action recognition model includes a spatial feature extraction network and a spatiotemporal feature extraction network, an output of the spatial feature extraction network is connected to an input of the spatiotemporal feature extraction network; the spatial feature extraction network includes six convolutional layers and three maximum pooling layers; and the spatiospatial feature extraction network includes three layers of convLSTM.
    • Claim:
      8. The system according to 6 , wherein an input of the action recognition model is a three-channel sub-image sequence of 30 frames.
    • Claim:
      9. The system according to 6 , wherein the action recognition model also includes a Softmax function, and the Softmax function is used to determine classification results.
    • Claim:
      10. The system according to 6 , wherein the target-box enlarged area is a square, and a side length of the square is expressed as: [mathematical expression included] where L i represents a side length of the target-box enlarged area corresponding to a i-th frame image in the video sequence, α is a scale coefficient, w i represents a short side length of the target box, and h i represents a long side length of the target box.
    • Patent References Cited:
      11055872 July 2021 Chen
      108985259 December 2018
      110378259 October 2019
      109255284 February 2021
      113158983 July 2021

    • Other References:
      Xu, Lu, Xian Zhong, Wenxuan Liu, Shilei Zhao, Zhengwei Yang, and Luo Zhong. “Subspace enhancement and colorization network for infrared video action recognition.” In Pacific Rim International Conference on Artificial Intelligence, pp. 321-336. Cham: Springer International Publishing, 2021. (Year: 2021). cited by examiner
      Ding, Meng, Yuan-yuan Ding, Xiao-zhou Wu, Xu-hui Wang, and Yu-bin Xu. “Action recognition of individuals on an airport apron based on tracking bounding boxes of the thermal infrared target.” Infrared Physics & Technology 117 (2021): 103859. (Year: 2021). cited by examiner
    • Primary Examiner:
      Chan, Carol W
    • Attorney, Agent or Firm:
      Spencer Fane LLP
    • الرقم المعرف:
      edspgr.12243314