- Patent Number:
11735,174
- Appl. No:
17/964637
- Application Filed:
October 12, 2022
- نبذة مختصرة :
A method of training a natural language neural network comprises obtaining at least one constituency span; obtaining a training video input; applying a multi-modal transform to the video input, thereby generating a transformed video input; comparing the at least one constituency span and the transformed video input using a compound Probabilistic Context-Free Grammar (PCFG) model to match the at least one constituency span with corresponding portions of the transformed video input; and using results from the comparison to learn a constituency parser.
- Inventors:
TENCENT AMERICA LLC (Palo Alto, CA, US)
- Assignees:
TENCENT AMERICA LLC (Palo Alto, CA, US)
- Claim:
1. A natural language neural network training method, performed by a computer device, the method comprising: obtaining at least one constituency span; obtaining a training video input, which includes at least one of the following rich features: action, object, scene, audio, face, optical character recognition (OCR); applying a multi-modal transform to the video input, thereby generating a transformed video input; comparing the at least one constituency span and the transformed video input using a compound Probabilistic Context-Free Grammar (PCFG) model to match the at least one constituency span with corresponding portions of the transformed video input; and using results from the comparison to learn a constituency parser, wherein the at least one constituency span and the transformed video input are compared according to the following formulas: [mathematical expression included] where c is a representation of the constituency span, M is an expert embedding projected via a gated embedding module, denoted as Ξ={ξ i } i=1 M , {u i } i=1 M are learned weights, and Ξ′ is an unmatched span expert embedding of Ψ, and Ψ′.
- Claim:
2. The natural language neural network training method according to claim 1 , wherein after obtaining the training video input, and before the multi-modal transform is applied, the training video input is divided into feature sequence projections (F) according to the formula F m ={f i m } i=1 L m , where f i m and L m are an ith feature and a total number of features of an mth expert, the expert being an extraction of a video representation from M models trained on different tasks.
- Claim:
3. The natural language neural network training method according to claim 2 , wherein the feature sequence projections (F) are used as an input to the multi-modal transform.
- Claim:
4. The natural language neural network training method according to claim 3 , wherein the feature sequence projections (F), before being used as the input to the multi-modal transform, are concatenated together and take the form: X=[f avg 1 ,f 1 1 , . . . ,f L 1 1 , . . . ,f avg M ,f 1 M , . . . ,f L M M ], where f avg m is an averaged feature of {f i m } i L m .
- Claim:
5. The natural language neural network training method according to claim 1 , wherein a hinge loss for the video input is given by h vid (Ξ, Ψ)=E c′ [o(Ξ′, Ψ)−o(Ξ,Ψ)+ε] + +E 105′ [o(Ξ, Ψ′)−o(Ξ, Ψ)+ε] + , where ε is a positive margin.
- Claim:
6. The natural language neural network training method according to claim 5 , wherein a video-text matching loss is defined as: s vid (V,σ)=Σ cεσ p(c|σ)h vid (Ξ, Ψ).
- Claim:
7. The natural language neural network training method according to claim 6 , wherein the PCFG model is optimized using the following formula: L(ϕ, θ)=Σ (V, σ)εΩ −ELBO(σ;ϕ,θ)+αs siv (V, σ), where α is a hyper-parameter balancing loss terms and Ω is a video-sentence pair.
- Claim:
8. The natural language neural network training method according to claim 7 , wherein, during inference, the method further includes predicting a most likely tree t* given a sentence σ without accessing videos.
- Claim:
9. The natural language neural network training method according to claim 8 , wherein t* is estimated with the following approximation: t*=argmax t ∫ z p θ (t|z)p θ (z|σ) dz≈argmax t p θ (t|σ, μ ϕ (σ)), where μ ϕ (σ) is a mean vector of a variational posterior q θ (z|σ) and t* is be obtained using a Cocke-Younger-Kasami algorithm.
- Claim:
10. The natural language neural network training method according to claim 1 , wherein the model implements a Cocke-Younger-Kasami (CYK) algorithm.
- Claim:
11. An apparatus for training a natural language neural network, the apparatus comprising: at least one memory configured to store computer program code; and at least one processor configured to access the at least one memory and operate according to the computer program code, the computer program code comprising first obtaining code configured to cause the at least one processor to obtain at least one constituency span; second obtaining code configured to cause the at least one processor to obtain a training video input, which includes at least one of the following rich features: action, object, scene, audio, face, optical character recognition (OCR); applying code configured to cause the at least one processor to apply a multi-modal transform to the video input, thereby generating a transformed video input; comparing code configured to cause the at least one processor to compare the at least one constituency span and the transformed video input using a compound Probabilistic Context-Free Grammar (PCFG) model to match the at least one constituency span with corresponding portions of the transformed video input; and learning code configured to cause the at least one processor to, using results from the comparison, learn a constituency parser, wherein the comparing code is configured such that the at least one constituency span and the transformed video input are compared according to the following formulas: [mathematical expression included] where c is a representation of the constituency span, M is an expert embedding projected via a gated embedding module, denoted as Ξ={ξ i } i=1 M , {u i } i=1 M are learned weights, and Ξ′ is an unmatched span expert embedding of Ψ, and Ψ′.
- Claim:
12. The apparatus according to claim 11 , wherein the computer program code further comprises dividing code, configured to cause the at least one processor to, after executing the second obtaining code and before executing the applying code, divide the training video input into feature sequence projections (F) according to the formula F m ={f i m } i=1 L m , where f 1 m and L m are an ith feature and a total number of features of an mth expert, the expert being an extraction of a video representation from M models trained on different tasks.
- Claim:
13. The apparatus according to claim 12 , the dividing code is configured such that the feature sequence projections (F) are used as an input to the applying code.
- Claim:
14. The apparatus according to claim 13 , wherein the dividing code is configured such that the feature sequence projections (F), before being used as the input to the applying code, are concatenated together and take the form: X=[f avg 1 ,f 1 1 , . . . ,f L 1 1 , . . . ,f avg M , . . . ,f 1 M , . . . ,f L M M ], where f avg m is an averaged feature of {f i m } i L m .
- Claim:
15. The apparatus according to claim 11 , wherein the comparing code is further configured such that a hinge loss for the video input is given by h vid (Ξ,Ψ)=E c′ [o(Ξ′,Ψ)−o(Ξ,Ψ)+ε] + +E Ψ′ [o(Ξ,Ψ′)−o(Ξ,Ψ)+ε] + , where ε is a positive margin.
- Claim:
16. The apparatus according to claim 15 , wherein the comparing code is further configured such that a video-text matching loss is defined as: s vid (V, σ)=Σ cεσ p(c|σ)h vid (Ξ,Ψ).
- Claim:
17. The apparatus according to claim 16 , wherein the comparing code is further configured such that the PCFG model is optimized using the following formula: L(ϕ,θ)=Σ (V,σ)εΩ −ELBO(σ;ϕ,θ)+αs vid (V,σ), where α is a hyper-parameter balancing loss terms and Ω is a video-sentence pair.
- Claim:
18. The apparatus according to claim 17 , wherein the comparing code is further configured such that during inference, the comparing code causes the at least one processor to predict a most likely tree t* given a sentence σ without accessing videos.
- Claim:
19. The apparatus according to claim 12 , wherein the model implements a Cocke-Younger-Kasami (CYK) algorithm.
- Claim:
20. A non-transitory computer-readable storage medium storing instructions that cause at least one processor to: obtain at least one constituency span; obtain a training video input, which includes at least one of the following rich features: action, object, scene, audio, face, optical character recognition (OCR); apply a multi-modal transform to the video input, thereby generating a transformed video input; compare the at least one constituency span and the transformed video input using a compound Probabilistic Context-Free Grammar (PCFG) model to match the at least one constituency span with corresponding portions of the transformed video input; and using results from the comparison, learn a constituency parser wherein the at least one constituency span and the transformed video input are compared according to the following formulas: [mathematical expression included] where c is a representation of the constituency span, M is an expert embedding projected via a gated embedding module, denoted as Ξ={ξ i } i=1 M , {u i } i=1 M are learned weights, and Ξ′ is an unmatched span expert embedding of Ψ, and Ψ′.
- Patent References Cited:
20140236571 August 2014 Quirk et al.
20140369596 December 2014 Siskind
20190303797 October 2019 Javali
20200251091 August 2020 Zhao
20220270596 August 2022 Song
- Other References:
Xu, N., Liu, A. A., Wong, Y., Zhang, Y., Nie, W., Su, Y., & Kankanhalli, M. (2018). Dual-stream recurrent neural network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 29(8), 2482-2493.) (Year: 2018). cited by examiner
Zhang, S., Su, J., & Luo, J. (Oct. 2019). Exploiting temporal relationships in video moment localization with natural language. In Proceedings of the 27th ACM International Conference on Multimedia (pp. 1230-1238). (Year: 2019). cited by examiner
Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., & Murphy, K. (2015). What's cookin'? interpreting cooking videos using text, speech and vision. arXiv preprint arXiv:1503.01558.) (Year: 2015). cited by examiner
Xu, R., Xiong, C., Chen, W., & Corso, J. (Feb. 2015). Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the AAAI conference on artificial intelligence (vol. 29, No. 1).) (Year: 2015). cited by examiner
Yang, Y., Li, Y., Fermuller, C., & Aloimonos, Y. (Mar. 2015). Robot learning manipulation action plans by“watching” unconstrained videos from the world wide web. In Proceedings of the AAAI conference on artificial intelligence (vol. 29, No. 1). (Year: 2015). cited by examiner
International Search Report dated Mar. 3, 2022 from the International Searching Authority in International Application No. PCT/US2021/063787. cited by applicant
Written Opinion dated Mar. 3, 2022 from the International Searching Authority in International Application No. PCT/US2021/063787. cited by applicant
- Assistant Examiner:
Lam, Philip H
- Primary Examiner:
Mehta, Bhavesh M
- Attorney, Agent or Firm:
Sughrue Mion, PLLC
- الرقم المعرف:
edspgr.11735174
No Comments.