本次研究主要探討了利用大型語言模型(如基於轉換器的生成式預訓練模型[Generative Pre-Trained Transformers, GPT]系列)和其他人工智慧(Artificial Intelligence, AI)技術,如Large Language Model Meta AI(LLaMA)和T5模型,進行文本到動作的轉換。文章詳細分析了這些模型的結構和功能,並比較了它們在生成動作影片方面的效能。研究使用了不同的技術進行實驗,如均方根正規化(Root Mean Square Normalization, RMSNorm)和絕對編碼,來探索最佳的文本到動作轉換方法。研究結果顯示,T5模型在根據文本描述生成動作方面表現更為優異,特別是在呈現關鍵動作和避免不必要動作方面。這些發現為未來的動作生成技術發展提供了有價值的見解。
This study explores text conversion to motion using large language models like the generative pre-trained transformer (GPT) series and other artificial intelligence (AI) technologies like Large Language Model Meta AI (LLaMA) and the T5 model. It analyzes the structure and functions of these models in detail, comparing their effectiveness in generating motion videos. Various techniques like root mean square normalization (RMSNorm) and absolute encoding were employed to identify the best method for text-to-motion conversion. The findings indicate that the T5 model generates actions based on textual descriptions, especially in presenting critical motions and avoiding unnecessary movements, offering valuable insights for future advancements in motion generation technology.