This thesis proposes a new learning-based method to generate dance poses from given music clips. Prior approaches often address the choreography generation tasks using models that comprise recurrent networks or transformers and thus make the tasks hardware-demanding and time-consuming. We propose a network architecture that uses convolution layers to explore the extent of lightweight approaches. The experimental results on video in the wild provide a baseline of several beat-related indices and a new self-similarity metric on dance sequence generation and validate the effectiveness of our method.