Homework 2 | Notion

Character level text generation. Follow the following steps and show your works in a jupyter notebook. The requirements of the notebook are emphasis in bold.

Find and download a suitable data for training a character level text generation model.
1. The text material can be English, Chinese, or Code.
2. The training data should contain at least 5000 sentences or comparable amount of data.
Gather the statistics of $p(x_t|x_{t-1}, x_{t-2}, \ldots, x_{t-n})$ for the training dataset, where $x_i$ are characters in the training dataset. Do the followings for at least 2 different $n$,
1. Store the information in a counter.
2. Show how many distinct tuples $(x_{t-1}, x_{t-2}, \ldots, x_{t-n})$ are there in the training data.
3. Which tuple $(x_{t-1}, x_{t-2}, \ldots, x_{t-n})$ appears most times int the dataset?
4. Show the top 3 candidate's $x_t$ that is most likely to appear right after the above tuple.
5. Generate 3 paragraph of text according to the statistics. You may start from any tuples that is appeared in the dataset.
Warmup for neural network and deep learning
1. Choose any of the deep learning framework. Pytorch lightning and keras are recommended.
2. Instead of using a counter, train a neural network to learn $p(x_t|x_{t-1}, x_{t-2}, \ldots, x_{t-n})$ or $p(x_t|x_{t-1}, x_{t-2}, \ldots)$
3. The network can be an RNN, 1D-CNN or an MLP. You can limit $x_i$ to be a subset of the characters, e.g. choose only the top 100~1000 most frequently appeared characters.
4. Generate 3 paragraph of text using this neural network model.