Add the notebook on the sampling recipe

693bd3a5 · Marco Kuhlmann · daa79130 · 693bd3a5
Commit 693bd3a5 authored 3 years ago by Marco Kuhlmann
--- a/labs/l1/Sampling_words_by_frequency.ipynb
+++ b/labs/l1/Sampling_words_by_frequency.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "8e7907e8",
+   "metadata": {},
+   "source": [
+    "# Sampling words by frequency"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "67167697",
+   "metadata": {},
+   "source": [
+    "This notebook illustrates the sampling recipe mentioned in lab L1."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4167876e",
+   "metadata": {},
+   "source": [
+    "## Goal\n",
+    "\n",
+    "We want to sample words from a vocabulary with a probability that is proportional to their counts (absolute frequencies) in some given text. That is, if we have two words $w_1$ and $w_2$, where $w_2$ appears $k$ times as often as $w_1$, then the expected number of times we sample $w_1$ should be $k$ times higher than the expected number of times we sample $w_2$."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "81d0f6a3",
+   "metadata": {},
+   "source": [
+    "## Sampling recipe\n",
+    "\n",
+    "Imagine all the words in the vocabulary covering a line marked with numbers between 0 and the sum of all word frequencies, where each word covers an interval corresponding to its frequency. To sample a word, we choose a random point on that line, and return that word whose interval includes this chosen point. In doing so, we will sample words with a probability that is proportional to its frequency."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1ded3597",
+   "metadata": {},
+   "source": [
+    "## Example\n",
+    "\n",
+    "We illustrate the sampling recipe with a concrete example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bf86a87a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import torch"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cc9f8460",
+   "metadata": {},
+   "source": [
+    "Here is a list of counts for words in a ten-word vocabulary:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "864d01f4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "counts = np.array([14507, 5014, 4602, 4529, 4000, 3219, 3010, 2958, 2225, 1271])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "32bf71c5",
+   "metadata": {},
+   "source": [
+    "To implement the sampling recipe, we need the cumulative sums of these counts. We can get them with the function [`torch.cumsum()`](https://pytorch.org/docs/stable/generated/torch.cumsum.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f73b493e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cumulative_sums = torch.cumsum(torch.from_numpy(counts), dim=0)\n",
+    "cumulative_sums"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8cf8c78e",
+   "metadata": {},
+   "source": [
+    "To choose a random point on the counts line, we sample a random number between 0 and 1 and multiply it with the sum of all counts, which is the last entry in the list of cumulative sums. Here we choose $5$ such points."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8f0b6933",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "random_points = torch.rand(5) * cumulative_sums[-1]\n",
+    "random_points"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d0838e73",
+   "metadata": {},
+   "source": [
+    "To return the word whose interval on the counts line includes a chosen point, we use the function [`torch.searchsorted()`](https://pytorch.org/docs/stable/generated/torch.searchsorted.html). This function takes a sorted sequence and tensor of values and finds the indices from the sorted sequence such that, if the corresponding values were inserted before the indices, the order of the corresponding  dimension within the sorted sequence would be preserved."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d6035c26",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "torch.searchsorted(cumulative_sums, random_points)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2e7f0b8c",
+   "metadata": {},
+   "source": [
+    "Good luck with the lab!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
+%% Cell type:markdown id:8e7907e8 tags:
+
+# Sampling words by frequency
+
+%% Cell type:markdown id:67167697 tags:
+
+This notebook illustrates the sampling recipe mentioned in lab L1.
+
+%% Cell type:markdown id:4167876e tags:
+
+## Goal
+
+We want to sample words from a vocabulary with a probability that is proportional to their counts (absolute frequencies) in some given text. That is, if we have two words $w_1$ and $w_2$, where $w_2$ appears $k$ times as often as $w_1$, then the expected number of times we sample $w_1$ should be $k$ times higher than the expected number of times we sample $w_2$.
+
+%% Cell type:markdown id:81d0f6a3 tags:
+
+## Sampling recipe
+
+Imagine all the words in the vocabulary covering a line marked with numbers between 0 and the sum of all word frequencies, where each word covers an interval corresponding to its frequency. To sample a word, we choose a random point on that line, and return that word whose interval includes this chosen point. In doing so, we will sample words with a probability that is proportional to its frequency.
+
+%% Cell type:markdown id:1ded3597 tags:
+
+## Example
+
+We illustrate the sampling recipe with a concrete example.
+
+%% Cell type:code id:bf86a87a tags:
+
+``` python
+import numpy as np
+import torch
+```
+
+%% Cell type:markdown id:cc9f8460 tags:
+
+Here is a list of counts for words in a ten-word vocabulary:
+
+%% Cell type:code id:864d01f4 tags:
+
+``` python
+counts = np.array([14507, 5014, 4602, 4529, 4000, 3219, 3010, 2958, 2225, 1271])
+```
+
+%% Cell type:markdown id:32bf71c5 tags:
+
+To implement the sampling recipe, we need the cumulative sums of these counts. We can get them with the function [`torch.cumsum()`](https://pytorch.org/docs/stable/generated/torch.cumsum.html).
+
+%% Cell type:code id:f73b493e tags:
+
+``` python
+cumulative_sums = torch.cumsum(torch.from_numpy(counts), dim=0)
+cumulative_sums
+```
+
+%% Cell type:markdown id:8cf8c78e tags:
+
+To choose a random point on the counts line, we sample a random number between 0 and 1 and multiply it with the sum of all counts, which is the last entry in the list of cumulative sums. Here we choose $5$ such points.
+
+%% Cell type:code id:8f0b6933 tags:
+
+``` python
+random_points = torch.rand(5) * cumulative_sums[-1]
+random_points
+```
+
+%% Cell type:markdown id:d0838e73 tags:
+
+To return the word whose interval on the counts line includes a chosen point, we use the function [`torch.searchsorted()`](https://pytorch.org/docs/stable/generated/torch.searchsorted.html). This function takes a sorted sequence and tensor of values and finds the indices from the sorted sequence such that, if the corresponding values were inserted before the indices, the order of the corresponding  dimension within the sorted sequence would be preserved.
+
+%% Cell type:code id:d6035c26 tags:
+
+``` python
+torch.searchsorted(cumulative_sums, random_points)
+```
+
+%% Cell type:markdown id:2e7f0b8c tags:
+
+Good luck with the lab!