"This notebook illustrates the sampling recipe mentioned in lab L1."
]
},
{
"cell_type": "markdown",
"id": "4167876e",
"metadata": {},
"source": [
"## Goal\n",
"\n",
"We want to sample words from a vocabulary with a probability that is proportional to their counts (absolute frequencies) in some given text. That is, if we have two words $w_1$ and $w_2$, where $w_2$ appears $k$ times as often as $w_1$, then the expected number of times we sample $w_1$ should be $k$ times higher than the expected number of times we sample $w_2$."
]
},
{
"cell_type": "markdown",
"id": "81d0f6a3",
"metadata": {},
"source": [
"## Sampling recipe\n",
"\n",
"Imagine all the words in the vocabulary covering a line marked with numbers between 0 and the sum of all word frequencies, where each word covers an interval corresponding to its frequency. To sample a word, we choose a random point on that line, and return that word whose interval includes this chosen point. In doing so, we will sample words with a probability that is proportional to its frequency."
]
},
{
"cell_type": "markdown",
"id": "1ded3597",
"metadata": {},
"source": [
"## Example\n",
"\n",
"We illustrate the sampling recipe with a concrete example."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bf86a87a",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import torch"
]
},
{
"cell_type": "markdown",
"id": "cc9f8460",
"metadata": {},
"source": [
"Here is a list of counts for words in a ten-word vocabulary:"
"To implement the sampling recipe, we need the cumulative sums of these counts. We can get them with the function [`torch.cumsum()`](https://pytorch.org/docs/stable/generated/torch.cumsum.html)."
"To choose a random point on the counts line, we sample a random number between 0 and 1 and multiply it with the sum of all counts, which is the last entry in the list of cumulative sums. Here we choose $5$ such points."
"To return the word whose interval on the counts line includes a chosen point, we use the function [`torch.searchsorted()`](https://pytorch.org/docs/stable/generated/torch.searchsorted.html). This function takes a sorted sequence and tensor of values and finds the indices from the sorted sequence such that, if the corresponding values were inserted before the indices, the order of the corresponding dimension within the sorted sequence would be preserved."
This notebook illustrates the sampling recipe mentioned in lab L1.
%% Cell type:markdown id:4167876e tags:
## Goal
We want to sample words from a vocabulary with a probability that is proportional to their counts (absolute frequencies) in some given text. That is, if we have two words $w_1$ and $w_2$, where $w_2$ appears $k$ times as often as $w_1$, then the expected number of times we sample $w_1$ should be $k$ times higher than the expected number of times we sample $w_2$.
%% Cell type:markdown id:81d0f6a3 tags:
## Sampling recipe
Imagine all the words in the vocabulary covering a line marked with numbers between 0 and the sum of all word frequencies, where each word covers an interval corresponding to its frequency. To sample a word, we choose a random point on that line, and return that word whose interval includes this chosen point. In doing so, we will sample words with a probability that is proportional to its frequency.
%% Cell type:markdown id:1ded3597 tags:
## Example
We illustrate the sampling recipe with a concrete example.
%% Cell type:code id:bf86a87a tags:
``` python
importnumpyasnp
importtorch
```
%% Cell type:markdown id:cc9f8460 tags:
Here is a list of counts for words in a ten-word vocabulary:
To implement the sampling recipe, we need the cumulative sums of these counts. We can get them with the function [`torch.cumsum()`](https://pytorch.org/docs/stable/generated/torch.cumsum.html).
To choose a random point on the counts line, we sample a random number between 0 and 1 and multiply it with the sum of all counts, which is the last entry in the list of cumulative sums. Here we choose $5$ such points.
%% Cell type:code id:8f0b6933 tags:
``` python
random_points=torch.rand(5)*cumulative_sums[-1]
random_points
```
%% Cell type:markdown id:d0838e73 tags:
To return the word whose interval on the counts line includes a chosen point, we use the function [`torch.searchsorted()`](https://pytorch.org/docs/stable/generated/torch.searchsorted.html). This function takes a sorted sequence and tensor of values and finds the indices from the sorted sequence such that, if the corresponding values were inserted before the indices, the order of the corresponding dimension within the sorted sequence would be preserved.