1. Repeat before Forgetting: Spaced Repetition for Efficient and Effective Training of Neural Networks
2. A Simple Theoretical Model of Importance for Summarization
3. Attending to Future Tokens for Bidirectional Sequence Generation
4. Sequence Generation: From Both Sides to the Middle

## [Repeat before Forgetting: Spaced Repetition for Efficient and Effective Training of Neural Networks]

### 方法

d是该sample的loss，t是还有几个epoch被review到；s则是在validation data上的performance。

x是不同核函数的变量：

$f_{g a u}(x, \tau)=\exp \left(-\tau x^{2}\right)$

$f_{l a p}(x, \tau)=\exp (-\tau x)$

$f_{l i n}(x, \tau)=\left\{\begin{array}{ll}{1-\tau x} & {x<\frac{1}{\tau}} \\ {0} & {\text { otherwise }}\end{array}\right.$

$f_{\sec }(x, \tau)=\frac{2}{\exp \left(-\tau x^{2}\right)+\exp \left(\tau x^{2}\right)}$

## [A Simple Theoretical Model of Importance for Summarization]

$D$是source document；$S$是candidate summary。

### Redundancy

$S$的交叉熵：

### Relevance

A summary with a low expected surprise produces a low uncertainty about what were the original sources。summary相当于对源文档的有损压缩，因此应尽可能减小该损失。

relevance与redundancy的联系：

KL divergence is the information loss incurred by using D as an approximation of S (i.e., the uncertainty about D arising from observing S instead of D). A summarizer that minimizes the KL divergence minimizes Redundancy while maximizing Relevance.

### Informativeness

relevance忽略了外部知识。将外部知识引入，a summary is informative if it induces, for a user, a great change in her knowledge about the world.

for Informativeness, the cross-entropy between S and K should be high because we measure the amount of new information induced by the summary in our knowledge.

## [Attending to Future Tokens for Bidirectional Sequence Generation]

### 方法

One-step greedy： 直接同时将所有的placeholder一次性recover。
Highest probability：一步一步来，每次uncover概率最大的那个token。
Lowest entropy： 选择熵最小的，这代表了其不确定性最小。
Left-to-right：就是从左到右，但未来的placeholder也会attend到。

## [Sequence Generation: From Both Sides to the Middle]

inference的时候能加速；能够attend到未来的词，缓解under-translation；