Reading January

Christian Engels January 24, 2023 [TCS, Data structures]

Paper 1

Golovnev, Gur, Shinkar - Derandomization of Cell Sampling

Result

Previously there was a problem on $n$ inputs and $m$ possible queries:
- Such that any data structure that answers queries by probing $t$ memory cells requires space $s\geq \tilde \Omega(n (m/n)^{1/t})$
This paper improves this to $s\geq \tilde\Omega(n (m/n)^{1/(t-1)})$ for $t\geq 2$ for non-adaptive data structures.

Proof

Theorem: Let $G$ be a large enough $t$-hypergraph with at least $|V| (1+\epsilon)$ edges for large enough $\epsilon$. If $|E| \geq 3|V|\left( \frac{2^{t+3} |V| \log |V|}{k}\right)^{t-2}$ for $2^{t+2}\log |V| \leq k \leq |V|$. Then there exists a subset $S$ of size $|S|\leq k$ that spans at least $|S| + \frac{k}{2^{t+1}\log s}$ hyperedges.

How to use this to get the data structure lower bound?

A function $f:F^n \rightarrow F^m$ is called $k$-wise independent if for every possible output $S$ the uniform distribution of the $n$ inputs induces the uniform distribution of the outputs.
Proof:
- We build a data structure for the $k$-wise independent problem.
- We need to read at least $k$ memory cells.
- For $t=2$ construct a multigraph with $s$ vertices corresponding to the memory cells.
- Each edge will be a pair of cells read for a query.
- Then by the graph theorem, there exists a set of $k$ queries that depend on $k-1$ memory cells.
- Hence, it must satisfy that $s\geq m/(1+16 \log s /k)$.

Paper 2

Lovett, Zhang - Streaming Lower Bounds and Asymmetric Set-Disjointness

Result

Needle problem: $t$, number of samples, $n$ universe size, $p$ probability of needle.
Distinguish in a streaming setting with $s$ space the following two distributions:
- Uniformly sampled from $[n]$.
- Sample each with probability $p$ to be the needle $x\in [n]$ and with $1-p$ sample uniformly from $[n]$.
We assume with high probability all elements are unique by setting $n=\Omega(t^2)$.
$p=\Omega(1/t)$ as otherwise the distributions are too close.
Algorithms examples:
- Test if two adjacent elements are equal.
  - $t=\Omega(1/p^2)$ samples and space $\Theta(\log n)$.
- Store the entrie stream in memory and check for repeated elements.
  - $t=\Theta(1/p)$ but space $s=t\log n$.
Conjecture: Any single-pass streaming algorithm which can solve the needle problem satisfies $p^2st=\Omega(1)$.
Result: Any $l$ pass streaming algorithm satisfies $lp^2st\log t = \Omega(1)$.

Proof Aproach

Reduction to unique set disjointness in communication complexity.
Partition the stream into intervals $I_1,\dots,I_k$.
Any streaming algorithm can be transformed into a communication protocol that solves the $k$-party unqiue set disjointness.
This requires one needle per interval.
The rest of the paper is devoted to fixing this.

Paper 3

Benoit, Demaine, Munro, Raman, Raman, Rao-Representing Trees of Higher Degree
The lower bound for $k$-ary trees is $\log C_n^k = nk\log k -n(k-1)\log (k-1) - O(\log kn)$.
Their bound is $n\log k + n\log e$ (for a slowly growing function of $k(n)$).
- This answers queries of parent, $i$th child, subtree size in constant time.
There exists a data structure of size $\log C_n^k + o(n+\log k)$. See Paper 4
- This answers queries of parent, $i$th child with constant time and subtree size needs more time.
What strikes me odd is that it seems difficult to store labels in the tree and access this label efficiently.
- While some of this can be fixed it is unclear how to keep the query time for this constant.
- Take for example, storing the labels in an array with the order of the vertices being the index in the label.
- Getting from the an arbitrary node to the parent of the child takes probably the subtree encoding which is non constant in the optimal way.
- But I am not sure.

Proof

Jacobson Ordinal tree encoding:
- Represent a node of degree $d$ by $1^d0$.
- Write them in a level order traversal for the entire treee.
- This needs rank and select to answer child queries.
Munro and Raman encoding:
- Isomorphism with Ordinal tree (ordered trees). Use now a balanced parenthesis encoding.
This encoding: Write the unary degree encoding in DFS order.

Paper 4

Raman, Raman, Satti - Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees, Prefix Sums and Multisets

Paper 5

Ferragina, Luccio, Manzini, Muthukrishnan-Structuring labeled trees for optimal succinctness and beyond
This should allowe trees with arbitrary labels and the operations you want with almost optimal encoding.
They define the following transcoding:
- Let $t=n+l$ where $n$ is the number of internal nodes and $l$ the size of the labels.
- Visit $T$ in pre-order and create for a vertex $u$: $(\text{last}(u), \alpha(u),\pi(u))$.
  - $\text{last}(u)$ is the binary flag if it is the last child of the parent.
  - $\pi(u)$ is the string obtained by concatenating the symbols on the upward path to the root.
  - $\alpha(u)$ is the label of $u$.
- Stable Sort this lexicographically according to $\pi(u)$.
- Let this array be called $S$ and let us call $S_\text{last}$ to call the projection to the first part of the tuple.
- $S_\text{last}$ has $n$ bits set to one.
  - I do not quite understand this. The upper bound is true but not every internal node is the last node of a parent.
- The other bits are set to zero.
- $S_\alpha$ contains all the labels of the nodes.
- $S_\pi$ contains all the upward labelled paths of $T$.
- Now we can find the following structure which comes from the sorting:
  - The first tuple is the root.
  - If $u,u'$ are two nodes with $\pi(u)=\pi(u')$. Then they have the same depth and $u$ is to the left of $u'$ if it preceeds it.
  - The children of a node $u_1,\dots,u_c$ lie continuously in the array and the last one has the bit set to one.
  - Let $u,u'$ be two nodes in $T$. If they preceed in the array they preceeded in the tree.
For navigation we need rank and select.
- Rank$_c(1,q)$ is the number of times c appears in $S[1..q]$.
- select$_c(S,q)$ is the position of the $q$th occurence of $c$ in $S$.
- These queries can be done with constant time using $\log \binom{|S|}{m} + o(m) + O(\log \log s)$ where $m$ is the number of ones.
Navigation on this array.
- GetChildren(i)
  - Get the label of the current node.
  - Find the first occurrence of this label (this is now the first child).
  - Find where the one is for this block of children.
  - The in between are now all the children of $i$.
- GetParent(i)
  - Find the symbol of the parent.
  - Compute the offset.

Paper 6

Brodal, Faderberg - Dynamic Representations of Sparse Graphs
Arboricity of a graph is given as $\max_J \frac{\lvert E(J)\rvert}{\lvert V(J)-1\rvert}$ for any subgraph $J$ with $\lvert V(J)\rvert\geq 2$.
This contains graph of bounded treewidth, planar and bounded genus.
Operations on the data structure:
- Adjacent(u,v) => true if $(u,v)\in E$
- Insert(u,v) => $E= (u,v)\cup E$
- Delete(u,v) => $E = E\setminus (u,v)$
- Build(G) => Construct the data structure.
Data structure uses $O(m+n)$ space and supports queries in adjacent in worst case $O(c)$, Insert in armortized $O(1)$ and Delete in armortized $O(c+\log n)$ where c is the arboricity of the graph.
Arboricity needs to be known.
The data structure is a simple adjacency list where we assign an orientation to every edge such that every vertex has outdegree $O(c)$.

Paper 7

(Raman, Raman, Rao - Succinct Dynamic Data Structures)[https://www.imsc.res.in/~vraman/pub/wads_01.pdf]
Solving Dynamic Data Structures for Partial Sums and Arrays
Partial sums solved with $kn+o(n)$ space and time $O(\log n/\log \log n)$ with $k$ being the logarithm of the size of the entries.
Array uses $o(n)$ extra space with $O(\log n/\log \log n)$ operation time.

Paper 8

(Chung, Larsen - Stronger 3SUM Indexing Lower bounds)[http://arxiv.org/abs/2203.09334v1]
3SUM:
- Given an abelian group $G$ and and two subsets of group elements $A_1,A_2$ does there for a given $z$ exists $a_1 +a_2=z$ with $a_1\in A_1, a_2\in A_2$.
There exists various hardness conjectures for this problem.
Old Theorem:
- Any non-adaptive data structure for 3SUM using $S$ words of size $w$ must have query time $T=\Omega(\lg n/\lg (Sw/n))$.
The make this theorem work for adaptive data structures.
Proof by reduction to butterfly graphs.

Proof

Butterfly graph:
- $d+1$ layers each having $B^d$ nodes.
- We look at a layer as a $d$ digit number in base $B$.
- Every vertex has the label of its number in $B$-ary representation.
- There is an edge from $i$ to $j$ if $i$ and $j$ differ only on the $k$th digit.
For any source sink pair there exists a unique path.
- This is easy to see. Otherwise, we would have two paths which flip different bits which can never be unflipped.
Butter fly graph reachability has the following lower bound
- $T=\Omega(d)$, then $B=\Omega(w^2)$ and $\lg(B) =\Omega(\lg (Sd/dB^d)$.
Construction of 3SUM reduction, i.e., given a butterfly graph, construct a 3SUM query:
- $A_1$:
  - The digits will be broken into 5 blocks.
  - First block encodes the layer the edge is from.
  - Second block encodes the presence of the edge $e(i,j)$ in layer $k$.
  - Third block encodes the $d-k$ most significant bits of $i$ followed by $k$ zeroes.
  - Fourth block encodes the $d-k-1$ zeroes followed by $k+1$ least significant digits of $j$.
  - Fifth block holds 2 zeroes.
- $A_2$:
  - First block holds some value $-k$.
  - Second block is zero.
  - Third block is $d-k$ zeroes followed by any possible $k$ digit value.
  - Fourth block holds any possible $d-k-1$ digit value followed by $k+1$ zeroes.
  - Fifth block holds any possible digit from $[0,B-1]$.
- Translting Reachability Query:
  - First notice that carries do not matter here.
    - A lot of the construction needs to make sure and that is why there are certain zeroes.
  - If there does not exist a path then there is an edge missing.
  - Then there is just all the parts fit together.
  - If there does exist a path and all have 1 at the second place.
  - This can never be made into a value zero at the second place.

Non Adaptive Bit Probe 3SUM Lower Bound

Lower bound of $\Omega(\min { \lg |G|, \lg (Sw/n), n/w})$.
Let $\Delta = n/2w$.
By an averaging counting argument there is a set of cells answering at least $|G| \binom{S-T}{\Delta-T}/ \binom{S}{\Delta}$ queries.
Now assume $T\leq \Delta/2$ as otherwise we are done.
This is at least $|G|^{1-o(1)} > n$.
Now we will prove that the queries cannot be answered from such few cells for a given distribution over $A_1,A_2$.
Lemma: There exists an input distribution for which the event of any element in a subset $Q$ being included in the sum $A_1 +A_2$ is fully independent.
This gives us an entropy of $n$.
But these cells chosen above have are $n/2w$ having $n/2$ bits.
As their addresses are fixed this gives us a contradiction.