Reading January
Christian Engels January 24, 2023 [TCS, Data structures]Paper 1
Result
- Previously there was a problem on $n$ inputs and $m$ possible queries:
- Such that any data structure that answers queries by probing $t$ memory cells requires space $s\geq \tilde \Omega(n (m/n)^{1/t})$
- This paper improves this to $s\geq \tilde\Omega(n (m/n)^{1/(t-1)})$ for $t\geq 2$ for non-adaptive data structures.
Proof
- Theorem: Let $G$ be a large enough $t$-hypergraph with at least $|V| (1+\epsilon)$ edges for large enough $\epsilon$. If $|E| \geq 3|V|\left( \frac{2^{t+3} |V| \log |V|}{k}\right)^{t-2}$ for $2^{t+2}\log |V| \leq k \leq |V|$. Then there exists a subset $S$ of size $|S|\leq k$ that spans at least $|S| + \frac{k}{2^{t+1}\log s}$ hyperedges.
How to use this to get the data structure lower bound?
- A function $f:F^n \rightarrow F^m$ is called $k$-wise independent if for every possible output $S$ the uniform distribution of the $n$ inputs induces the uniform distribution of the outputs.
- Proof:
- We build a data structure for the $k$-wise independent problem.
- We need to read at least $k$ memory cells.
- For $t=2$ construct a multigraph with $s$ vertices corresponding to the memory cells.
- Each edge will be a pair of cells read for a query.
- Then by the graph theorem, there exists a set of $k$ queries that depend on $k-1$ memory cells.
- Hence, it must satisfy that $s\geq m/(1+16 \log s /k)$.
Paper 2
Result
- Needle problem: $t$, number of samples, $n$ universe size, $p$ probability of needle.
- Distinguish in a streaming setting with $s$ space the following two distributions:
- Uniformly sampled from $[n]$.
- Sample each with probability $p$ to be the needle $x\in [n]$ and with $1-p$ sample uniformly from $[n]$.
- We assume with high probability all elements are unique by setting $n=\Omega(t^2)$.
- $p=\Omega(1/t)$ as otherwise the distributions are too close.
- Algorithms examples:
- Test if two adjacent elements are equal.
- $t=\Omega(1/p^2)$ samples and space $\Theta(\log n)$.
- Store the entrie stream in memory and check for repeated elements.
- $t=\Theta(1/p)$ but space $s=t\log n$.
- Test if two adjacent elements are equal.
- Conjecture: Any single-pass streaming algorithm which can solve the needle problem satisfies $p^2st=\Omega(1)$.
- Result: Any $l$ pass streaming algorithm satisfies $lp^2st\log t = \Omega(1)$.
Proof Aproach
- Reduction to unique set disjointness in communication complexity.
- Partition the stream into intervals $I_1,\dots,I_k$.
- Any streaming algorithm can be transformed into a communication protocol that solves the $k$-party unqiue set disjointness.
- This requires one needle per interval.
- The rest of the paper is devoted to fixing this.
Paper 3
- Benoit, Demaine, Munro, Raman, Raman, Rao-Representing Trees of Higher Degree
- The lower bound for $k$-ary trees is $\log C_n^k = nk\log k -n(k-1)\log (k-1) - O(\log kn)$.
- Their bound is $n\log k + n\log e$ (for a slowly growing function of $k(n)$).
- This answers queries of parent, $i$th child, subtree size in constant time.
- There exists a data structure of size $\log C_n^k + o(n+\log k)$. See Paper 4
- This answers queries of parent, $i$th child with constant time and subtree size needs more time.
- What strikes me odd is that it seems difficult to store labels in the tree and access this label efficiently.
- While some of this can be fixed it is unclear how to keep the query time for this constant.
- Take for example, storing the labels in an array with the order of the vertices being the index in the label.
- Getting from the an arbitrary node to the parent of the child takes probably the subtree encoding which is non constant in the optimal way.
- But I am not sure.
Proof
- Jacobson Ordinal tree encoding:
- Represent a node of degree $d$ by $1^d0$.
- Write them in a level order traversal for the entire treee.
- This needs rank and select to answer child queries.
- Munro and Raman encoding:
- Isomorphism with Ordinal tree (ordered trees). Use now a balanced parenthesis encoding.
- This encoding: Write the unary degree encoding in DFS order.
Paper 4
Paper 5
- Ferragina, Luccio, Manzini, Muthukrishnan-Structuring labeled trees for optimal succinctness and beyond
- This should allowe trees with arbitrary labels and the operations you want with almost optimal encoding.
- They define the following transcoding:
- Let $t=n+l$ where $n$ is the number of internal nodes and $l$ the size of the labels.
- Visit $T$ in pre-order and create for a vertex $u$: $(\text{last}(u), \alpha(u),\pi(u))$.
- $\text{last}(u)$ is the binary flag if it is the last child of the parent.
- $\pi(u)$ is the string obtained by concatenating the symbols on the upward path to the root.
- $\alpha(u)$ is the label of $u$.
- Stable Sort this lexicographically according to $\pi(u)$.
- Let this array be called $S$ and let us call $S_\text{last}$ to call the projection to the first part of the tuple.
- $S_\text{last}$ has $n$ bits set to one.
- I do not quite understand this. The upper bound is true but not every internal node is the last node of a parent.
- The other bits are set to zero.
- $S_\alpha$ contains all the labels of the nodes.
- $S_\pi$ contains all the upward labelled paths of $T$.
- Now we can find the following structure which comes from the sorting:
- The first tuple is the root.
- If $u,u'$ are two nodes with $\pi(u)=\pi(u')$. Then they have the same depth and $u$ is to the left of $u'$ if it preceeds it.
- The children of a node $u_1,\dots,u_c$ lie continuously in the array and the last one has the bit set to one.
- Let $u,u'$ be two nodes in $T$. If they preceed in the array they preceeded in the tree.
- For navigation we need rank and select.
- Rank$_c(1,q)$ is the number of times c appears in $S[1..q]$.
- select$_c(S,q)$ is the position of the $q$th occurence of $c$ in $S$.
- These queries can be done with constant time using $\log \binom{|S|}{m} + o(m) + O(\log \log s)$ where $m$ is the number of ones.
- Navigation on this array.
- GetChildren(i)
- Get the label of the current node.
- Find the first occurrence of this label (this is now the first child).
- Find where the one is for this block of children.
- The in between are now all the children of $i$.
- GetParent(i)
- Find the symbol of the parent.
- Compute the offset.
- GetChildren(i)
Paper 6
- Brodal, Faderberg - Dynamic Representations of Sparse Graphs
- Arboricity of a graph is given as $\max_J \frac{\lvert E(J)\rvert}{\lvert V(J)-1\rvert}$ for any subgraph $J$ with $\lvert V(J)\rvert\geq 2$.
- This contains graph of bounded treewidth, planar and bounded genus.
- Operations on the data structure:
- Adjacent(u,v) => true if $(u,v)\in E$
- Insert(u,v) => $E= (u,v)\cup E$
- Delete(u,v) => $E = E\setminus (u,v)$
- Build(G) => Construct the data structure.
- Data structure uses $O(m+n)$ space and supports queries in adjacent in worst case $O(c)$, Insert in armortized $O(1)$ and Delete in armortized $O(c+\log n)$ where c is the arboricity of the graph.
- Arboricity needs to be known.
- The data structure is a simple adjacency list where we assign an orientation to every edge such that every vertex has outdegree $O(c)$.
Paper 7
- (Raman, Raman, Rao - Succinct Dynamic Data Structures)[https://www.imsc.res.in/~vraman/pub/wads_01.pdf]
- Solving Dynamic Data Structures for Partial Sums and Arrays
- Partial sums solved with $kn+o(n)$ space and time $O(\log n/\log \log n)$ with $k$ being the logarithm of the size of the entries.
- Array uses $o(n)$ extra space with $O(\log n/\log \log n)$ operation time.
Paper 8
- (Chung, Larsen - Stronger 3SUM Indexing Lower bounds)[http://arxiv.org/abs/2203.09334v1]
- 3SUM:
- Given an abelian group $G$ and and two subsets of group elements $A_1,A_2$ does there for a given $z$ exists $a_1 +a_2=z$ with $a_1\in A_1, a_2\in A_2$.
- There exists various hardness conjectures for this problem.
- Old Theorem:
- Any non-adaptive data structure for 3SUM using $S$ words of size $w$ must have query time $T=\Omega(\lg n/\lg (Sw/n))$.
- The make this theorem work for adaptive data structures.
- Proof by reduction to butterfly graphs.
Proof
- Butterfly graph:
- $d+1$ layers each having $B^d$ nodes.
- We look at a layer as a $d$ digit number in base $B$.
- Every vertex has the label of its number in $B$-ary representation.
- There is an edge from $i$ to $j$ if $i$ and $j$ differ only on the $k$th digit.
- For any source sink pair there exists a unique path.
- This is easy to see. Otherwise, we would have two paths which flip different bits which can never be unflipped.
- Butter fly graph reachability has the following lower bound
- $T=\Omega(d)$, then $B=\Omega(w^2)$ and $\lg(B) =\Omega(\lg (Sd/dB^d)$.
- Construction of 3SUM reduction, i.e., given a butterfly graph, construct a 3SUM query:
- $A_1$:
- The digits will be broken into 5 blocks.
- First block encodes the layer the edge is from.
- Second block encodes the presence of the edge $e(i,j)$ in layer $k$.
- Third block encodes the $d-k$ most significant bits of $i$ followed by $k$ zeroes.
- Fourth block encodes the $d-k-1$ zeroes followed by $k+1$ least significant digits of $j$.
- Fifth block holds 2 zeroes.
- $A_2$:
- First block holds some value $-k$.
- Second block is zero.
- Third block is $d-k$ zeroes followed by any possible $k$ digit value.
- Fourth block holds any possible $d-k-1$ digit value followed by $k+1$ zeroes.
- Fifth block holds any possible digit from $[0,B-1]$.
- Translting Reachability Query:
- First notice that carries do not matter here.
- A lot of the construction needs to make sure and that is why there are certain zeroes.
- If there does not exist a path then there is an edge missing.
- Then there is just all the parts fit together.
- If there does exist a path and all have 1 at the second place.
- This can never be made into a value zero at the second place.
- First notice that carries do not matter here.
- $A_1$:
Non Adaptive Bit Probe 3SUM Lower Bound
- Lower bound of $\Omega(\min { \lg |G|, \lg (Sw/n), n/w})$.
- Let $\Delta = n/2w$.
- By an averaging counting argument there is a set of cells answering at least $|G| \binom{S-T}{\Delta-T}/ \binom{S}{\Delta}$ queries.
- Now assume $T\leq \Delta/2$ as otherwise we are done.
- This is at least $|G|^{1-o(1)} > n$.
- Now we will prove that the queries cannot be answered from such few cells for a given distribution over $A_1,A_2$.
- Lemma: There exists an input distribution for which the event of any element in a subset $Q$ being included in the sum $A_1 +A_2$ is fully independent.
- This gives us an entropy of $n$.
- But these cells chosen above have are $n/2w$ having $n/2$ bits.
- As their addresses are fixed this gives us a contradiction.