Investigating how real syntactic dependency trees differ from
random baselines across English, German, Hindi, and Japanese
using Surface-Syntactic Universal Dependencies treebanks.
Understanding the cognitive limitations that shape the dependency trees in our everyday language.
When we speak, every sentence forms a dependency tree—a directed acyclic graph (DAG) where each word attaches to exactly one head. The central question of our research is: Are these trees just random structures, or do they follow hidden, universal constraints?
We hypothesize that natural language trees are not accidental. Due to human memory and cognitive processing limits, languages evolve to be compact and easy-to-parse compared to sheer randomness.
If our hypothesis holds, human language should act as an optimized system, exhibiting specifically bounded properties when compared to structurally equivalent random graphs:
To prove our hypothesis, we created an algorithmic pipeline to compare real sentences from the Surface-Syntactic Universal Dependencies (SUD) treebanks against random baselines across four typologically diverse languages: English, German, Hindi, and Japanese.
After computing structural graph-theoretic metrics, we trained a Multilingual GNN Classifier to distinguish between real language structures and random noise based purely on these extracted constraints like depth and tree arity.
Sentences analyzed from
Surface-Syntactic Universal Dependencies treebanks
Typologically diverse languages:
English, German, Hindi & Japanese
Higher arity & shallower depth in
real trees compared to random baselines
Violin plots comparing structural metrics of real dependency trees vs. randomly generated baselines with matched crossing counts.
Real trees exhibit significantly higher arity (max branching factor) than random baselines across all four languages, suggesting natural languages prefer wider, flatter structures.
Real dependency trees are consistently shallower than random counterparts, indicating that human sentence structures minimize deep nesting — likely for cognitive efficiency.
We trained a classifier to distinguish real dependency trees from randomly generated ones based on structural features like arity, depth, and crossing patterns. Explore it live on our hosted environment.
from src.metrics import tree_arity, tree_depth
from src.graph_builder import build_dependency_graph
# Build graph from parsed sentence
G = build_dependency_graph(sentence)
# Extract structural features
features = {
"arity": tree_arity(G),
"depth": tree_depth(G),
"n_nodes": len(G.nodes)
}
# Classify: real or random?
prediction = model.predict(features)
print(f"Tree is {prediction}")
Directions for extending this research beyond the current study.
Extend the analysis to 20+ languages from diverse language families including Dravidian, Bantu, and Slavic languages, using the full UD repository. Removing the sentence length filter (currently ≤12 words) would capture more complex syntactic patterns.
Replace the Prüfer-code approach with more linguistically motivated random baseline models, such as random projective trees or constraint-based generation that preserves specific sub-tree patterns while randomizing overall structure.
Explore Graph Attention Networks (GAT) and GraphSAGE for richer representation learning on the dependency trees. Incorporate edge features (dependency relation types) in addition to structural metrics.
Investigate whether there are universal upper bounds on arity and depth across all human languages, and test formal hypotheses about dependency length minimization and cognitive processing constraints.
Correlate structural tree metrics with eye-tracking and reading-time data to establish direct links between syntactic tree complexity and human cognitive processing effort.