Structural Constraints in Human Language Dependency Trees

RESEARCH CONTEXT

Deconstructing Syntax

Understanding the cognitive limitations that shape the dependency trees in our everyday language.

The Problem Statement

When we speak, every sentence forms a dependency tree—a directed acyclic graph (DAG) where each word attaches to exactly one head. The central question of our research is: Are these trees just random structures, or do they follow hidden, universal constraints?

Our Hypothesis

We hypothesize that natural language trees are not accidental. Due to human memory and cognitive processing limits, languages evolve to be compact and easy-to-parse compared to sheer randomness.

Structural Constraints

If our hypothesis holds, human language should act as an optimized system, exhibiting specifically bounded properties when compared to structurally equivalent random graphs:

Bounded Arity (Hubs) Real dependency trees should exhibit a specific branching factor. We expect hubs to form around verbs connecting multiple dependents.

Shallower Depth Real trees should avoid long dependency chains. Deep nesting is computationally heavy for the human brain to parse in real-time.

Matching Crossings To ensure a fair baseline, we generated random trees using Prüfer codes while controlling for the number of edge crossings (Gap Degree).

The Scientific Pipeline

To prove our hypothesis, we created an algorithmic pipeline to compare real sentences from the Surface-Syntactic Universal Dependencies (SUD) treebanks against random baselines across four typologically diverse languages: English, German, Hindi, and Japanese.

After computing structural graph-theoretic metrics, we trained a Multilingual GNN Classifier to distinguish between real language structures and random noise based purely on these extracted constraints like depth and tree arity.

Parse SUD Treebanks Load CoNLL-U formatted sentences and construct directed dependency graphs.

Compute Tree Metrics Calculate arity (max branching factor) and depth (longest root-to-leaf path).

Generative Baselines Create random trees with Prüfer codes, matched by crossing count via rejection sampling.

GNN Classification Train a Graph Neural Network to detect natural language structures.

EXPERIMENTAL RESULTS

Results across Languages

Violin plots comparing structural metrics of real dependency trees vs. randomly generated baselines with matched crossing counts.

Real vs Random Tree Arity across English, German, Hindi, Japanese

Tree Arity Comparison

Real trees exhibit significantly higher arity (max branching factor) than random baselines across all four languages, suggesting natural languages prefer wider, flatter structures.

Real vs Random Tree Depth across English, German, Hindi, Japanese

Tree Depth Comparison

Real dependency trees are consistently shallower than random counterparts, indicating that human sentence structures minimize deep nesting — likely for cognitive efficiency.

INTERACTIVE MODEL

Try our ML Model

We trained a classifier to distinguish real dependency trees from randomly generated ones based on structural features like arity, depth, and crossing patterns. Explore it live on our hosted environment.

Graph Neural Network classifier

Trained on 13,396 real & random trees

Multi-language support (EN, DE, HI, JA)

Launch Model

predict.py

from src.metrics import tree_arity, tree_depth
from src.graph_builder import build_dependency_graph

# Build graph from parsed sentence
G = build_dependency_graph(sentence)

# Extract structural features
features = {
    "arity":  tree_arity(G),
    "depth":  tree_depth(G),
    "n_nodes": len(G.nodes)
}

# Classify: real or random?
prediction = model.predict(features)
print(f"Tree is {prediction}")

LOOKING AHEAD

Future Scope

Directions for extending this research beyond the current study.

Extend the analysis to 20+ languages from diverse language families including Dravidian, Bantu, and Slavic languages, using the full UD repository. Removing the sentence length filter (currently ≤12 words) would capture more complex syntactic patterns.

Replace the Prüfer-code approach with more linguistically motivated random baseline models, such as random projective trees or constraint-based generation that preserves specific sub-tree patterns while randomizing overall structure.

Explore Graph Attention Networks (GAT) and GraphSAGE for richer representation learning on the dependency trees. Incorporate edge features (dependency relation types) in addition to structural metrics.

Investigate whether there are universal upper bounds on arity and depth across all human languages, and test formal hypotheses about dependency length minimization and cognitive processing constraints.

Correlate structural tree metrics with eye-tracking and reading-time data to establish direct links between syntactic tree complexity and human cognitive processing effort.

Structural Constraints inHuman LanguageDependency Trees