CGS410 — Cognitive Science Research Project

Structural Constraints in
Human Language
Dependency Trees

Investigating how real syntactic dependency trees differ from
random baselines across English, German, Hindi, and Japanese
using Surface-Syntactic Universal Dependencies treebanks.

Deconstructing Syntax

Understanding the cognitive limitations that shape the dependency trees in our everyday language.

The Problem Statement

When we speak, every sentence forms a dependency tree—a directed acyclic graph (DAG) where each word attaches to exactly one head. The central question of our research is: Are these trees just random structures, or do they follow hidden, universal constraints?

Our Hypothesis

We hypothesize that natural language trees are not accidental. Due to human memory and cognitive processing limits, languages evolve to be compact and easy-to-parse compared to sheer randomness.

Structural Constraints

If our hypothesis holds, human language should act as an optimized system, exhibiting specifically bounded properties when compared to structurally equivalent random graphs:

Bounded Arity (Hubs) Real dependency trees should exhibit a specific branching factor. We expect hubs to form around verbs connecting multiple dependents.
Shallower Depth Real trees should avoid long dependency chains. Deep nesting is computationally heavy for the human brain to parse in real-time.
Matching Crossings To ensure a fair baseline, we generated random trees using Prüfer codes while controlling for the number of edge crossings (Gap Degree).

The Scientific Pipeline

To prove our hypothesis, we created an algorithmic pipeline to compare real sentences from the Surface-Syntactic Universal Dependencies (SUD) treebanks against random baselines across four typologically diverse languages: English, German, Hindi, and Japanese.

After computing structural graph-theoretic metrics, we trained a Multilingual GNN Classifier to distinguish between real language structures and random noise based purely on these extracted constraints like depth and tree arity.

01
Parse SUD Treebanks Load CoNLL-U formatted sentences and construct directed dependency graphs.
02
Compute Tree Metrics Calculate arity (max branching factor) and depth (longest root-to-leaf path).
03
Generative Baselines Create random trees with Prüfer codes, matched by crossing count via rejection sampling.
04
GNN Classification Train a Graph Neural Network to detect natural language structures.
13,396

Sentences analyzed from
Surface-Syntactic Universal Dependencies treebanks

4 langs

Typologically diverse languages:
English, German, Hindi & Japanese

Real vs Random

Higher arity & shallower depth in
real trees compared to random baselines

Results across Languages

Violin plots comparing structural metrics of real dependency trees vs. randomly generated baselines with matched crossing counts.

Real vs Random Tree Arity across English, German, Hindi, Japanese

Tree Arity Comparison

Real trees exhibit significantly higher arity (max branching factor) than random baselines across all four languages, suggesting natural languages prefer wider, flatter structures.

Real vs Random Tree Depth across English, German, Hindi, Japanese

Tree Depth Comparison

Real dependency trees are consistently shallower than random counterparts, indicating that human sentence structures minimize deep nesting — likely for cognitive efficiency.

Try our ML Model

We trained a classifier to distinguish real dependency trees from randomly generated ones based on structural features like arity, depth, and crossing patterns. Explore it live on our hosted environment.

Graph Neural Network classifier
Trained on 13,396 real & random trees
Multi-language support (EN, DE, HI, JA)
Launch Model
predict.py
from src.metrics import tree_arity, tree_depth
from src.graph_builder import build_dependency_graph

# Build graph from parsed sentence
G = build_dependency_graph(sentence)

# Extract structural features
features = {
    "arity":  tree_arity(G),
    "depth":  tree_depth(G),
    "n_nodes": len(G.nodes)
}

# Classify: real or random?
prediction = model.predict(features)
print(f"Tree is {prediction}")

Future Scope

Directions for extending this research beyond the current study.

Extend the analysis to 20+ languages from diverse language families including Dravidian, Bantu, and Slavic languages, using the full UD repository. Removing the sentence length filter (currently ≤12 words) would capture more complex syntactic patterns.

Replace the Prüfer-code approach with more linguistically motivated random baseline models, such as random projective trees or constraint-based generation that preserves specific sub-tree patterns while randomizing overall structure.

Explore Graph Attention Networks (GAT) and GraphSAGE for richer representation learning on the dependency trees. Incorporate edge features (dependency relation types) in addition to structural metrics.

Investigate whether there are universal upper bounds on arity and depth across all human languages, and test formal hypotheses about dependency length minimization and cognitive processing constraints.

Correlate structural tree metrics with eye-tracking and reading-time data to establish direct links between syntactic tree complexity and human cognitive processing effort.