Author Topic: What is the performance of structure learning?  (Read 12961 times)

Offline Anders L Madsen

  • HUGIN Expert
  • Hero Member
  • *****
  • Posts: 2295
    • View Profile
What is the performance of structure learning?
« on: August 14, 2008, 08:58:18 »
The data used for learning is assumed to be a sample of cases generated by a process that can be described as a Bayesian network. The DAG structure of this Bayesian network is a referred to as the graph underlying the data generating process. Structure learning is the task of identifying the graph of the data generating process or an equivalent graph where two DAGs are equivalent if they induce the same set of (conditional) dependence and independence statements. 

Structure learning in HUGIN software is implemented using the PC and NPC algorithms. The two algorithms are constraint-based approaches with the NPC algorithm being an extension of the PC algorithm with a Necessary Path Condition.

A constraint-based structure learning algorithm identifies the structure of the underlying graph by performing a set of statistical tests for pairwise independence between each pair of variables. In principle, the independence of any pair of variables given any subset of other variables is tested.

The Necessary Path condition is applied to determine the validity of an independence statement derived by the statistical tests. The NPC algorithm is usually slower than the PC algorithm.

The HUGIN implementation of the NPC/PC algorithm performs statistical tests conditional on subsets of size 0,1,2,3. It stops testing for independence between a pair of variables once an independence statement has been found. The implementation also uses the structure of the graph to guide the selection of the test to perform next. This is referred to as the PC* algorithm in the literature.

The more sparse the underlying graph is the fewer tests have to be performed (as the algorithm stops testing when an independence statement has been found). If the graph is very dense a large number of independence tests have to fail. This may take up a significant amount of time.

In general it is very difficult to say what running times the user should expect as it depends on the actual data and the hardware. We know of users who have performed learning on data sets with 100s of variables and 1000s of cases.
HUGIN EXPERT A/S