I. Introduction

Automated identification of forum posts into categories (such as question, answer, feedback and off-topic posts) can help in summarizing threads and allows for efficient information retrieval. Previous approaches to this problem can be classified into supervised and unsupervised classes. Supervised approaches [2, 3, 5] perform this classification task adequately. However, their success comes at a great cost: a large amount of labelled data is required for that level of performance. With larger datasets and ever increasing forum-membership, labelling quickly becomes infeasible. The alternate approaches [1, 6, 7] do away with labelled data, opting for an unsupervised solution. This approach often corresponds to a decrease in performance. In this study, we explored novel statistical techniques for automatically clustering forum posts into dialogue acts using a semi-supervised approach. Our work on the unsupervised classification algorithm is discussed elsewhere.

II. Approach

Our semi-supervised algorithm expands on previous work. Barzilay and colleagues [1] proposed an unsupervised approach involving a Hidden Markov Model (HMM) at the sentence level, tailored to match clusters of sentences to particular topics. Others [4] improved the model by introducing structural features, along with a Gaussian Mixture Model (GMM) for emission probabilities. Here, we propose a Hidden Markov Model that incorporates both structural and textual features. Furthermore, we explored the inclusion of emission probabilities from the HMM represented by a Gaussian Mixture Model. Both models were implemented in a semi-supervised fashion. More generally, we believe that a Hidden Markov Model is an appropriate choice when trying to represent sequential data, as it could implicitly factor in human knowledge (e.g., a solution can’t come before a question), and the GMM is said to help reduce topical clustering, which is a problem in unsupervised techniques.

Here is a step-by-step description of the semi-supervised approach:

  1. Vectorize all posts by means of word n-gram frequency counts and feature occurrences.

  2. Cluster vectors that have a given gold label (semi-supervised aspect).

  3. Construct a Hidden Markov Model (each cluster obtained in step 2 corresponds to a hidden state, and each post corresponds to an observation from the given state). Run Expectation-Maximization Algorithm:

    1. Expectation Step:

      1. Construct an n-gram+Feature language model for each state or fit a GMM for each state. This will be used to calculate emission probabilities of a post.

      2. Estimate the initial state probabilities given the observed state frequency counts.

    2. Maximization Step:

      1. Run the Viterbi algorithm to obtain the most likely state sequence, and HMM parameters.

In order to compare our novel semi-supervised approaches, we constructed a fully supervised approach. Following a proven approach by Catherine and colleagues [2], we implemented a fully supervised Support Vector Machine (SVM) to use as an approximation of the upper limit on dialogue act classification performance. To do this we trained a Weka SVM:SMO (Sequential Minimal Optimization) classifier on both n-grams and features.

III. Analysis

The following evaluation measures were used: \[{\textit Precision:=} \frac{\text{# Actual C posts predicted as C }}{\text{# Posts predicted as C}}\]

\[{\textit Recall :=} \frac{\text{# C posts predicted as C }}{\text{# Actual C posts}}\]

\[{\text F_1 {\it measure:=} }\frac{2 \times P \times R}{P + R}\]

The category-wise evaluation measures for the described techniques are listed in Table 1. As expected, the fully supervised technique outperforms the semi-supervised techniques. However the semi-supervised techniques perform relatively well, with the HMM performing at a similar level to the fully supervised method.

The methods perform adequately in most categories with the exception of Clarification and Clarification Request, both of which suffer from a lack of training examples.

Supervised Approach (SVM)
Category Precision Recall \(F_1\)
Problem 0.73 0.78 0.76
Solution 0.65 0.75 0.69
Clarification 0.3 0.2 0.24
Clarification R 0 0 0
Feedback 0.5 0.53 0.52
Other 0.62 0.52 0.57
Macro-Avg 0.47 0.46 0.46
Semi-Supervised (HMM)
Category Precision Recall \(F_1\)
Problem 0.63 0.71 0.67
Solution 0.57 0.7 0.63
Clarification 0.25 0.15 0.19
Clarification R 0.14 0.09 0.11
Feedback 0.39 0.29 0.33
Other 0.58 0.43 0.5
Macro-Avg 0.43 0.40 0.41
Semi-Supervised (HMM+GMM)
Category Precision Recall \(F_1\)
Problem 0.91 0.60 0.72
Solution 0.73 0.22 0.34
Clarification 0.05 0.1 0.07
Clarification R 0.04 0.16 0.07
Feedback 0.18 0.15 0.17
Other 0.32 0.61 0.42
Macro-Avg 0.37 0.31 0.34
Table 1: Supervised: 10-fold cross validation experiment results for fully supervised model described in Approach. Semi-Supervised (HMM Only): 5-fold cross-validation experiment results for the semi-supervised conversation model with POS tags and features. Semi-Supervised(HMM+GMM): 5-fold cross-validation experiment with semi-supervised model with GMM and features. Cross-validation involved the use of the smaller split to train the data, and testing on larger split of the data (which is opposite to the traditional supervised machine learning technique).

IV. Conclusion

The results of our study suggest that semi-supervised techniques are promising: they achieve a respectable middle ground between the low cost of unsupervised techniques and the high performance of fully supervised techniques. For future work, we hope to explore higher order Markov chains, incorporating the ability to learn longer-range dependencies between the categories. Our data experimentation has also emphasized another hurdle in the forum-post classification problem: posts can contain multiple dialogue acts (e.g., a given post can have both a Solution to a Problem, and contains Feedback to another Solution). The model has no intuition about this; we suggest that summarization might be an important technique to employ to retain the overall meaning of the post, while cutting out parts (dialogue acts) that are not representative.


I am grateful to Krish Perumal and Professor Graeme Hirst for their help and insight during the research process, and their critical reading of the abstract. This work was completed with data provided by VerticalScope Inc. The research was supported by NSERC and VerticalScope. We thank Afsaneh Fazly for comments and insight that greatly improved the work completed.


[1] R. Barzilay and L. Lee., "Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization," in Proc. of HLT-NAACL, 2004, pp. 113–120.

[2] R. Catherine et al., "Does Similarity Matter? The Case of Answer Extraction from Technical Discussion Forums," in Proc. of the 24th Int. Conf. on Computational Linguistics (COLING), 2012, pp. 175–184.

[3] L. Hong and B.D. Davison., "A classification–based approach to question answering in discussion boards," in Proc. of 32nd Int. ACM SIGIR Conf. on Research and development in information retrieval, 2009, pp. 171-178.

[4] S. Joty et al., "Unsupervised Modeling of Dialog Acts in Asynchronous Conversations," in Proc. of Int. Joint Conf. on Artificial Intelligence (IJCAI), 2011, pp. 1807-1813.

[5] S. Kim et al., "Tagging and Linking Web Forum Posts," in Proc. of the 14th Conf. on Computational Natural Language Learning (CoNLL), 2010, pp. 192-202.

[6] A. Ritter et al., "Unsupervised Modeling of Twitter Conversations," in The 2010 Annual Conf. of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 172-180.

[7] Z. Qu and Y. Liu., "Finding Problem Solving Threads in Online Forum," in Proc. of 5th Int. Joint Conf. on Natural Language Processing (IJCNLP), 2011, pp. 1413-1417.