Sampling can be hard to beat, for text understanding
Wednesday October 21, 2020
If you have a lot of related documents (like thousands of reviews of a product, or answers to a survey question) you might be tempted to use an automatic procedure to summarize them (maybe making a word cloud, or applying LDA, or something fancier). No automated process is flawless, but if you need to work with many groups of documents, such an approach could be called for. Regardless of whether you use an automated technique, you should still start by reading a sample of documents, and in fact reading may be the best way to understand what people have said.
This is a special case of a few more general principles:
- Look at the data
- Simple techniques can be very effective
- Start with a baseline for comparison