The semantic is latent, that's what you're trying to analyse.

documentation pool // wiki // lsa matrix applied to texts // understanding for dummies


SVD needs SciPy or NumPy
kwargs, args //

tf-idf Transform  //In sophisticated Latent Semantic Analysis systems, the raw matrix counts  are usually modified so that rare words are weighted more heavily than  common words. For example, a word that occurs in only 5% of the  documents should probably be weighted more heavily than a word that  occurs in 90% of the documents. The most popular weighting is TFIDF  (Term Frequency – Inverse Document Frequency). Under this method, the  count in each cell is replaced by the following formula.


  1. Documents are represented as “bags of words”, where the order of the  words in a document is not important, only how many times each word  appears in a document.
  2. Concepts are represented as patterns of words that usually appear  together in documents. For example “leash”, “treat”, and “obey” might  usually appear in documents about dog training.
  3. Words are assumed to have only one meaning. This is clearly not the  case (banks could be river banks or financial banks) but it makes the  problem tractable.

Code samples

from numpy.linalg import svd from numpy import dot, diag u, sigma, vt = svd(matrix, full_matrices=False) for i in range(-k, 0): sigma[i] = 0 # Reduce k smallest singular values. matrix = dot(u, dot(diag(sigma), vt))


The U matrix gives us the coordinates of each word on our “concept”  space, the Vt matrix gives us the coordinates of each document in our  “concept” space, and the S matrix of singular values gives us a clue as  to how many dimensions or “concepts” we need to include.

Link to the example in the patterns library: