Latent (semantic) Analysis

The semantic is latent, that's what you're trying to analyse.

http://upload.wikimedia.org/wikipedia/commons/thumb/4/47/SVM_with_soft_margin.pdf/page1-640px-SVM_with_soft_margin.pdf.jpg

documentation pool

http://en.wikipedia.org/wiki/Latent_semantic_analysis // wiki
http://blog.josephwilk.net/projects/latent-semantic-analysis-in-python.html
http://lsa.colorado.edu/cgi-bin/LSA-matrix.html // lsa matrix applied to texts
http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html?showall=1
http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html
https://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/ // understanding for dummies
https://groente.puscii.nl/lsa-thesis.pdf

function/library

SVD needs SciPy or NumPy
kwargs, args //

tf-idf Transform //In sophisticated Latent Semantic Analysis systems, the raw matrix counts are usually modified so that rare words are weighted more heavily than common words. For example, a word that occurs in only 5% of the documents should probably be weighted more heavily than a word that occurs in 90% of the documents. The most popular weighting is TFIDF (Term Frequency – Inverse Document Frequency). Under this method, the count in each cell is replaced by the following formula.

methods

https://technowiki.files.wordpress.com/2011/08/diagram1.png?w=640

Documents are represented as “bags of words”, where the order of the words in a document is not important, only how many times each word appears in a document.
Concepts are represented as patterns of words that usually appear together in documents. For example “leash”, “treat”, and “obey” might usually appear in documents about dog training.
Words are assumed to have only one meaning. This is clearly not the case (banks could be river banks or financial banks) but it makes the problem tractable.

Code samples

http://www.puffinwarellc.com/lsa.py
//

from numpy.linalg import svd from numpy import dot, diag u, sigma, vt = svd(matrix, full_matrices=False) for i in range(-k, 0): sigma[i] = 0 # Reduce k smallest singular values. matrix = dot(u, dot(diag(sigma), vt))

[[meh]]
[[words]]
[[dontknow]]

---------
The U matrix gives us the coordinates of each word on our “concept” space, the Vt matrix gives us the coordinates of each document in our “concept” space, and the S matrix of singular values gives us a clue as to how many dimensions or “concepts” we need to include.

Link to the example in the patterns library:

    https://github.com/clips/pattern/blob/820cccf33c6ac4a4f1564a273137171cfa6ab7cb/examples/05-vector/03-lsa.py




http://tech.blog.aknin.name/2011/12/11/walking-python-objects-recursively/