Search Engine Marketers and their LSI Myths
Saturday, June 23, 2007
As with many IR topics, LSI is a subject that from time to time surfaces in the search engine marketing industry through forums, conferences and events. Often, these discussions are limited to partially quoting or interpreting IR papers and patents.
At the time of writing this industry doesn't provide its members with stepwise how-to instructions for implementing LSI even when most of the information is available online. Consequently search marketers don't understand LSI. In particular, they don't seem to grasp the advantages and limitations of the technique, what is/is not LSI or what this can or cannot do for them or their clients.
The result is the dissemination of inaccurate information. For instance, some marketers have assigned a meaning to the terms "latent" and "semantic" that is not in the LSI literature. Others have become "experts" at quoting each other hearsays. In an effort to sell their services, even others have come with "LSI-based" software, videos, "lessons", tools, etc., that are at best a caricature of how a search engine or IR system implements LSI. Whatever these tools score probably is not what a search engine like Google or Yahoo might be scoring.
Some of these marketing companies even display a tag cloud of words and try to sell the idea that they have a real or unique "LSI technology". Such clouds are easy to construct and link to search result pages. These can be constructed from any lookup list, thesaurus or search log files. No SVD is needed. In an effort to save face and avoid litigation from consumers, some of these purveyors of falsehood as other crooks and their friends play with words and call theirs "LSI-like", "LSI-based", "LSI-driven" technology or use similar snaky phrases. The funny thing is that other SEOs, bloggers, and marketers fall for these tactics. And how to forget the "LSI and Link Popularity" half lies and half crap promoted by those that offer link-based services? As usual, SEO bloggers repeat such hearsays like parrots since most of these don't really know how to SVD a simple matrix.
Since there is now a crew of search marketing firms claiming to sell all sort of LSI-based SEO services and making a profit out of the ignorance of consumers, I am making public my case against these firms in the post
Latest SEO Incoherences (LSI) - My Case Against "LSI based" Snakeoil Marketers
Stay away from such marketing firms, their claims and businesses.
By providing an SVD-LSI tutorial, complete with step-by-step how-to calculations and examples, I hope to put to rest the many myths and misquotes disseminated by these search marketers. Here is a list of the most common misconceptions:
Latent Semantic Indexing (LSI) ...
is a query operator, like a proximity operator ("~"). is limited to English documents.
is limited to text.
is theming (analysis of themes).
is used by search engines to find all the nouns and verbs, and then associate them with related (substitution-useful) nouns and verbs.
allows search engines to "learn" which words are related and which noun concepts relate to one another.
is a form of on-topic analysis (term scope/subject analysis).
can be applied to collections of any size.
has no problem addressing polysemy (terms with different meanings).
is a kind of "associative indexing" used in stemming.
is document indexing.
can be implemented by a search engine if the system can understand the query.
is really important only when you have several keywords that are related by category.
is not too computationally expensive.
is a Google update.
is implemented as LSI/IDF.
is an anchor text thing.
is a link building thing.
scores differently regular text and anchor text (text placed in anchor tags).
looks at the title tag and the textual content of the page that your link is on.
ensures that anchor text variance will not dilute a link popularity building campaign.
scores differently links from specific url domains.
is applied by search engines by going to each page and analyzing the importance of a page as per a matrix of words.
accounts for word order (e.g., keyword sequences).
grants contextuality between terms.
is co-occurrence.
compares documents against a "master document".
is disconnected or divorced from term vector theory.
is Applied Semantics's CIRCA technology.
can be used as an SEO optimization technique to make "LSI-Friendly" documents
was invented by Google.
was patented by Google.
is ontology.
can be used by SEOs to improve rankings in SERPs.
This list of misconceptions, myths or plain lies was recopilated from SEOBook, SearchEngineWatch, Cre8asiteforums, SEOChat, SEOMOZ, SeoRoundTable, Webmasterworld and similar forums. A sample of the Latest SEO Incoherences ("LSI") is available online.
This spreading of incorrect knowledge through electronic forums gives rise to a bursting phenomenon that in the past we have referred to as blogonomies. In our view knowledge, citation importance or link weight transmitted through such bursts can be considered corrupted. Thus, this tutorial pretends to dispel search marketing blogonomies relevant to SVD and LSI.
The fact is that query operators are not part of LSI. In addition to English text, LSI has been applied to text in Spanish and to text in other languages. LSI makes no presumptions regarding words in documents or queries -whether these are or should be nouns, verbs, adjectives or other form of tokens. LSI is not on-topic analysis or what SEOs like to call "theming". Current LSI algorithms ignore word order (term sequences), though a Syntagmatic Paradigmatic Model and Predication Algorithm has been proposed to work around this.
Another misconception is that latent semantic is co-occurrence. Actually is not; at least, not first-order co-occurrence. LSI works great at identifying terms that induce similarity in a reduced space, but research from Dr. Tom Landauer and his group at the University of Colorado (19) indicates that over 99 % of word-pairs whose similarity is induced never appear together in a paragraph. Readers should be reminded that synonyms or terms conveying a synonymity association don't tend to co-occur, but tend to occur in the same, similar or related context. While LSI itself is not co-occurrence, term co-occurrence is important in LSI studies.
A persistent myth in search marketing circles is that LSI grants contextuality; i.e., terms occurring in the same context. This is not always the case. Consider two documents X and Y and three terms A, B and C and wherein:
A and B do not co-occur.
X mentions terms A and C
Y mentions terms B and C.
:. A---C---B
The common denominator is C, so we define this relation as an in-transit co-occurrence since both A and B occur while in transit with C. This is called second-order co-occurrence and is a special case of high-order co-occurrence.
However, only because terms A and B are in-transit with C this does not grant contextuality, as the terms can be mentioned in different contexts in documents X and Y. For example, this would be the case of X and Y discussing different topics. Long documents are more prone to this.
Even if X and Y are monotopic these might be discussing different subjects. Thus, it would be fallacious to assume that high-order co-occurrence between A and B while in-transit with C equates to a contextuality relationship between terms. Add polysemy to this and the scenario worsens, as LSI can fail to address polysemy.
There are other things to think about. LSI is computationally expensive and its overhead is amplified with large-scale collections. Certainly LSI is not associative indexing or root (stem) indexing like some have suggested. It is not document indexing, but used with already indexed collections whose document terms have been prescored according to a particular term weight scheme. Furthermore, understanding a query; i.e., the assumption that the query must be of the natural language type, is not a requirement for implementing LSI.
In addition, the claim that terms must come from a specific portion of a document like title tags, anchor text, links or a specific url domain plays no role and is not a requisit for implementing LSI. These false concepts have been spreaded for a while, mostly by those that sell link-based services, who conveniently don't provide mathematical evidence on how LSI works since they cannot do the math.
True that some papers on large-scale distributed LSI mentions the word "domain" in connection with LSI, but the term is used in reference to information domains, not url domains or what is known as web sites. True that LSI can be applied to collections that have been precategorized by web site domains, but this is merely filtering and preclassification and is not part of the SVD algorithm used in LSI.
Let me mention that the technique of singular value decomposition used in LSI is not an AI algorithm, but a matrix decomposition technique developed in the sixties; though SVD has been used in many environments, including AI (1-14). Roughly speaking, SVD itself is just one matrix decomposition technique. Certainly there are more than one way of decomposing and analyzing a given matrix. Plenty of alternate techniques are available online (e.g., LU, QR, etc.).
True that SVD as NMF (non-negative matrix factorization) has been used to conduct email forensics. True that SVD has been used as an eavesdropping tool for identifying word patterns from web communities (15-18), but LSI is not a secret weapon from the Government designed to read your mind --at least not yet. :).
Another misconception is that LSI is CIRCA, a technology developed by Applied Semantics (acquired by Google). As mentioned at this IR Thoughts blog, this is another SEO blogonomy. CIRCA is based on ontologies, not on SVD. LSI is not based on ontologies, but on SVD.
When you think thoroughly There is No Such Thing as "LSI-Friendly" Documents. This is just another SEO Myth promoted by certain search engine marketing firms to market better whatever they sell. In the last tutorial of this series (SVD and LSI Tutorial 5: LSI Keyword Research and Co-Occurrence Theory), we explain in details why there is not such thing as "LSI-Friendly" documents and why SEOs cannot use LSI to optimize for ranking purposes any document.
Tags:
Search Engine Optimization
Comments[ 0 ]
Post a Comment