1 of 23

DATA VISUALIZATION

S.V.V.D.Jagadeesh

Sr. Assistant Professor

Dept of Artificial Intelligence & Data Science

LAKIREDDY BALI REDDY COLLEGE OF ENGINEERING

2 of 23

Session Outcomes
Visualization of Spatial Data
Types of Spatial Data
One dimensional data
Two dimensional data
Probing two dimensional data
Three dimensional data

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Previously Discussed Topics

LBRCE

DATA VISUALIZATION

3 of 23

At the end of this session, Student will be able to:

Understand the Text Data ns Vector Space Model(Understand-L2)

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Session Outcomes

LBRCE

DATA VISUALIZATION

4 of 23

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Text Data

We define a collection of documents as a corpus (plural corpora). We deal with objects within corpora.
These objects can be words, sentences, paragraphs, documents, or even collections of documents.
We may even consider images and videos. Often these objects are considered atomic with respect to the task, analysis and visualization.
Text and documents are often minimally structured and may be rich with attributes and metadata, especially when focused in a specific application domain.

LBRCE

DATA VISUALIZATION

5 of 23

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Text Data

We can compute statistics about documents.
For example, the number of words or paragraphs, or the word distribution or frequency all can be used for author authenticity.
Are there any paragraphs that repeat the same words or sentences?
We can also identify relationships between paragraphs or documents within a corpus

LBRCE

DATA VISUALIZATION

6 of 23

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Levels of Text Representation

Lexical Level
Syntactic Level
Semantic Level

LBRCE

DATA VISUALIZATION

7 of 23

The lexical level is concerned with transforming a string of characters into a sequence of atomic entities, called tokens.
Lexical analyzers process the sequence of characters with a given set of rules into a new sequence of tokens that can be used for further analysis.
Tokens can include characters, character n-grams, words, word stems, lexemes, phrases, or word n-grams, all with associated attributes.
Many types of rules can be used to extract tokens, the most common of which are finite state machines defined by regular expressions.

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Lexical Level

LBRCE

DATA VISUALIZATION

8 of 23

The syntactical level deals with identifying and tagging (annotating) each token’s function.
We assign various tags, such as sentence position or whether a word is a noun, expletive, adjective, dangling modifier, or conjunction.
Tokens can also have attributes such as whether they are singular or plural, or their proximity to other tokens.
Richer tags include date, money, place, person, organization, and time
The process of extracting these annotations is called named entity recognition (NER).
The richness and wide variety of language models and grammars (generative, categorical, dependency, probabilistic, and functionalist) yield a wide variety of approaches.

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Syntactic Level

LBRCE

DATA VISUALIZATION

9 of 23

4. A scatterplot results if, at each location on the plot, the data value(s) control the color, shape, or size of a marker. Note that unlike for images, no interpolation is performed.
5. A map results if the data contains linear and area features, as well as point objects. A linear feature, such as a road or stream, is represented as a sequence of connected coordinates, which are plotted as a series of line segments.
6. A contour or isovalue map conveys boundary information extracted from an image depicting a continuous phenomenon, such as elevation or temperature. The term isovalue means “single value,” and thus a contour on such a map indicates the boundary between points above this value and points below the value.

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Two Dimensional Data

LBRCE

DATA VISUALIZATION

10 of 23

Computing term vectors is an essential step for many document and corpus visualization and analysis techniques.
In the vector space model, a term vector for an object of interest (paragraph, document, or document collection) is a vector in which each dimension represents the weight of a given word in that document.
Typically, to clean up noise, stop words (such as “the” or “a”) are removed (filtering), and words that share a word stem are aggregated together (stemming)

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Vector Space Model

LBRCE

DATA VISUALIZATION

11 of 23

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Vector Space Model

LBRCE

DATA VISUALIZATION

12 of 23

The paragraph contains 98 string tokens, 74 terms, and 48 terms when stop words are removed.

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Vector Space Model

LBRCE

DATA VISUALIZATION

13 of 23

This vector space model requires a weighting scheme for assigning weights to terms in a document.
There exist many such methods, the most well known of which is the term frequency inverse document frequency (tf-idf) .
Let Tf (w) be the term frequency or number of times that word w occurred in the document, and let Df (w) be the document frequency (number of documents that contain the word).
Let N be the number of documents. We define Tf Idf(w) as

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Computing Weights

LBRCE

DATA VISUALIZATION

14 of 23

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Computing Weights

LBRCE

DATA VISUALIZATION

15 of 23

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Computing Weights

LBRCE

DATA VISUALIZATION

16 of 23

The normal and uniform distributions are the ones we are most familiar with.
The power law distribution is common today with the large data sizes we encounter, which reflect scalable phenomena.
The economist Vilfredo Pareto stated that a company’s revenue is inversely proportional to its rank—a classic power law, resulting in the famous 80-20 rule, in which 20% of the population holds 80% of the wealth.
Harvard linguist George Kingsley Zipf stated the distribution of words in natural language corpora using a discrete power law distribution called a Zipfian distribution.

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Zipf’s Law

LBRCE

DATA VISUALIZATION

17 of 23

Zipf’s Law states that in a typical natural language document, the frequency of any word is inversely proportional to its rank in the frequency table.
Plotting the Zipf curve on a log-log scale yields a straight line with a slope of -1

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Zipf’s Law

LBRCE

DATA VISUALIZATION

18 of 23

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Zipf’s Law

LBRCE

DATA VISUALIZATION

19 of 23

One immediate implication of Zipf’s Law is that a small number of words describe most of the key concepts in small documents.
There are numerous examples of text summarization that permit a full description with just a few words

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Zipf’s Law

LBRCE

DATA VISUALIZATION

20 of 23

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Zipf’s Law

LBRCE

DATA VISUALIZATION

21 of 23

The vector space model, when accompanied by some distance metric, allows one to perform many useful tasks.
We can use tf-idf and the vector space model to identify documents of particular interest.
For example, the vector space model, with the use of some distance metric, will allow us to answer questions such as which documents are similar to a specific one, which documents are relevant to a given collection of documents, or which documents are most relevant to a given search query—all by finding the documents whose term vectors are most similar to the given document, the average vector over a document collection, or the vector of a search query.

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Tasks Using the Vector Space Model

LBRCE

DATA VISUALIZATION

22 of 23

Another indirect task is how to help the user make sense of an entire corpus.
The user may be looking for patterns or for structures, such as a document’s main themes, clusters, and the distribution of themes through a document collection.
This often involves visualizing the corpus in a twodimensional layout, or presenting the user with a graph of connections between documents or entities to navigate through.
The visualization pipeline maps well to document visualization: we get the data (corpus), transform it into vectors, then run algorithms based on the tasks of interest (i.e., similarity, search, clustering) and generate the visualizations.

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Tasks Using Vector Space Model

LBRCE

DATA VISUALIZATION

23 of 23

Session Outcomes
Text Data
Levels of Text Representation
Vector Space Model
Computing Weights
Zipf’s Law
Tasks Using Vector Space Model

S.V.V.D.Jagadeesh

Thursday, October 9, 2025

Summary

LBRCE

DATA VISUALIZATION