Nowadays text synthesis is most vital part of web technology. Natural language processing (NLP) are often a subfield of linguistics, computing , information engineering, and AI concerned with the interactions between computers and human (natural) languages, especially the way to program computers to process and analyze large amounts of tongue data. Natural Language Toolkit also known as NLTK is to be used in which there are pre-existing provisions and libraries which are required for this particular application. The programming language to be used is Python. The system will be designed such that it will result into the sentiment of the given sentence. In Layman’s terms, it would indicate if the database is positive, neutral or negative. This technology is already implemented in Google search recommendations, YouTube search recommendations, etc.
Everything we express (either verbally or in written) carries huge amounts of data . The topic we elect , our tone, our selection of words, everything adds some sort of information which will be interpreted and value extracted from it. In theory, we will understand and even predict human behaviour using that information.
But there is a problem: one person may generate hundreds or thousands of words during a declaration, each sentence with its corresponding complexity. If you would like to scale and analyze several hundreds, thousands or many people or declarations during a given geography, then things is unmanageable.
Data generated from conversations, declarations or maybe tweets are samples of unstructured data. Unstructured data doesn’t fit neatly into the normal row and column structure of relational databases, and represent the overwhelming majority of knowledge available within the actual world. It is messy and hard to manipulate. Nevertheless, because of the advances in disciplines like machine learning an enormous revolution goes on regarding this subject . Nowadays it is not about trying to interpret a text supported its keywords , but about understanding the meaning behind those words (the cognitive way). This way it’s possible to detect figures of speech like irony, or maybe perform sentiment analysis.
Natural language processing could also be a subfield of linguistics, computing , information engineering, and AI concerned with the interactions between computers and human (natural) languages, especially the thanks to program computers to process and analyze large amounts of natural language data.NLP is a part of computer science and artificial intelligence concerned with interactions between computers and human (natural)languages. It is used to apply machine learning algorithms to text and speech.
NATURAL LANGUAGE PROCESSING
The essence of Natural Language Processing lies in making computers understand the natural language. That’s not an easy task though. Computers can understand the structured form of data like spreadsheets and the tables in the database, but human languages, texts, and voices form an unstructured category of data, and it gets difficult for the computer to understand it, and there arises the need for Natural Language Processing. There’s a lot of natural language data out there in various forms and it would get very easy if computers can understand and process that data. We can train the models in accordance with expected output in different ways. Humans have been writing for thousands of years, there are a lot of literature pieces available, and it would be great if we make computers understand that. But the task is never going to be easy. There are various challenges floating out there like understanding the correct meaning of the sentence, correct Named-Entity Recognition(NER), correct prediction of various parts of speech, coreference resolution(the most challenging thing in my opinion). Computers can’t truly understand the human language. If we feed enough data and train a model properly, it can distinguish and try categorizing various parts of speech(noun, verb, adjective, supporter, etc…) based on previously fed data and experiences. If it encounters a new word it tried making the nearest guess which can be embarrassingly wrong few times.
It’s very difficult for a computer to extract the exact meaning from a sentence. For example – The boy radiated fire like vibes. The boy had a very motivating personality or he actually radiated fire? As you see over here, parsing English with a computer is going to be complicated. types of NLP are:
Percy Liang, a Stanford CS professor and NLP expert, breaks down the various approaches to NLP / NLU into four distinct categories:
- Interactive learning
Distributional approaches include the large-scale statistical tactics of machine learning and deep learning. These methods typically turn content into word vectors for mathematical analysis and perform quite well at tasks such as part-of-speech tagging (is this a noun or a verb?), dependency parsing (does this part of a sentence modify another part?), and semantic relatedness (are these different words used in similar ways?). These NLP tasks don’t rely on understanding the meaning of words, but rather on the relationship between words themselves.
B. FRAME-BASED APPROACH
“A frame is a data-structure for representing a stereotyped situation,” explains Marvin Minsky in his seminal 1974 paper called “A Framework For Representing Knowledge.” Think of frames as a canonical representation for which specifi s can be interchanged.
The third category of semantic analysis falls under the model-theoretical approach. To understand this approach, we’ll introduce two important linguistic concepts: “model theory” and “compositionality”.
Model theory refers to the idea that sentences refer to the world, as in the case with grounded language (i.e. the block is blue). In compositionality, meanings of the parts of a sentence can be combined to deduce the whole meaning.
Python is a widely used general-purpose, high level programming language. It was initially designed by Guido van Rossum in 1991 and developed by Python Software Foundation. It was mainly developed for emphasis on code readability, and its syntax allows programmers to express concepts in fewer lines of code.
Python is a programming language that lets you work quickly and integrate systems more efficiently. There are two major Python versions- Python 2 and Python 3. Both are quite different.
The features of Python are:
When we say the word ‘easy’, we mean it in different contexts.
a. Easy to Code
Python is very easy to code. Compared to other popular languages like Java and C++, it is easier to code in Python. Anyone can learn Python syntax in just a few hours. Though sure, mastering Python requires learning about all its advanced concepts and packages and modules. That takes time. Thus, it is programmer-friendly.
First, let’s learn about expressiveness. Suppose we have two languages A and B, and all programs that can be made in A can be made in B using local transformations. However, there are some programs that can be made in B, but not in A, using local transformations. Then, B is said to be more expressive than A. Python provides us with a myriad of constructs that help us focus on the solution rather than on the syntax. This is one of the outstanding python features that tell you why you should learn Python.
c. Free and Open-Source
Firstly, Python is freely available. You can download it from the Python Website.
Secondly, it is open-source. This means that its source code is available to the public. You can download it, change it, use it, and distribute it. This is called FLOSS(Free/Libre and Open Source Software). As the Python community, we’re all headed toward one goal- an ever-bettering Python.
A large amount of data that is generated today is unstructured, which requires processing to generate insights. Some examples of unstructured data are news articles, posts on social media, and search history. The process of analyzing natural language and making sense out of it falls under the field of Natural Language Processing (NLP). Sentiment analysis is a common NLP task, which involves classifying texts or parts of texts into a pre-defined sentiment. You will use the Natural Language Toolkit (NLTK), a commonly used NLP library in Python, to analyze textual data.
Tokenization is the process of converting text into tokens before transforming it into vectors. It is also easier to filter out unnecessary tokens. For example, a document into paragraphs or sentences into words.
Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.
Stop words are the most commonly occuring words which are not relevant in the context of the data and do not contribute any deeper meaning to the phrase. In this case contain no sentiment.
Words which look different due to casing or written another way but are the same in meaning need to be process correctly. Normalisation processes ensure that these words are treated equally. For example, changing numbers to their word equivalents or converting the casing of all the text.
Preparing Data for the Model
Sentiment analysis is a process of identifying an attitude of the author on a topic that is being written about. You will create a training data set to train a model. It is a supervised machine learning process, which requires you to associate each dataset with a “sentiment” for training. In this tutorial, your model will use the “positive” and “negative” sentiments.
Sentiment analysis can be used to categorize text into a variety of sentiments. For simplicity and availability of the training dataset, this tutorial helps you train your model in only two categories, positive and negative.
Subjectivity and sentiment analysis is an emerging field in NLP with very interesting applications. A lot can be learned from the amount of unstructured/structured information on the web which can aid in subjectivity and sentiment analysis.
Simple sentiment of the given text is the expected output from this project which is efficiently done by the use of Python, NLTK and Naïve-Bayes Algorithm. Annotations, abbreviations and sarcasm are the challenges faced in this sentiment analysis.
- E. Haddi, X. Liu, and Y. Shi, “The role of text pre-processing in sentiment analysis,” Procedia Comput. Sci., vol. 17, pp. 26–32, 2013.
- D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” J. Mach. Learn. Res., vol. 3, no. 4–5, pp. 993–1022, 2003.
- X. Ding, B. Liu, and P. S. Yu, “A holistic lexicon-based approach to opinion mining,” WSDM’08 – Proc. 2008 Int. Conf. Web Search Data Min., pp. 231–239, 2008.
- K. Dave, S. Lawrence, and D. M. Pennock, “Mining the peanut gallery: Opinion extraction and semantic classification of product reviews,” Proc. 12th Int. Conf. World Wide Web, WWW 2003, pp. 519–528, 2003.
- R. Prabowo and M. Thelwall, “Sentiment analysis: A combined approach,” J. Informetr., vol. 3, no. 2, pp. 143–157, 2009.
- K. Saranya and S. Jayanthy, “Onto-based sentiment classification using machine learning techniques,” Proc. 2017 Int. Conf. Innov. Information, Embed. Commun. Syst. ICIIECS 2017, vol. 2018-January, pp. 1–5, 2018.
- S. K. Yadav, “Sentiment Analysis and Classification: A Survey,” Int. J. Adv. Res. …, pp. 113–121, 2015.
- E. Marrese-Taylor, J. D. Velasquez, and F. Bravo-Marquez, “OpinionZoom, a modular tool to explore tourism opinions on the Web,” Proc. – 2013 IEEE/WIC/ACM Int. Jt. Conf. Web Intell. Intell. Agent Technol. – Work. WI-IATW 2013, vol. 3, pp. 261–264, 2013.