A social media network is defined as a social structure of individual or multiple people, who are related to each other directly or indirectly based on a common relation of interest like friendship, trust, etc. Social media analysis is the study of social networks to understand user’s behavior. In recent time social media analysis is popular term due to its use in different applications from product marketing like viral marketing to search engines and organizational management. Recently there has been a fast increase in interest regarding social media analysis in the data mining. The basic motivation behind social media analysis is increasing demand to exploit knowledge from large amounts of data collected and to know social behavior of users in online environments. Data mining based techniques are proving to be useful for analysis of social media data, especially for huge amount of datasets that cannot be handled by traditional database management system.
Big Data is a very popular term today. Everywhere all types of companies and organizations are talking about their Big Data solutions and Analytic systems. The source of the data used as input in these systems varies so one type of data is of great interest to most companies and organizations is Social Media Data. Social Media applications like Twitter, Facebook and Instagram are used by a large population around the world. The ability to instantly connect and reach other people and organizations over large distances is an important part of today’s society. Social Media applications allow users to share opinions, comments, ideas, and media with friends, family, organizations, and businesses. The data and information contained by these comments, ideas, and media are usable for many types of organizations. Through Data Analysis and Mining, it is possible to predict specific behavior of users of these social media applications.
Throughout the past years, we have seen a lot of growth on the event of social media. Social media helps multiple people to communicate and users can share their thoughts through social media. Many organizations promote their product and services with the help of social media. Many social networks are available and there are a unit over thousand social media sites available on the internet. On twitter users tweets millions of tweets per day. These tweets contain large amount of information this information can be useful for an organization to enhance their products quality. The main objective of this research is to find out completely different techniques of analyzing the huge amount of social media information. This paper contains techniques that can facilitate in revealing the competition’s promoting strategy together with their data, people, and messages.
- This will help organizations to grow fast.
- This will help organizations in fastly decision making.
- This will be an evolution in market tactics.
- This will help in understanding current social culture.
Every day massive size of data is produced by social media users which can be used to analyze their opinion about any event, movie, product or politics. We can use Big Data technology and its tools like Flume. Flume is a distributed and reliable tools use for effectively collecting and transferring huge amount of streaming data set in HDFS (Hadoop Distributed File System). then we use Big Data technology to analyze that data.
2.1 Introduction to Technology:
2.1.1 Big Data Definition:
- Big data is a term that describes the huge amount of data both structured and unstructured.
- Big data is data sets that are so massive and complex that traditional database management system or software can’t handle them.
- Big data is a term that refers to a process that is used when traditional data mining and handling techniques cannot find out insights and useful information from large amount of dataset.
2.1.2 Big Data Characteristics:
Big Data have five characteristics also known as five v’s of Big Data
Velocity means speed on which huge size of datasets are produced, collected and analyzed. As we know each day the number of social media messages, email, photo, videos, etc. growing very rapidly speed around the globe. Each and every day datasets is growing so fastly. Not only it is analysis, but the speed of getting of data and access to the data must also remain instantaneous to allow for real -time access to credit card verification, website and instant messaging. Big data technology gives us permission now to analysis of data while it is producing without save it this data into database.
Unstructured Data: - this type of data either does not has a fixed data model or is not saved in a fixed manner. This type of information text-heavy. In other words unstructured data is something that is at the other end of the spectrum. Unstructured data can be in any form like texts, audios and videos. We can’t identify by looking at the datasets what this data mean, unless we apply understanding of humans to it. Structured Data: - this type of data is the data that stays in a fixed field in a record. This includes data collected by a relational database and spreadsheet. This type of datasets has the advantages of easily entered, saved, queried and analyzed by users. Semi Structured Data: - this type of datasets is a part of structured data that do not conform with the normal structures of datasets model that is contained by relational database or other types of data tables.
Volume means huge size of data is being produced every second from social medias, cell phone, photographs, car, credit card, videos, etc. The huge sizes of data have become so big that we can not store this data by using traditional database system like RDBMS. We now use distributed system, on which part of the data is saved in different locations and brought together by software to analyze this data.
Value of data pointing to the worthiness of the data that is generated. Endless amount of data is a thing but this data is useless if it cannot be turn into value .
It means the quality or trustworthy of the dataset means how correct is all these data? If we talk about all the Twitter post with hash tag. and the accuracy and reliability of all this data. Collecting large and large of data is no use if the trustworthiness or quality is not correct.
2.2 Tools and Framework:
- Data Analyzing Tools
- Data Ingestion Tools
- Data Visualization Tool
- Power Bi
Hadoop is a distributed framework from the Apache Software Foundation. It is an open-source framework that is written in Java. Hadoop works on Distributed processing. It is an effective framework to run jobs on multiple nodes of clusters. It processes large size of datasets on cluster of commodity hardware. Hadoop was developed by Doug Cutting. Hadoop name was given by Doug Cutting it was the name of his son's toy elephant. Hadoop was developed in the year 2002 by Doug Cutting which was used to build a web search engine that was an open source web search engine. Hadoop works on master and slave architectures. Name node (Master node) has information of all other nodes. All the data are present at slave node. Slave node (Data node) is use to p reform computation on data that is present at node.
Hadoop ecosystem has three main Components in ecosystem
HDFS (Hadoop Distributed File System):
This stands for Hadoop Distributed File System. HDFS is a distributed file system which allows users to store files that have large sizes. This is an extension on Google’s File system (Google file system). It is designed in the way so it can run on commodity hardware. HDFS has some features like fault tolerance, high availability, data reliability, data replication and scalability. HDFS is useful for apps which produce with huge amount of data. HDFS is known as most reliable data storage file system.
Hadoop Distributed File System works on Master and Slave architectures in which Master node is Name Node which store meta-data about all other nodes and Slave node is Data Node that store the real data that user wants to store. HDFS Architecture contain one Master Node and other nodes which are Slave Nodes.
Fig1- HDFS Architecture
Name Node (Master Node):-
Master Node is also known as Name node. Master Node stores metadata like numbers of data blocks, replication of blocks and other information about data. Metadata is present in memory in the name for fast accessing of datasets. Master Node maintain and manage the data node or slave node, and assigns task to slave nodes.
Data Node (Slave Node):-
Data Node is also known as Slave node. In Hadoop Distributed File System, Data Node stores real data in HDFS. Data node apply read and write operations on the request of user.
Features of HDFS:-
Fault tolerance is and important feature of HDFS. Fault tolerance means the working power of any system in bad situations and how the system can perform against this type of situation. As we know HDFS is provide high fault tolerant to user because in Hadoop file system data is partitioned into data blocks and multiple data copy or replicas of data blocks are stored on multiple nodes across clusters. So if any node in the clusters goes inactive or node crashed then a user can access data from nodes which have same data of their data block. Hadoop Distributed File System also maintain replication factor of data by making multiple copies of data blocks on other rack in system so if a node go down, then user can get data from another node available in other rack.
It is a high availability file system because as we know data on HDFS is replicated at nodes in clusters by creating replication of the data blocks on the nodes available in HDFS clusters. So when a user or client wants to get data, then user can get their data from nodes that keep its block and that is present on closet nodes in clusters. And during unfortunate conditions and like a when node get fail or in case of dead nodes user can easily get data from another nodes which have replicas of data. Because multiple copies of block which have same user dataset are made on other node available in cluster.
It is a distributed file system that provides efficient storage of data on framework Hadoop. it can save data in range of hundreds petabytes. It does partition of data in block and then store these blocks at node available in cluster. It store dataset reliably by replication it create a copy of each block available at node present in clusters so we can say that it give fault tolerance to user. If any node that contains data goes inactive, then client can quickly access that data from another node which have a copy of same data. It by default makes three copies of data blocks contain data present in nodes cluster because default replication factor of HDFS is three. So data is available quickly to all the users so user does not needs face any problem of data loss in file system that is HDFS. Hence we can say HDFS is provides us high reliability.
It is one of the most unique and important features of Hadoop’s file system that HDFS. In this replication of dataset is done so we can user can overcome from the problem of data loss in unfortunate condition like hardware failure, crashing of any node etc. Since data is replicated across a many numbers of node in the cluster by creating blocks. The process of replication is maintained by HDFS at regular time intervals and HDFS creates replicas of user data on many nodes which are present in cluster of system. Hence when nodes in cluster go inactive or nodes are dead then client can get their data from another node which are active and have blocks of that dataset. Hence here user has no possibility of data lost that is stored at HDFS.
As we know Hadoop Distributed File System save data on many nodes in clusters. When require more nodes so user can scale up cluster. It provides two scalability mechanisms available one is Vertical scalability means add more resources like CPU, Memory and Disk on existing nodes of cluster and other way that is horizontal scalability means add more machines on cluster. Horizontal way is used more because user can scale the number of clusters from 10 nodes to 100 nodes on less downtime of cluster.
In Hadoop Distributed File System all the features are achieved with the help of distributed storage of data and replication of data. HDFS save data in distributed manner on nodes in HDFS clusters (collection of node). In Hadoop Distributed File System data is divided into blocks and then these data blocks is stored on the nodes available in HDFS cluster. And then replication is done so replicas of each and every block are created and then stored on other nodes available in the cluster. So if a node in the cluster gets crashed we can easily recover our data from the other nodes which contain its replica of data.
Map Reduce is processing layer of framework Hadoop. Map Reduce is a programming model developed to process huge amount of data in parallel by partitioning work in sets of independent task. User needs to put logic for Map Reduce works and other process will completed by framework Hadoop. Complete job that is input by user to name node is divided in small work known as task and now these tasks are assigned to data nodes. Map Reduce scripts are written in a fixed style or format effected by functional programming construct, specifically to process lists or collection of datasets. Here in Map Reduce we get input as lists and MapReduce converts this into output that is also a list. MapReduce is heart of Hadoop. Hadoop is powerful because of MapReduce as MapReduce provide processing of data in parallel. Working of MapReduce:
Fig2- Working of MapReduce
Input data that is provided to mapper is processed by user created or defined function that is written on mapper. All desired advanced logic is created on mapper level so complicated processing is completed by mapper in parallel we know numbers of mapper is more than number of reducer. Mapper produces an output that is an intermediate data or intermediate output and this produced output used as input at reducer. This intermediate output is then process by client defined function which is written on reducer and then final output is produced. In reducer light processing is completed. This final output is then saved at HDFS and replica of data is created by Hadoop Distributed File system.
Yet Another Resource Negotiator (YARN) is the resource management component of Hadoop. Yarn was purposed in Hadoop version 2. YARN is like an operating system for Hadoop as we know operating system is use for resource management so YARN is also used to manage resource for Hadoop. It doesn’t do resource management it also does scheduling of job in Hadoop. Yarn increase the ability of Hadoop to other growing technologies so they can use of HDFS (Hadoop Distributed File System).YARN is also an operating system for Hadoop version. The architecture of Hadoop version 2 gives a general purpose data processing environment which is not just to MapReduce. It allows users to run some different type of frameworks on same type hardware on which Hadoop framework is installed.
- First PIG is developed as a research project at Yahoo in year 2006.Pig is a tool for analysis of huge size of dataset that are saved at HDFS.
- Pig is a component of Hadoop Ecosystem.
- It is like SQL. First it loads the data, applies filters and dumps data in format that is required for clients.
- Pig is tool that allows Hadoop programmers to write data analysis programs because it works on Pig Latin language it is a high level language.
It is an open source data warehouse and also a component of Hadoop ecosystem. It is built on Hadoop. Hive is data analyzing tools for analyzing and querying large amount of data. Hive mainly performs three functions data summarization, query and analysis. Hive use HiveQL (HQL) language that is same as to SQL language. Hive was designed by Facebook then Apache acquired Hive. Now many originations using Hive like Netflix and Amazon. Hive can process data that is stored at EXT4 and HDFS. It can load data in the form of internal and external tables. When user create table in Hive by default it is internal table to create external table user needs to mention keyword external during creation of tables. Hive provides two important concepts partitioning and bucketing. Partitioning is used for grouping same type of data based on a column or partition key and bucketing is used for create buckets of data.
- Apache Sqoop is a data ingestion tool and this is part of the Hadoop Ecosystem.
- It transfers the data between the relational database system and the Hadoop Distributed File System.
- Flume is a data in Ingestion tool and flume is also part of Hadoop ecosystem.
- Flume is a distributed and reliable tools use for efficiently collecting and moving huge amount of streaming dataset into HDFS (Hadoop Distributed File System).Data flow model of flume is showed in figure below:
Figure 3- Flume Data Flow
Flume source take event as input provided to it by an external sources for example web server. External source sends event to Flume in the format that is known to target Flume source for our project web server is twitter. When a Flume source get an event as an input it save these events in single or multiple channels. Channel is a temporary storage that keeps the events until it’s absorbed by sink. Sink removes the events that are stored at channels and then put them into an external file system that is HDFS or it can forward it to source of the next agents if we have multiple hops in flow. Source and sink of given agent execute asynchronously with input event staged in channel.
126.96.36.199 Power Bi:
Power Bi is a data Visualization Tool. Power BI is a suite of business analytics tools that deliver insights. It produce some reports according to data, these reports are useful for business.
3.1.1 Extracting Social Media Data with Flume:
The social media Streaming API gives a constant stream of social media data coming from the application it must reside in HDFS securely. The security can be ensured by the generation of keys at the time of creating an application in social media.
3.1.2 Querying JSON Data with Hive:
Hive will expect the input data in a delimited row format. But the social media data will be in a JSON, format. So to handle this type of data hive wil l use Hive SerDe interface to interpret the data which comes through social media. SerDe means Serializer and Deserializer, these are the interface that make Hive to exchange data in the form that Hive tool can process. Deserializer interfaces are used when user reads data from the disk, and converts data into format where Hive can manipulate this data. The data has some structure and sometimes it don’t even have a structure, but certain fields may or may not exist. This semi structured nature of the datasets makes the datasets very hard to query in a traditional databases. Hive can process this data. So, eventually Hive can also handle the log files of web servers which may be in CSV, TSV or any unstructured, semi-structured formats.
3.2 Analysis Diagrams:
3.2.1 Block Diagram:
Fig4- Block Diagram
3.2.2 Data Flow Diagram:
Fig5- Data Flow Diagram
3.2.3 Use Case Diagram:
Fig6- Use Case Diagram
3.2.4 Activity Diagram:
Fig7- Activity Diagram
3.2.5 Class Diagram:
3.2.6 Sequence Diagram:
Fig9- Sequence Diagram
HDFS: HDFS stands for Hadoop Distributed File System . HDFS is a distributed file system which allows users to store files that have huge sizes. This file system is based on Google’s File system (Google file system). It is designed in the way so it can run on commodity hardware. HDFS has some features like fault tolerance, high availability, data reliability, data replication and scalability. It is useful for apps which produce large size of data. It is known as most reliable data storage file system.
This above study gives knowledge about the technology Big Data, it’s characteristics, features, and classifications. This report possesses the basic and technical information about social media analyzing and architecture of it in big data. It explains the components of Hadoop like HDFS (Hadoop distributed file system), Map-Reduce, yarn, Pig, Hive, HBase and Sqoop and flume. This report mentions the process of social media analyzing using technology big data and framework hadoop.as user grows on social media the data is growing very rapidly, to analyze this large size of datasets technology Big data is widely used in different-different organizations.
4.2 Future Scope:
As time grow number of users increase on social media platform. Social Media platform allow these users to share their comments, opinions, ideas, and media with friends, family, businesses, and organizations. The data contained in these comments, ideas, and media are valuable to many types of organizations so with the help this analyzed data and insights organizations can customized their product and improve the quality of product according to users need. Analysis of social media data help many organizations and as the time grows the number organizations will increase which use social media data analysis to enhance their product quality.
- http://hadooptutorial.info/category/hive /