The keyword in data analysis is analysis, not the data. The mere volume of data does not make it significant. But the value we can extract and the questions that we can answer from the data is what makes the data significant. Suppose we have 300 MB of data but the answer we require can be answered from just few KBs. The trending field of Big Data would have flamed out if data analysis was only about data and not the analysis.
Building software in data analysis is the process of generalizing of a part of the data. A specific part of data analysis requires the usage of certain tools and procedures. Then software is the including of every one of these tools so that they can be applied again and again in various settings. Software allows for the systematic representation of the procedure so that different people can understand it at any given time.
Software builds a standardized interface to the analysis procedure and therefore helps in the usage of the functionalities of various tools and procedures. In data analysis, the most important aspect of a software building is considering the specific inputs and outputs. The value of the data will depend on its output and the extraction of information from it. For example, most statistical packages have a linear regression function. These packages have a well-defined interface where the inputs will be the data set or weights. The user need not know all the gory details about the regression function. The only thing the user has to specify is the type of outcome or the predictors, and the package can be applied in any setting.
Abstraction Levels and Rules
Like most software, the software built for data analysis also consists of three levels of abstraction. First, it contains a set of codes and procedures and the automation of these procedures which are encompassed together. Next, it contains the function interface. The inputs and the outputs will have to be specified properly. This is the most important level of abstraction since the user only needs to understand the inputs and outputs. For example, in a plotting function the user only has to input the data set. The highest level will be a software package. The most important aspect here will be the interface, as it should be convenient and easy to use.
Often, the question arises at what point should the common tasks should be systematized versus rewriting the code again for every new project. This requires communication with the team, a clear understanding of the type of data analysis being done and whether it will be done again. Often, the analysis being done is required more than once. The rules of software engineering say:
- If the analysis that is being done is going to be only used once, then the only requirement is writing the code and documenting it well. A clear understanding of the code and a good understanding will be helpful if this analysis is required again.
- If the analysis is going to be required twice, then a function should be written. There should be well defined interface with properly stated inputs and outputs. Hence, only a small piece of code is required to be abstracted.
- If the analysis is going to be required three or more times, then a small package should be built. All the operations that are to be done in the analysis should be encompassed in the package. Also, the package has to be well documented so that there is clear understanding of its application in different analysis.
Phases of a Data Analysis Project
There are five different phases to structure a data analysis project. Stating and refining the question is the most important phase of the process. It is analogous to the requirement gathering phase. This phase involves asking questions and identifying what you are interested in learning from. Specifying and refining the questions is very much important, as it will define the data you obtain and the type of analysis being done. Specifying the question involves the type of question you will be asking. There are six type of questions such as: descriptive, exploratory, inferential, causal, predictive and mechanistic. Figuring out the type of question is a very important aspect. A lot of time should be spent on this. Obtaining the data from a source also comes under this phase.
The next phase is exploratory data analysis. There are two important goals in this phase. The first involves determining whether the data obtained is suitable for answering the questions specified. Questions about the data such as ‘Is there enough data?’, ‘Are there any missing variables in the data set?’, ‘Is there a requirement to collect more data to get those missing variables?’ need to be answered in this phase. The second goal is making a sketch of the solution. If the data obtained is enough, then a basic sketch of the solution should be drawn out to get a better picture of what the answer looks like. The sketch will also ensure that the data you obtained is suitable. This has to be done without any formal modeling or any statistical testing. Data visualization is a very essential tool for EDA since it is generally easy to absorb the information and recognize the patterns in a graphical display.
The third phase of structuring a data analysis project includes writing down the parameters for estimation. The most important part in this phase is challenging the answers you obtained through the sketch of solution in exploratory data analysis part. Just because the answers are already obtained in the previous phase doesn’t mean that they are exactly what we require. Developing a formal framework that includes these challenges examines the sensitivity of the solution to different assumptions. Hence, it will help in finding out the robustness of the solution sketched in the previous phase.
The third phase is interpretation. This involves interpreting the results obtained and whether the results procured conform with your expected answers before you had data. Till now, you probably would have done more than one analysis, so the task of assembling the different pieces of information is also challenging. One has to figure out which piece is more reliable, which is more certain than other, which has more value than other, etc. This makes it a complex task.
Lastly, the findings of data analysis are required to be communicated to the audience. This audience may be internal or external to the organization, may be just a few people or even a large audience. This is an essential part of the process because it turns data analysis findings into actions.
Most of the time, data analysis was carried out to take a decision or support some action. The findings will inform that decision or action. This decision will be taken by the organization or different stakeholders.
Often the data set will already be available before the first phase. In that case, the data set will be used to generate the question rather than the answers. This is called hypothesis generating because the aim is to produce questions and not the answers. Then the other phases from exploratory data analysis will continue.
One thing to be wary of is to get a bias in the analysis. This happens when exploratory data analysis is done on one data set and then the same data set is used for answering the questions. Instead, two independent data set should be used, one for specifying the questions and the other for answering them.
Data Analysis Cycle
Data analysis is an iterative and a non-linear process. It can be shown as a series of epicycles. An epicycle is a small circle whose center moves around the circumference of the larger circle. The data analysis process can be viewed as an epicycle since the iterative process is repeated for each phase of the data analysis procedure. There are 3 steps in the epicycle. These steps are repeated for each of the 5 phases described earlier. They are:
- Setting expectations. This is the deliberate action of thinking about what you expect before you do anything. Experienced data analysts have automatic sub conscious process of developing expectations. It’s an important activity to cultivate. For example, if a person has to go out for dinner and that person has to withdraw money from ATM, then he/she already has some expectations about costing of the meal. This is known as a priori knowledge. And this a priori information can be used to develop expectations when you look at the data, and this applies to each phase of the analysis procedure.
- Collecting information and comparing the data to your expectations. Collection information includes collecting information about your questions or your data. Collection information about the question includes doing a literature search or asking experts to make sure that the question is a good one. As for the data, once you have the expectations when you examine the data, you perform some operations. The result of these operations is the data you collect and then you decide whether the data collected matches with the expectations. For example, taking the restaurant analogy, the bill you receive is the data collected. Next step will be the comparison of the collection of the data with your expectations. If they both match, then you could move to the next phase. If they do not match, then there could be two possibilities. First, your expectations were incorrect and they need to be revised. Second, the data collected contained an error, and you fix the error. Again, taking the analogy of the restaurant, the bill was your data collected. And the expectation of the bill was, let’s say, Rs. 500, and the bill amount was Rs. 600. Then the expectations and data do not match. Here, either you could revise your expectations or find that there was an error in calculation of the bill.
- Revising the expectations or fixing the collected data if the data don’t match your expectations. One key indicator of how well the data analysis procedure is going is the ease or difficulty with which you can match your expectations with the data collected. You can either revise your expectations or fix the error in the data collected.
As you go through each phase, this epicycle is repeated to refine each phase of questioning, exploratory date analysis, formal modeling, interpretation, and communication. Hence, the outer circle contains the five phase and the inner cycle contains the three steps. This forms an epicycle of data analysis.
To understand the epicycle of the data analysis procedure, I want to use an example. Suppose a company wants to find out how big the market is for a new asthma drug and our initial question is the prevalence of the asthma among adults. We need to perform each of the five phases.
For the first activity, you need to refine the question, develop the expectations, collect information about the questions, and check if the expectations and the information match or not. Your expectations in this case would be that the answer to the question of prevalence of asthma among adults is unknown, but that it is answerable. But a simple Internet search reveals that the answer is readily available, so you have to reconsider the question.
Your company reveals that the drug will target those whose asthma was not currently controlled with the available medication. This leads you to refinement of the question. Now your question becomes – ‘How many people have asthma that is not yet controlled by the currently available medication and what are the predictors of the uncontrolled asthma?’.
Next, you find a source from where you can download the data set which contains the adult population. The codebook that comes with the data set mentions that there will be 10,000 rows with each row representing an individual and the age of that individual. So, you expect 10,000 rows in the data set. However, you find only 5,000 rows.
So, you return to the codebook to validate your expectations. However, your expectations were correct. So, you head to the website from where you downloaded the data set and find that there were two files and each file contained 5000 rows and you downloaded only one file. You download the other file and now your expectations match the data.
The third phase is building the statistical model which is required in order to build the demographic information and hence predict if someone has uncontrolled asthma. Statistical model helps you to precisely figure out how you want to use your data, whether for making a prediction or for estimation of a parameter. It also provides a formal way to challenge your findings. You now find out that age, gender, income, body mass index, race and smoking status are the best predictors of uncontrolled asthma.
Next, you move to interpreting the results. You find that your expectations match your findings. You find that all the parameters are positively related with uncontrolled asthma. However, you find that for females the parameters are inversely related with uncontrolled asthma. This creates a mismatch between your expectations. This makes you check if your expectations need to be adjusted or if there is an error with the results. But you find that you had coded gender variable 0 for male and 1 for female, and the codebook had assigned 1 for male and 0 for female, so the error was in the interpretations of the result, not in the expectations.
The last phase is communicating your results. You create an informal report to effectively communicate your findings to the company. Your boss in the company asks two questions: ‘How recent was the data set collected?’ and ‘How the projection of varying demographic patterns in the upcoming years would affect the prevalence of uncontrolled asthma?’. Although you know the answer to the first question, you did not include this in the report. So, you add this to your report. You also realize that the second question requires a new data analysis to be tackled. Hence, new questions were introduced by the first data analysis, and this is a characteristic of a good data analysis.
Characteristics of a Good Question
The key element in good data analysis is asking the correct question. This aspect cannot be emphasized enough. As mentioned earlier, there are six types of questions. A descriptive question involves summarizing of a characteristic. For example, finding out the number of people who visited a website in the last 24 hours or finding out the average height among adults. Here, you are summarizing a particular feature in your model. An exploratory question is one in which you find and examine the trends, relationship between variables and patterns in the data set. These are generally called hypotheses generating because they help in creation of a hypothesis. An inferential question would be about determining whether the results obtained in one data set hold true in another data set. If they hold true, then the inference is considered as true for all data sets. Suppose there is a dataset about people and their diets. A predictive question would focus on finding out what type of people will eat a diet high in fruits and vegetables. It is not interested in finding out the causes behind the people eating fruits and vegetables. A causal question inquires whether changing one factor results in the change of another factor. Often, the underlying architecture of the data set is such that these casual questions can be asked. Finally, a question that would ask how one factor leads to another factor is called a mechanistic factor. For example, the question of how including fruits and vegetables in the diet leads to a decrease in the viral illness is a mechanistic question.
There are other points as necessary as the types of questions. The first one is that sometimes it may so happen that answering one type of question requires you to answer other types of questions as well before your original question can be answered. Second point is that the type of question to be raised also depends largely on the data set you have.
There are five key characteristics of a good question. These are:
- The question should be of particular interest to the audience and depend upon the context of the data. The question of whether the sales of cheese are higher when kept next to pizza crust will be interested to grocery store owners, but not to other industries. Similarly, question of the pollution index of city will be interested to the people regulating the pollution, but may not be of any interest to the grocery store owners.
- It is important to verify that your question hasn’t been answered yet. With so much data available on the Internet, it is very common that your question has already been answered.
- The question should also be a logical one. The question of whether the sales of cheese are higher when kept next to pizza crust is a plausible one, since pizza requires cheese as an ingredient.
- The question should be answerable. There should be available data set, resources to feasibly find the answer.
- The questions should be specific and not general. For example, the question ‘Does eating healthy reduce illness?’ is a general one, whereas ‘Does eating at least 3 serving of fruits and vegetables reduce illness?’ is a specific question.
Software engineering plays a very important role in the field of data analysis. It defines the structure of a data analysis project in five phases – questioning, exploratory data analysis, modelling, interpretation and communication. For each of these phases, three steps can be taken- setting expectations, matching expectations and revising them. This creates a cycling framework called an epicycle of data analysis. It basically means that all 3 steps are performed for each of the five phases. Furthermore, the aspect of asking the right questions is a key factor in the success of analysis project. The questions have five types and answering one type of question often involves answering other types as well. The characteristics of these questions should also be kept in mind while designing them.
- Brian Caffo, Roger D. Peng, Jeffrey T. Leek. “Executive Data Science”. Leanpub Publishing, December 2018.
- Roger D. Peng, Elizabeth Matsui. “The Art of Data Science”. Leanpub Publishing, June 2018.
- M. Shaw. “Writing Good Software Engineering Research Papers”. 25th International Conference on Software Engineering, 2003. Proceedings., Portland, OR, USA, 2003, pp.726-736.