Introduction 9
CHAPTER 1. E-COMMERCE, BIG DATA AND SENTIMENT ANALYSIS 14
1.1. How knowledge of Customer Attitude helps in E-Commerce 15
1.2. Big Data from Social Networking Sites 18
1.3. Extract and Assess Customers’ Attitudes with NLP and SA 24
1.4. Research Gap 36
1.5. Summary of Chapter 1 37
CHAPTER 2. RESEARCH METHODOLOGY 39
2.1. Research Design 39
2.2. Business understanding 44
2.3. Data understanding 46
2.4. Data preparation 49
2.5. Modelling 52
2.6. Evaluation 61
2.7. Summary of chapter 2 64
CHAPTER 3. POLARITY CLASSIFICATION MODELING 65
3.1. Initial model creation 65
3.2. Comparison of initial models with existing services 67
3.3. Final polarity classification model 70
3.4. Application of final model to Russian E-Commerce companies 71
3.5. Summary of chapter 3 76
Discussion and conclusions 78
Discussion of the findings 78
Theoretical contributions 83
Managerial implications 84
Limitations of the study 84
Prospects for future research 85
References 87
Appendices
E-Commerce is booming all around the globe, growing with double-digit rates (Statista,
2017) . In Russia, online sales exceeded 1 trillion rubles and E-Commerce market has shown growth rate of 13% in 2017, even as offline retail was severely affected by the economic crisis (AITC, 2018). At the same time, volumes of cross-border trading is growing up faster, than local E-Commerce and several huge international players, such as Aliexpress, Asos and JD.com entered the market in the last 2 years (Data Insight, 2017). Along with increase in supply, there is 6% decline in consumers’ purchasing power in 2016 (FSSS, 2017). These factors led to more intense competition for customers’ attention and approval. To get this approval, company should identify and fulfill customers’ demands and needs. The question every marketing and sales specialist in E-Commerce should ask is ‘how to change customers’ attitude in favor of our company’. Efficient communication between firms and customers can promote selling of particular products, services and brands. Thus, understanding consumers’ attitudes is the core of creating and making marketing decisions (Sheng et al., 2017).
In E-Commerce, where all the company-customer interactions happen online and often in real-time, the rational venues to collect data upon customers’ attitude are internal sources (sales data, marketing researches) or external sources, such as online forums and social networking sites (SNS). SNS are vital part of humans’ daily lives and are widely spread around the globe (Statista,
2018) . Such diffusion leads to creation of huge amounts of User-Generated Content (UGC), which in many cases contains customers’ sentiments towards companies, their brands, products, services. UGC is online word-of-mouth behavior and is represented in form of unstructured data. Existence of such sort of information allows companies to forget about surveys, focus groups and external consultants to find consumer opinion about its products and those of its competitors (Liu, 2010). To collect and interpret this type of unstructured data is more efficient with application of Big Data (BD) techniques and tools.
‘Big Data’ is a buzzword of the beginning of 21st century. Numerous companies in different industries — from small fashion boutiques to multinational pharmaceutical conglomerates — utilize this technology trend to reach and verify their competitive advantage, since it positively affects both companies’ strategy and operations (Hagen, 2013). It allows creating design-driven innovations and changing the paradigm of company-customer relationships (Morabito, 2015). BD proved to be especially useful from marketing perspective, since it allows companies to gather and analyze unstructured data and study consumer behavior and hidden consumer sentiment with help of it (Michael & Miller, 2013).
There is a wide variety of possible applications of BD extracted from SNS, which are beneficial to a company — company can predict adoption probability (Fang et al., 2013), improve consumer-retailer loyalty (Rapp et al., 2013), boost advertising and revenue growth (Shriver et al., 2013). Through analysis of company-related UGC, company is capable of identifying consumers’ attitude towards its products and services (Pozzi, 2017). This sounds interesting to companies, because knowledge of customers’ attitude allows company to tune its marketing strategy (including niche market identification and brand positioning) and interaction with customers (Bahtar & Muda, 2016), which directly influence end users’ decision upon purchase of company’s products and services (Ding et al., 2015). This is the reason, why having a flexible and powerful (in many cases, free) toolkit to leverage brand-related openly accessible UGC in favor of extracting knowledge about SNS users’ attitude from optional feature became a ‘must-have’ (MongoDB White Paper, 2016).
This concurrent learning of users’ behavior is beneficial to real-time, intent-based optimal interventions, which increases purchase likelihood (Ding et al., 2015). However, many E-Commerce companies in Russia do not even try to benefit from using this type of information in spite of its appealing possible outcomes, or most companies are capturing only a fraction of the potential value of data for the sake of improving its sales efforts (Tadviser, 2017). One of the reasons for that may be lack of theoretical base clearly aligning application of innovative BD techniques toward digital marketing benefits (Amado, 2018).
Since most of the data from SNS (online reviews, UGC, online ratings) is raw textual data, such toolkits as Natural Language Processing (NLP) may be applied. This frontier domain of BD and Artificial Intelligence (AI) is aimed at text extraction, preparation and analysis, and deals with human-computer language interaction (Devika et al., 2016). It is applied in such spheres, as spam filtering, search recommendations and chat bots. One of NLP subdomains — Sentiment Analysis (SA) — is specifically designed to work with attitudes within textual data (Pozzi, 2017).
How SA may be beneficial for business? SA of UGC allows extracting knowledge about customers’ attitude, thus, to make efficient data-driven decisions upon brands digital marketing activities (Sheng et al., 2017; Rambocas & Pacheco, 2018). The implementation of developed NLP practices of this type will be beneficial for any type of companies. For E-Commerce companies, implementation of such SA types, as opinion mining or polarity classification, in marketing process may be applied for online evaluation of customer satisfaction, better understanding of consumers and market (Nassirtoussi et al., 2014).
Research on SA of English language is comprehensive and includes numerous studies upon various SA tasks (Devika et al., 2016; Mantyla et al., 2018). Research on SA of Russian language is more limited in its variety and is concentrated on studies upon sentiment lexicon generation (Klekovkina & Kotelnikov, 2012; Rubtsova, 2013; Rubtsova, 2015), opinion search and retrieval (Kravchenko, 2012) and polarity classification (Kotelnikov, 2012; Loukachevitch et al., 2015). However, when it comes to more business-oriented research with actionable outcomes of SA tasks in Russian language, amount of studies is very limited (Ermakov, 2009; Polyakov et al., 2012; Kirilenko & Stepchenkova, 2017). Moreover, relying on overview of 300+ articles and conference presentations on topic of SA of Russian language in last 7 years (Dialog-21, 2012-2017; ROMIP, 2010-2015; RUSSIR, 2016-2017), it is legit to claim that there is the absence of business-oriented research related to application of such SA tasks, as subjectivity and polarity classification, in E¬Commerce companies. Due to the fact, that SA algorithm have been tailored to a specific language given the complexity of having a number of lexical variations and errors introduced by the people generating content (Tellez et al., 2017), research of applications of SA of English language in E¬Commerce cannot be seamlessly applied to SA of Russian language in E-Commerce.
All of the abovementioned facts indicate that there is a research gap, which this master thesis will fulfill. Research gap is in the lack of empirical studies upon applications of polarity classification of Russian language that are beneficial for managers of E-Commerce companies.
To fulfill stated research gap, the following research objectives were formulated:
• To review what knowledge of CA may help E-Commerce companies and how it may be extracted from UGC on SNS;
• To review applicability of BD and DM approaches in UGC collection and analysis;
• To get acknowledged with NLP as a toolkit for textual data analysis;
• To get acknowledged with SA fundamentals, types (with focus on subjectivity and polarity classification), models and what value it brings to E-Commerce companies;
• To review cutting-edge studies upon SA of English and Russian languages along with multilingual SA (with focus upon polarity classification);
• To review research upon applications of polarity classification of English and Russian languages (with focus on applications in E-Commerce);
• To review different models of polarity classification of English and Russian language and how their performance baselines are measured;
• To identify the criteria of choice of the most efficient models of polarity classification applicable to Russian language and E-Commerce business.
The following research goal was stated:
• To create and test polarity classification model, which allows managers of Russian E-Commerce companies to extract additional knowledge about customers’ attitudes towards their companies from user-generated content.
To achieve this goal, the following research questions were stated:
First, usage of other types of Twitter data for managerial implications. Along with textual data, Twitter also contains other types of data. Geospatial data, which may be used to perform precise segmentation and solve problems on the level of individual stores. Numerical data upon user behavior on SNS (number of followers, number of followings, number of tweets), which may be used to identify users relationships, influencers and patterns of information spread within these internal ‘friendship’ networks. Attached images and videos, which usually either show the Customers’ Attitude or show some real-life situation within stores, may be analyzed to gain additional information upon customers. Meta-level features can be extracted for the same purposes.
Second, current model can be improved with more technically complex and sophisticated approaches to SA modeling steps. For lexicon-based polarity classification, overall accuracy of model may be improved with integration to training data more wide or domain-specific dictionaries. In addition, lexicons with high results in numerous studies, such as MPQA, Bing Liu’s and LIWC, may be translated and used as training data. For ML/DL-based polarity classification models (since they proved to be the most accurate ones), every step of modelling process may be improved. On the step of feature selection, larger training data corpus (e.g., closed for private use such as (Russian National Corpus, 2018)) may be used to get the higher accuracy. For ensemble models, additional rules may be added to make feature selection more precise. On the step of weight assignment, sentiment labeled lexicons, such as Twitter Sentiment Analysis Dataset, may be translated and integrated into training corpora. Preemptive subjectivity classification may be improved and learned to not only separate tweets into subjective and objective, but also to mine opinions from objective tweets and to test whether they may be useful for managers. Aspect extraction may be improved and identify not only aspects themselves, but also summarize (or cluster and summarize) the Customers’ Attitudes upon selected aspects automatically.
Third, with deeper research upon consumer behavior online and customers’ psychographic characteristics, it is possible to advance SA and retrieve more relevant and sophisticated data upon customers’ shopping intentions and motives. With more profound psychology-based approach, consumer behavior may be interpreted in a more relevant manner. For example, with better understanding of interrelations between building trust and increase of purchase intention, it is possible to use SA as a tool to measure customers’ trust towards brand and find ways to enhance it.
Fourth, there are number of interesting for research topics within Sentiment Analysis, which are strongly related to polarity classification itself. For example, fake opinion’ detection, slang preprocessing and automatic handling of grammatical errors (Tubishat et al., 2018). The most problematic are related to extraction of implicit data. Theoretically, all this implicit data may be extracted with more advanced approach to topic modeling. Along with this, such entities within text as sarcasm and hate may be detected more precisely and be interpreted in a more correct way.
1. Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R. (2011). Sentiment Analysis of Twitter Data. In: Proceedings of the Workshop on Languages in Social Media.
2. Aguwa, C. (2017). Modeling of fuzzy-based voice of customer for business decision analytics. Knowledge-Based Systems, vol. 125: 136-145.
3. AITC. (2017). M.Video online sales showed record growth in the last two years. Retrieved from http://www.akit.ru/oH.nabH-npoga:®M-M-BHgeo-noKa3a.nH-peK/.
4. AITC. (2018). Russian E-Commerce market volume exceeded 1 trillion rubles. Retrieved from http://www.akit.ru/oбopoт-poссийскoгo-pынкa-интepнeт-pи/
5. Amado, A., Cortez, P., Rita, P., & Moro, S. (2018). Research Trends on Big Data in Marketing: A text mining and topic modeling based literature. European Research on Management and Business Economics, vol. 24(1-7).
6. Anastasyev, D., Andrianov, A., Indenbom, E. (2017). Part-of-Speech tagging with rich language description. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2017”.
7. Arefyev, N. 2015. Evaluating Three Corpus-based Semantic Similarity Systems for Russian. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2015”.
8. Arkhipenko, K., Kozlov, I., Trofimovich, J., Skorniakov, K., Gomzin, A., Turdakov, D. (2016). Comparison of neural network architectures for sentiment analysis of Russian tweets. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2016”.
9. Baccianella, S., Esuli, A., Sebastiani, F. (2010). SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC ’10.
10. Baek, H., Ahn, J., & Choi, Y. (2012). Helpfulness of online consumer reviews: Readers' objectives and review cues. International Journal of Electronic Commerce, 17(2): 99-126.
11. Bahtar, A., Muda, M. (2016). The Impact of User - Generated Content (UGC) on Product Reviews towards Online Purchasing - A Conceptual Framework. Procedia Economics and Finance, vol. 37: 337-342.
12. Balahur, A., & Perea-Ortega, J. M. (2015). Sentiment analysis system adaptation for multilingual processing: The case of tweets. Information Processing & Management, 51(4): 547-556.
13. Bao, H., Li, Q., Liao, S. S., Song, S., & Gao, H. (2013). A new temporal and social PMF- based method to predict users' interests in micro-blogging. Decision Support Systems, 55(3): 698-709.
14. Barbosa, L., Feng, J. (2010). Robust sentiment detection on Twitter from biased and noisy data. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10: 36-44.
15. Benko, V., Zakharov, V. (2016). Very Large Russian Corpora: New Opportunities and New Challenges. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2016”.
16. Bifet, A., Holmes, G., Pfahringer, B., & Gavalda, R. (2011). Detecting sentiment change in Twitter streaming data.
17. Chamlertwat, W., Bhattarakosol, P., Rungkasiri, T., Haruechaiyasak, C. (2012). Discovering consumer insight from twitter via sentiment analysis. Journal of Universal Computer Science, 18(8): 973-992.
18. Chang, V. (2017). A proposed social network analysis platform for big data analytics. Technological Forecasting and Social Change.
19. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R. (1999). CRISP-DM 1.0. 1-76.
20. Chen, C. P., & Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275: 314-347.
21. Chetviorkin, I., Loukachevitch, N. (2013). Evaluating Sentiment Analysis Systems in Russian. In: Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing: 12-17.
22. ComNews. (2017). Retrieved from https://www.comnews.ru/digital-
economy/content/110448/opinions/2017-11-13/mvideo-onlayn-blokcheyn-i-made-russia.
23. Coussement, K., & Van den Poel, D. (2009). Improving customer attrition prediction by integrating emotions from client/company interaction emails and evaluating multiple classifiers. Expert Systems with Applications, 36(3): 6127-6134.
24. Data Insight. (2017). Top-100 Russian E-Commerce companies. Retrieved from
http://datainsight.ru/top100/.
25. Dean, J. (2014). Big Data, Data Mining and Machine Learning. New Jersey, Wiley & Sons.
26. Deng, S., Sinha, A. P., & Zhao, H. (2017). Adapting sentiment lexicons to domain-specific social media texts. Decision Support Systems, 94: 65-76.
27. Devika, M., Sunitha, C., Ganesh, A. (2016). Sentiment Analysis: A comparative study on different approaches. Procedia Computer Science, vol. 87: 44-49.
28. Dialog-21. (2012-2017). Retrieved from http://www.dialog-21.ru/.
29. Ding, A., Li, S., Chatterjee, P. (2015). Learning User Real-Time Intent for Optimal Dynamic Web Page Transformation. Information Systems Research, vol. 26, issue 2: 339-359.
30. Du, J., Xu, H., & Huang, X. (2014). Box office prediction based on microblog. Expert Systems with Applications, 41(4): 1680-1689.
31. Dubatovka, A., Kurochkin, Yu., Mikhailova, E. (2016). Automatic Generation of the Domain-Specific Sentiment Russian Dictionaries. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2016”.
32. E-Commerce Foundation. (2016). Russia B2C E-Commerce Report. Retrieved from http://www.ecommercefoundation.org/.
33. Engel, J. F., Blackwell, R. D., & Miniard, P. W. (1995). Consumer behavior, 8th. New York: Dryder.