OpenCrowd: A Human-AI Collaborative Approach for Finding Social Influencers via Open-Ended Answers Aggregation

Finding social influencers is a fundamental task in many online applications ranging from brand marketing to opinion mining. Existing methods heavily rely on the availability of expert labels, whose collection is usually a laborious process even for domain experts. Using open-ended questions, crowdsourcing provides a cost-effective way to find a large number of social influencers in a short time. Individual crowd workers, however, only possess fragmented knowledge that is often of low quality. To tackle those issues, we present OpenCrowd, a unified Bayesian framework that seamlessly incorporates machine learning and crowdsourcing for effectively finding social influencers. To infer a set of influencers, OpenCrowd bootstraps the learning process using a small number of expert labels and then jointly learns a feature-based answer quality model and the reliability of the workers. Model parameters and worker reliability are updated iteratively, allowing their learning processes to benefit from each other until an agreement on the quality of the answers is reached. We derive a principled optimization algorithm based on variational inference with efficient updating rules for learning OpenCrowd parameters. Experimental results on finding social influencers in different domains show that our approach substantially improves the state of the art by 11.5% AUC. Moreover, we empirically show that our approach is particularly useful in finding micro-influencers, who are very directly engaged with smaller audiences.


INTRODUCTION
Social influence is an important mechanism impacting the dynamics of social networks. Social influencers are users who regularly produce authoritative or novel content on specific topics and who can reach and engage a potentially large group of followers. Finding social influencers has become a fundamental task in many online applications, ranging from brand marketing [34,43] to opinion mining [20,28], expert finding for question answering [32], minimizing misinformation propagation [37], or analyzing presidential elections [6].
The task of finding social influencers is challenging due to the complexity in quantifying user engagement, the subjectivity in perceiving social influence, and the need for expert knowledge in determining the authenticity of user-generated content. Existing techniques mainly tackle this problem using supervised machine learning approaches that rely on a training set hand-labeled by domain experts [9,19,46]. While models trained in this fashion are effective at finding social influencers who are similar to those in the training data, they are intrinsically limited by the availability of expert labels. These labels are typically very hard to gather. As an example, our collaboration with the largest European fashion retailer 1 reveals that an expert can only recognize no more than 200 fashion influencers on Twitter over a 3-week period of time. Finding social influencers is, therefore, a long and usually laborious process even for domain experts [16].
Compared to an individual expert, online crowds possess as a whole a broader knowledge of social influencers in several domains, e.g., fashion, fitness, or information technology. As an example, while it is generally difficult for an expert to come up with a long list of fashion influencers in a short period of time, it is much easier to obtain such a list by asking online workers. Therefore, Figure 1: The OpenCrowd framework (a) creates a worker-answer matrix from workers answers; (b) extracts social features for the Candidate Influencers (i 1 , . . . , i 4 ); (c) uses the social features and the expert labels to train an answer quality model and estimate unknown labels; (d) (E-step) uses both the worker-answer matrix and the labels generated by the answer quality model to estimate the worker's reliability and the candidate influencer's labels; (M-step) uses the new candidate influencer's labels (influencer quality) to retrain the answer quality model; and (e) generates labels for the unlabeled candidate influencers.
we advocate a human computation approach that crowdsources the task of finding social influencers in the form of open-ended question-answering, a popular and crucially important, yet severely understudied class of crowdsourcing [29]. Specifically, we consider a task where the crowd is asked to name as many social influencers as possible in a predefined domain. By aggregating the answers from a large number of crowd workers, we can identify the identities (e.g., usernames on Twitter) of a large number of social influencers in an efficient and cost-effective manner. Despite its obvious benefits, aggregating answers from openended crowdsourcing campaigns is challenging: individual crowd workers may only possess fragmented knowledge that is of lowquality. Unlike Boolean crowdsourcing, where crowd workers are asked to classify an existing close pool of data instances into predefined classes, open-ended crowdsourcing results in open-ended pools of answers -often of large size -that were all deemed relevant by crowd workers. The input data for open-ended answers aggregation is, therefore, a positive-only worker-answer matrix, where each entry indicates the "given by" relationship between an answer and a certain worker, as illustrated in Figure 1. This comes in contrast to the input data for aggregating answers from Boolean crowdsourcing, where each entry indicates a class (e.g., 0 or 1 for the binary case) assigned by a worker to a data instance. As an implication, existing answers aggregation methods [10,47,48,53], which are designed to leverage the disagreement between workers' answers, do not yield good performance for open-ended answers aggregation (cf. Section 5).
To address the problem of open-ended answers aggregation, we introduce a human-AI collaborative approach that integrates both machine learning and crowdsourcing for aggregating open-ended answers. We present OpenCrowd, a Bayesian framework that models the true label of a candidate influencer as dependent on both the features of the candidate and the reliability of the workers who named the candidate. To infer the truth, OpenCrowd leverages a small number of expert labels to bootstrap the inference process.
It then jointly learns a feature-based model for the quality of the answers and the reliability of the crowd workers. The model parameters and worker reliability are updated in an iterative manner, allowing their learning processes to benefit from each other until an agreement on answer quality is reached. The overall learning process is illustrated in Figure 1. We formalize such a learning process with a principled optimization algorithm based on variational expectation-maximization. In particular, we derive updating rules that allow both model parameters and worker reliability to be updated incrementally at each new iteration. By doing so, OpenCrowd parameters can be efficiently learned with little extra computational cost compared to the computational cost for training a feature-based answer quality model.
To the best of our knowledge, we are the first to adopt a human-AI collaborative approach for finding social influencers. Our proposed framework is a generic one that can incorporate any machine learning models with crowdsourcing. Moreover, as it solicits contribution directly from crowd workers, the framework is effective in finding a particular type of influencers known as "microinfluencers" [2,49]. These social influencers are deeply connected to specific niche audiences, thus are able to effectively deliver messages to a highly relevant audience. Unlike macro-influencers who have a huge number of followers (e.g., millions), micro-influencers often have relatively fewer followers, yet they enjoy a more trustworthy reputation (e.g., higher conversion rate in product promotion) and direct relationship with them.
In summary, we make the following key contributions: • We propose OpenCrowd, a Bayesian framework for finding social influencers through open-ended answers aggregation; • We derive an efficient learning algorithm based on variational inference with incremental updating rules for Open-Crowd parameter estimation; • We conduct an extensive evaluation on two domains -fashion and information technology -and show that OpenCrowd substantially improves the state of the art by 11.5% AUC.

RELATED WORK
In this section, we first review existing answers aggregation methods applicable for finding social influencers. Then, we discuss metric and feature-based techniques proposed to solve this problem.

Answers Aggregation
Influence is defined as "the power of producing an effect on the character or behavior of someone" (Oxford Dictionary). This concept is intrinsically difficult to quantify especially in a large scale context. Answers aggregation provides an efficient and cost-effective manner to identify a large number of social influencers. Methods have been mainly developed in Boolean crowdsourcing. Typical methods include majority voting [35] and those based on expectationmaximization (EM), which simultaneously estimate the true labels and parameters related to the annotation process such as worker reliability and task difficulty [53]. Dawid and Skene [10] make a seminal contribution by proposing to model the worker's reliability with a confusion matrix for answers aggregation. Demartini et al. [11] address a similar problem while modeling worker reliability as a scalar parameter, which can be less expressive but more robust for highly sparse worker-answer matrices -as we discuss in our experiments. Whitehill et al. [48] introduce a similar method yet further propose to model task difficulty in addition to worker reliability. Closer to our method is LFC proposed by Raykar et al. [31], which models worker reliability as a latent variable with a prior distribution, thus capable of quantifying the uncertainty of the inference. Unlike these techniques, our proposed framework further incorporates existing labels and social features, thus extending the applicability of answers aggregation to open-ended tasks.
While little work has focused on open-ended answers aggregation [29], some techniques consider features of answers or tasks for answers aggregation. A seminal work by Welinder et al. [47] considers the implicit dimensions of worker expertise and task domains and propose a probabilistic model where such dimensions are modeled as feature vectors of tasks to be learned from the workeranswer matrix. A similar line of work takes advantage of explicit task features to learn such dimensions [12,22,52]. Ma et al. [22] propose a joint model that captures the task domain from textual descriptions and worker reliability from the worker-answer matrix. Zheng et al. [52] further consider external knowledge bases to better capture task domains. Fan et al. introduce iCrowd [12], which measures the topical similarity of tasks by employing topic modeling techniques (e.g., LDA [5]), and leverages such similarity for a better estimate of worker reliability. A similar idea is investigated by Lakkaraju et al. [18], which also considers similarity among workers based on their features. All these works, however, rely on unsupervised topic models or ad-hoc modeling of specific features to help estimate worker reliability. Our proposed framework is different in that it incorporates crowdsourcing and supervised models that can consume any features.
Human-AI Collaboration. Our work is related to the emerging field of human-AI collaboration paradigm arising from the intersection between human computation and machine learning [44]. Human computation has been used to enhance machine learning systems by generating the data [50,53] before model training, or providing interpretations for model decisions [33] and debugging the system or the data [26,51] after model training. Typically, human computation and machine learning are treated as disentangled processes. Our approach provides a way to deeply integrate human and machine intelligence in a Bayesian framework, where human characteristics (i.e., reliability) and model parameters are iteratively inferred in a mutually boosting manner until the decisions from aggregated human answers and those from the model agree with each other. Our work can be seen as a development of the "learning-from-crowds" line of research [31,41,50], which considers the machine learning problem in the context of noisy labels contributed by the crowd. Unlike existing work, our framework is generic in that it does not assume any type of machine learning models, thus is applicable to a wider range of problems and application domains.

Social Influencer Finding
Existing methods for social influencer finding can be categorized into two classes: metric and feature-based. Common metrics for identifying social influencers include the number of followers, the number of mentions, and the ratio of the number of comments/likes to the number of followers [15,17]. These metrics are often insufficient to fully capture the degree of influence because of the difficulty to measure content authenticity and engagement with audience. The latter dimension, i.e., engagement, has become a key consideration along with the shift of focus in industry from finding macroinfluencers (including celebrities) to micro-influencers [2,49].
An alternative approach to finding social influencer is machine learning, which can detect influencers by weighting a large number of social features. Existing work has considered a variety of social features including metadata features such as the number of followers and followees [9,19], the number of retweets and mentions [8], semantic features such as the topics of a candidate influencer's microposts [32,46], or features derived from user behavioral data such as the activeness of a candidate influencer in online activities [1,19]. In addition to those, several pieces of work consider a specific type of feature, namely the structure of the social network among the influencers and other online users, to improve the accuracy in finding social influencers [13,21,27,30,39,40]. For instance, Tang et al. [39,40] propose to find the influencers as the nodes from which the spread of information is maximized. Qiu et al. [30] adopt a deep learning framework where the network embedding and some userspecifc features are fed into a deep neural network for predicting social influencers. Bi et al. [3] introduce a model to incorporate the content of tweets and the followee distribution of microblogs. Similarly, Pal et al. [27] proposed a probabilistic clustering method to produce a ranked list of influencers using node degree, information diffusion, and metrics related to tweets' content.
A less discussed aspect in the machine learning approach is training example creation, which is generally performed by experts through manual screening. The process involves careful examination of content quality, feed consistency, and estimation of the rate of high-quality interactions with the audience. Such a process does not scale for a large number of influencers. Unlike existing methods, our proposed framework only requires a small number of expert labels and shifts the burden of label creation to online workers through crowdsourcing, which is fast, scalable, and cost-effective.

PROBLEM FORMULATION
In this section, we first introduce the notations used in the paper and then formally define our problem.
Notations. We use boldface lowercase letters to denote vectors and boldface uppercase letters to denote matrices. For an arbitrary matrix M, we use M i, j to denote the entry at the i-th row and j-th column. We use capital letters (e.g., P) in calligraphic math font to denote sets. We use x ∝ y to denote that the two variables x and y are proportionally related, i.e., x = ky, where k is a constant. Table 1 summarizes the notations used throughout this paper. We denote the set of unique candidate social influencers named by the crowd workers as I and the set of workers as J . We use A i, j = 1 to denote that the candidate influencer i is an answer provided by worker j, and A i, j = 0 otherwise. Due to the fact that an individual worker can only provide a limited number of candidate influencers, A i, j is a sparse matrix where only a small proportion of the entries are non-zero. For each candidate influencer i ∈ I, we collect her social features as described in detail in Section 5.1, and denote the resulting feature vector by x i . The subset of I, denoted as I L ⊂ I, represents candidate influencers who are associated with expert labels y i (for i ∈ I L ). Note that we only have a relatively small number of candidate influencers with expert labels, namely, |I L | ≪ |I| and that we aim at estimating the true labels of the candidate influencers who are in I \ I L .
Problem Definition. Let I be a set of candidate social influencers and I L be the subset labeled by experts. Let also J be the set of workers who collectively nominated I, where each candidate influencer can be named by a different number of workers. We aim at inferring the true labels z i ∈ {0, 1} for all candidate influencers in I \ I L .
Note that in an open-ended answers aggregation setting, we do not control the number of answers provided by each worker [29]. Hence, the number of workers relevant to different candidate influencer can vary from one to many, rendering the aggregation task highly challenging. This comes in contrast to the conventional crowdsourcing setting, where the number of workers is usually fixed for every data instance (e.g., five workers per instance), which simplifies answers aggregation that relies on worker disagreement.

THE OPENCROWD FRAMEWORK
OpenCrowd is a unified Bayesian framework that incorporates both supervised learning and crowdsourcing for identifying true social influencers via open-ended answers aggregation. In this section, we first describe the model and then present our variational inference algorithm for learning OpenCrowd parameters.

OpenCrowd as a Generative Model
We represent the generative process of answers as conditioned on both the true labels of the answers and the reliability of the workers. We model the true label of a candidate influencer z i ∈ {0, 1} with a Bernoulli distribution: where θ i is the parameter of the distribution predicted by the social features of the candidate influencer through a feature-based answer quality model, denoted by f (·); W I is the set of the model parameters; σ (·) is a sigmoid function. We denote f (·) as a generic function that can be instantiated with any supervised learning model, be it a linear model or a neural network. Note that W I is shared across all candidate influencers [14], which allows us to exploit the similarity among candidate influencers. We represent worker reliability as r j ∈ [0, 1] (j ∈ J ) where r j = 1 indicates that the worker is fully reliable and r j = 0 otherwise. In practice, we would like to have a measure of confidence in estimating the reliability of the workers providing different numbers of answers: we should be more confident in estimating the reliability of workers who provide 50 answers than those who provide 5 answers only. To quantify the confidence in our inference, we adopt a Bayesian treatment of r j by introducing a prior, thus modeling r j as a latent variable. Given that r j is a continuous variable in [0, 1], we choose a Beta distribution to model its prior: where A and B are the parameters of the distribution. The incorporation of confidence makes our framework more robust to overfitting, as we show later in Section 5.2. We now define the likelihood of a worker j naming a candidate influencer i as the probability conditioned on the worker's reliability r j and the true label of the candidate z i : where 1[·] is an indicator function returning 1 if the statement is True and 0 otherwise. Note that Eq. (3) considers a worker to be reliable if she does not name a candidate influencer who is indeed not a real influencer. It is, however, likely that a worker does not name a candidate influencer i simply because she did not think of i. That means that we can only partly treat the non-named candidate influencers as those the worker considers as non-influencers. It is, therefore, necessary to introduce negative sampling into the inference algorithm.
Negative Sampling. Negative sampling consists in taking a random sample of candidate influencers not nominated by a worker as her answers of non-influencers. Such negative samples are useful to improve worker reliability inference, as we show in our experiment in Section 5.4. For each worker, we consider the candidate influencers named by her and the negatively sampled ones as the candidate influencers relevant to her. Similarly, to estimate the quality of a candidate influencer, we consider not only those workers who have nominated the influencer but also those whose negatively sampled answers contain such an influencer.
The overall OpenCrowd framework is depicted in Figure 2. Model learning constitutes of parameter learning for W I and posterior inference for the latent variables z i and r j .

Variational Inference for OpenCrowd
Learning the parameters of OpenCrowd resorts to maximizing the following likelihood function: where z and r are the latent true labels for all candidate influencers and the reliability of all workers, respectively. Eq. (4) consists of an integral with two latent variables, rendering it computationally unfeasible to optimize [42]. Instead, we consider the log of our likelihood function, i.e., where q(z, r) is any probability density function and KL(·) is the KL divergence between two distributions. By doing so, the two parts of the objective function can then be optimized iteratively with a variational expectation-maximization method [42]. Specifically, we iterate between two steps: 1) the E-step where we approximate the distribution of latent variables p(z, r|A, x i ; W I ) with the variational distribution q(z, r), by minimizing the KL-divergence and 2) the M-step where we maximize the first term L(W I , q) of Eq. (5) given the newly inferred latent variables.
E-step. We use the mean-field variational inference approach [4] by assuming that q(z, r) factorizes over the latent variables: We further assume the following forms for the factor functions: where θ i , α j and β j are variational parameters used to perform optimization to minimize the KL-divergence. The latter can then be minimized using coordinate ascent where we update one factor while keeping all others fixed and then iterate until convergence.
In the following, we derive the update rules for the variational distributions q(z i ) and q(r j ). We start by deriving the update rule for q(z i ). Let p(z i |x i ; W I ) be the variational distribution of z i from the last iteration. The KL-divergence in Eq. (5) can be easily simplified [4], by keeping only the terms that depend on z i , to the following: where J i is the set of workers relevant to a candidate influencer i and д x (·) is the expectation term E x [log(·)] with x being a variational distribution. Based on this equation, we show in the next lemma how to efficiently update q(z i ) using the feature-based answer quality model and the worker reliability parameters from the previous iteration.
Lemma 4.1 (Incremental Answer Quality). The true label distribution q(z i ) of a candidate influencer i can be incrementally updated from the output of the answer quality model θ i and the worker's reliability parameters α j and β j (j ∈ J i ) in the previous iteration: where Ψ(·) is the Digamma function. If q(z i = 0) then θ i is replaced with (1 − θ i ) and, Ψ(β j ) and Ψ(α j ) are swapped. Proof. We show the proof only for z i = 1 since the proof for z i = 0 follows similarly. Using Eq. (1), we have: We substitute the probabilities p(z i |x i ; W I ) and p(A i, j |z i , r j ) in Eq. (8) by their respective definitions in Eq. (10) and Eq. (3) and get: By computing the geometric mean of the beta distribution [23], we can evaluate the expectations д x (.) as follows: Putting (12) into (11), the update equation can be simplified as: which concludes the proof. □ Next, we show how to efficiently update the variational distribution q(r j ). Let p(r j ) be the variational distribution of r j from the last iteration, θ ′ be the true label distribution from the current iteration and I j be the set of all candidate influencers relevant to a worker. The KL-divergence in Eq. (5) can be simplified, similarly to Eq. (8), by keeping only the terms that depend on r j to get: The following lemma shows how to solve Eq. (14) using an incremental updating rule.
Lemma 4.2 (Incremental Worker Reliability). The reliability distribution q(r j ) of worker j can be incrementally updated using the reliability parameters α j and β j from the last iteration and the true label distribution from the current iteration, denoted as θ ′ : Proof. We replace the probablity p(r j ) in Eq. (14) by the Beta distribution with parameters α j and β j from the previous iteration: The expectation term in Eq. (16) can be evaluated as follows: In the case when A i, j = 1, we use the expressions from Eq. (17) to replace the second term in Eq. (16) as follows: The probability density function of r j 's distribution is given by: Putting (19) into (18), we get: Thus, if A i, j = 1 we have: Update W I via standard gradient descent; Following the same steps, we similarly obtain the expression of □ M-step. Given the true labels of candidate influencers and the worker reliability inferred by the E-step, the M-step maximizes the first term of Eq. (5) to learn the parameters: where const . = E q(z i ,r j ) loд( 1 q(z i ,r j ) ) is a constant. Only the second part of L(W I , q), i.e., M 2 , depends on the model's parameters. M 2 is exactly the inverse of the cross-entropy between q(z i ) and p(z i |x i ; W I ), which is widely used as the loss function for many classifiers. M 2 can, therefore, be optimized using standard methods [25] (e.g., back-propagation in the case of a neural network).

Algorithm
The overall optimization algorithm is given in Algorithm 1. It iterates over the E-step (rows 2-5) and the M-step (rows 6-7) until our objective function converges. In rows 2-3, we iterate through all candidate influencers where for each candidate influencer i, we update q(z i ) using Lemma 4.1. Similarly in rows 4-5, we iterate through all workers where for each worker j we update q(r j ) using Lemma 4.2. In rows 6-7, W I can be incrementally updated starting from the values in the previous iteration. The convergence is reached when answer quality q(z i ) is no longer modified by worker reliability in the previous iteration (Eq. (8)) and it no longer updates the parameters of the answer quality model (Eq. (23)).
The iterations through the relevant workers for each candidate influencer (rows 2-3) require a time complexity of O(|A|), where |A| denotes the number of non-zero entries of A. Similarly, the time complexity for row 4-5 is O(|A|). The overall complexity of the algorithm is, therefore, O(#iter × |A| + T W ), where #iter is the number of iterations in the variational inference algorithm and T W represents the complexity of learning W I in a supervised learning setting.

EXPERIMENTS AND RESULTS
This section presents experimental results evaluating the performance of OpenCrowd 2 on two different domains, by comparing it against state-of-the-art Boolean and feature-based aggregation methods. In addition, we investigate the properties of our approach such as the impact of negative sampling on the performance. We start by introducing our experimental setup below before presenting the results of our experiments.

Experimental Setup
Crowdsourcing Task. We consider the problem of finding social influencers in two domains: fashion and information technology. For both domains, we published question-answering tasks on Figure  Eight 3 , asking workers to name social influencers they know. To set the context and promote workers to reflect on their experience, we asked workers to assess their domain-specific knowledge (fivepoint scale), estimate how often do they read social media posts from influencers (never, rarely, sometimes, always), and describe how they got to know the influencers. Workers name candidate influencers by providing their Twitter usernames, from which we retrieve social features (see "Social Feature Extraction"). 4 The task took 2 minutes to complete on average. Workers who completed the task received a reward of 30 cents (USD), with an additional bonus: they were paid 10 additional cents (and up to 50 cents) for every social influencer they provided after naming 3 influencers.
Datasets. We collected two datasets of candidate influencers in two domains: Fashion and Infotech. The size of the collected datasets are comparable to typical datasets for Boolean answer aggregation [36]. Key statistics of these datasets are reported in Table 2. Our manual analysis (see "Expert Assessment") revealed that 30.64% and 43.39% of the crowd answers designate true influencers for Fashion and InfoTech, respectively. The relatively large number of crowd answers collected in a short period of time (<10 hours for both Fashion and InfoTech) confirms our assumption that crowdsourced open-ended question-answering can drastically speed up the data collection for finding social influencers. Moreover, the high sparsity of the answer matrices (Table 2) and the fact that the majority of the answers are incorrect substantiates the necessity of open-ended answers aggregation that takes into account the workers' reliability. Expert Assessment. We conducted a series of interviews with experts from three leading companies 5 that connect brands to social influencers. We distilled four main characteristics of influencer assessments: authenticity, dedication, branding, and communication. Following their guidelines and examples, three of the authors randomly selected 40% of the candidate influencers and labeled them by manually examining their profile and content on Twitter. In more detail, a candidate was considered as a real influencer whenever she: 1) tweets about a specific topic; 2) posts new content regularly; 3) keeps a consistent and unique style in her posts; and 4) communicates with her followers through comments (mostly for micro-influencers). The authors reached an initial agreement of over 80%. In case of disagreement, they discussed it until reaching a decision.
Social Feature Extraction. The features used in our framework are extracted from the Twitter account of the named candidate influencers. These features include metadata features such as the number of followers, number of followees and number of tweets, and semantic features such as the topics of a candidate influencer's tweets. In order to extract the topics from the tweets, we first represent all tweets as a bag of words. Then, we apply a grid search in {5, 10, 20, 50, 100} to set a threshold on the word's frequency. For our experiments, we keep only the words that appear more than 20 times. We finally compute the TF-IDF scores of the constructed bag of words and use the scores together with the other features to train our answer quality model.
Comparison Methods. Due to the lack of existing open-ended answers aggregation methods (cf. Section 2), we first compare against the following state-of-the-art closed-pool (Boolean) aggregation methods: 1) ZenCrowd [11], an expectation-maximization (EM) method that estimates worker reliability as a model parameter; 2) Dawid-Skene (DS) [10], an EM method that learns worker reliability as a confusion matrix; 3) GLAD [48], an EM method that simultaneously learns worker reliability and task difficulty; and 4) LFC [31], an EM method that incorporates priors in modeling worker reliability. Then, we compare against existing techniques that take into account task's features for answer aggregation: 1) LFC_SoT [41], a statistical model that estimates both worker reliability and task clarity by clustering workers into groups; 2) CUBAM [47], a Bayesian probabilistic model that learns worker reliability and task domains as a feature vector from the worker-answer matrix; and 3) iCrowd [12], a crowdsourcing framework that considers the topical similarity of tasks based on their textual description for worker reliability inference. We compare all the EM-based methods in a semi-supervised setting by fixing the known labels in the EM algorithm [36] (See [38,45] for the case of semi-supervised DS). Then, in order to apply Table 3: Performance (accuracy and AUC) comparison of aggregation techniques on two datasets with supervision degree s_deд from 50% to 90%. The best performance is highlighted in bold; the second best performance is marked by '*' for accuracy and by '+' for AUC.

Method
Metric  these methods to our problem, we use negative sampling to simulate a worker's answers of non-influencers by sampling the candidate influencers she does not name. We empirically determine the optimal sampling rates for each comparison method. Furthermore, for the techniques that model a task based on its textual description, we use the textual social features as input to model a candidate influencer.
To further investigate the benefits of taking into account the worker-answer matrix, we compare OpenCrowd against some featurebased methods: logistic regression (LR) and a multi-layer perceptron (MLP). We define two variants of our framework: 1) OpenCrowd-EM: OpenCrowd that aggregates workers' answers but models worker reliability as a fixed parameter; and 2) OpenCrowd, our framework that models worker reliability as a latent variable.
Parameter Settings. The parameters of our framework and those for model training are empirically set. We search for the best model architecture for MLP, and the predictor f in OpenCrowd-EM and OpenCrowd, with 0, 1, and 2 hidden layers, and apply a grid search in {64, 128, 256, 512, 1024} for the dimension of the hidden layers. In model training, we select learning rates from {0.0001, 0.001, 0.01, 0.1, 1} for the learning of W I in all variants of our framework, as well as for the learning of r j in OpenCrowd-EM. To investigate the impact of negative sampling, we experiment with sampling rates (s_rate) in {0, 0.1, 1, 10, 100} where s_rate = 10 indicates for example that for each worker, the negative samples are ten times the size of the candidate influencers named (i.e., deemed as positive) by each worker. For OpenCrowd, we set the priors A and B by sampling from a uniform distribution ∼ [0, 10] and update them in the E-step according to Lemma 4.2.
Evaluation Protocols. We split the labeled subset of candidate influencers into training, validation, and test sets. OpenCrowd is trained on the answers in the training set, tuned on the validation set and evaluated on the test set. To investigate the impact of the degree of supervision (s_deд) on OpenCrowd performance, we split the labeled subset by s_deд ∈ {50%, 60%, 70%, 80%, 90%}, where s_deд = 60% means that 60% of the labeled subset is used for training, and the rest for validation and test with equal split. We use accuracy and area under the precision-recall curve (AUC) to measure the performance. Higher values of accuracy and AUC indicate better performance. Note that given the imbalanced classes in our datasets, accuracy is dominated by the results on the noninfluencers; similarly, the metric area under the ROC curve would also be biased to the non-influencers. In contrast, the AUC we use -area under the precision-recall curve -is more indicative of the performance in our context, as we are more interested in detecting real influencers from the workers' answers [7]. Table 3 summarizes the performance of boolean answers aggregation methods on our two datasets with different supervision degrees. We make several observations. First, we observe that ZenCrowd outperforms DS and GLAD in terms of accuracy and has a comparable performance in terms of AUC. Recall that ZenCrowd is less expressive compared to DS and GLAD, as it only models worker reliability as a parameter. In comparison, DS models worker reliability as a confusion matrix, and GLAD further models the task difficulty (in our context the ambiguity of a candidate influencer being the true influencer). The comparison result indicates that in our context, more expressive models do not necessarily lead to higher performance. This is likely due to the high sparsity of the worker-answer matrices that can easily lead to overfitting. Second, we observe that methods that model worker reliability as a latent variable with a prior distribution, namely LFC and our framework OpenCrowd, outperform the other methods. Such a result confirms the necessity of modeling worker reliability as a latent variable, as it helps to account for the confidence in estimating model parameters. This is particularly important to improve model robustness for sparse datasets similar to our case. We provide more results about this point in Section 5.4.

Comparison to Boolean Aggregation
Most importantly, OpenCrowd achieves the best performance among all answers aggregation methods under comparison: it improves the state of the art by 6.94% accuracy and 62.06% AUC on Fashion, and by 17.56% accuracy and 33.54% AUC on InfoTech. This significant improvement clearly demonstrates the effectiveness of our framework in open-ended answers aggregation. Impact of Supervision Degree. The supervision degree s_deд controls the number of observed labels in model training. We observe that the performance of our framework increases along with the increase of s_deд, as measured by both accuracy and AUC. This is natural as using more labeled data provides more information in discriminating influencers from non influencers. Such a pattern, however, is not observed for the other methods that we compare to. This is likely due to the fact that the other methods do not take advantage of the social features, which are useful as they serve as a means to propagate the labels to non-labeled candidate influencers. These results show that OpenCrowd is better at utilizing existing labels for answers aggregation.
Robustness. The learning of OpenCrowd involves two types of random processes, i.e., the random initialization of the parameters (e.g., W I ) and negative sampling. To investigate their impacts on OpenCrowd performance, we measure the standard deviation of OpenCrowd performance over 10 runs. Results show that the standard deviation in terms of accuracy is 0.017 and 0.018 on Fashion and InfoTech, respectively; and in terms of AUC, the standard deviation is 0.023 and 0.028 on Fashion and InfoTech, respectively. The standard deviations are small compared to the absolute accuracy and AUC. Such a result is consistent across different supervision degrees. These results signify the robustness of OpenCrowd across different runs.

Comparison to Feature-Based Aggregation
We now compare the performance of our method against featurebased aggregation techniques. Table 4 shows the results of our comparison against these methods in terms of accuracy and AUC on both the Fashion and InfoTech datasets (with a supervision degree of 60%). From these results, we make the following observations. Among the baselines, LFC_SoT achieves the best accuracy yet the lowest AUC. In fact, LFC_SoT cannot handle the case where some workers do not give an answer to some tasks and hence cannot properly support negative sampling. Since the worker-answer matrix is very sparse in our setting, LFC_SoT labels most candidate influencers as negative (more than 75%). Therefore, LFC_SoT infers most true influencers to be non influencers and hence the results. In contrast, iCrowd achieves better performance in terms of AUC than CUBAM and LFC_SoT. Recall that iCrowd takes into account the social features of candidate influencers and combines them with worker reliability to infer the truth. In comparison, CUBAM models the task difficulty as a vector but relies solely on the worker-answer matrix. This result confirms the necessity of taking into account the social features to identify real influencers in the set of candidates.
Overall, OpenCrowd achieves the best performance among all feature-based aggregation methods: it outperforms the second best method by 3.7% accuracy and 16.27% AUC on Fashion and by 13.18% accuracy and 6.7% AUC on InfoTech (on average: 8.44% accuracy and 11.5% AUC). Unlike the baseline methods that do not use social features (e.g., CUBAM) or rely only on textual features (e.g., iCrowd), OpenCrowd is able to leverage any type of social features, including non-textual ones. More importantly, unlike the unsupervised topic modeling used by iCrowd, the supervised answer quality model in OpenCrowd learns from the labeled data the "weights" of social features, thereby making it better at influencer identification.

Properties of OpenCrowd
The comparison between OpenCrowd variants against featurebased methods (see "Comparison Methods") is shown in Figures  3(a,b). OpenCrowd-EM outperforms both LR and MLP by 18.25% and 5.82% accuracy and by 13.69% and 18.05% AUC, respectively. These results show the importance of considering worker reliability in aggregating workers' answers. Among the two variants, Open-Crowd outperforms OpenCrowd-EM by 7.37% accuracy and 31.62% Figure 5: Examples of workers and their nominated candidate influencers in the InfoTech domain. We show for workers the inferred reliability (Rel.) and inference confidence (Conf.), and for candidate influencers the predicted quality as well as the ground-truth labels. Profile pictures are from public data sources and are randomly assigned to anonymize user identities.
AUC. This result indicates that modeling the worker reliability as a latent variable with a prior distribution not only makes the model more robust, but also improves the aggregation performance.
Impacts of Sampling Rate. The sampling rate s_rate controls the size of randomly sampled candidate influencers in estimating the workers' reliability. The results are shown in Figure 4. We observe that, as the sampling rate increases from 0 to 100, the performance first increases then decreases. Such a result is consistent on both datasets, measured by both accuracy and AUC. The optimal performance is reached for s_rate = 10 for Fashion and s_rate = 0.1 for InfoTech, indicating that workers' evaluation on candidate influencers they do not name is more negative on Fashion. Overall, the variation of the performance with different s_rate indicates the importance of selecting the optimal sampling rate. The similarity in performance variation across the two datasets again demonstrates the robustness of OpenCrowd.
Interpretation of Learning Results. Results of OpenCrowd can be explained in terms of the social features of candidate influencers and of the correlation between worker answers. We show in Figure 5 the learning results of real-world examples for three workers and seven candidate influencers from the InfoTech dataset. We also show the mean and confidence (differential entropy [24]) of worker reliability distribution (r j ), the predicted quality (θ i ) and the ground-truth labels of candidate influencers. We observe that workers who name real influencers have a high reliability as inferred by OpenCrowd, and otherwise have a low reliability. For example, the three influencers named by worker j 1 , who has the highest reliability, are all real influencers. Among them, candidates i 1 and i 2 clearly exhibit influencer characteristics, e.g., they have a large number of followers and tweets dedicated to InfoTech. These results indicate that our approach is able to correctly infer the reliability of workers by leveraging the social features of the candidate influencers they named. Thanks to the worker's reliability, microinfluencers with a smaller number of followers, such as candidate i 3 , can also be successfully detected by our approach. We also observe that worker j 3 has the same high reliability as j 1 , despite the fact that only one of the candidates (i 2 ) she named exhibits influencer characteristics. This is because OpenCrowd leverages the correlation between worker answers in reliability inference: i 2 is named by both workers j 1 and j 3 . The difference between the number of answers provided by j 1 and j 3 is captured through the confidence measure: i 1 has a higher confidence than i 3 . Most importantly, we observe that the high reliability inferred for j 3 helps to detect an additional micro-influencer that she named, i.e., i 7 . These results demonstrate that OpenCrowd can find micro-influencers through reliable workers, whose reliability can be inferred either through further named candidate influencers or through similar workers.

CONCLUSION
In this paper, we have presented OpenCrowd, a unified Bayesian framework that seamlessly incorporates machine learning and crowdsourcing for social influencer identification. Our framework aggregates open-ended answers while modeling both the quality of the workers' answers and their reliability. We derived a principled optimization algorithm based on variational inference with efficient incremental update rules for learning OpenCrowd parameters. Extensive validation on two real-world datasets shows that OpenCrowd is an effective and robust framework that substantially outperforms state-of-the-art answers aggregation methods. Results further show that our framework is particularly useful in finding micro-influencers by exploiting the social features and the correlation between worker answers.