Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
1. A non-transitory computer-readable medium having stored thereon computer-readable instructions that when executed by a computing device cause the computing device to: compute a mean vector for each cluster of a plurality of clusters from a plurality of observation vectors, wherein the plurality of observation vectors includes a plurality of classified observation vectors and a plurality of unclassified observation vectors, wherein each observation vector of the plurality of observation vectors includes a value for each variable of the plurality of variables, wherein a target variable value is defined to represent a class for each respective observation vector of the plurality of classified observation vectors, wherein the target variable value is unlabeled for each respective observation vector of the plurality of unclassified observation vectors; compute an inverse precision parameter value for each cluster of the plurality of clusters from the plurality of observation vectors; initialize a responsibility parameter vector for each observation vector of the plurality of unclassified observation vectors, wherein the responsibility parameter vector includes a probability value of a cluster membership in each cluster of the plurality of clusters for each respective observation vector, wherein the plurality of unclassified observation vectors are distributed across a plurality of threads, and the responsibility parameter vector is initialized by each thread on which the plurality of unclassified observation vectors are distributed on each computing device of one or more computing devices; (A) compute beta distribution parameter values for each cluster using a predefined mass parameter value and the responsibility parameter vector; (B) compute parameter values for a normal-Wishart distribution for each cluster using a predefined concentration parameter value, a predefined degree of freedom parameter value, the computed mean vector, the computed inverse precision parameter value, each observation vector of the plurality of observation vectors, and each responsibility parameter vector defined for each observation vector of the plurality of unclassified observation vectors; (C) update each responsibility parameter vector defined for each observation vector of the plurality of unclassified observation vectors using the computed beta distribution parameter values, the computed parameter values for the normal-Wishart distribution, and a respective observation vector of the plurality of unclassified observation vectors; (D) compute a convergence parameter value; (E) repeat (A) to (D) until the computed convergence parameter value indicates the responsibility parameter vector defined for each observation vector of the plurality of unclassified observation vectors is converged; determine a cluster membership for each observation vector of the plurality of unclassified observation vectors from the plurality of clusters using a respective updated responsibility parameter vector; and output the determined cluster membership for each observation vector of the plurality of unclassified observation vectors.
This invention relates to a machine learning system for clustering and classifying observation vectors, particularly in scenarios involving both labeled and unlabeled data. The system addresses the challenge of efficiently processing large datasets with mixed classification states using probabilistic models. The method computes mean vectors and inverse precision parameters for clusters derived from observation vectors, which include both classified (labeled) and unclassified (unlabeled) data. Each observation vector contains values for multiple variables, with a target variable indicating class membership for labeled data and remaining unlabeled for unclassified data. The system initializes responsibility parameter vectors for unclassified observations, representing cluster membership probabilities, and distributes these computations across multiple threads and computing devices. It iteratively refines cluster assignments by computing beta distribution parameters and normal-Wishart distribution parameters, updating responsibility vectors, and checking for convergence. The process repeats until cluster assignments stabilize, after which the system assigns final cluster memberships to unclassified observations and outputs the results. This approach leverages parallel processing to improve scalability and efficiency in clustering tasks.
2. The non-transitory computer-readable medium of claim 1 , wherein the inverse precision parameter value is computed using Ψ 0,k −1 =ρσ k c +(1−ρ)σ u , k=1, 2, . . . , K max , where Ψ 0,k −1 is an inverse precision parameter matrix for a k th cluster of the plurality of clusters, wherein the inverse precision parameter matrix includes the computed inverse precision parameter value, ρ is a predefined labeling coefficient, σ k c is a first standard deviation matrix for the k th cluster computed using the plurality of classified observation vectors, σ u is a second standard deviation matrix computed using the plurality of unclassified observation vectors, and K max is a number of the plurality of clusters.
This invention relates to a method for computing an inverse precision parameter matrix in a clustering algorithm, particularly for improving classification accuracy in machine learning systems. The problem addressed is the challenge of effectively incorporating both classified and unclassified observation vectors to refine cluster-specific precision estimates, enhancing the robustness of clustering models. The system computes an inverse precision parameter matrix for each cluster in a set of clusters, where the matrix is derived from a weighted combination of two standard deviation matrices. The first standard deviation matrix is computed using classified observation vectors assigned to the cluster, while the second is computed using unclassified observation vectors. A predefined labeling coefficient determines the contribution of each matrix to the final inverse precision parameter value. This approach allows the clustering algorithm to dynamically adjust precision estimates based on the availability and reliability of labeled and unlabeled data, improving model performance in scenarios with limited labeled observations. The method iteratively applies this computation for each cluster up to a maximum cluster count, ensuring that the inverse precision parameter matrix accurately reflects the statistical properties of both classified and unclassified data. This technique is particularly useful in applications requiring adaptive clustering, such as anomaly detection, pattern recognition, and unsupervised learning tasks.
3. The non-transitory computer-readable medium of claim 2 , wherein the first standard deviation matrix is computed using σ k c = 1 n k c - 1 ∑ i = 1 n k c ( x i - m k c ) ( x i - m k c ) ⊤ , k = 1 , 2 , … , K max , where x i is an i th observation vector of the plurality of classified observation vectors that is included in the k th cluster, m k c is a mean vector for the k th cluster computed using the plurality of classified observation vectors included in the k th cluster, wherein the mean vector includes the mean vector value computed for each variable of the plurality of variables, n k c is a number of observation vectors of the plurality of classified observation vectors that is included in the k th cluster, and T indicates a transpose.
This invention relates to computational methods for analyzing clustered data, specifically focusing on the calculation of standard deviation matrices for clusters of observation vectors. The problem addressed involves accurately quantifying variability within clusters of multivariate data, which is essential for tasks such as anomaly detection, pattern recognition, and statistical modeling. The invention describes a method for computing a first standard deviation matrix for each of a set of clusters, where each cluster contains a subset of classified observation vectors. For a given cluster, the standard deviation matrix is calculated using a formula that incorporates the observation vectors, their mean vector, and the number of vectors in the cluster. The mean vector is derived from the observation vectors in the cluster, with each element representing the mean value of a corresponding variable across all vectors in the cluster. The formula sums the outer products of deviations from the mean for each observation vector, normalized by the number of vectors minus one. This approach ensures that the standard deviation matrix accurately reflects the covariance structure of the data within each cluster, which is critical for downstream analyses that rely on cluster-specific variability measures. The method is applicable to any domain where multivariate data clustering is performed, such as machine learning, bioinformatics, and financial modeling.
4. The non-transitory computer-readable medium of claim 3 , wherein the second standard deviation matrix is computed using σ u = 1 n - 1 ∑ i = 1 n ( x i , j - m 0 , j ) 2 , j = 1 , 2 , … , d , where d is a number of the plurality of variables, x i,j is a variable value for a j th variable of the i th observation vector of the plurality of unclassified observation vectors, n is a number of the plurality of unclassified observation vectors, and m 0,j is a mean value computed from the plurality of unclassified observation vectors for the j th variable of the plurality of variables.
The invention relates to statistical analysis of multivariate data, specifically computing a second standard deviation matrix for unclassified observation vectors. The problem addressed is accurately calculating variability in high-dimensional datasets where observations are not yet categorized. The solution involves computing a standard deviation matrix for each variable in the dataset, where each variable's standard deviation is derived from the squared differences between individual observations and their mean value. The formula used is σ_u = 1/(n-1) * Σ (x_i,j - m_0,j)^2 for each variable j, where n is the number of observations, x_i,j is the value of the j-th variable in the i-th observation, and m_0,j is the mean of the j-th variable across all observations. This approach ensures robust statistical characterization of unclassified data, which is critical for subsequent classification or clustering tasks. The method is particularly useful in machine learning preprocessing, anomaly detection, and exploratory data analysis where understanding data distribution is essential. The invention improves upon traditional standard deviation calculations by explicitly accounting for unclassified data states, providing more reliable variance estimates for further analytical processes.
5. The non-transitory computer-readable medium of claim 3 , wherein the second standard deviation matrix is computed using σ u = 1 n - 1 ∑ i = 1 n ( x i - m 0 ) ( x i - m 0 ) T , where d is a number of the plurality of variables, x i is the i th observation vector of the plurality of unclassified observation vectors, m 0 is the mean vector computed from the plurality of unclassified observation vectors, and n is a number of the plurality of unclassified observation vectors.
This invention relates to statistical analysis of multivariate data, specifically computing a second standard deviation matrix for unclassified observation vectors. The problem addressed is accurately measuring the spread of multivariate data points around a computed mean vector, which is essential for clustering, anomaly detection, and other machine learning applications. The invention describes a method for calculating a second standard deviation matrix (σ_u) from a set of unclassified observation vectors. The computation involves determining the mean vector (m_0) of the observation vectors, then using this mean to compute the covariance-like matrix. For each observation vector (x_i) in the dataset, the difference between the observation and the mean vector is calculated, outer products of these differences are summed, and the result is scaled by the inverse of the number of observations minus one (n-1). This approach provides a measure of variability in the data, accounting for correlations between variables. The method is particularly useful in scenarios where data points are not yet classified, such as in initial stages of clustering algorithms or when analyzing raw sensor data. The computed matrix can be used to assess data distribution, detect outliers, or initialize further statistical models. The formula ensures mathematical consistency with standard deviation calculations while extending the concept to multivariate data.
6. The non-transitory computer-readable medium of claim 1 , wherein the responsibility parameter vector is initialized for each observation vector using random draws from a multinomial distribution such that ∑ k = 1 K max r i , k = 1 for i = 1 , 2 , … , n , where r i,k is a responsibility parameter value for an i th observation vector of the plurality of unclassified observation vectors and a k th cluster of the plurality of clusters, n is a number of the plurality of unclassified observation vectors, and K max is a number of the plurality of clusters.
This invention relates to clustering algorithms, specifically improving initialization in probabilistic clustering methods. The problem addressed is the sensitivity of clustering performance to initial parameter settings, which can lead to suboptimal or unstable results. The solution involves initializing responsibility parameters for each observation vector using random draws from a multinomial distribution. The initialization ensures that the sum of responsibility parameters for each observation vector across all clusters equals one, maintaining a valid probability distribution. This approach helps avoid poor starting points that could bias the clustering process. The method is particularly useful in algorithms like Gaussian Mixture Models (GMMs) or Expectation-Maximization (EM) clustering, where proper initialization is critical for convergence to meaningful clusters. By using a multinomial distribution, the system ensures that the initial responsibilities are properly normalized, reducing the risk of degenerate solutions. The technique is applicable to any clustering scenario where probabilistic assignments of observations to clusters are involved.
7. The non-transitory computer-readable medium of claim 1 , wherein the responsibility parameter vector is initialized using ∑ k = 1 K max r i , k = 1 / K max for i = 1 , 2 , … , n , where r i,k is a responsibility parameter value for an i th observation vector of the plurality of unclassified observation vectors and a k th cluster of the plurality of clusters, n is a number of the plurality of unclassified observation vectors, and K max is a number of the plurality of clusters.
This invention relates to a method for initializing responsibility parameters in a clustering algorithm, specifically for unsupervised machine learning tasks. The problem addressed is the efficient initialization of responsibility parameters in clustering algorithms, which are used to assign observation vectors to clusters. The invention provides a mathematical formulation for initializing these parameters to improve clustering accuracy and convergence speed. The method involves initializing a responsibility parameter vector for each observation vector in a dataset. The responsibility parameter vector is initialized using a specific mathematical formula: the sum of the maximum responsibility parameter values across all clusters for each observation vector, divided by the total number of clusters. This ensures that the initial responsibility values are balanced and reflect the relative importance of each observation vector in the clustering process. The invention applies to unsupervised learning systems where data points are grouped into clusters without prior labels. By initializing the responsibility parameters in this manner, the clustering algorithm can more effectively assign observation vectors to the correct clusters, leading to improved performance in tasks such as data segmentation, pattern recognition, and anomaly detection. The method is particularly useful in scenarios where the number of clusters is known in advance, and the goal is to optimize the initial assignment of data points to these clusters.
8. The non-transitory computer-readable medium of claim 1 , wherein the beta distribution parameter values include a first beta distribution parameter value and a second beta distribution parameter value, wherein the first beta distribution parameter value is computed using γ k , 1 = 1 + n k c + ∑ i = 1 n u r i , k , k = 1 , 2 , … , K max , where γ k,1 is the first beta distribution parameter value, n k c is a number of the plurality of classified observation vectors included in a k th cluster of the plurality of clusters, r i,k is a responsibility parameter value of the responsibility parameter vector defined for an i th observation vector of the plurality of unclassified observation vectors and the k th cluster of the plurality of clusters, n u is a number of the plurality of unclassified observation vectors, and K max is a number of the plurality of clusters.
This invention relates to a method for computing beta distribution parameter values in a clustering algorithm, specifically for determining the first parameter (γk,1) of a beta distribution used in probabilistic clustering models. The problem addressed involves accurately estimating the parameters of a beta distribution to improve the classification of observation vectors into clusters. The method computes the first beta distribution parameter (γk,1) for each cluster (k) by summing the responsibilities (r_i,k) of unclassified observation vectors assigned to that cluster, adding the number of classified vectors (n_kc) in the cluster, and incorporating a base value of 1. The responsibility parameter (r_i,k) represents the likelihood that an unclassified observation vector (i) belongs to a given cluster (k). The total number of unclassified vectors (n_u) and the maximum number of clusters (K_max) are also considered. This approach enhances the accuracy of probabilistic clustering by refining the parameter estimation process, particularly in models like Dirichlet process mixtures or other non-parametric clustering techniques. The method ensures that the beta distribution parameters are dynamically adjusted based on both classified and unclassified data, improving clustering performance.
9. The non-transitory computer-readable medium of claim 8 , wherein the second beta distribution parameter value is computed using γ k , 2 = α 0 + ∑ l = k + 1 K max ∑ i = 1 n r i , l , where γ k,2 is the second beta distribution parameter value, and α 0 is the predefined mass parameter value.
This invention relates to statistical modeling and parameter estimation, specifically for computing a second beta distribution parameter value in a hierarchical Bayesian model. The problem addressed involves accurately estimating parameters in probabilistic models where data is structured hierarchically, such as in multi-level or nested datasets. Traditional methods may struggle with computational efficiency or accuracy when dealing with complex dependencies between parameters. The invention provides a method for computing a second beta distribution parameter value (γ_k,2) in a hierarchical Bayesian framework. The computation is based on a predefined mass parameter (α_0) and a sum of weighted responses (r_i,l) across multiple levels of the hierarchy. Specifically, the second parameter is derived by summing the responses (r_i,l) for each level (l) from (k+1) to a maximum level (K_max), and then summing these contributions across all observations (i) from 1 to n. The predefined mass parameter (α_0) is added to this sum to produce the final parameter value (γ_k,2). This approach ensures that the hierarchical structure of the data is preserved while improving the robustness and accuracy of parameter estimation. The method is particularly useful in applications requiring precise probabilistic modeling, such as machine learning, risk assessment, and decision-making systems.
10. The non-transitory computer-readable medium of claim 1 , wherein computing the parameter values for the normal-Wishart distribution comprises: computing a first parameter value for the normal-Wishart distribution for each cluster using the predefined concentration parameter value, the computed mean vector, each observation vector of the plurality of unclassified observation vectors, and each responsibility parameter vector defined for each observation vector of the plurality of unclassified observation vectors; computing a second parameter value for the normal-Wishart distribution for each cluster using the predefined concentration parameter value and each responsibility parameter vector defined for each observation vector of the plurality of unclassified observation vectors; computing a third parameter value for the normal-Wishart distribution for each cluster using the predefined degree of freedom parameter value and each responsibility parameter vector defined for each observation vector of the plurality of unclassified observation vectors; and computing a fourth parameter value for the normal-Wishart distribution for each cluster using the predefined concentration parameter value, the computed mean vector, the computed first parameter value, the computed inverse precision parameter value, each observation vector of the plurality of observation vectors, and each responsibility parameter vector defined for each observation vector of the plurality of unclassified observation vectors.
This invention relates to a method for computing parameter values of a normal-Wishart distribution in a clustering algorithm, particularly for unsupervised machine learning tasks. The normal-Wishart distribution is used to model the covariance structure of data points in a probabilistic clustering framework, such as Gaussian mixture models or variational Bayesian inference. The method involves computing four distinct parameter values for each cluster in the model. The first parameter value is derived using a predefined concentration parameter, a computed mean vector, each observation vector from the unclassified data, and responsibility parameter vectors that indicate the likelihood of each observation belonging to a cluster. The second parameter value is computed using the predefined concentration parameter and the responsibility parameter vectors. The third parameter value is calculated using a predefined degree of freedom parameter and the responsibility parameter vectors. The fourth parameter value is derived using the predefined concentration parameter, the computed mean vector, the first parameter value, an inverse precision parameter, each observation vector, and the responsibility parameter vectors. This approach ensures that the normal-Wishart distribution parameters are accurately estimated based on the observed data and the probabilistic assignments of observations to clusters, improving the robustness and accuracy of the clustering algorithm. The method is particularly useful in applications requiring high-dimensional data clustering, such as image recognition, bioinformatics, and anomaly detection.
11. The non-transitory computer-readable medium of claim 10 , wherein the first parameter value is computed using m k = β 0 m 0 , k + u k β 0 + q k for k=1, . . . , K max , where m k is a first parameter vector that includes the first parameter value for each variable of the plurality of variables for a k th cluster of the plurality of clusters, β 0 is the predefined concentration parameter value, m 0,k is the mean vector for the k th cluster of the plurality of clusters, u k = ∑ i = 1 n k c x i c + ∑ i = 1 n u r i , k x i u , k = 1 , 2 , … , K max , q k = n k c + ∑ i = 1 n u r i , k , x i c is an i th observation vector of the plurality of classified observation vectors included in the k th cluster, r i,k is a responsibility parameter value for the i th observation vector of the plurality of unclassified observation vectors and the k th cluster, x i u is the i th observation vector of the plurality of unclassified observation vectors, n k c is a number of the plurality of classified observation vectors included in the k th cluster, n u is a number of the plurality of unclassified observation vectors, and K max is a number of the plurality of clusters.
This invention relates to a method for computing parameter values in a clustering algorithm, specifically for updating the mean vector of a cluster in a probabilistic model. The problem addressed is efficiently incorporating both classified and unclassified observation vectors into cluster parameter estimation, particularly in scenarios where new data must be integrated into existing clusters. The method computes a first parameter vector for each cluster using a formula that combines predefined concentration parameters, mean vectors of classified observations, and contributions from unclassified observations. For each cluster, the computation involves summing observation vectors from classified data and weighted contributions from unclassified data, where the weights are determined by responsibility parameters. The responsibility parameters indicate the likelihood that an unclassified observation belongs to a given cluster. The formula also accounts for the number of classified and unclassified observations, ensuring that the cluster parameters are updated accurately as new data is processed. This approach improves clustering accuracy by dynamically adjusting cluster parameters based on both existing and new data, making it suitable for applications requiring real-time or incremental clustering, such as data stream analysis or adaptive machine learning systems. The method ensures that the influence of unclassified data is properly weighted, preventing bias toward either classified or unclassified observations.
12. The non-transitory computer-readable medium of claim 11 , wherein the second parameter value is computed using β k =β 0 +q k , k=1, . . . , K max , where β k is the second parameter value for the k th cluster.
This invention relates to a method for processing data in a machine learning system, specifically for computing parameter values in a clustering algorithm. The problem addressed is the need for an efficient and accurate way to determine parameter values for clusters in a dataset, particularly in scenarios where the data is processed in batches or streams. The system involves a non-transitory computer-readable medium storing instructions that, when executed, perform a clustering algorithm. The algorithm computes a second parameter value for each of K_max clusters, where K_max is the maximum number of clusters. The second parameter value for the k-th cluster, denoted as β_k, is calculated using the formula β_k = β_0 + q_k, where β_0 is a base parameter value and q_k is an adjustment value specific to the k-th cluster. This adjustment allows the algorithm to refine the parameter values for each cluster based on the characteristics of the data in that cluster. The system may also include a first parameter value, which could be used in conjunction with the second parameter value to improve clustering accuracy. The clustering algorithm may be applied to data in batches or streams, ensuring scalability and adaptability to real-time data processing. The method ensures that the parameter values are dynamically adjusted to optimize the clustering process, leading to more accurate and efficient data segmentation.
13. The non-transitory computer-readable medium of claim 12 , wherein the third parameter value is computed using v k =v 0 +q k , k=1, . . . , K max , where v k is the third parameter value for the k th cluster, and v 0 is the predefined degree of freedom parameter value.
This invention relates to a method for computing parameter values in a clustering algorithm, specifically for determining a third parameter value associated with each cluster in a set of clusters. The method addresses the challenge of dynamically adjusting parameter values in clustering algorithms to improve accuracy and efficiency. The invention involves computing a third parameter value for each cluster using a mathematical formula that incorporates a predefined degree of freedom parameter value and a sequence of adjustments. The formula is defined as v_k = v_0 + q_k, where v_k represents the third parameter value for the k-th cluster, v_0 is a predefined degree of freedom parameter value, and q_k is an adjustment term. The computation is performed for each cluster from k=1 to k=K_max, where K_max is the maximum number of clusters. This approach allows for flexible and adaptive parameter tuning, enhancing the performance of clustering algorithms in various applications. The method is implemented using a non-transitory computer-readable medium, ensuring reproducibility and scalability. The invention is particularly useful in data analysis tasks where precise clustering is required, such as in machine learning, pattern recognition, and data mining.
14. The non-transitory computer-readable medium of claim 13 , wherein the fourth parameter value is computed using Ψ k =(Ψ 0 −1 +β 0 (m k −m 0 )(m k −m 0 ) T +s k −u k m k T −m k u k T +q k m k m k T ) −1 , k=1, . . . , K max , where Ψ k is a d by d-dimensional matrix that includes the fourth parameter value for each variable of the plurality of variables by each variable of the plurality of variables and for the k th cluster, Ψ 0 −1 is a d by d-dimensional inverse precision parameter matrix that includes the computed inverse precision parameter value for each variable of the plurality of variables by each variable of the plurality of variables, s k = ∑ i = 1 n k c x i c x i c T + ∑ i = 1 n u r i , k x i , j , x i , j T , j = 1 , 2 , … , d , k = 1 , 2 , … , K max , d is a number of the plurality of variables, and T indicates a transpose.
The invention relates to statistical modeling and clustering, specifically improving the computation of parameter values in probabilistic models. The problem addressed is efficiently estimating precision parameters in high-dimensional clustering models, where traditional methods may be computationally expensive or numerically unstable. The solution involves a mathematical formulation for computing a fourth parameter value (Ψk) for each cluster in a clustering model. This parameter is a d-by-d-dimensional matrix representing the inverse precision parameter for each variable pair in the k-th cluster. The computation uses a base inverse precision matrix (Ψ0−1), cluster-specific statistics (sk), and other derived quantities (mk, uk, qk) to ensure numerical stability and efficiency. The formula accounts for the covariance structure between variables within each cluster, leveraging sums of squared deviations (sk) and cross-products of variables. The method is designed for models with up to Kmax clusters and d variables, where the transpose operation (T) ensures proper matrix orientation. This approach optimizes parameter estimation in probabilistic clustering, particularly useful in high-dimensional data analysis where traditional methods may fail.
15. The non-transitory computer-readable medium of claim 14 , wherein the responsibility parameter vector is updated using r i , k ∝ exp ( Γ ( 1 ) ( γ k , 1 ) - Γ ( 1 ) ( γ k , 1 + γ k , 2 ) + ∑ l = 1 k - 1 ( Γ ( 1 ) ( γ l , 2 ) - Γ ( 1 ) ( γ l , 1 + γ l , 2 ) ) + 1 2 Γ d ( 1 ) ( v k 2 ) + 1 2 log Ψ k - 1 2 ( x i - m k ) T v k Ψ k ( x i - m k ) - d 2 β k - 1 ) for k=1, 2, . . . , K max , i=1, 2, . . . , n, where γ k,1 is a first beta distribution parameter value of the beta distribution parameter values for the k th cluster, γ k,2 is a second beta distribution parameter value of the beta distribution parameter values for the k th cluster, γ l,1 is the first beta distribution parameter value of the beta distribution parameter values for the l th cluster, γ l,2 is the second beta distribution parameter value of the beta distribution parameter values for the l th cluster, Γ (1) indicates a digamma function, and Γ d (1) indicates a d-dimensional digamma function.
The invention relates to a method for updating a responsibility parameter vector in a clustering algorithm, particularly in probabilistic models like Gaussian mixture models (GMMs) or Dirichlet process mixtures. The problem addressed is efficiently computing cluster responsibilities, which are crucial for assigning data points to clusters while accounting for uncertainty in cluster assignments. The solution involves a mathematical formulation for updating the responsibility parameter vector using a combination of digamma functions, beta distribution parameters, and Gaussian distribution terms. The update rule incorporates cluster-specific parameters (γ_k,1 and γ_k,2) for each cluster k, where γ_k,1 and γ_k,2 are beta distribution parameters influencing cluster responsibilities. The formula also includes terms involving the digamma function (Γ(1)), a d-dimensional digamma function (Γ_d(1)), and Gaussian distribution terms (mean m_k, covariance Ψ_k, and precision v_k). The update rule ensures that responsibilities are computed in a way that balances cluster assignments while accounting for the uncertainty in cluster parameters. This approach improves the robustness and accuracy of clustering algorithms by providing a principled way to update responsibilities during iterative optimization.
16. The non-transitory computer-readable medium of claim 10 , wherein after determining the cluster membership for each observation vector of the plurality of unclassified observation vectors, the computer-readable instructions further cause the computing device to: determine a number of clusters of the plurality of clusters that include at least one observation vector of the plurality of observation vectors; and output the determined number of clusters.
This invention relates to clustering algorithms for analyzing unclassified observation vectors, addressing the challenge of efficiently determining the number of meaningful clusters in a dataset. The system processes a plurality of unclassified observation vectors, each representing a data point, and assigns each vector to one of a plurality of clusters based on similarity metrics. After clustering, the system evaluates the clusters to identify those containing at least one observation vector, then outputs the count of such valid clusters. This approach helps in identifying the optimal number of clusters without manual intervention, improving automation in data analysis tasks. The method leverages computational techniques to streamline the clustering process, ensuring that only relevant clusters are considered in the final output. This is particularly useful in applications requiring automated data segmentation, such as pattern recognition, anomaly detection, or customer segmentation in large datasets. The system enhances efficiency by dynamically adjusting the cluster count based on the presence of observation vectors, reducing the need for predefined parameters or iterative adjustments.
17. The non-transitory computer-readable medium of claim 16 , wherein after determining the number of clusters, the computer-readable instructions further cause the computing device to: output the first parameter value and the fourth parameter value computed for each cluster that includes at least one observation vector of the plurality of observation vectors.
This invention relates to a method for analyzing data clusters in a machine learning system. The problem addressed is the need to efficiently determine and output key parameter values associated with clusters formed from observation vectors in a dataset. The system first processes a plurality of observation vectors to identify clusters, then computes parameter values for each cluster. Specifically, it calculates a first parameter value representing a central tendency (e.g., mean or median) and a fourth parameter value representing a measure of dispersion (e.g., variance or standard deviation) for each cluster. After determining the number of clusters, the system outputs the computed parameter values for clusters containing at least one observation vector. This allows users to analyze the statistical properties of the clusters, aiding in tasks such as anomaly detection, pattern recognition, or data segmentation. The method ensures that only meaningful clusters (those with at least one observation vector) are considered, improving computational efficiency and relevance of the results. The system may also include additional steps such as initializing parameters, iteratively refining clusters, and validating the clustering process to ensure accuracy. The output parameter values can be used for further analysis, visualization, or decision-making in applications like predictive modeling or data-driven decision support.
18. The non-transitory computer-readable medium of claim 17 , wherein, after determining the number of clusters, the computer-readable instructions further cause the computing device to: read a new observation vector from a dataset; assign the read new observation vector to a cluster of the determined number of clusters based on the read new observation vector, the first parameter value, and the fourth parameter value computed for each cluster that includes at least one observation vector; and output the assigned cluster.
This invention relates to clustering algorithms in machine learning, specifically improving the assignment of new data points to clusters. The problem addressed is the efficient and accurate classification of new observations into pre-determined clusters based on learned parameters. The system first determines an optimal number of clusters for a dataset using a clustering algorithm, such as k-means, and computes parameter values for each cluster, including centroids and covariance matrices. When a new observation vector is received, the system assigns it to the most appropriate cluster by comparing the observation vector to the computed parameters of each cluster. The assignment considers the observation vector's features, the cluster centroids, and the covariance matrices to determine the best fit. The output is the assigned cluster label for the new observation. This method ensures that new data points are accurately classified into existing clusters without requiring full re-clustering, improving computational efficiency and scalability. The approach is particularly useful in applications where real-time or near-real-time clustering of streaming data is required.
19. The non-transitory computer-readable medium of claim 1 , wherein each thread computes q k , w , t = ∑ i = 1 n w , t r i , k , u k , w , t = ∑ i = 1 n w , t r i , k x i , and s k , w , t = ∑ i = 1 n w , t r i , k x i x i T for each cluster k=1, . . . , K max , where n w,t is a number of observation vectors on which a computing device w and a thread t of the computing device w initializes the responsibility parameter vector, r i,k is a responsibility parameter value for an i th observation vector of the plurality of unclassified observation vectors and the k th cluster on which a computing device w and a thread t of the computing device w initialize the responsibility parameter vector, x i is the i th observation vector of the plurality of unclassified observation vectors on which a computing device w and a thread t of the computing device w initialize the responsibility parameter vector, K max is a number of the plurality of clusters, and T indicates a transpose.
This invention relates to parallelized clustering algorithms for processing large datasets. The problem addressed is the computational inefficiency of traditional clustering methods when applied to high-dimensional or large-scale datasets, particularly in distributed computing environments. The solution involves a parallelized implementation of a clustering algorithm, such as Gaussian Mixture Models (GMM) or Expectation-Maximization (EM), where multiple computing devices and threads work concurrently to accelerate the clustering process. Each computing device and its associated threads independently compute key statistical quantities for each cluster. Specifically, for each cluster k (where k ranges from 1 to K_max, the maximum number of clusters), the system calculates three sums: (1) q_k,w,t, representing the sum of responsibility parameter values (r_i,k) for all observation vectors (x_i) assigned to cluster k, (2) u_k,w,t, representing the weighted sum of observation vectors (x_i) scaled by their responsibility values (r_i,k), and (3) s_k,w,t, representing the weighted sum of outer products of observation vectors (x_i) with themselves, also scaled by responsibility values (r_i,k). These computations are performed in parallel across multiple threads and devices to improve efficiency. The responsibility parameter (r_i,k) indicates the likelihood that an observation vector (x_i) belongs to cluster k, and these values are initialized and updated iteratively. The parallelized approach ensures that large datasets can be processed faster by distributing the workload across multiple computing resources. This method is particularly useful in applications requiring real-time or near-real-time clustering, such as data mining, pattern recognition, and machine learning.
20. The non-transitory computer-readable medium of claim 19 , wherein the responsibility parameter vector is updated by each thread on which the plurality of unclassified observation vectors are distributed on each computing device of the one or more computing devices.
This invention relates to distributed computing systems for processing observation vectors in machine learning or data analysis tasks. The problem addressed is efficiently updating a responsibility parameter vector in a distributed environment where multiple computing devices process unclassified observation vectors across multiple threads. The responsibility parameter vector is used to assign probabilities or weights to observations, often in clustering or classification algorithms. The system involves one or more computing devices, each executing multiple threads to process a plurality of unclassified observation vectors. Each thread updates the responsibility parameter vector based on the observations it processes. This distributed approach ensures that the responsibility parameter vector is updated in parallel across all computing devices and threads, improving computational efficiency and scalability. The updates are synchronized to maintain consistency across the distributed system, allowing the responsibility parameter vector to reflect the collective processing of all observation vectors. The invention optimizes the handling of large datasets by leveraging parallel processing, reducing the time required for tasks such as clustering or classification. The responsibility parameter vector is dynamically adjusted as each thread processes its assigned observations, ensuring accurate and up-to-date probabilistic assignments. This method is particularly useful in scenarios where real-time or near-real-time processing of large-scale data is required.
21. The non-transitory computer-readable medium of claim 20 , wherein the cluster membership is determined for each observation vector of the plurality of unclassified observation vectors using a respective updated responsibility parameter vector by each thread on which the plurality of unclassified observation vectors are distributed on each computing device of the one or more computing devices.
This invention relates to distributed computing systems for clustering unclassified observation vectors using a parallelized Gaussian Mixture Model (GMM) algorithm. The problem addressed is efficiently processing large datasets of unclassified vectors in a distributed computing environment to determine cluster memberships while minimizing communication overhead between computing devices. The system involves one or more computing devices processing a plurality of unclassified observation vectors in parallel. Each computing device distributes the vectors across multiple threads, where each thread independently computes an updated responsibility parameter vector for each observation vector. These responsibility parameters are used to determine cluster memberships for the vectors. The system leverages parallel processing to accelerate the clustering process, with each thread handling a subset of vectors and computing responsibilities in parallel. The updated responsibility parameters are derived from a Gaussian Mixture Model, where each cluster is represented by a Gaussian distribution. The parallel computation ensures that the clustering process scales efficiently with the size of the dataset and the number of computing devices. The invention optimizes the clustering process by distributing the workload across threads and computing devices, reducing the time required for large-scale clustering tasks while maintaining accuracy. The approach is particularly useful in applications requiring real-time or near-real-time analysis of high-dimensional data, such as machine learning, data mining, and pattern recognition.
22. The non-transitory computer-readable medium of claim 1 , wherein the target variable value selected for each observation vector of the plurality of unclassified observation vectors identifies a characteristic of a respective observation vector.
This invention relates to machine learning systems for classifying observation vectors, addressing the challenge of accurately determining target variable values that represent key characteristics of unclassified data. The system processes a plurality of unclassified observation vectors, each representing a set of features or measurements. For each observation vector, a target variable value is selected to identify a specific characteristic of that vector, such as a class label, category, or other distinguishing attribute. The selection process may involve analyzing the observation vector's features, comparing them to known patterns, or applying statistical or probabilistic models. The target variable values are then used to classify the observation vectors, enabling the system to categorize or label the data for further analysis or decision-making. The invention improves classification accuracy by ensuring that the selected target variable values are meaningful and representative of the observation vectors' inherent characteristics. This approach is particularly useful in applications where precise classification is critical, such as medical diagnosis, fraud detection, or quality control in manufacturing. The system may also incorporate feedback mechanisms to refine the selection of target variable values over time, enhancing overall performance.
23. The non-transitory computer-readable medium of claim 1 , wherein the mean vector is computed using m 0,k =ρm k c +(1−ρ)m u , k=1, 2, . . . , K max , where m 0,k is the mean vector for a k th cluster of the plurality of clusters, ρ is a predefined labeling coefficient, m k c is a first mean vector for the k th cluster computed using the plurality of classified observation vectors, m u is a second mean vector computed using the plurality of unclassified observation vectors, and K max is a number of the plurality of clusters.
The invention relates to a method for computing mean vectors in clustering algorithms, particularly for handling both classified and unclassified observation vectors. The problem addressed is improving the accuracy of cluster mean vectors by incorporating information from unclassified data points, which are often ignored in traditional clustering methods. The solution involves computing a weighted mean vector for each cluster, where the weight is determined by a predefined labeling coefficient. For each cluster, the mean vector is calculated as a combination of the mean of classified vectors within that cluster and the mean of all unclassified vectors. The labeling coefficient controls the influence of the unclassified data, allowing the system to balance between relying on labeled data and incorporating unlabeled information. This approach enhances clustering performance by leveraging additional data that would otherwise be discarded, leading to more robust and accurate cluster representations. The method is applicable in machine learning, data analysis, and pattern recognition tasks where partial labeling of data is common.
24. The non-transitory computer-readable medium of claim 23 , wherein the first mean vector is computed using m k c = 1 n k c ∑ i = 1 n k c x i , k = 1 , 2 , … , K max , where x i is an i th observation vector of the plurality of classified observation vectors that is included in the k th cluster, and n k c is a number of observation vectors of the plurality of classified observation vectors that is included in the k th cluster.
This invention relates to machine learning and data clustering, specifically to a method for computing mean vectors of clusters in a dataset. The problem addressed is the efficient and accurate calculation of cluster centroids, which are essential for clustering algorithms like k-means. The invention provides a mathematical formulation for computing the mean vector of a cluster, where the mean vector is derived from a set of classified observation vectors assigned to that cluster. The formula used is m_kc = (1/n_kc) * Σ (x_i), where x_i represents an individual observation vector in the cluster, n_kc is the number of observation vectors in the cluster, and the summation is performed over all observation vectors in the cluster. This approach ensures that the mean vector accurately represents the central tendency of the cluster, improving the performance of clustering algorithms. The method is implemented in a non-transitory computer-readable medium, allowing for efficient computation and storage of cluster centroids. This technique is particularly useful in applications requiring precise clustering, such as data analysis, pattern recognition, and machine learning model training.
25. The non-transitory computer-readable medium of claim 24 , wherein the second mean vector is computed using m u = 1 n u ∑ i = 1 n u x i u , where x i u is the i th observation vector of the plurality of unclassified observation vectors, and n u is a number of the plurality of unclassified observation vectors.
This invention relates to a method for computing a mean vector in a machine learning or data processing system, specifically for unclassified observation vectors. The problem addressed is the need for an efficient and accurate way to calculate a representative mean vector from a set of unclassified data points, which is a common requirement in clustering, classification, and other machine learning tasks. The invention describes a computational approach where a second mean vector is derived from a plurality of unclassified observation vectors. The calculation is performed using the formula mu = (1/nu) * Σ (from i=1 to nu) xi u, where xi u represents the i-th observation vector in the set of unclassified vectors, and nu is the total number of unclassified observation vectors. This formula computes the arithmetic mean of the unclassified vectors, providing a central tendency measure for further processing, such as clustering or classification. The method ensures numerical stability and efficiency by directly averaging the observation vectors, which is particularly useful in high-dimensional data spaces. This approach is applicable in various domains, including but not limited to, anomaly detection, pattern recognition, and data compression. The invention improves upon prior methods by providing a clear, mathematically defined way to compute the mean vector, reducing ambiguity and computational overhead.
26. A system comprising: a processor; and a computer-readable medium operably coupled to the processor, the computer-readable medium having computer-readable instructions stored thereon that, when executed by the processor, cause the system to compute a mean vector for each cluster of a plurality of clusters from a plurality of observation vectors, wherein the plurality of observation vectors includes a plurality of classified observation vectors and a plurality of unclassified observation vectors, wherein each observation vector of the plurality of observation vectors includes a value for each variable of the plurality of variables, wherein a target variable value is defined to represent a class for each respective observation vector of the plurality of classified observation vectors, wherein the target variable value is unlabeled for each respective observation vector of the plurality of unclassified observation vectors; compute an inverse precision parameter matrix for each cluster of the plurality of clusters from the plurality of observation vectors; initialize a responsibility parameter vector for each observation vector of the plurality of unclassified observation vectors, wherein the responsibility parameter vector includes a probability value of a cluster membership in each cluster of the plurality of clusters for each respective observation vector, wherein the plurality of unclassified observation vectors are distributed across a plurality of threads, and the responsibility parameter vector is initialized by each thread on which the plurality of unclassified observation vectors are distributed on each computing device of one or more computing devices; (A) compute beta distribution parameter values for each cluster using a predefined mass parameter value and the responsibility parameter vector; (B) compute parameter values for a normal-Wishart distribution for each cluster using a predefined concentration parameter value, a predefined degree of freedom parameter value, the computed mean vector, the computed inverse precision parameter matrix, each observation vector of the plurality of observation vectors, and each responsibility parameter vector defined for each observation vector of the plurality of unclassified observation vectors; (C) update each responsibility parameter vector defined for each observation vector of the plurality of unclassified observation vectors using the computed beta distribution parameter values, the computed parameter values for the normal-Wishart distribution, and a respective observation vector of the plurality of unclassified observation vectors; (D) compute a convergence parameter value; (E) repeat (A) to (D) until the computed convergence parameter value indicates the responsibility parameter vector defined for each observation vector of the plurality of unclassified observation vectors is converged; determine a cluster membership for each observation vector of the plurality of unclassified observation vectors from the plurality of clusters using a respective updated responsibility parameter vector; and output the determined cluster membership for each observation vector of the plurality of unclassified observation vectors.
The system is designed for clustering and classification in machine learning, specifically addressing the challenge of assigning unclassified observation vectors to clusters while leveraging both labeled and unlabeled data. The system processes a dataset containing observation vectors, each representing multiple variables, where some vectors are classified (with a target variable indicating class membership) and others are unclassified (missing the target variable). The system computes mean vectors and inverse precision parameter matrices for each cluster in the dataset. It initializes responsibility parameter vectors for unclassified observations, representing the probability of each observation belonging to each cluster. These vectors are distributed across multiple threads and computing devices for parallel processing. The system iteratively refines cluster assignments using Bayesian inference, computing beta distribution parameters and normal-Wishart distribution parameters for each cluster. Responsibility vectors are updated based on these parameters and the observation vectors. The process repeats until convergence is achieved, at which point the system assigns each unclassified observation to a cluster based on the final responsibility vectors. The results are then output. This approach improves scalability and accuracy in clustering tasks by efficiently handling both labeled and unlabeled data in a distributed computing environment.
27. A method of providing distributed training of a clustering model, the method comprising: computing, by a computing device, a mean vector for each cluster of a plurality of clusters from a plurality of observation vectors, wherein the plurality of observation vectors includes a plurality of classified observation vectors and a plurality of unclassified observation vectors, wherein each observation vector of the plurality of observation vectors includes a value for each variable of the plurality of variables, wherein a target variable value is defined to represent a class for each respective observation vector of the plurality of classified observation vectors, wherein the target variable value is unlabeled for each respective observation vector of the plurality of unclassified observation vectors; computing, by the computing device, an inverse precision parameter matrix for each cluster of the plurality of clusters from the plurality of observation vectors; initializing a responsibility parameter vector for each observation vector of the plurality of unclassified observation vectors, wherein the responsibility parameter vector includes a probability value of a cluster membership in each cluster of the plurality of clusters for each respective observation vector, wherein the plurality of unclassified observation vectors are distributed across a plurality of threads, and the responsibility parameter vector is initialized by each thread on which the plurality of unclassified observation vectors are distributed on each computing device of one or more computing devices; (A) computing, by the computing device, beta distribution parameter values for each cluster using a predefined mass parameter value and the responsibility parameter vector; (B) computing, by the computing device, parameter values for a normal-Wishart distribution for each cluster using a predefined concentration parameter value, a predefined degree of freedom parameter value, the computed mean vector, the computed inverse precision parameter matrix, each observation vector of the plurality of observation vectors, and each responsibility parameter vector defined for each observation vector of the plurality of unclassified observation vectors; (C) updating, by the computing device, each responsibility parameter vector defined for each observation vector of the plurality of unclassified observation vectors using the computed beta distribution parameter values, the computed parameter values for the normal-Wishart distribution, and a respective observation vector of the plurality of unclassified observation vectors; (D) computing, by the computing device, a convergence parameter value; (E) repeating (A) to (D), by the computing device, until the computed convergence parameter value indicates the responsibility parameter vector defined for each observation vector of the plurality of unclassified observation vectors is converged; determining, by the computing device, a cluster membership for each observation vector of the plurality of unclassified observation vectors from the plurality of clusters using a respective updated responsibility parameter vector; and outputting, by the computing device, the determined cluster membership for each observation vector of the plurality of unclassified observation vectors.
The invention relates to distributed training of clustering models, specifically for handling both classified and unclassified observation vectors. The method computes mean vectors and inverse precision parameter matrices for each cluster from a dataset containing labeled and unlabeled observations. Each observation vector includes values for multiple variables, with a target variable indicating class membership for labeled data and remaining unlabeled for unclassified data. The method initializes responsibility parameter vectors for unclassified observations, representing cluster membership probabilities, and distributes these vectors across multiple threads on one or more computing devices. It then iteratively computes beta distribution parameters and normal-Wishart distribution parameters for each cluster using predefined mass and concentration parameters, mean vectors, inverse precision matrices, and responsibility vectors. The responsibility vectors are updated in each iteration, and the process repeats until convergence is detected. Finally, the method assigns cluster memberships to unclassified observations based on the converged responsibility vectors and outputs the results. This approach enables scalable, distributed training of clustering models for large datasets with mixed labeled and unlabeled data.
28. The method of claim 27 , wherein the mean vector is computed using m 0,k =ρm k c +(1−ρ)m u , k=1, 2, . . . , K max , where m 0,k is the mean vector for a k th cluster of the plurality of clusters, ρ is a predefined labeling coefficient, m k c is a first mean vector for the k th cluster computed using the plurality of classified observation vectors, m u is a second mean vector computed using the plurality of unclassified observation vectors, and K max is a number of the plurality of clusters.
This invention relates to clustering techniques in data analysis, specifically improving the accuracy of cluster mean vectors by incorporating both classified and unclassified observation vectors. The problem addressed is the potential bias in traditional clustering methods that rely solely on classified data, which can lead to inaccurate cluster representations when unclassified data is ignored. The method computes a refined mean vector for each cluster by combining a first mean vector derived from classified observation vectors with a second mean vector derived from unclassified observation vectors. The combination is weighted by a predefined labeling coefficient, which controls the influence of each component. The formula for the mean vector of the k-th cluster is given by m_0,k = ρ * m_k_c + (1 - ρ) * m_u, where ρ is the labeling coefficient, m_k_c is the mean vector computed from classified observations, and m_u is the mean vector computed from unclassified observations. The process is repeated for all clusters up to a maximum number K_max. This approach ensures that unclassified data contributes to the cluster means, improving the robustness and accuracy of the clustering results. The method is particularly useful in scenarios where labeled data is scarce or noisy, enhancing the reliability of cluster analysis.
29. The method of claim 27 , wherein the inverse precision parameter value is computed using Ψ 0,k −1 =ρσ k c +(1−ρ)σ u , k=1, 2, . . . , K max , where Ψ 0,k −1 is an inverse precision parameter matrix for a k th cluster of the plurality of clusters, wherein the inverse precision parameter matrix includes the computed inverse precision parameter value, ρ is a predefined labeling coefficient, σ k c is a first standard deviation matrix for the k th cluster computed using the plurality of classified observation vectors, σ u is a second standard deviation matrix computed using the plurality of unclassified observation vectors, and K max is a number of the plurality of clusters.
This invention relates to a method for computing an inverse precision parameter value in a clustering algorithm, specifically for improving the classification of observation vectors into clusters. The method addresses the challenge of accurately estimating cluster-specific precision parameters, which are critical for distinguishing between classified and unclassified data points in probabilistic clustering models. The method computes the inverse precision parameter matrix Ψ₀,k−1 for a given cluster k using a weighted combination of two standard deviation matrices. The first matrix, σₖc, represents the variability of classified observation vectors within the k-th cluster, while the second matrix, σᵤ, represents the variability of unclassified observation vectors. The weighting factor ρ is a predefined labeling coefficient that balances the influence of classified and unclassified data on the inverse precision parameter. The computation is performed iteratively for each cluster k, up to a maximum number of clusters K_max. By incorporating both classified and unclassified data, the method enhances the robustness of the clustering process, particularly in scenarios where unclassified observations provide additional context for refining cluster boundaries. This approach is useful in applications such as anomaly detection, data segmentation, and pattern recognition, where accurate cluster modeling is essential.
30. The method of claim 27 , wherein each thread computes q k , w , t = ∑ i = 1 n w , t r i , k , u k , w , t = ∑ i = 1 n w , t r i , k x i , and s k , w , t = ∑ i = 1 n w , t r i , k x i x i T for each cluster k=1, . . . , K max , where n w,t is a number of observation vectors on which a computing device w and a thread t of the computing device w initializes the responsibility parameter vector, r i,k is a responsibility parameter value for an i th observation vector of the plurality of unclassified observation vectors and the k th cluster on which a computing device w and a thread t of the computing device w initialize the responsibility parameter vector, x i is the i th observation vector of the plurality of unclassified observation vectors on which a computing device w and a thread t of the computing device w initialize the responsibility parameter vector, K max is a number of the plurality of clusters, and T indicates a transpose.
This invention relates to parallelized clustering algorithms for processing large datasets. The problem addressed is the computational inefficiency of traditional clustering methods when applied to high-dimensional or large-scale data, where sequential processing becomes impractical. The solution involves a parallelized approach where multiple computing devices and threads work concurrently to compute key statistical parameters for clustering. Each computing device and its associated threads independently calculate three critical values for each cluster: the sum of responsibility parameters (q_k,w,t), the weighted sum of observation vectors (u_k,w,t), and the sum of outer products of observation vectors (s_k,w,t). These values are derived from a set of unclassified observation vectors and their corresponding responsibility parameters. The responsibility parameters (r_i,k) indicate the likelihood of each observation vector belonging to a particular cluster. The observation vectors (x_i) represent the data points being clustered, and K_max defines the maximum number of clusters. By distributing these computations across multiple threads and devices, the method accelerates the clustering process, making it feasible for large-scale datasets. The parallelization ensures that each thread computes its contributions to the cluster statistics independently, which are then aggregated to form the final cluster parameters. This approach enhances scalability and efficiency in clustering tasks.
Unknown
December 15, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.