Right now i’m trying to run very large for loops for some task, nearly about 8e+12 iterations. I tried using c++11 threading, but it do not seems to be working that fast as required. I am using system with 8 gb ram, i5 cpu and intel graphics 4000 card. If i use openmp would it be better or i have to use nvidia gpu and use cuda for this task? My code is as below:

Code:

void thread_function(pcl::PointCloud<pcl::PointXYZRGB>: : ConstPtr cloudB, vector<int> v, int p0) {
for (size_t p1 = 0; p1<v.size() && ros::ok(); ++p1) {
int p0p1 = sqrt(pow(cloudB->points[v[p1]].x - cloudB->points[v[p0]].x, 2)
+ pow(cloudB->points[v[p1]].y - cloudB->points[v[p0]].y, 2)
+ pow(cloudB->points[v[p1]].z - cloudB->points[v[p0]].z, 2)) * 1000;
if (p0p1>10) {
for (size_t p2 = 0; p2<v.size() && ros::ok(); ++p2) {
int p0p2 = sqrt(pow(cloudB->points[v[p2]].x - cloudB->points[v[p0]].x, 2)
+ pow(cloudB->points[v[p2]].y - cloudB->points[v[p0]].y, 2)
+ pow(cloudB->points[v[p2]].z - cloudB->points[v[p0]].z, 2)) * 1000;
int p1p2 = sqrt(pow(cloudB->points[v[p2]].x - cloudB->points[v[p1]].x, 2)
+ pow(cloudB->points[v[p2]].y - cloudB->points[v[p1]].y, 2)
+ pow(cloudB->points[v[p2]].z - cloudB->points[v[p1]].z, 2)) * 1000;
if (p0p2>10 && p1p2>10) {
}
}
}
}
x[p0] = 3;
cout << “ended thread = ” << p0 << endl;
}

This task is really important for my algorithm to complete. I need a suggestion how to make this loops run very fast. In above code the thread_function is the main function where i’m putting the for loops currentely. Is their any way to increase its performance in above code?