Heh, I don't have the WGL_ARB_render_texture either.
You can always copy from the frame buffer into a texture without the data being returned to the CPU. It's not as fast as render to texture - but our textures are likely to be microscopic compared to most 3D graphic textures.
I think it would go like this : First, generate texture maps with positions/velocities/etc in a square texture or not (if you have GL_ARB_texture_non_power_of_two), then render a quad in ortho mode of the exact 2D dimensions of the texture (say a 8x8 pixels quad) this will get you exact mapping, at each pixel you'll get the data you put in (no interpolation), then in the pixel shader, compute stuff, and render to a texture but...
1) you can't write back into the textures the new positions/velocities (as far as I know, although it would be cool)
No - you can't. But (al least on most hardware) you can render into a texture instead of onto the screen - so you have to segment your math into simple chunks - you need to run: velocity += accelleration * time on all million moving objects at once - then do: position += velocity * time on all million objects in parallel. Each step involves a lot of messing around switching textures, changing shaders and figuring out what polygon to draw - but since you're doing the work on a million objects at once, these overheads are negligable. The hassle is in structuring your code this way...it's a very different way of thinking.
The good thing is that you have all the data you need at each object/pixel, just by knowing the X,Y of the texture and the number of objects (presumably the square texture is not all filled) given by a shader parameter,you could even loop trough every other object (that if GLSL support a 'for' loop -I didn't worked with it,yet),
You have a deep misunderstanding of what the shaders do. You actually load a PAIR of GLSL shader programs and draw your polygon(s). The first shader program runs on every vertex of the polygon(s) - and is therefore of little interest to us here. The second shader runs once from the start of 'main' until it falls off the bottom of main FOR EVERY PIXEL THAT THE POLYGON TOUCHED ON THE SCREEN.
(The first shader is called the "Vertex Shader" - the second is called the "Fragment Shader" because technically it runs on polygon fragments - which just happen to be pixels)
So if you have (say) a 10x10 pixel square - the Vertex shader runs four times - once for each of the four vertices. Then the hardware chops up the resulting polygon into pixel-sized fragments and runs the Fragment Shader once for each of the 100 pixels that the polygon touched. So there is generally no 'looping' involved in your GLSL code. The 'main()' of the fragment shader runs from start until completion for every singly pixel!
Furthermore, whilst you *can* have loops and if/then/else statements in your shaders, these things are HORRIBLY inefficient and must be avoided wherever possible. The hardware of the fragment shaders consists of lots and lots of very simple CPU's. These are all fed with the same machinecode instructions in lock-step. So if you did write an "if (test) then_code ; else else_code ; - then what actually happens is that all of the dozens of little CPU's evaluate the 'test' code. Those that get a "FALSE" result effectively write-protect all of their on-board memory. Now, all of those CPU's run the "then_code" (even the ones that got a "FALSE" result from the test) - but the execution of that code has no effect on the ones for which the test failed. When we hit the "else" clause, all of the CPU's flip their write-protect status flags - then all of the CPU's run the "else_code" - so the ones that failed the initial if test will be able to change their registers - those that already executed the 'then' code will be wasting time.
So you can see that you should NEVER us an 'if' statement to try to skip over code you don't need to execute - because the processors will execute it anyway!
Now - the implication for 'for' loops is that the compiler needs to unroll *all* loops so that the shaders can all execute all of the iterations. So, if you write:
j = 6 ;
for ( i = 0 ; i < 100 ; i++ )
if ( i >= j ) break ;
...then all of your CPU's will take run a hundred loop iterations to execute this code - even though you'd think that none of them went past iteration number 6!
The numerical effect of these restrictions is never obvious to the programmer - GLSL is just like C or C++ or whatever - everything works as you'd expect. The problem is that if you aren't aware of the unusual hardware architecture, you can end up writing some truly, amazingly inefficient programs!
However, the nature of the things that GLSL is useful for means that almost all programs are about 10 lines long and have no loops or conditionals in them at all.
if not, you'd have to dynamically generate the shader with computations for all the objects,
Dynamically generating code is painful because it has to be compiled from source code.
There is an ancient OpenGL machinecode language - but it's becoming obsolete so fast that you'd have to be crazy to write new code using it. The GLSL compiler is built into the OpenGL driver - and you pass shaders to it as source code in string variables and it loads the code into the hardware and hands you back a handle to it so you can decide which pair of shaders you want to run each time you draw a bunch of polygons.
However, you can pre-compile a bazillion little shaders on startup and select between them at runtime. An entire shader is likely to be something like:
void main ( sampler2D old_velocity, // A "sampler2D" is a 2D texture map handle
varying vec2 texcoord, // A 'varying' is automatically interpolated
// across the polygon
uniform float delta_t ) // A 'uniform' is a variable that's
// the same for the entire polygon
vec3 accelleration = texture2D ( force, texcoord ).xyz /
texture2D ( mass, texcoord ).x ;
vec3 new_velocity = texture2D (old_velocity, texcoord) + accelleration * delta_t ;
gl_color . r = new_velocity ;
...so you can have a LOT of shaders!