I am currently reading about the P-RAM (Parallel Random Access Machine), which is kind of the Turing Machine for parallel algorithms.
This machine consists of a bunch of CPUs acting on a shared memory. It is very similar to current SIMD and multiprocessing implementations, apart from the memory. The kind of memory access for P-RAM comes in different flavors. Reads and writes can be concurrent or exclusive. Usually, when you talk about P-RAM you imply the CREW P-RAM. This means concurrent reads, but exclusive writes. This sounds rather natural, since reading (from RAM or rather from a cache) could happen concurrently on most real machines, but writes should be exclusive, otherwise it is not clear which result will actually be stored. The latter aspect is usually not implemented by real hardware, so a bit of trickery needs to be done. There seem to be publications on this topic.
Much nicer would be a real CRCW P-RAM, which in literature is also called the W-RAM. This is a very general parallel machine, and it allows for some very efficient algorithms. For example, take a look at the following program, which computes the minimum value of an array:
import random minimum = None n = 10 M = [ 0 ] * n C = [ int(1000 * random.random()) for x in range(0, n) ] print C for i in range(0, n): for j in range(0, n): if C[i] < C[j]: M[j] = 1 for i in range(0, n): for j in range(0, n): if M[i] == 0 and i < j: M[j] = 1 for i in range(0, n): if M[i] == 0: minimum = C[i] print minimum
On a sequential machine, this algorithm obviously runs in O(n^2) time and O(n) space. Your standard implementation of a sequential array-minimum computation would run in O(n) with O(1) space. However, if you were to have a CRCW P-RAM, you could change all the outer for-loops to parallel-for loops, similar to what you can do in OpenMP. Assuming that you have O(n^2) processors available, which is not that unthinkable nowadays anymore, this will reduce the whole time complexity to O(1), with O(n) space complexity. Awesome, isn't it?