Simulation of a large DNA Profile Database

Simulation of a large DNA Profile Database



A pure ,mathematical ,database consisting of 2 million unrelated 'DNA profiles' will ,on average, contain one match. The generation is totally random, it would be possible to do a 2 million run and get no matches and also possible another one to have 2 matches in 2 million. The UK NDNAD contains 2 million profiles with this one match plus many more due to the inescapable fact that most people in the UK have ancestors in common , so more chance of shared alleles and consequential match.


Despite official government sites linking to these files there are still corrupt persons knocking out my sites, so for the purposes of searchengines cross-linking them, files no longer available on the original web hosting sites were on http://www.nutteing.50megs.com/dnas.htm , http://www.nutteing.freeisp.co.uk/dnas.htm , http://www.nutteing.batcave.net/dnas.htm , http://home.graffiti.net/nutteing/dnas.htm , http://nutteing.no-frills.net/dnas.htm and http://nutteing3.no-frills.net/dnas.htm (last 2 due now to host failure)
Details of that match at the end of this file.
If you found this file in an archive then use keyword "nutteingd" in a search engine to find an updated version or related pages.
Updated file August 2006
Please contact me if you notice any error that would lead to an error 
in the result. 
( 'allele'  4  on ' locus' D19 was slightly erroneously 0.715 previously, 
corrected now to 0.719, so generated 'profiles' will be slightly 
different at D19 / 4 to the results displayed in this file )
I am not a programmer so don't bother communicating 
about my lack of structure etc. I know the flags are poorly chosen 
and the external Random calls should be function calls etc. 
Before going into Visual Basic Editor go into ordinary 
Word and call up anything in the directory you want 
the VB files to go into as this is not designated in the 
following code.
Using plain text handling Notepad with no line wrap 
 copy and paste 
from this file as displayed on a browser or as Source/ Text file into 
a Visual Basic / Macro handler between Sub and End Sub ,
reset, and Run. I am not familiar with VB and so get tied up in 
knots concerning procedures,mudules ,functions etc. 
My choice of file names ,datewise sept25- etc is for 
ease of deleting because of disk space constraints. 

If using straight VB6 then designate the directory 
for files by "replace all" occurances of sept25 to 
c:\vb\sept25 or whatever, also add a sound progress 
indicator before the [ next x ] line 
 
If x/1000 = Int (x/1000 ) Then Beep
 
before highlighting and copying.

In VB6 open New Project
In Form1 open up a Command1 button
Double click this button to open command 
code window and copy and paste the 'DNA' VB code 
between the Private Sub Command1_click ()
and the End Sub
Then Run/Start
Press command1
Wait until Beep/ clicks cease
I had to ditch 3 Random Number Generators as 
they were producing their repeats too often considering 
at times I was dealing with calls to the RNG 200 million times 
for 10 million profiles.
The results and background is after the VB code.


The first task is to generate a file simulating '10 loci' that is an array of 10 pairs of numbers. These number pairs are constrained to represent the allele frequencies of the published UK caucasian population. The average number of alleles in the UK NDNAD (Caucasian ) SGM plus system is 11.3 alleles per locus ,2 per locus,times the 10 chosen loci. But as derrived from bio-chemistry the inheritance of these alleles is not equally likely. If indeed equal occurance over 11.3 x 2 x 20 then false matches would be very much rarer than real life. To simplify I have standardised to a choice of 10 (0 to 9) and the rarer alleles lumped together in the '0' subset. For purists it is an easy matter to increase from 0 to 9 to include "A","B" etc , as now string data for complete modelling of all alleles on loci FGA,D21,D18,D2 and D19. ------------ In the generator section at the start of each j loop ,have pb(j) = "Z" then amend generator characteristics, If ph(j) < 0.337 Then pb(j) = "A" If ph(j) < 0.437 Then ph(j) = 2 If ph(j) < 0.444 Then pb(j) = "B" etc instead of just 0 to 9 then before end of each j loop,have If pb(j) <> "Z" Then ph(j) = pb(j) -------------- I would suggest using the letters for only the rare alleles rather than going 0 to 9,A,B,C,D etc. The first 3 loci (6 numbers) will not contain alphanumerics but 7 or more would so beware if subdividing on 7th digit or more. In principle I tried adapting and it processes through to final match checking, but I've not done a full run fully enlarged. The final macro for converting back to standard notation would need altering, or at least manually converting the A,B,Cs etc back to alleles. As one general result along the way was that rare alleles become very much rarer, proportionally, in any matches. Because of the large numbers involved and my pc being of 1997 vintage there is a lot of saving to disk and only partial sets are processed rather than trying to process full 10 million 'profiles',my sensible limit is about 2 million processed in their entirety. Others, with more powerful computers should be able to tackle full 10 million. If the long conditional statements break in this HTML file then you will have to re-concattenate to use. The order of each locus in the most commonly portrayed order for the UK NDNAD profiles. On my pc ,200MHz,64MB only machine with about 200MB of hard disk space free so not daunting requirements. To generate and process all 2 million profiles expect to look at 5 hours to complete and that is when you are familiar with the routines. For faster pcs then reduce this time as most relates to the sort routines. Put a conditional If / End If statement in the genertor file where the output write is, to restrict to profiles in areas where matches are known to have occured will reduce process time. Anyway I suggest starting with generating only 20,000 profiles then 200,000 and eventually 2 million to get the hang of things. Macro modified for data input and output as strings rather than earlier version as numeric data. Visual Basic/ Macro code for the separate macros are between horizontal rules. FGA,vWA etc are the 10 loci and the associated generating tables are from the allele frequency tables in the forensic science literature cited on file dnapr.htm .
' Generating 10 loci x2 profiles ' directing pairs and first divider Dim ph(20) ' initialising Random Number Generator - RNG count9 = 0 count8 = 0 Randomize a = 214013 c = 2531011 x0 = Timer z = 2 ^ 24 ' 1 file 'sept25g' for original, un-directed pairs, source data. ' This file is necessary to check on the performance of the RNG ' when a matched pair is found then it is highly unlikely that ' both sequences as generated, before pair directing, would ' be the same - more likely a manifest of repeat within the RNG ' (reason for adopting the 214013 / 2531011 RNG ) ' Use 'Word' find function on part of the sequences, including pair reversals, ' with luck would include a 'homozygotic' pair eg (3,3) say ,so no reversal ' on that pair Open "sept25g" For Output As #1 ' outputs directed and divided by first digit Open "sept25-0" For Output As #10 Open "sept25-1" For Output As #11 Open "sept25-2" For Output As #12 Open "sept25-3" For Output As #13 Open "sept25-4" For Output As #14 Open "sept25-5" For Output As #15 Open "sept25-6" For Output As #16 Open "sept25-7" For Output As #17 Open "sept25-8" For Output As #18 Open "sept25-9" For Output As #19 ' change for different total size eg 199999 for 200,000 For x = 0 To 1999999 For j = 0 To 1 ' vWA ,first locus ' RNG random number generator temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj If ph(j) < 0.001 Then ph(j) = 11 If ph(j) < 0.106 Then ph(j) = 1 If ph(j) < 0.186 Then ph(j) = 2 If ph(j) < 0.402 Then ph(j) = 3 If ph(j) < 0.672 Then ph(j) = 4 If ph(j) < 0.891 Then ph(j) = 5 If ph(j) < 0.984 Then ph(j) = 6 If ph(j) < 0.998 Then ph(j) = 7 If ph(j) < 1 Then ph(j) = 8 If ph(j) > 10 Then ph(j) = 0 Next j For j = 2 To 3 ' THO1 ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj If ph(j) < 0.002 Then ph(j) = 11 If ph(j) < 0.243 Then ph(j) = 1 If ph(j) < 0.437 Then ph(j) = 2 If ph(j) < 0.545 Then ph(j) = 3 If ph(j) < 0.546 Then ph(j) = 4 If ph(j) < 0.686 Then ph(j) = 5 If ph(j) < 0.99 Then ph(j) = 6 If ph(j) < 1 Then ph(j) = 7 If ph(j) > 10 Then ph(j) = 0 Next j For j = 4 To 5 ' D8 ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj If ph(j) < 0.018 Then ph(j) = 11 If ph(j) < 0.031 Then ph(j) = 1 If ph(j) < 0.125 Then ph(j) = 2 If ph(j) < 0.191 Then ph(j) = 3 If ph(j) < 0.334 Then ph(j) = 4 If ph(j) < 0.667 Then ph(j) = 5 If ph(j) < 0.876 Then ph(j) = 6 If ph(j) < 0.964 Then ph(j) = 7 If ph(j) < 0.995 Then ph(j) = 8 If ph(j) < 1 Then ph(j) = 9 If ph(j) > 10 Then ph(j) = 0 Next j For j = 6 To 7 ' FGA ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj If ph(j) < 0.025 Then ph(j) = 11 If ph(j) < 0.081 Then ph(j) = 1 If ph(j) < 0.224 Then ph(j) = 2 If ph(j) < 0.411 Then ph(j) = 3 If ph(j) < 0.576 Then ph(j) = 4 If ph(j) < 0.587 Then ph(j) = 5 If ph(j) < 0.726 Then ph(j) = 6 If ph(j) < 0.872 Then ph(j) = 7 If ph(j) < 0.947 Then ph(j) = 8 If ph(j) < 0.982 Then ph(j) = 9 If ph(j) < 1 Then ph(j) = 0 ' 1.8% not generated If ph(j) > 10 Then ph(j) = 0 Next j For j = 8 To 9 ' D21 ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj If ph(j) < 0.031 Then ph(j) = 11 If ph(j) < 0.191 Then ph(j) = 1 If ph(j) < 0.417 Then ph(j) = 2 If ph(j) < 0.675 Then ph(j) = 3 If ph(j) < 0.702 Then ph(j) = 4 If ph(j) < 0.771 Then ph(j) = 5 If ph(j) < 0.864 Then ph(j) = 6 If ph(j) < 0.882 Then ph(j) = 7 If ph(j) < 0.972 Then ph(j) = 8 If ph(j) < 0.994 Then ph(j) = 9 If ph(j) < 1 Then ph(j) = 0 ' 0.5% not generated If ph(j) > 10 Then ph(j) = 0 Next j For j = 10 To 11 ' D18 ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj If ph(j) < 0.012 Then ph(j) = 11 If ph(j) < 0.151 Then ph(j) = 1 If ph(j) < 0.276 Then ph(j) = 2 If ph(j) < 0.44 Then ph(j) = 3 If ph(j) < 0.585 Then ph(j) = 4 If ph(j) < 0.722 Then ph(j) = 5 If ph(j) < 0.837 Then ph(j) = 6 If ph(j) < 0.917 Then ph(j) = 7 If ph(j) < 0.958 Then ph(j) = 8 If ph(j) < 0.975 Then ph(j) = 9 If ph(j) < 1 Then ph(j) = 0 ' 2.5% not generated If ph(j) > 10 Then ph(j) = 0 Next j For j = 12 To 13 ' D2S1338 ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj If ph(j) < 0.037 Then ph(j) = 11 If ph(j) < 0.222 Then ph(j) = 1 If ph(j) < 0.309 Then ph(j) = 2 If ph(j) < 0.419 Then ph(j) = 3 If ph(j) < 0.557 Then ph(j) = 4 If ph(j) < 0.589 Then ph(j) = 5 If ph(j) < 0.613 Then ph(j) = 6 If ph(j) < 0.725 Then ph(j) = 7 If ph(j) < 0.867 Then ph(j) = 8 If ph(j) < 0.978 Then ph(j) = 9 If ph(j) < 1 Then ph(j) = 0 ' 2.2% not generated If ph(j) > 10 Then ph(j) = 0 Next j For j = 14 To 15 ' D16 ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj If ph(j) < 0.019 Then ph(j) = 11 If ph(j) < 0.148 Then ph(j) = 1 If ph(j) < 0.202 Then ph(j) = 2 If ph(j) < 0.491 Then ph(j) = 3 If ph(j) < 0.779 Then ph(j) = 4 If ph(j) < 0.965 Then ph(j) = 5 If ph(j) < 0.994 Then ph(j) = 6 If ph(j) < 1 Then ph(j) = 7 If ph(j) > 10 Then ph(j) = 0 Next j For j = 16 To 17 ' D19 ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj If ph(j) < 0.087 Then ph(j) = 11 If ph(j) < 0.309 Then ph(j) = 1 If ph(j) < 0.322 Then ph(j) = 2 If ph(j) < 0.704 Then ph(j) = 3 If ph(j) < 0.719 Then ph(j) = 4 If ph(j) < 0.896 Then ph(j) = 5 If ph(j) < 0.934 Then ph(j) = 6 If ph(j) < 0.975 Then ph(j) = 7 If ph(j) < 0.992 Then ph(j) = 8 If ph(j) < 0.997 Then ph(j) = 9 If ph(j) < 1 Then ph(j) = 0 If ph(j) > 10 Then ph(j) = 0 ' 0.3% not generated Next j For j = 18 To 19 ' D3 ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj If ph(j) < 0.001 Then ph(j) = 11 If ph(j) < 0.007 Then ph(j) = 1 If ph(j) < 0.139 Then ph(j) = 2 If ph(j) < 0.404 Then ph(j) = 3 If ph(j) < 0.651 Then ph(j) = 4 If ph(j) < 0.846 Then ph(j) = 5 If ph(j) < 0.987 Then ph(j) = 6 If ph(j) < 1 Then ph(j) = 7 If ph(j) > 10 Then ph(j) = 0 Next j ' output the original generated file Write #1, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19) ' Because in real DNA profiles without further info ,no one ' knows which allele in each pair came from the mother or father ' by convention they are written smaller ,larger (or equal). ' The following directs each pair For j = 0 To 18 Step 2 If ph(j + 1) < ph(j) Then jjj = ph(j) ph(j) = ph(j + 1) ph(j + 1) = jjj End If Next j ' put extra conditional statements here to reduce ' the number of files or just delete some of the following ' ' dividing on first column, file by file If ph(0) = 0 Then Write #10 , ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19) count0 = count0 + 1 End If If ph(0) = 1 Then Write #11, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19) count1 = count1 + 1 End If If ph(0) = 2 Then Write #12, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19) count2 = count2 + 1 End If If ph(0) = 3 Then Write #13, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19) count3 = count3 + 1 End If If ph(0) = 4 Then Write #14, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19) count4 = count4 + 1 End If If ph(0) = 5 Then Write #15, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19) count5 = count5 + 1 End If If ph(0) = 6 Then Write #16, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19) count6 = count6 + 1 End If If ph(0) = 7 Then Write #17, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19) count7 = count7 + 1 End If If ph(0) = 8 Then Write #18, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19) count8 = count8 + 1 End If If ph(0) = 9 Then Write #19, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19) count9 = count9 + 1 End If Next x Close #10 Close #11 Close #12 Close #13 Close #14 Close #15 Close #16 Close #17 Close #18 Close #19 Close #1 ' count file for data to fix for - next loops in sucessive dividings Open "sept25-c" For Output As #20 Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9 Close #20
To reduce the file sizes so they can be sorted it is necessary to subdivide by various leading digits . If 5th or 6th column divider is required make approriate changes
' Dividing file into 10 by second digit Dim ph(20) dim ps as string ' xxxx = count size from count file xxxx = ' input file Open "sept25-1" For Input As #1 ' 10 divided files Open "sept25-10" For Output As #10 Open "sept25-11" For Output As #11 Open "sept25-12" For Output As #12 Open "sept25-13" For Output As #13 Open "sept25-14" For Output As #14 Open "sept25-15" For Output As #15 Open "sept25-16" For Output As #16 Open "sept25-17" For Output As #17 Open "sept25-18" For Output As #18 Open "sept25-19" For Output As #19 count9 = 0 count8 = 0 xxxx = xxxx - 1 For x = 0 To xxxx Input #1, ps a2$ = Mid(ps, 2, 1) ph(1) = Val(a2$) If ph(1) = 0 Then Write #10, ps count0 = count0 + 1 End If If ph(1) = 1 Then Write #11, ps count1 = count1 + 1 End If If ph(1) = 2 Then Write #12, ps count2 = count2 + 1 End If If ph(1) = 3 Then Write #13, ps count3 = count3 + 1 End If If ph(1) = 4 Then Write #14, ps count4 = count4 + 1 End If If ph(1) = 5 Then Write #15, ps count5 = count5 + 1 End If If ph(1) = 6 Then Write #16, ps count6 = count6 + 1 End If If ph(1) = 7 Then Write #17, ps count7 = count7 + 1 End If If ph(1) = 8 Then Write #18, ps count8 = count8 + 1 End If If ph(1) = 9 Then Write #19, ps count9 = count9 + 1 End If Next x Close #1 Close #10 Close #11 Close #12 Close #13 Close #14 Close #15 Close #16 Close #17 Close #18 Close #19 ' output counts Open "sept25-1c" For Output As #20 Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9 Close #20
' Dividing file into 10 by third digit Dim ph(20) dim ps as string ' enter count in xxxx xxxx = Open "sept25-11" For Input As #1 Open "sept25-110" For Output As #10 Open "sept25-111" For Output As #11 Open "sept25-112" For Output As #12 Open "sept25-113" For Output As #13 Open "sept25-114" For Output As #14 Open "sept25-115" For Output As #15 Open "sept25-116" For Output As #16 Open "sept25-117" For Output As #17 Open "sept25-118" For Output As #18 Open "sept25-119" For Output As #19 count9 = 0 count8 = 0 xxxx=xxxx - 1 For x = 0 To xxxx Input #1, ps a3$ = Mid(ps, 3, 1) ph(2) = Val(a3$) If ph(2) = 0 Then Write #10, ps count0 = count0 + 1 End If If ph(2) = 1 Then Write #11, ps count1 = count1 + 1 End If If ph(2) = 2 Then Write #12, ps count2 = count2 + 1 End If If ph(2) = 3 Then Write #13, ps count3 = count3 + 1 End If If ph(2) = 4 Then Write #14, ps count4 = count4 + 1 End If If ph(2) = 5 Then Write #15, ps count5 = count5 + 1 End If If ph(2) = 6 Then Write #16, ps count6 = count6 + 1 End If If ph(2) = 7 Then Write #17, ps count7 = count7 + 1 End If If ph(2) = 8 Then Write #18, ps count8 = count8 + 1 End If If ph(2) = 9 Then Write #19, ps count9 = count9 + 1 End If Next x Close #1 Close #10 Close #11 Close #12 Close #13 Close #14 Close #15 Close #16 Close #17 Close #18 Close #19 Open "sept25-11c" For Output As #20 Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9 Close #20
' Dividing file into 10 by fourth digit Dim ph(20) dim ps as string ' enter count in xxxx xxxx = Open "sept25-131" For Input As #1 Open "sept25-1310" For Output As #10 Open "sept25-1311" For Output As #11 Open "sept25-1312" For Output As #12 Open "sept25-1313" For Output As #13 Open "sept25-1314" For Output As #14 Open "sept25-1315" For Output As #15 Open "sept25-1316" For Output As #16 Open "sept25-1317" For Output As #17 Open "sept25-1318" For Output As #18 Open "sept25-1319" For Output As #19 count9 = 0 count8 = 0 xxxx=xxxx - 1 For x = 0 To xxxx Input #1, ps a4$ = Mid(ps, 4, 1) ph(3) = Val(a4$) If ph(3) = 0 Then Write #10, ps count0 = count0 + 1 End If If ph(3) = 1 Then Write #11, ps count1 = count1 + 1 End If If ph(3) = 2 Then Write #12, ps count2 = count2 + 1 End If If ph(3) = 3 Then Write #13, ps count3 = count3 + 1 End If If ph(3) = 4 Then Write #14, ps count4 = count4 + 1 End If If ph(3) = 5 Then Write #15, ps count5 = count5 + 1 End If If ph(3) = 6 Then Write #16, ps count6 = count6 + 1 End If If ph(3) = 7 Then Write #17, ps count7 = count7 + 1 End If If ph(3) = 8 Then Write #18, ps count8 = count8 + 1 End If If ph(3) = 9 Then Write #19, ps count9 = count9 + 1 End If Next x Close #1 Close #10 Close #11 Close #12 Close #13 Close #14 Close #15 Close #16 Close #17 Close #18 Close #19 Open "sept25-131c" For Output As #20 Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9 Close #20
' Dividing file into 10 by fifth digit Dim ph(20) Dim ps As String ' enter count in xxxx xxxx = Open "dec14-3412" For Input As #1 Open "dec14-34120" For Output As #10 Open "dec14-34121" For Output As #11 Open "dec14-34122" For Output As #12 Open "dec14-34123" For Output As #13 Open "dec14-34124" For Output As #14 Open "dec14-34125" For Output As #15 Open "dec14-34126" For Output As #16 Open "dec14-34127" For Output As #17 Open "dec14-34128" For Output As #18 Open "dec14-34129" For Output As #19 count9 = 0 count8 = 0 xxxx = xxxx - 1 For x = 0 To xxxx Input #1, ps a5$ = Mid(ps, 5, 1) ph(4) = Val(a5$) If ph(4) = 0 Then Write #10, ps count0 = count0 + 1 End If If ph(4) = 1 Then Write #11, ps count1 = count1 + 1 End If If ph(4) = 2 Then Write #12, ps count2 = count2 + 1 End If If ph(4) = 3 Then Write #13, ps count3 = count3 + 1 End If If ph(4) = 4 Then Write #14, ps count4 = count4 + 1 End If If ph(4) = 5 Then Write #15, ps count5 = count5 + 1 End If If ph(4) = 6 Then Write #16, ps count6 = count6 + 1 End If If ph(4) = 7 Then Write #17, ps count7 = count7 + 1 End If If ph(4) = 8 Then Write #18, ps count8 = count8 + 1 End If If ph(4) = 9 Then Write #19, ps count9 = count9 + 1 End If Next x Close #1 Close #10 Close #11 Close #12 Close #13 Close #14 Close #15 Close #16 Close #17 Close #18 Close #19 Open "dec14-3412c" For Output As #20 Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9 Close #20
The next is sorting using Word Tables/ Sort Before using ,make a test batch of numbers as there are various Sort outcomes. Now I'm using string data, Text sort gave the right form on my machine. Use Ctrl+shift+Home(or End) to highlight text up or down . After sort and before saving to disk press up or down arrow to select which way the text is returned to you. My set-up was limited to no more than 15,000. To sort say 28,000 sort upper half ,then lower half then cut and paste say 0 to 2 section of lower half into the top of the top half. Re-sort the expanded 0 to 2 section then re-sort the remainder. If say selecting 2 to 3 section then cut and paste at the juncture of 2 and 3 in the other block to save some repeated sorting. Other times it is quicker to oversort then backtrack / overlap on the next sort. Many of the subdivision files are empty because of the directing. They consist of eg 4,4.. 4,5.... etc never 4,0.., 4,1.. etc and a number of 8 and 9 sections are absent back to the generator characteristics eg only first 8 of 10 are used. When you know all files are less than 15,000, or whatever Sort limit, use the next (simply a recorded macro) to sort 10 related files. An empty file will stop the macro so edit out empty files before running.
'Sort 10 related files in one go ' Documents.Open FileName:="sept25-130", ConfirmConversions:=False, ReadOnly _ :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _ :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _ , Format:=wdOpenFormatAuto Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _ SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _ FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _ wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _ wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _ wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _ :=wdLanguageNone ActiveDocument.Save ' Documents.Open FileName:="sept25-131", ConfirmConversions:=False, ReadOnly _ :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _ :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _ , Format:=wdOpenFormatAuto Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _ SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _ FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _ wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _ wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _ wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _ :=wdLanguageNone ActiveDocument.Save ' Documents.Open FileName:="sept25-132", ConfirmConversions:=False, ReadOnly _ :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _ :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _ , Format:=wdOpenFormatAuto Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _ SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _ FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _ wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _ wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _ wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _ :=wdLanguageNone ActiveDocument.Save ' Documents.Open FileName:="sept25-133", ConfirmConversions:=False, ReadOnly _ :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _ :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _ , Format:=wdOpenFormatAuto Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _ SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _ FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _ wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _ wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _ wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _ :=wdLanguageNone ActiveDocument.Save ' Documents.Open FileName:="sept25-134", ConfirmConversions:=False, ReadOnly _ :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _ :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _ , Format:=wdOpenFormatAuto Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _ SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _ FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _ wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _ wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _ wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _ :=wdLanguageNone ActiveDocument.Save ' Documents.Open FileName:="sept25-135", ConfirmConversions:=False, ReadOnly _ :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _ :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _ , Format:=wdOpenFormatAuto Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _ SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _ FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _ wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _ wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _ wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _ :=wdLanguageNone ActiveDocument.Save ' Documents.Open FileName:="sept25-136", ConfirmConversions:=False, ReadOnly _ :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _ :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _ , Format:=wdOpenFormatAuto Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _ SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _ FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _ wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _ wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _ wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _ :=wdLanguageNone ActiveDocument.Save ' Documents.Open FileName:="sept25-137", ConfirmConversions:=False, ReadOnly _ :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _ :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _ , Format:=wdOpenFormatAuto Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _ SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _ FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _ wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _ wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _ wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _ :=wdLanguageNone ActiveDocument.Save ' Documents.Open FileName:="sept25-138", ConfirmConversions:=False, ReadOnly _ :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _ :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _ , Format:=wdOpenFormatAuto Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _ SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _ FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _ wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _ wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _ wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _ :=wdLanguageNone ActiveDocument.Save ' Documents.Open FileName:="sept25-139", ConfirmConversions:=False, ReadOnly _ :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _ :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _ , Format:=wdOpenFormatAuto Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _ SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _ FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _ wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _ wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _ wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _ :=wdLanguageNone ActiveDocument.Save
' empty files will append spurious carriage returns at ' head or tail of files so check for this before final match routine ' otherwise use Insert / File to merge files ' merge 10 related files back to one ' for convenience I named these re-concattenated ' files as .txt so they were obvious in listing ' compared to no suffix ones ' Documents.Add Template:="", NewTemplate:=False Selection.InsertFile FileName:="sept25-130", Range:="", ConfirmConversions _ :=False, Link:=False, Attachment:=False Selection.InsertFile FileName:="sept25-131", Range:="", ConfirmConversions _ :=False, Link:=False, Attachment:=False Selection.InsertFile FileName:="sept25-132", Range:="", ConfirmConversions _ :=False, Link:=False, Attachment:=False Selection.InsertFile FileName:="sept25-133", Range:="", ConfirmConversions _ :=False, Link:=False, Attachment:=False Selection.InsertFile FileName:="sept25-134", Range:="", ConfirmConversions _ :=False, Link:=False, Attachment:=False Selection.InsertFile FileName:="sept25-135", Range:="", ConfirmConversions _ :=False, Link:=False, Attachment:=False Selection.InsertFile FileName:="sept25-136", Range:="", ConfirmConversions _ :=False, Link:=False, Attachment:=False Selection.InsertFile FileName:="sept25-137", Range:="", ConfirmConversions _ :=False, Link:=False, Attachment:=False Selection.InsertFile FileName:="sept25-138", Range:="", ConfirmConversions _ :=False, Link:=False, Attachment:=False Selection.InsertFile FileName:="sept25-139", Range:="", ConfirmConversions _ :=False, Link:=False, Attachment:=False ActiveDocument.SaveAs FileName:="sept25-13.txt", FileFormat:=wdFormatText, _ LockComments:=False, Password:="", AddToRecentFiles:=True, WritePassword _ :="", ReadOnlyRecommended:=False, EmbedTrueTypeFonts:=False, _ SaveNativePictureFormat:=False, SaveFormsData:=False, SaveAsAOCELetter:= _ False End Sub
Copy and paste all these subfiles together to submit to the next section. The final match finding, initially for 12 digits ,then change to 14,16,18 and finally 20 if 18 shows something. This routine after hours of dividing/sorting/re-merging takes only seconds to complete.
' Find matching pairs in 12 digits ' xxxx is count = ???? xxxx = b$ = "0" Count = 0 Dim ps As String Open "sept25-24.txt" For Input As #1 Open "sept25-24m12.txt" For Output As #2 ' change the 12 in the #2 file name above and ' the Left function below to suit number of matches xxxx = xxxx - 1 For x = 0 To xxxx Input #1, ps a$ = Left(ps, 12) If a$ = b$ Then Write #2, ps Count = Count + 1 End If b$ = a$ Next x Write #2, "Count ", Count close #1 Close #2
' Find matching triples in 12 digits ' xxxx is count from the count files xxxx = b$ = "0" c$ = "0" Count = 0 Dim ps As String xxxx = xxxx - 1 Open "sept25-1.txt" For Input As #1 Open "sept25-1trip.txt" For Output As #2 ' change the 12 in the #2 file name above and ' the Left function below to suit number of matches For x = 0 To xxxx Input #1, ps a$ = Left(ps, 12) a2$ = ps If a$ = c$ Then Write #2, a2$, b2$, c2$ Count = Count + 1 End If If a$ = b$ Then c$ = b$ c2$ = b2$ End If b$ = a$ b2$ = a2$ Next x Write #2, "Count ", Count Close #1 Close #2
' Find matching quadruples in 12 digits ' xxxx is from the count files xxxx = b$ = "0" c$ = "0" Count = 0 Dim ps As String xxxx = xxxx - 1 Open "sept25-3.txt" For Input As #1 Open "sept25-3quad.txt" For Output As #2 ' change the 12 in the #2 file name above and ' the Left function below to suit number of matches For x = 0 To xxxx Input #1, ps a$ = Left(ps, 12) a2$ = ps If a$ = d$ Then Write #2, a2$, b2$, c2$, d2$ Count = Count + 1 End If If a$ = c$ Then d$ = c$ d2$ = c2$ End If If a$ = b$ Then c$ = b$ c2$ = b2$ End If b$ = a$ b2$ = a2$ Next x Write #2, "Count ", Count Close #1 Close #2
' Find matching quintuples in 12 digits ' xxxx is from the count files xxxx = b$ = "0" c$ = "0" Count = 0 Dim ps As String xxxx = xxxx - 1 Open "sept25-4.txt" For Input As #1 Open "sept25-4quin.txt" For Output As #2 ' change the 12 in the #2 file name above and ' the Left function below to suit number of matches For x = 0 To xxxx Input #1, ps a$ = Left(ps, 12) a2$ = ps If a$ = e$ Then Write #2, a2$, b2$, c2$, d2$, e2$ Count = Count + 1 End If If a$ = d$ Then e$ = d$ e2$ = d2$ End If If a$ = c$ Then d$ = c$ d2$ = c2$ End If If a$ = b$ Then c$ = b$ c2$ = b2$ End If b$ = a$ b2$ = a2$ Next x Write #2, "Count ", Count Close #1 Close #2
' converting integre values back to DNA loci,alleles xxxx= ' xxxx is number of profiles to be converted Dim ph(20) Dim pj(20) Dim ps As String Open "sept25-m12.txt" For Input As #1 Open "sept25-mr12.txt" For Output As #2 For x = 1 To xxxx Input #1, ps a1$ = Mid(ps, 1, 1) a2$ = Mid(ps, 2, 1) a3$ = Mid(ps, 3, 1) a4$ = Mid(ps, 4, 1) a5$ = Mid(ps, 5, 1) a6$ = Mid(ps, 6, 1) a7$ = Mid(ps, 7, 1) a8$ = Mid(ps, 8, 1) a9$ = Mid(ps, 9, 1) a10$ = Mid(ps, 10, 1) a11$ = Mid(ps, 11, 1) a12$ = Mid(ps, 12, 1) a13$ = Mid(ps, 13, 1) a14$ = Mid(ps, 14, 1) a15$ = Mid(ps, 15, 1) a16$ = Mid(ps, 16, 1) a17$ = Mid(ps, 17, 1) a18$ = Mid(ps, 18, 1) a19$ = Mid(ps, 19, 1) a20$ = Mid(ps, 20, 1) ph(0) = Val(a1$) ph(1) = Val(a2$) ph(2) = Val(a3$) ph(3) = Val(a4$) ph(4) = Val(a5$) ph(5) = Val(a6$) ph(6) = Val(a7$) ph(7) = Val(a8$) ph(8) = Val(a9$) ph(9) = Val(a10$) ph(10) = Val(a11$) ph(11) = Val(a12$) ph(12) = Val(a13$) ph(13) = Val(a14$) ph(14) = Val(a15$) ph(15) = Val(a16$) ph(16) = Val(a17$) ph(17) = Val(a18$) ph(18) = Val(a19$) ph(19) = Val(a20$) For j = 0 To 1 ' vWA If ph(j) = 0 Then pj(j) = 13 If ph(j) = 1 Then pj(j) = 14 If ph(j) = 2 Then pj(j) = 15 If ph(j) = 3 Then pj(j) = 16 If ph(j) = 4 Then pj(j) = 17 If ph(j) = 5 Then pj(j) = 18 If ph(j) = 6 Then pj(j) = 19 If ph(j) = 7 Then pj(j) = 20 If ph(j) = 8 Then pj(j) = 21 If ph(j) = 9 Then pj(j) = 0 Next j For j = 2 To 3 ' THO1 If ph(j) = 0 Then pj(j) = 5 If ph(j) = 1 Then pj(j) = 6 If ph(j) = 2 Then pj(j) = 7 If ph(j) = 3 Then pj(j) = 8 If ph(j) = 4 Then pj(j) = 8.3 If ph(j) = 5 Then pj(j) = 9 If ph(j) = 6 Then pj(j) = 9.3 If ph(j) = 7 Then pj(j) = 10 If ph(j) = 8 Then pj(j) = 0 If ph(j) = 9 Then pj(j) = 0 Next j For j = 4 To 5 ' D8 If ph(j) = 0 Then pj(j) = 8 If ph(j) = 1 Then pj(j) = 9 If ph(j) = 2 Then pj(j) = 10 If ph(j) = 3 Then pj(j) = 11 If ph(j) = 4 Then pj(j) = 12 If ph(j) = 5 Then pj(j) = 13 If ph(j) = 6 Then pj(j) = 14 If ph(j) = 7 Then pj(j) = 15 If ph(j) = 8 Then pj(j) = 16 If ph(j) = 9 Then pj(j) = 17 Next j For j = 6 To 7 ' FGA If ph(j) = 0 Then pj(j) = 18 If ph(j) = 1 Then pj(j) = 19 If ph(j) = 2 Then pj(j) = 20 If ph(j) = 3 Then pj(j) = 21 If ph(j) = 4 Then pj(j) = 22 If ph(j) = 5 Then pj(j) = 22.2 If ph(j) = 6 Then pj(j) = 23 If ph(j) = 7 Then pj(j) = 24 If ph(j) = 8 Then pj(j) = 25 If ph(j) = 9 Then pj(j) = 26 Next j For j = 8 To 9 ' D21 If ph(j) = 0 Then pj(j) = 27 If ph(j) = 1 Then pj(j) = 28 If ph(j) = 2 Then pj(j) = 29 If ph(j) = 3 Then pj(j) = 30 If ph(j) = 4 Then pj(j) = 30.2 If ph(j) = 5 Then pj(j) = 31 If ph(j) = 6 Then pj(j) = 31.2 If ph(j) = 7 Then pj(j) = 32 If ph(j) = 8 Then pj(j) = 32.2 If ph(j) = 9 Then pj(j) = 33.2 Next j For j = 10 To 11 ' D18 If ph(j) = 0 Then pj(j) = 11 If ph(j) = 1 Then pj(j) = 12 If ph(j) = 2 Then pj(j) = 13 If ph(j) = 3 Then pj(j) = 14 If ph(j) = 4 Then pj(j) = 15 If ph(j) = 5 Then pj(j) = 16 If ph(j) = 6 Then pj(j) = 17 If ph(j) = 7 Then pj(j) = 18 If ph(j) = 8 Then pj(j) = 19 If ph(j) = 9 Then pj(j) = 20 Next j For j = 12 To 13 ' D2S1338 If ph(j) = 0 Then pj(j) = 16 If ph(j) = 1 Then pj(j) = 17 If ph(j) = 2 Then pj(j) = 18 If ph(j) = 3 Then pj(j) = 19 If ph(j) = 4 Then pj(j) = 20 If ph(j) = 5 Then pj(j) = 21 If ph(j) = 6 Then pj(j) = 22 If ph(j) = 7 Then pj(j) = 23 If ph(j) = 8 Then pj(j) = 24 If ph(j) = 9 Then pj(j) = 25 Next j For j = 14 To 15 ' D16 If ph(j) = 0 Then pj(j) = 8 If ph(j) = 1 Then pj(j) = 9 If ph(j) = 2 Then pj(j) = 10 If ph(j) = 3 Then pj(j) = 11 If ph(j) = 4 Then pj(j) = 12 If ph(j) = 5 Then pj(j) = 13 If ph(j) = 6 Then pj(j) = 14 If ph(j) = 7 Then pj(j) = 15 If ph(j) = 8 Then pj(j) = 0 If ph(j) = 9 Then pj(j) = 0 Next j For j = 16 To 17 ' D19 If ph(j) = 0 Then pj(j) = 12 If ph(j) = 1 Then pj(j) = 13 If ph(j) = 2 Then pj(j) = 13.2 If ph(j) = 3 Then pj(j) = 14 If ph(j) = 4 Then pj(j) = 14.2 If ph(j) = 5 Then pj(j) = 15 If ph(j) = 6 Then pj(j) = 15.2 If ph(j) = 7 Then pj(j) = 16 If ph(j) = 8 Then pj(j) = 16.2 If ph(j) = 9 Then pj(j) = 17 Next j For j = 18 To 19 ' D3 If ph(j) = 0 Then pj(j) = 12 If ph(j) = 1 Then pj(j) = 13 If ph(j) = 2 Then pj(j) = 14 If ph(j) = 3 Then pj(j) = 15 If ph(j) = 4 Then pj(j) = 16 If ph(j) = 5 Then pj(j) = 17 If ph(j) = 6 Then pj(j) = 18 If ph(j) = 7 Then pj(j) = 19 If ph(j) = 8 Then pj(j) = 0 If ph(j) = 9 Then pj(j) = 0 Next j Write #2, ""; pj(0), pj(1); ""; pj(2), pj(3); ""; pj(4), pj(5); ""; pj(6), pj(7); ""; pj(8), pj(9); ""; pj(10), pj(11); ""; pj(12), pj(13); ""; pj(14), pj(15); ""; pj(16), pj(17); ""; pj(18), pj(19); "" Next x Close #1 Close #2
2 million profile sub-division counts. Anyone repeating the exercise will have very similar numbers For 1 million divide numbers by 2, for 200,000 divide by 10 etc For my set up any profile count over 15,000 would not be sorted by Word. all / first dividing 0,4019,1,398036,2,273611,3,609940,4,499104,5,191390,6,23392,7,501,8,7,9,0 1................... 0,,1,22058,2,33588,3,91034,4,113493,5,92135,6,38947,7,5927,8,854,9,0 11.................. 0,88,1,9262,2,5621,3,2518,4,19,5,2354,6,2194,7,2,8,0,9,0 12.................. 120,121,1,14174,2,8609,3,3670,4,31,5,3589,6,3394,7,,8,0,9,0 13.................. 0,371,1,38677,2,22983,3,10158,4,78,5,9802,6,8956,7,9,8,0,9,0 131................. 0,,1,5305,2,8541,3,4746,4,46,5,6233,6,13372,7,434,8,0,9,0 132................. 0,,1,,2,3291,3,3769,4,31,5,4887,6,10666,7,339,8,0,9,0 14.................. 0,456,1,47943,2,29091,3,12592,4,94,5,12133,6,11173,7,11,8,0,9,0 141 0,,1,6488,2,10745,3,5863,4,48,5,7669,6,16576,7,554,8,0,9,0 142................. 0,,1,5305,2,8541,3,4746,4,46,5,6233,6,13372,7,434,8,0,9,0 15.................. 0,376,1,38889,2,23703,3,10039,4,88,5,9828,6,9200,7,12,8,0,9,0 151................. 0,,1,5350,2,8576,3,4743,4,40,5,6204,6,13535,7,441,8,0,9,0 152................. 0,,1,,2,3415,3,3929,4,36,5,4940,6,10984,7,399,8,0,9,0 16.................. 0,162,1,16607,2,9981,3,4262,4,36,5,4113,6,3784,7,2,8,0,9,0 2................... 0,,1,,2,12917,3,68968,4,86770,5,70116,6,29621,7,4535,8,684,9,0 23.................. 0,260,1,29219,2,17652,3,7616,4,63,5,7440,6,6709,7,9,8,0,9,0 24.................. 0,337,1,36823,2,22145,3,9418,4,75,5,9414,6,8549,7,9,8,0,9,0 241................. 0,,1,5015,2,8205,3,4468,4,46,5,5986,6,12698,7,405,8,0,9,0 242................. 0,,1,,2,3307,3,3664,4,38,5,4563,6,10213,7,360,8,0,9,0 25.................. 0,280,1,29665,2,17938,3,7755,4,67,5,7519,6,6882,7,10,8,0,9,0 251................. 0,,1,4115,2,6585,3,3575,4,34,5,4802,6,10203,7,351,8,0,9,0 252................. 0,,1,,2,2570,3,2966,4,30,5,3803,6,8281,7,288,8,0,9,0 26.................. 0,109,1,12495,2,7572,3,3238,4,22,5,3161,6,3017,7,7,8,0,9,0 3................... 0,,1,,2,,3,93386,4,233176,5,188883,6,80487,7,12264,8,1744,9,0 33.................. 0,366,1,39623,2,23922,3,10316,4,79,5,9959,6,9112,7,9,8,0,9,0 331................. 0,,1,5321,2,8850,3,4920,4,43,5,6275,6,13774,7,440,8,0,9,0 332................. 0,,1,,2,3543,3,3904,4,37,5,5170,6,10888,7,380,8,0,9,0 34.................. 0,922,1,98703,2,59774,3,25677,4,233,5,24954,6,22893,7,20,8,0,9,0 341................. 0,,1,13568,2,21954,3,12210,4,114,5,15676,6,34066,7,1115,8,0,9,0 342................. 0,,1,,2,8847,3,9825,4,93,5,12693,6,27433,7,883,8,0,9,0 343................. 0,,1,,2,,3,2694,4,50,5,7065,6,15350,7,518,8,0,9,0 345................. 0,,1,,2,,3,,4,,5,4449,6,19839,7,666,8,0,9,0 346................. 0,,1,,2,,3,,4,,5,,6,21486,7,1407,8,0,9,0 35.................. 0,736,1,80070,2,48356,3,20742,4,188,5,20292,6,18483,7,16,8,0,9,0 351................. 0,,1,10979,2,17776,3,9753,4,102,5,12661,6,27911,7,888,8,0,9,0 352................. 0,,1,,2,7096,3,8007,4,59,5,10369,6,22127,7,698,8,0,9,0 353................. 0,,1,,2,,3,2158,4,33,5,5739,6,12399,7,413,8,0,9,0 355................. 0,,1,,2,,3,,4,,5,3714,6,16060,7,518,8,0,9,0 356................. 0,,1,,2,,3,,4,,5,,6,17319,7,1164,8,0,9,0 36.................. 0,321,1,34027,2,20677,3,8876,4,92,5,8566,6,7921,7,7,8,0,9,0 361................. 0,,1,4752,2,7491,3,4162,4,34,5,5425,6,11758,7,405,8,0,9,0 362................. 0,,1,,2,3021,3,3428,4,31,5,4284,6,9608,7,305,8,0,9,0 4................... 0,,1,,2,,3,,4,145639,5,236123,6,100130,7,15027,8,2185,9,0 44.................. 0,544,1,61636,2,37389,3,15876,4,141,5,15655,6,14386,7,12,8,0,9,0 441................. 0,,1,8497,2,13443,3,7579,4,77,5,9872,6,21480,7,688,8,0,9,0 4416................ 0,810,1,536,2,3761,3,2412,4,4533,5,7152,6,1963,7,289,8,24,9,0 442................. 0,,1,,2,5491,3,6096,4,59,5,7952,6,17248,7,543,8,0,9,0 4426................ 0,614,1,444,2,2979,3,1928,4,3701,5,5628,6,1683,7,249,8,21,9,1 443................. 0,,1,,2,,3,1699,4,32,5,4355,6,9480,7,310,8,0,9,0 45.................. 0,960,1,99982,2,60165,3,25896,4,218,5,25443,6,23434,7,25,8,0,9,0 451................. 0,,1,13926,2,22066,3,12193,4,110,5,16034,6,34516,7,1137,8,0,9,0 4512................ 0,780,1,610,2,3810,3,2399,4,4661,5,7412,6,2058,7,311,8,23,9,2 4516................ 0,1207,1,858,2,5971,3,3889,4,7295,5,11576,6,3206,7,474,8,40,9,0 452................. 0,,1,,2,8919,3,9771,4,86,5,12634,6,27853,7,902,8,0,9,0 4526................ 0,994,1,698,2,4849,3,2986,4,5844,5,9369,6,2656,7,417,8,40,9,0 453................. 0,,1,,2,,3,2744,4,55,5,7136,6,15444,7,517,8,0,9,0 455................. 0,,1,,2,,3,,4,,5,4601,6,20190,7,652,8,0,9,0 456................. 0,,1,,2,,3,,4,,5,,6,21951,7,1483,8,0,9,0 46.................. 0,399,1,42396,2,25654,3,11094,4,102,5,10657,6,9822,7,6,8,0,9,0 461................. 0,,1,5900,2,9361,3,5178,4,51,5,6755,6,14631,7,520,8,0,9,0 462................. 0,,1,,2,3800,3,4153,4,44,5,5403,6,11654,7,400,8,0,9,0 47.................. 0,60,1,6295,2,3908,3,1694,4,12,5,1620,6,1437,7,1,8,0,9,0 5................... 0,,1,,2,,3,,4,,5,95938,6,81661,7,12100,8,1691,9,0 55.................. 0,386,1,40439,2,24634,3,10444,4,87,5,10438,6,9497,7,13,8,0,9,0 56.................. 0,309,1,34369,2,20890,3,8991,4,80,5,8921,6,8097,7,4,8,0,9,0 6................... 0,,1,,2,,3,,4,,5,,6,17375,7,5293,8,724,9,0

Background and Results

Background and results reported to usenet group uk.legal over a number of weeks in 2003 Thread titled R v. Watters - Court of Appeal judgement 2000/2001 My first idea was to download UK football results data and analyse because they are in pairs of numbers with a bias towards the lower numbers.
Which brings me to the statistical research I would like to do concerning such multi-modal sets and conjectured increase in matches in close to modal sets. Like the 'biblical' analysis on this same thread. I am for the moment trying to get some weighted numerical data . At the moment am trying to find complete historical record of football results back over 100 years or so. Theory being that restricting to divisional football to avoid mismatched teams like Man U. v. Barnstoneworth United then score lines containing 0s,1s and 2s should be more common than 3s,4, and 5s etc. Retaining order eg 2,1 and 1,2 to equate to my negative normalised elements. Then for each 10 games (non-void) over all the decades find how many 20 figure numerical matches there are . I suspect many more in the 0s to 2s only rather than containing some 3s,4,s etc.
From http://www.rsssf.com/engpaul/FLA/league.html up to about 1950 "cross-table" (good keyword) results so perhaps 60 blocks of paired data like block below. Concattenating 60 seasons and breaking into 5 pairs then repeat on 6 pairs etc ,unlikely (my guess) up to 10 pairs, and testing for matches would be interesting. example below for England 1936/7,no particular reason one score of 10 changed to * and from original xxx deleted as well as text. digit count 0 197 1 292 2 218 3 114 4 55 5 31 6 12 7 4 8 nil 9 nil 10 1 which appears to have about the right sort of weighting to equate with normalised DNA profiles 1-1 0-0 1-1 1-1 4-1 2-2 3-2 0-0 1-1 4-1 1-0 1-3 1-1 5-3 4-0 4-1 1-1 0-0 4-1 2-0 3-0 1-3 1-1 4-0 1-2 0-0 0-1 2-0 2-3 4-2 2-1 5-0 2-2 2-2 0-0 2-1 1-0 1-1 2-4 2-0 1-1 1-0 0-5 0-0 2-2 2-1 2-1 1-3 1-2 1-2 2-2 2-1 0-1 0-2 0-4 1-3 1-0 0-0 1-0 0-0 1-1 4-1 1-2 2-0 2-1 2-2 4-2 1-0 6-2 2-2 2-3 1-1 4-1 5-2 2-6 4-0 4-1 4-0 1-1 2-1 2-1 3-3 2-1 3-2 0-2 2-2 1-0 2-1 1-0 2-0 2-0 1-0 1-0 1-0 1-1 1-1 3-0 2-2 0-0 3-1 1-0 2-0 3-1 4-2 4-0 2-0 1-3 0-1 2-1 3-0 1-1 4-0 3-2 0-0 2-1 2-0 4-4 4-2 1-0 1-1 0-0 1-1 1-0 1-3 3-0 0-1 5-4 3-1 3-0 2-3 5-0 1-1 3-1 3-1 3-3 5-3 4-1 0-5 5-4 0-2 1-3 1-2 3-2 2-2 3-0 1-0 5-1 1-1 3-3 3-2 3-0 2-2 0-0 7-0 3-0 2-1 7-1 2-0 1-1 2-3 2-3 4-0 2-2 3-1 1-1 3-0 4-2 1-0 1-3 1-1 3-1 2-0 0-1 3-0 3-4 1-0 2-2 4-1 2-1 5-3 6-2 5-1 1-0 6-4 5-1 1-3 6-0 2-3 1-1 0-0 1-1 2-0 1-1 1-2 4-2 2-0 0-3 0-3 3-0 4-0 1-1 3-1 2-0 1-2 4-2 1-0 2-1 2-1 1-1 4-0 3-4 0-2 2-2 3-1 2-0 2-3 2-0 3-0 2-0 2-1 2-0 1-1 2-1 5-0 3-1 1-0 1-1 2-1 3-0 3-1 0-1 2-1 2-0 0-0 2-2 1-2 1-1 3-3 3-2 7-1 1-1 3-0 0-5 2-0 0-2 0-0 1-1 2-2 2-1 4-0 1-2 1-0 2-0 1-1 2-2 2-1 1-1 0-0 3-2 4-1 1-1 3-0 4-0 5-1 1-0 2-1 3-1 4-1 4-1 2-1 2-4 6-2 4-1 2-0 1-2 1-0 1-3 0-0 0-0 2-2 2-1 1-1 3-1 0-0 2-5 3-2 2-1 0-1 1-1 1-1 2-1 2-1 2-2 1-1 1-1 3-1 2-0 3-0 1-1 2-0 1-3 2-0 0-0 5-0 4-2 3-3 2-0 3-2 2-2 2-1 2-0 1-0 5-5 4-1 1-0 1-5 2-1 1-1 1-3 0-1 4-1 1-2 2-2 2-1 1-0 3-0 6-2 2-1 2-1 2-1 0-1 1-0 1-0 3-2 5-3 1-1 1-3 2-2 1-2 1-1 0-0 1-0 5-2 1-0 3-2 1-1 1-0 3-1 2-5 3-1 2-0 1-1 1-1 0-1 2-0 3-2 1-3 0-0 0-3 2-0 0-2 3-1 1-1 2-3 6-4 2-1 2-2 1-2 1-2 5-1 1-0 1-0 0-0 0-1 0-0 2-0 2-3 1-3 0-0 2-0 2-2 5-1 1-1 2-0 1-2 2-1 2-0 1-1 2-1 1-1 2-2 3-0 6-2 2-4 0-2 1-0 5-3 *-3 2-1 1-1 4-0 3-0 4-1 1-0 2-3 3-2 3-1 5-1 3-2 2-1 4-2 1-3 1-1 4-1 3-2 3-0 2-1 3-0 1-0 6-2 2-4 3-2 0-2 1-0 1-2 2-0 1-3 2-1 4-2 2-1 3-0 3-1 2-2 1-0 3-1 3-1 0-0 2-3 2-2 6-4 2-1 2-0 2-1 2-3 4-0 6-1 1-2 3-1 7-2 5-2 3-1 3-0 2-0 2-1 3-1 0-1 1-1 5-0 4-3 2-1 1-1 5-2 Each row is one team in turn, results,playing each of the others in the season. For my normalising purposes I would perhaps leave 0=0 1=1,2=> -1,3=> 2,4=> -2,5=> 3,6=> -3,7=> 4,8=> -4,9 or 10=>5, 11 or 12=>-5 or some such transformation. Then add a transformation normalisation profile element by element and the results would have very much the look and feel of DNA profiles Point to note - greater likelihood of home side with larger score so perhaps convert all pairs to right number equal or greater than left number,before match processing,then a part negative transformation. After all in the real world of DNA profiles they never know which parent contributes which so always right number larger or same as left number of each pair.
This evening I tested my technique just on that 1936/37 block posted earlier. Broke it into triplets so 147 triples of pairs. Only one match 3-0,2-2,0,0 so already finding evidence against my theory. I expected any matches to be most likely consisting of 0s,1s and 2s as weighted like DNA profiles , but my first match includes a 3. Giving it 'chapter and verse' these are Charlton v Man U,Middlesb,Pompey and Everton v Brentford,Charlton,Chelsea Now I know my analytical technique works I will download the other 60 odd blocks and break into '5 loci' ,'6loci' pairs and analyse
Football results alnalysis I broke down 1888 to 1938 data into 5 pairs giving 3038 sets of 5pairs , discarding surplus columns after splitting each years data into 5 pair wide chunks. Within them 6 only pairs of matches - no triples 0-1,2-2,2-0,2-1,3-1 for Arsenal 1911 and Shef Wed 1933 0-2,1-1,1-0,2-2,4-0 for Man C 1899 and Blackpool 1937 1-1,1-0,1-1,3-0,0-0 for 1921 Middlesb and 1923 Huddersfield 1-1,2-1,1-0,1-0,1-0 for 1900 Wolves, 1911 Bradford 2-2,1-0,0-0,3-0,3-0 for 1905 Notts C and 1906 Derby 3-0,0-1,1-0,2-2,3-1 for 1899 Burnley and 1903 Bury so 1 only involving a 4 (Approximate for 0 to 5 in total) digit ocurance counts in total / matched pairs 0 4900 / 20 1 8400 / 21 2 7300 / 12 3 4840 / 6 4 2650 / 1 5 1400 6 358 7 146 8 51 9 22 10 9 12 2 11 & 13-19 nil There is no point in doing a 6 pair analysis for this data but I may try a 4 pair analysis to see if there is something like a correlation between overall number count and roughly similar distribution within the matched pairs
I repeated the footballl result analysis on sets of 4 pairs. Gave 54 matches and a single triple match on 2-0 2-1 1-0 1-1 The digit distribution on the matches was 0 298 1 334 2 158 3 72 4 20 5 6 nothing higher For all scores the digit counts were 0 6771 1 9331 2 5840 3 3538 4 1746 5 759 6 304 7 118 8 39 9 17 >=10 9 Again a rough correlation with proportionally more of the higher frequncy digits in the matches There must have have been someone here before with some weighted otherwise random process not necessarily DNA inheritance. Is there a rule relating a known weighted generator ( approximately multi-modal 'normal' distribution ) predetermning the weighting of any matches occuring ?
Well that was an interesting exercise, I've not tried composing Visual Basic macros before. Tailored the pseudo-random generator to the desired characteristic,checked the output against the desired characteristic. Determined matches and plotted the digit distribution of the match cases. 32000 digits divided into 8 columns. Not quite as i predicted one match of 44,43,14,44 so a single 1 crept in. 10 matched sequences in total,no triples Desired weighting of generator to roughly equate to vWA 0 _ 0.002 1 _ 0.015 2 _ 0.100 3 _ 0.133 4 _ 0.25 5 _ 0.25 6 _ 0.133 7 _ 0.100 8 _ 0.015 9 _ 0.002 Actual weighting of output 0 _ 0.00203 1 _ 0.01569 2 _ 0.0996 3 _ 0.13103 4 _ 0.25147 5 _ 0.25044 6 _ 0.13412 7 _ 0.09775 8 _ 0.016 9 _ 0.00187 Weighting within matched pairs,80 digits,no triples 0 _ 0 1 _ 0.012 2 _ 0.05 3 _ 0.137 4 _ 0.4 5 _ 0.225 6 _ 0.1 7 _ 0.075 8 _ 0 9 _ 0 Matched sequencies were 24425554 27565445 33457434 44431444 44535467 46544564 54634754 55244574 57643443 73563434 So my prediction not quite right for this one-off as that single 1 intruded and an interesting skew in the centre which hopefully would clarify with repeated processing. I suspect it relates to the piecewise 'quantisation' as 3 say is something like between 2.5 and 3.5 . But can sum up as attenuation at the tails and enlarged modal group - more of an inverted U or V characteristic. So in the original 'population' 50 % have 4 or 5 increasing to 62.5 % within any matches and 76.7% have 3,4,5 or 6 increasing to 86.2 % within matches. Tentative evidence that any unrelated matches in the NDNAD are going to be concentrated around the multi-modal groups. So if anyone does get around to resolving those unresolved matches in the NDNAD then any matches involving rareish ( < 2% allele frequency say ) alleles can be ignored in the first instance as they are probably repeats either due to clerical error or use of aliases. Concentrate investigation /cross-correlation with the dermal fingerprint database ,or whatever,for those matches nearest the 'average Joe'. All the above concerns undirected numbers - the data in the NDNAD is of course directed pairs eg (14,16) never (16,14). Also some loci are more distributed than vWA but others are less distributed/more skewed. In theory could model for each loci/allele frequency distribution and simulate a large DNA database given big enough number cruncher. So in case i've discovered some previously unknown mathematical law i should repeat with a weighting of a genuine 'normal distribution' f(x) of form EXP [-(x-mu)^2] and repeat many times to try and put some sort of a f(x) to the match characteristic and also see how far I can push the number crunching on my pc to 10 or more digit-sets and 100,000 or more digits. I've just done an 8 digit times 10,000 run yielding 63 matches . Unfortunately my method does not ,as it stands, pick up triples. I have to check one by one the source file which is alright for 10 but 63 is a bit much Weighting within matched sequences,504 digits,no triples checked in central 45...... to 54...... region,a single 1 again in 1,5,5,4,4,5,5,5 . 0 _ 0 1 _ 0.002 2 _ 0.0536 3 _ 0.1151 4 _ 0.3353 5 _ 0.3571 6 _ 0.1012 7 _ 0.0357 8 _ 0 9 _ 0 so for 4,5 69.2% and 3,4,5,6 at 90.9 %
I tailored my generator for Afro- Caribbean vWA which has no nulls for 10 adjascent alleles and is more symmetric than the caucasian Projected allele frequency characteristic of the generator 0_ 0.005 1_ 0.016 2_ 0.079 3_ 0.218 4_ 0.208 5_ 0.211 6_ 0.161 7_ 0.068 8_ 0.029 9_ 0.005 Actual characteristic of 500,000 digits 0_ 0.0051 1_ 0.0157 2_ 0.0794 3_ 0.217 4_ 0.2085 5_ 0.212 6_ 0.1606 7_ 0.0679 8_ 0.0288 9_ 0.005 And characteristic of the 28 matches (no triples) 0_ 0 1_ 0.0036 2_ 0.0464 3_ 0.2071 4_ 0.2679 5_ 0.2714 6_ 0.1821 7_ 0.0179 8_ 0.0036 9_ 0 So again serious attenuation of the normal/binomial distribution tails and increase in the take of the modal group. 3,4,5 originally 63.7% increasing to 74.6% 0,1,2 originally 10% decreasing to 5% and 6,7,8 originally 26% decreasing to 22% or 3,4,5,6 79% up to 92.8% and 0,1,2,7,8,9 21% down to 7.2% These are the 28 matches for 50,000 spins of 'vWA - Afro Caribbean allele frequencies' No 0 or 9 and one each of 1 and 8 3,3,6,5,3,6,6,5,4,4 3,4,3,5,6,6,5,3,3,3 3,4,4,5,4,5,5,4,6,6 3,4,5,3,5,5,7,6,5,3 3,4,6,4,4,3,4,5,6,4 3,4,6,7,4,5,4,4,4,4 3,5,3,4,3,3,3,3,3,5 3,5,3,4,4,4,7,4,4,5 3,5,3,5,3,3,5,4,4,5 3,5,5,5,4,5,5,6,2,3 3,5,6,3,6,2,6,4,1,4 3,6,5,4,3,3,3,5,5,3 3,6,6,6,5,4,5,5,3,5 4,3,3,6,5,3,3,5,5,4 4,3,6,4,4,5,3,3,3,2 4,4,3,4,5,2,6,2,5,6 4,4,4,5,5,5,6,7,4,5 4,4,7,3,4,3,4,5,5,3 4,5,5,3,6,5,3,6,5,4 4,5,5,4,6,6,6,5,5,4 4,5,6,3,2,5,4,5,5,4 4,6,3,4,5,3,6,6,6,2 4,6,4,6,5,5,2,6,2,6 5,4,3,4,4,4,5,2,5,5 5,4,6,5,5,6,5,5,4,5 6,4,6,3,4,4,6,6,6,6 6,5,5,4,5,4,4,2,6,4 8,2,6,6,4,4,5,6,2,5 For my next run I think I will model each of the 10 UK loci/alleles in my generator and spin 50,000 times to simulate a 10 loci/single allele database of 50,000 profiles. I will lose the significance between null and 0 but 0s have not appeared in any match so far. THO1 would only use 7 of the possible 10 values in the array ,others like D21 with about 16 possible alleles I will truncate to the modal 10 /most frequent (undecided yet) .The triple peaked D2 (equal peaks at 17,20 and 24) I will truncate to the10 around the 'Anglo-Saxon' group of 17 - 20 leaving out 2% alleles 26 and 27 at the 'Celtic 24 end'. For any mathematical runs I cannot decide whether to use binomial quantised/piecewise distribution for the generator ,closer to this use, or the normal function f(x) with more chance of a numerically derrived f '(x) for the match distribution. A 100,000 x 10 run would be possible I think but a bit of a work up. I may also try 6 loci ,paired alleles,so 50,000 x 12 ,with simulation of the earlier 6 NDNAD loci charcteristics which should give an idea of how many 'Raymond Eaton' cases there would be in the earlier NDNAD form. But I would have to build another macro to direct the pairs before match checking.
I have converted my generator to 6 loci and pairs so 12 digit 'profiles' . So far I've only done one run of 12000 x12 spins to check the characteristics. Continuing on and directing pairs and checking for matches produced no matches with 12000 '6 loci profiles'. I deliberately added 2 matches to the data and it found those 2(4) as a check of functioning. Anyone care to predict how many matches for runs of 20,000 / 50,000 / 100,000 and 200,000 ? Nulls are either due to no FSS data for that allele or to keep my selection down to a maximum of 10 digits. UK Caucasian Tabulated as FSS data eg vWA allele 14 corresponds to a digit 1 in my modelling Allele / desired frequency / modelled frequency for VWA 11 0.000 NULL 13 0.001 0.0012 14 0.105 0.1065 15 0.080 0.0794 15.2 0.000 NULL 16 0.216 0.2146 17 0.270 0.2717 18 0.219 0.2183 19 0.093 0.0926 20 0.014 0.0137 21 0.002 0.0022 ^ 9 only modelled THO1 5 0.002 0.0012 6 0.241 0.2439 7 0.194 0.1972 8 0.108 0.1027 8.3 0.001 0.0011 9 0.140 0.1385 9.3 0.304 0.3051 10 0.012 0.0103 10.3 0.000 NULL ^ 8 only modelled D8 D8S1179 / D6 8 0.018 0.018 9 0.013 0.0143 10 0.094 0.0953 11 0.066 0.0656 12 0.143 0.1442 13 0.333 0.330 14 0.209 0.2081 15 0.088 0.0886 16 0.031 0.030 17 0.004 0.0057 18 0.000 NULL FGA 18 0.025 0.0426 18.2 0.000 null 19 0.056 0.0577 19.2 0.000 null 20 0.143 0.1432 20.2 0.002 null 21 0.187 0.1838 21.2 0.002 null 22 0.165 0.1631 22.2 0.011 0.0116 23 0.139 0.1411 23.2 0.004 null 24 0.146 0.1462 24.2 0.002 null 25 0.075 0.0758 25.2 0.000 null 26 0.035 0.0348 27 0.007 null 28 0.000 null 29 0.000 null 30 0.001 null 30.2 0.000 null 31 0.000 null 45.2 0.000 null 46.2 0.000 null ^ 0 (allele 18 ) is inflated by 1.8% nulls D21 D21S11 53 (24) 0.000 null 54 0.001 null 57 (26) 0.001 null 59 (27) 0.031 0.0368 61 (28) 0.160 0.1559 63 (29) 0.226 0.2289 64.1 0.000 null 64 0.000 null 65 (30) 0.258 0.2571 66 0.027 0.0264 67 (31) 0.069 0.0666 68 0.093 0.0965 69 (32) 0.018 0.0179 70 0.090 0.0922 71 (33) 0.001 null 72 0.022 0.0217 73 (34) 0.000 null 74 0.002 null 75 (35) 0.000 null 77 0.000 null ^ 0 (allele 27) is inflated by 0.5% nulls D18 D18S51 8 0.000 null 9.2 0.001 null 10 0.008 null 11 0.012 0.0335 12 0.139 0.1405 13 0.125 0.1254 14 0.164 0.1686 14.2 0.000 null 15 0.145 0.1447 16 0.137 0.1342 17 0.115 0.1167 18 0.080 0.0767 19 0.041 0.0419 19.2 0.000 null 20 0.017 0.0177 21 0.010 null 22 0.005 null 23 0.001 null 24 0.002 null ^ 0 (allele 11) is inflated by 2.5% nulls Remainder 7 to 10 loci yet to be modelled D2 D2S1338 16 0.037 17 0.185 18 0.087 19 0.110 20 0.138 21 0.032 22 0.024 23 0.112 24 0.142 25 0.111 26 0.019 27 0.002 28 0.000 D16 D16S539 5 0.000 8 0.019 9 0.129 10 0.054 11 0.289 12 0.288 13 0.186 14 0.029 15 0.005 D19 D19S433 10 0.000 10.2 0.000 11 0.000 12 0.087 12.2 0.000 13 0.222 13.2 0.013 14 0.382 14.2 0.015 15 0.177 15.2 0.038 16 0.041 16.2 0.017 17 0.005 17.2 0.000 18 0.000 18.2 0.002 19.2 0.001 D3 D3S1358 12 0.001 13 0.006 14 0.132 15 0.265 16 0.247 17 0.195 18 0.141 19 0.014
Trumpet these results for a simulated DNA database for UK caucasians For 20,000 ,6loci / 12 allele 'profile' run First run 5 matched pairs ,no triples 1,6,1,6,2,6,1,3,2,3,2,6 3,4,2,6,5,5,2,3,1,3,4,5 3,5,2,6,5,6,2,3,1,3,1,6 5,6,1,6,5,5,2,3,1,1,1,6 5,6,2,6,5,6,7,7,2,8,1,8 Second run one pair only 4,5,6,6,6,6,3,6,1,3,5,5 The real NDNAD had 45,000 6loci profiles back in 1991 which just shows what dangerous nonsense these databases are for nabbing false suspects and the number of 'unresolved' pairs in the real NDNAD. This is the multi-modal 'average Joe' 6 loci profile for UK Caucasians is vWA,THO1,D8,FGA,D21,D18 (17,17)(6&9.3)(13,14)(21,21)(29,30)(14,14) corresponding to 4,4,1,6,5,6,3,3,2,3,3,3 in this representation so little agreement with my other hypothesis although individual pairs seem to tally in 4 of the 6 loci. To convert one representation to the other use the tables in my previous posting. Then a single 50,000x 12 run with 27 matched pairs,no triples processed down from 600,000 data points 1,3,1,6,4,7,4,8,2,3,2,2 1,5,1,6,4,5,3,6,2,3,4,6 1,5,1,6,5,5,2,3,2,2,1,2 1,5,6,6,5,5,4,8,1,3,1,3 1,5,6,6,5,6,2,3,2,3,2,3 2,6,1,5,5,5,3,6,1,3,5,5 3,3,2,5,5,6,3,6,2,3,1,3 3,4,1,6,5,5,3,4,3,8,1,3 3,4,2,6,3,5,3,4,2,8,4,5 3,4,2,6,5,5,4,8,5,6,4,6 3,5,1,2,5,6,3,4,2,3,0,7 3,5,1,5,5,5,2,8,2,3,1,6 3,5,1,5,5,6,4,4,3,8,5,5 3,5,5,6,4,5,3,6,2,8,0,2 4,4,1,5,4,6,7,7,2,3,1,3 4,4,1,6,5,6,3,4,1,3,3,7 4,4,5,6,0,2,2,3,1,3,1,5 4,4,5,6,4,6,2,4,2,3,2,4 4,5,1,1,5,5,2,9,3,3,5,5 4,5,1,2,5,6,2,3,2,3,1,3 4,5,1,6,5,5,2,7,3,3,1,6 4,5,2,6,4,5,3,9,1,3,3,7 4,5,2,6,5,6,1,7,2,3,4,7 4,5,2,6,5,6,3,7,2,3,1,3 4,6,1,6,4,6,2,3,3,3,1,2 4,6,5,6,5,5,2,3,2,8,5,6 5,7,2,5,5,6,4,6,2,5,3,5 There is one further bit of analysis which probably could do with another macro. For each pair of columns on the above 27 match- rows do a frequency plot of each digit and compare to the generating characteristic for each 'locus' for the attenuated tails and enlarged modal group effect. And which are the most commonly occuring paired alleles in each locus ? eg (4,5) for vWA and (1,6) for THO1 or whatever. To go any further i must make a software restructure, basically swapping disk-space for memory and reconcattenating , to go to 7 loci ,to 8 ,to 9 and then 10 loci and more than 50,000 'profiles'. Would anyone care to speculate for number of matches in 50,000, 100,000, and 200,000 profiles in 6 loci,7 loci,8 loci ,9 loci and 10 loci data-sets ? or even the general case to extend to 2 million or even 60 million. At the moment for 6 loci it looks as though the number of matches is square law about [N/(10^4)]^2 where N = No of 'profiles' I have to guage when to post this stuff to the Yahoo / forensic group, Prof Sir Alec Jeffreys etc,
I've now converted all macros to 7 loci 14 data-points. This is the result for a 50,000 profile run simulating vWA,THO1,D8,FGA,D21,D18,D2 1,4,1,6,4,6,4,7,1,3,4,7,3,4 4,5,5,6,5,5,3,6,1,3,3,5,1,4 5,6,1,6,4,6,6,8,1,3,3,3,2,9 3 pairs,no triples This 7th,D2 ,is the most removed from normal distribution having 3 distinct,separated peaks for UK caucasian. The final 3 loci are more normal distribution but I will certainly have to increase to 100,000 profiles and more. At the moment just the pc processing time on 1997 vintage AMD K6 ,64M RAM pc for 50,000 x 14 is 1/ generating profiles constrained to allele frequencies - 32 seconds 2/ redirecting pairs - 20 s 3/ splitting into 10 files by first digit (0 to 9 ) - 17 seconds 4/ sorting the biggest file (3...........) in this case (but no pairs in this file ) - 85 seconds 5/ pair matching - 3 seconds 6/ visual check of sorted file to confirm presence of matches and also see if a triple 7/ for files that don't reveal a match then repeat with a seeded match in the data to check the macro does pick it up. repeat processes 4,5,6,7 on each/bunch of remaining 9 files The sort is alphanumeric rather than numeric. If the files become too big to sort (process 4 ) then I will just subdivide on the second digit and proceed as before. So anyone care to predict for number of matches in 100,000 and 200,000 runs for 8 loci,9 loci and 10 loci ? Results so far 4,000 8 digit ,undirected,10 pairs 10,000 ,8 digit, ", 63 pairs 10,000 ,10 digit,undirected ,28 pairs 12,000 6 loci,directed, no pairs 20,000 6 loci, " ,1 to 5 matches 50,000 6 loci , ", 27 pairs 50,000 7 loci ,directed, 3 pairs
I've now converted all macros to 8 loci 16 data-points. This is the result for one 100,000 profile run simulating vWA,THO1,D8,FGA,D21,D18,D2,D16 Just one match 4,4,2,6,6,6,1,4,2,3,3,3,1,8,3,3 Results so far 4,000 8 digit ,undirected,10 pairs 10,000 ,8 digit, ", 63 pairs 10,000 ,10 digit,undirected ,28 pairs 12,000 6 loci,directed, no pairs 20,000 6 loci, " ,1 to 5 matches 50,000 6 loci , ", 27 pairs 50,000 7 loci ,directed, 3 pairs 100,000 8 Loci,directed ,1 pair
I've now converted all macros to 9 loci 18 data-points. No matches for one 200,000 profile run simulating vWA,THO1,D8,FGA,D21,D18,D2,D16,D19 Results so far 4,000 8 digit ,undirected,10 pairs 10,000 ,8 digit, ", 63 pairs 10,000 ,10 digit,undirected ,28 pairs 12,000 6 loci,directed, no pairs 20,000 6 loci, " ,1 to 5 matches 50,000 6 loci , ", 27 pairs 50,000 7 loci ,directed, 3 pairs 100,000 8 Loci,directed ,1 pair 200,000 9 Loci,directed,no pairs For anyone coming after me this is a breakdown by 'vWA' leading digits as it is quite bunched and matches presumably more likely in the bigger groups (eg 1,4... ; 3,4.... ; 3,5.... ;4,4.... ; 4,5... ; 4,6... ) and probably much the same proportions for the 10 Loci case 0,0.... to 0,9..... 400 'profiles' 1,1............. 2100 12..... 3400 13 9000 14 11300 15 9300 16 3900 1,7.... to 1,9.... 700 2,2 1300 2,3 6800 2,4 8900 25 7100 26 2900 2,7.... to 2,9..... 500 3,3 9500 34 23400 35 18900 36 8000 3,7..... to 3,9....... 1400 4,4 14700 45 23400 46 10100 4,7..... to 4,9...... 1800 5,5 9500 5,6 8000 5,7.... to 5,9...... 1400 6,6.... to 6,9.... 2200 7,0..... to 7,9..... 50 8,0... to 8,9 1
Now converted all macros to 10 Loci x2 and also a macro for converting back to usual represention. For a run of 600,000 Single 10 loci match of VWA,THO1,D8,FGA,D21,D18,D2,D16,D19,D3 (17,18);(8,9);(13,14);(20,22);(30,30);(14,15);(20,20);(12,13);(13,14);(16,18) Then cutting back on the same output array Same single match on 9 loci and no other 9 (18) matches on 8 loci,including the 9 and 10 one 102 (204) matches on 7 Loci,including the first 7 pairs of the 8,9,10 ones and 2907 (x2) matches on 6 loci,including the first 6 pairs of the 7,8,9,10 ones No triples on the 8 loci set and i've not checked for the 7 aand 6 loci If there are 6 loci records on the NDNAD they must be next to useless. About 3000 matches in 600,000,so if it were still 6 loci and square law then there would have been about 3000 x 3 squared or 27,000 x 2 matches . My 'average Joe' is (17,17);(6,9.3);(13,14);(21,21);(29,30);(14,14);(17,20);(11,12);(14,14);(15,16) and my (slightly altered) profile is (17,19);(8,9.3);(13,13);(20,22);(29,29);(13,15);(18,19);(12,12);(12,14);(16,18) 4,6;3,6;5,5;2,4;2,2;2,4;2,3;4,4;0,3;4,6 even closer to the numerically derrived first match normalised to (0,1);(0,1);(0,-1);(0,0);(-1,-1);(-1,0);(-2,-1);(0,-1);(-1,0);(0,0) There were 70,764 'profiles' with ,first,vWA pair of (17,18) that contained the 10 loci match and was the largest sub-set. The next largest was 70,131 for vWA (16,17) I may do a one million run for the shear hell of it but maybe only fully sort for the (17,18) subset.
I think there is a problem with the Rnd function depite using the Randomize adjunct. Did a (4,5) subset of 2 million profiles which took 25 minutes. Giving 236,345 x20 digit 'profiles' (4,5);.......... Much processing later ......... The same matched pair as before which all looks highly suspicious. And again same single match for 9 loci. 22 matches for 8 loci subset 214 for 7 loci and 6113 for 6 loci . Inspecting the 22 matches compared to full 10 loci. There were 3 near misses to adjascent sorted sequences,first 16 digits matched Final 4 digits 1,3,3,6 and 1,8,3,6 1,3,3,6 and 3,5,3,6 1,3,4,6 and 3,3,4,6 so 3 separate 9 loci matches and 2 separate 9.5 loci if I had chosen loci 1,2,3,4,5,6,7,8,10 instead of straight sequence. I will have to research the Rnd pseudo random number generator as my macros seem to check out ok.
I did some further checking back to the original generated undirected '2million profile' file and what becomes a match started as these two sequences 5,4,3,5,6,5,2,4,3,3,3,4,4,4,5,4,1,3,4,6 5,4,3,5,6,5,2,4,3,3,4,3,4,4,5,4,3,1,6,4 which when directed both become 4,5,3,5,5,6,2,4,3,3,3,4,4,4,4,5,1,3,4,6 which is not in the original at all so not a manifest of the Rnd function repeating itself. There is no way the Rnd function would 'know' what i was going to do with the output. In other words what looked highly suspicious 1, 10 loci and only 1, 9 loci match would seem genuine after all. Fascinating stuff. There is no repeated sequence turning up in the generator file as that would carry through and be picked up by the matching macro. Unfortunately due to constraints of disk space/ enforced deleting files I don't have the original undirected source file ( 23.4MByte) for the 600,000 profiles where the same sequence later emerged,only the directed file but probably the same effect. Generating new (5,4) + (4,5) 2 million subset the sequences differed from the previous run so randomize was working. BUT - I checked on original undirected /unsorted file for central 3x2 group 3,3,3,4,4,4 and 5,4,3,5,6,5,2,4,3,3,3,4,4,4,5,4,1,3,4,6 emerged again but in a different place in the file. The following sequences also matched . So Rnd seems ok within one run but repeat a run and same result is likely to emerge somewhere in a long run despite using the Randomize so the Rnd function starts at a different point. Bearing in mind although I'm only selecting the 5,4 /4,5 subset the Rnd function is being called 2 million x 20 times, 2^24 in the inbuilt Rnd function is only about 16 million Rnd produces an exact figure based on the previous call. I've buried a superfluous Rnd call in the subroutine that writes the (4,5)..... file very approximately on average every 20 profiles so should disrupt the sequence as far as the numbers used in the loci generator are concerned. This write call would not be the same for each run . I've so far done another 230,000 odd (4,5).... profiles and that sequence does not reappear and will fully process and see what emerges
some right fun and games with Linear Congruent Generators for random numbers from sources http://www.geocities.com/SiliconValley/Campus/7071/rnd.html and www.kaner.com/pdfs/random.pdf I am now using the microsoft form for the Rnd but in this form has 15 digit precision rather than truncated to 7 and seem to be getting more convincing results. I tried the Kaner/ Vokey with z = 2^ 40 trying each a= 27182819621,c = 3 and a = 8413453205,c = 99991 in exactly the same visual basic code as below but there was horrendous repeating of 'random' numbers. I've no idea what the problem is if someone else would like to fabricate a fairly simple RNG or check the following code in a VB procedure. Dimensioning variables to Double made no difference. --------------- ' initialising a = 214013 c = 2531011 x0 = Timer ' timer sets start seed to number of seconds after midnight z = 2 ^ 24 ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 result = x1 / z ' 0 < result < 1 ----------------------- So far just processed from one run of 1 million generator using the above form of RNG outputing to disk just 4,5 and 5,4 subsets - directed giving 118,193 4,5......... Then sorting just 12,878 of the divided 4,5,3............. profiles no 16,18 or 20 loci matches, 3 '14 loci matches' and 88 '12 loci matches' lopping back.
For a 1 million run ,each time, one time extracting 4,5 and another run extracting 4,4.. 4,5…. including the 4,5,1.. I mentioned yesteday of 105,315 2x 8 loci matches only,no 9 or 10. 4,5,1,6,6,7,3,4,0,3,4,5,3,8,3,4 4,5,2,6,6,6,3,4,1,2,3,7,1,8,4,5 The 0 above represents only 3.1% sequences convert to 17,18 ; 6,9.3 ; 14,15 ; 21,22 ; 27,30 ; 15,16 ; 19,24 ; 11,12 17,18 ; 7,9.3 ; 14,14 ; 21,22 ; 28,29 ; 14,18 ; 17,24 ; 12,13 36x 7 loci matches 1345x 6 loci matches and another 1 million run for 4,4 only and one match for 73,259 4,4,……...profiles for 8 loci 1 match 4,4,3,6,5,6,4,6,3,3,3,7,0,1,4,5 the 0 here represents 3.7% 18x 7 loci matches 647x 6 loci matches
I now have confidence in the RNG and have ramped up to 10 million profiles. It took 2 hours 12 minutes to generate and save to disk subset 174,017 profiles 4,5,1,6... 4,5,6,1... 5,4,1,6... and 5,4,6,1..... which when directed gave 4,5,1,6..... profiles only. In them were 2 matches on 10 loci 4,5,1,6,2,5,0,4,1,5,1,7,1,4,4,4,1,3,5,6 4,5,1,6,5,6,2,6,1,2,2,3,1,4,3,4,3,3,3,5 which converts to vWa;THO1;D8;FGA;D21;D18;D2;D16;D19;D3 17,18 ; 6,9.3 ; 10,13 ; 18,22 ; 28,31 ; 12,18 ; 17,20 ; 12,12 ; 13,14 ; 17,18 17,18 ; 6,9.3 ; 13,14 ; 20,23 ; 28,29 ; 13,14 ; 17,20 ; 11,12 ; 14,14 ; 15,17 The remaining processing because only 174,017 took much the same time as previous processing but narrower 'catch'. Other results in usual sequence, ie ordered 9 (excluding D3) , not perm any 9 from 10,which would be higher numbers but as I rely on a sort routine I cannot do that determination. 9 loci - 7 matches 8 loci - 103 matches 7 loci - 1078 matches 6 loci - 21,113 matches The 7x 9 loci and 2x 10 loci result is not too surprising because the 10th locus is D3 and very biased in the 3/4/5 area. 9 loci match analysis 4 pairs were 4,5,1,6,2,5,.... including the 10 loci one 8 loci analysis 12 were 4,5,1,6,2,5 ..... 17 were 4,5,1,6,4,5...... 35 were 4,5,1,6,5,5...... 17 were 4,5,1,6,5,6...... So from ramping up from 2 million to 10 million a factor of 5 then these results agree square law with the 2 million results if restricted to 4,5,1,6... also. Remember someone has decided to halt the NDNAD when it reaches 3 million. It looks suspiciously like he has done the same processing as me. 3m is likely the figure [<10/(2^-.5) and > 2 million (square law assumed)] where you are likely to get one match in the most frequently occuring (first) loci. Returning to the 10 million result I still have no idea whether there would be more 10 loci matches in the remaining (10m minus 174,017 ) = 9,825,983 profiles I did not save and test. From the 2m runs and 8 loci results for subsets 4,5,2,6... and 4,4,.... I would suggest there is but I cannot put a likely figure there. What 8,9,10 matches do emerge are not being found totally in the multi-modal areas where I intuitively expected them to be. So could appear anywhere it seems perhaps with a majority of modal matches . I will try another 174,017 subset of 10m in a block away from 4,5,1,6.... ; perhaps 4,4........ and see what emerges. I will also write up with the macros for anyone else to have a go - independent replication of such analysis is fundamental. I used Visual Basic / macros with Word 97 on a 6 year-old pc . The next area of exploration is the common ancestor ie parent and 10 alleles in common at least, grand-parent and 5 alleles in common, on average, at least. What is the probability of someone related having these 5 to 10 as a starting point then also matching on the remaining 15 to 10 ,just by chance process,and probability of that person being also in the NDNAD. ? Remember we are talking real ancestry here not the nice comfy (sham) ancestry of the genealogy community. The milkman factor, lovers,one night stands etc that mean up to 30% of people have a genetic father different to their accredited father.
The nearest ,to 174,017 I could find to a convenient rarer subset of 10million profiles was for 2,6.... & 6,2..... giving 150,105 'profiles' Results for 10 loci - 0 matches 9 loci - 1 match 8 loci - 3 matches 7 loci - 39 matches 6 loci - 1262 matches The 9 loci match was on 2,6,2,6,5,5,3,7,2,3,3,5,7,8,4,5,3,5 which started as 2,6,2,6,5,5,3,7,2,3,3,5,8,7,5,4,3,5,6,2 and 2,6,6,2,5,5,7,3,3,2,3,5,8,7,4,5,3,5,3,4 so confidence in the RNG Previously I did a similar 2,6&6,2 run but included a variation from adding in calls , 1 in 20 ,to the built in Rnd function added to the external Rnd on the assumption that adding a poor rand to a reasonable rand would make it better. Not so Processed through and checked for matches. Apparently 3 10 loci matches. Going back to the generator array, exept for the pair-directing, the sequence appeared twice exactly the same ,different places, in that array . I repeated for the second 'match' and again a pair of sequences in the original. I did not bother checking the third result and scrapped the lot. I down-loaded the Sunny-beach RRnd but haven't got anywhere with it. The help file doesn't come up and it doesn't like my sound-card. Knowing what (regular rather than random) hash appears on radio reception close to a computer I would have thought any analogue noise derrived from a sound card would be heavily contaminated with all the ,repetitive, digital noise.
171,122 subset 3,4,1,6.......... of a 10m run results 10 loci - 0 matches 9 loci - 5 matches 8 loci - 91 matches 7 loci - 1079 matches 6 loci - 22,113 matches The 5x 9 loci results were all 3,4,1,6,5,6...... For anyone wishing to replicate these processes I've put the macros and some background on http://www.nutteing2.freeservers.com/dnas.htm Over the next week i will write up the rest of this simulation experiment and add to that file (and mirror sites). Next run will probably be subset 3,5...... which should be about 946,000 processing 3,5,1,6....... first and then the remainder. I am trying to think my way around the co-ancestry conundrum. Should it help anyone else I did some processing on the final sorted arrays for 15 alleles and 10 alleles In the first instance assuming match on 10 digits 1,2,.....10 of 20 is for this purpose much the same as 1,3,5... 19 of 20 and for the moment ignoring the perm 1 from 2 . For the rarer 2,6...... profiles (150,105 out of 10m ) 15 allele matches 9 10 allele matches 16,939 and i would guess about 1 in 50 were quadruples, repeated pairs. For the common 3,4,1,6.... profiles (171,122 out of 10m ) 15 alleles - 271 matches 10 alleles - 71,876 matches including i would guess 1 in 10 quadruples What is the probability of a related person (parent-wise) so 10 loci in common already also having by chance a match on the other 10 ? What is the probability of a related (grandparent-wise) so 5 loci in common already ,on average,also having by chance a match on the other 15 ? This week i've started reading the Spencer Wells book The Journey of Man : A Genetic Odyssey. The Y chromosome derrivation of human migration since an African ' Adam' - like the mitochondrial 'Eve'. A quote ,from it,relevent here (Kidd's paper,concidently, on the Amerindian study i should have received this week from the Brittish Library ) " The geneticist Kenneth Kidd, of Yale University , has pointed out that if we double the number of ancestors in each generation (around 25 years) ,when we go back in time about 500 years each of us must have had over a million living ancestors. If we go back a thousand years ,our calculation tells us that we must have had one trillion (1,000,000,000,000 ) ancestors - far more than the total number of people that have existed in the whole of human history. ................... ....... The error in our ancestor tally is not from a malfunctioning calculator,but from the assumption that each of the people in our genealogy is completely unrelated to the others"
Good news for the anti-FSS brigade. Found another 10 loci match in a different area. I thought THO1 had the maximum possibility of pairs of alleles as max frequencies of .241 and .304 but it is actually loci 8 and 9 in the standard FSS order D19 at .382 and .222 and D16 at .289 and .288 So I rejigged things generating 10 million profiles but only saving to disk those directed to become ..............3,4,1,3.. Giving 283,201 'profiles' Then divided for THO1 (1,6) so 41,551 profiles of form ..1,6..........3,4,1,3.. Then divided ,sorted , reconcattenated and match-checked giving one match of 4,5,1,6,6,7,3,7,2,3,3,3,1,8,3,4,1,3,3,4 converted back as 17,18 ; 6,9.3 ; 14,15 ; 21,24 ; 29,30 ; 14,14 ; 17,24 ; 11,12 ; 13,14 ;15,16 This match started as 54,61,67,37,32,33,18,43,31,43 and 54,61,67,73,32,33,18,34,13,34 so no obvious problem with the RNG Cutting back on final array for lower matches , perhaps not too relevent , as 3,4,1,3 columns all the same 9 loci 4 matches 7 and 8 as 9 6 loci 162 matches 9 loci result 4,5,1,6,4,4,2,7,1,3,3,3,3,7,3,4,1,3 3,5,1,6,5,6,2,7,1,2,1,3,1,8,3,4,1,3 4,5,1,6,5,6,6,7,4,8,1,1,1,2,3,4,1,3 4,5,1,6,6,7,3,7,2,3,3,3,1,8,3,4,1,3 By reconfiguring the columns and resorting 10 loci - 1 match as before 9 loci - 4 matches as before in effect 8 loci - 162 match 7 loci - 3,682 match 6 loci - 15,172 matches I will probably process the next biggest batch of the generated 283,201 profiles before ditching ie ..2,6..............3,4,1,3.. It really requires someone with a bigger number crunching computer to structure the multiple sort processes into one macro or different process altogether and crunch all 10 million in one go.
I have now found the first 10 loci match in an area where I was not expecting one. Firstly continuing yesterdays results 9 loci result 4,5,1,6,4,4,2,7,1,3,3,3,3,7,3,4,1,3 3,5,1,6,5,6,2,7,1,2,1,3,1,8,3,4,1,3 4,5,1,6,5,6,6,7,4,8,1,1,1,2,3,4,1,3 4,5,1,6,6,7,3,7,2,3,3,3,1,8,3,4,1,3 By reconfiguring the columns and resorting 10 loci - 1 match as before 9 loci - 4 matches as before in effect 8 loci - 162 match 7 loci - 3,682 match 6 loci - 15,172 matches Then processed the remaining ..2,*...........3,4,1,3.. For 72,578 profiles 10 loci - 0 matches 9 loci - 3 8 loci - 117 7 loci - 3,855 6 loci - 21,646 matches Then processed the remaining ..1*..............3,4,1,3.. but * not = 6 for 78,434 profiles 10 loci - 0 9 loci 4 8 loci - 154 7 loci - 4,028 6 loci - 23,401 Then remaining ..a*..............3,4,1,3.. a not= 1 or 2 for 90,634 profiles 10 loci 1 match 9 loci 5 matches 8 loci - 159 7 loci - 4,327 6 loci - 25,318 The match was for 4,5,3,6,5,6,2,6,1,3,4,6,1,8,3,4,1,3,3,4 converted back to 17,18 ; 8,9.3 ; 13,14 ; 20,23 ; 28,30 ; 15,17 ; 17,24 ; 11,12 ; 13,14 ; 15,16 generated originally as 5,4,3,6,5,6,2,6,3,1,4,6,8,1,4,3,3,1,4,3 and 4,5,6,3,6,5,6,2,1,3,6,4,8,1,4,3,3,1,4,3 so good RNG This has (4,5) of one of the main modal groups so to properly test for extramulti-modal matches I will probably derrive 300,000 profiles selected to be of form (other than 4 or 5)(other than 1 or 6) .............. (other than 3 or 4)(other than 1 or 3) .. and process through. At the moment for all matches found 2 matches in expected batch of 174,017 1 match in expected batch of 41,551 1 match in unexpected 90,634 plus 72,578 plus 78,434 So for the moment ,best guess ,for 10 loci matches in 10 million, totally unrelated ,profiles is >4 and less than 40
I tried 300,000 profiles selected to be of form (not a 5)(not a 6) .............. (not a 3 )(not a 3) .. and processed through. I hadn't realised this gives only 6.6%. Of 300,000 such profiles no 10 loci matches or 9 loci matches. 8 loci - 2 7 loci - 40 6 loci - 1387 Next I will probably try something like the opposite in a 2 million run
2 million run with processed 137,190 profiles containing at least one of the four most common alleles on each of loci 0,1,....7,8. 10 loci matches - 0 9 - 0 8 - 0 7 - 22 6 - 987 I was playing around with kinship (coin-tossing) statistics simulating with the RNG so approximate only as only using variously 100,000 and 10,000 x 10 and 20 'tosses'. As far as I see it,50 % chance per allele. For two people with the same mother and father the chances for inheriting the same ,unspecified ie no particular order,N alleles is N percent probability 20 low 19 .003% 18 .02 17 .12 16 .5 15 1.5 14 3.6 13 7.4 12 12.0 11 16.3 10 17.3 9 16.3 8 12.0 7 7.4 6 3.6 5 1.5 4 .5 3 .12 2 .02 1 .003 0 low For one parent concerning inheritance of matching 1 allele in each pair of 10 loci N % 10 .11 9 1.0 8 4.2 7 11.6 6 20.6 5 24.6 4 20.6 3 11.6 2 4.2 1 1.0 0 .11 For a common grandparent ,one allele on each of 10 loci, 25% chance for each allele N % 10 low 9 .001 8 .04 7 .36 6 1.64 5 5.43 4 14.3 3 25.4 2 28.4 1 18.8 0 5.7 For common great-grandparent ,12.5% chance each, N % 7 .004 6 .06 5 .4 4 2.3 3 9.3 2 24.1 1 37.8 0 25.6 For common gg-grandparent ,6.25% N % 7 .001 6 .001 5 .015 4 .13 3 2.0 2 10.4 1 35.2 0 52.3 For common ggg-grandparent, 3.125% N % 5 .01 4 .28 3 2.0 2 10.5 1 34.2 0 53.0 Is there a mistake here in the residual 30 odd percent chance of inheriting one allele over 5 generations ? Anyone know the figures for number of people alive today legitimate (real and assumed) and illegitimate having the same 2 parents, one parent ,one grandparent, one great-grandparent etc ? How to meld this sort of data with multi-allele matching probability in a NDNAD ? Is there a numerical /simulation way to determine how much co-ancestry will increase number of matches within a database ?
I've now joined the redirection macro and the first divider macro to the generator macro and added a save of the original undirected array of profiles as number strings to halve the disk space requirment. This is probably the proper way to do all the processing. Repeated application of the dividing routine on successive columns until there is nothing left to divide. To do this automatically would be alright if it was not for such variable divided file sizes /counts from few to 10,000s in the same dividing. I suspected i was wasting my time but i decided to do a million run and save all to disk for later processing. So far just processed profiles of form 4................... numbering about a quarter of a million - 250,942 Results 10 loci matches - 0 9 - 0 8 - 2 matches both starting 45, 7 - 60 6 loci - 2465 matches That jump from 1 million to 10 million makes all the difference Now i've started i will have to carry on to the remaining before trying a 2 or 3 million run - disk space permitting. The next large run i will probably change the order in the generator array from the clumpy vWA,THO1.... to D2,D18,FGA...... to partially even-out some of this clumpiness.
Perhaps a starting simulation could be. 5 males and 5 females of totally random unconnected but otherwise generic UK caucasian profiles. Generate 4 or 5 'children' for each pairing and they in turn only allowed to mate with random profile outsiders. Add in a bit of second cousin/cousin/incest matings/pairings and repeat for perhaps 5 or 10 generations and see what emerges. Then repeat with outsiders constrained to come only from 5 say similarly generated 'communities' etc
I worked out how to do VB random access files, Get and Put ,and made a macro to detect matches in datafiles in string form. But it would take forever and a day for the macro to process through like that. Looks as though it will have to be a quick-sort macro or Word /sort for the subdivided files then my match macro after re-uniting the sub-files. Now using the data stored as strings not only reduces the file size but using the standard Word/Sort ,un-highlighted columns or text, default (Text) type of sort now works. I thought the smaller files would increase the handlling size of Word/Sort from 15,000 but its still the same limit I may make a macro within Word that inputs in turn each of the subdivided (<15,000 profile ) files ,Sorts each file, saves each file,then some sort of macro to copy and paste all these sorted subfiles into one file to match check. I've accessed a number of VB sort code procedures but will try the repeated Word/Sort/ macro first as I suspect going down that route will be quicker in actual processing time.
I thought i was wasting my time processing the remaining 3/4 million profiles,but no . I was going to leap to 3 million now i have changed to 'string' data blocks and handling. Now I know what i'm doing ,have all macros to hand and how the profiles sub-divide into various amounts. If i repeated a whole 1 million run again i reckon it would take only about 3 hours in total to generate, through dividing,batch sorting,batch file merging and final match checking. So as a very near miss (below) for a 10 loci/ 20 digit match in 1 million ,the next run will be for 2 million . File size for 1 million profiles as strings 22.8 MB. Results for 1 million profiles all saved and processed through 1.............. (198,191 profiles ) 6 loci/12 - 938 matches 7 - 29 8 - 1 2............... (135,851 ) 6 - 546 7 - 19 8 - 3 9 - 1 3............ (305,269 ) 6 - 2972 7 - 71 8 - 5 9 - 0 4................... (previously reported) - 250,942 6 loci - 2465 matches 7 - 60 8 - 2 matches 9 - 0 5................ (95,969 ) 6 - 474 7 - 13 8 - 2 9 - 0 Remainder 0...,6....,7......,8...... (13,778 profiles) 6 - 11 matches 7 - 1 8 - 0 So match totals in 1 million profiles 6 loci - 7,406 7 loci - 193 8 loci - 13 9 loci - 1 Can now also easily check for triples so far only emerged on 6 loci matches ( reconfigured macro for quadruples also) 1..... -18 triples (1 quadruple) 2........ - 7 triples (0 quad) 3....... - 84 triples (3 quadruple) 4........ - data no longer retained 5... - 11 triples ( 0 quad) remainder - 0 So >=120 triples and >=4 quadruples on 6 loci Needle in a Haystack The near miss on 2........ profiles was actually also a match for 19 digits The 2 profiles were "24162378233401331125" and "24162378233401331122" Conversion to standard notation (15,17)(6,9.3)(10,11)(24,25)(29,30)(14,15)(16,17)(11,11)(13,13)(14,17) (15,17)(6,9.3)(10,11)(24,25)(29,30)(14,15)(16,17)(11,11)(13,13)(14,14) again mostly ,but not all ,are common alleles vWA / 15 - allele frequency .08 D8 /10,11 - af .094 ,.066 FGA/ 25 - .075 and D2 /16 is only af 0.037 These started life as "42613287323401331152" and "24162378323410331122" so nothing suspect about the Rand function. Anyone care to lay bets on a match/ matches being contained within 2 million profiles ? For anyone not aware of all the previous research. This simulation is for the artificial situation where all profiles are generated absolutely randomly within the constraint of distributions as found in UK caucasians. It does not assume any co-ancestry ie all profiles are totally independent of one another with no common ancestors bequeething any allele/alleles down the generations. That is the next research/simulation.
The final reckoning A single 10 loci match on 2 million profiles Breakdown of results in standard loci order for 1.............. (398,036 profiles ) 6 loci/12 - 3,644 matches,105 triples,5 quad 7 - 91, 0 triples 8 - 5 9 - 0 for 2............... (273,611 ) 6 - 2,118 , 48 triples, 1 quad 7 - 69, 0 triples 8 - 4 9 - 0 for 3............ (609,940 ) 6 - 9,950, 597 triples , 52 quadruples, 7 quintuples 7 - 255, 0 triples 8 - 28 9 - 2 10 LOCI - 1 match for 4................... (499,104 ) 6 loci - 9,865, 540 triples , 49 quad , 7 quin 7 - 268, 0 triples 8 - 28 9 - 0 for 5................ (191,390 ) 6 - 1,564, 40 triples , 3 quad 7 - 28 , 0 triples 8 - 2 9 - 0 Remainder 0...,6....,7......,8...... (27,921 profiles) 6 - 27 matches , 1 triple 7 - 1 8 - 0 So match totals in 2 million profiles 6 loci - 27,168 7 loci - 712 8 loci - 67 9 loci - 2 10 loci - 1 for 6 loci 1231 triples 110 quadruples 14 quintuples 3... subset, 9, 10 loci numbers look suspicious but that is just the way things have panned out, including somewhat similar before . If i wanted to fiddle these results then the first thing i would do is make the 9 loci match number larger. Hopefully anyone repeating this experiment will find similar numbers. For anyone so doing I will add the count breakdown of the sub-divisions to the dnas.htm file tomorrow. You need a plan to work to because of the serious disparity of numbers in sub-divisions. ***************************************** THE 10 LOCI MATCH in 2 MILLION is "34,66,56,24,33,13,17,45,13,45" when converted back , in standard form (16,17)(9.3,9.3)(13,14)(20,22)(30,30)(12,14)(17,23)(12,13)(13,14)(16,17) all in the more common allele frequencies. ***************************************** The lowest being D2 / 23 of 11.2 % allele frequency This match started life as "34,66,65,24,33,13,71,54,13,45" and "34,66,65,42,33,13,71,45,31,54" so nothing suspect about the Rand function. Previous results suggested number of 10 loci matches in 10 million to be between 4 and 40 . Assuming the square law then 5x 2 million leads to 5^2 = 25 approx matches in 10 million profles and implied 625 in 50 million. More repeats of this experiment ,or even perhaps 3m or 4m runs ,will show whether 1 in 2 million is average ,below or above average. My hunch from the near miss in 1m is that it is below average ie implying between 25 and 40 matches in 10 million. I now have population date for UK from 1700 to present day to work on the next simulation. No data yet for interbreeding factors,father/daughter, brother/sister,uncle/neice, first cousin mariages,second cousin mariages etc. I will place the modified macros ,other 'tools' and results on the ftp'd dnas.htm file sunday. And notify the forensic science lot Sunday or Monday. Is all the above and preceding a first ? I've not come across a hint even of anyone publishing this sort of simulation.
Up to Sept 28,2003 -f207

Email Paul Nutteing by removing 4 of the 5 dots
or email Paul Nutteing ,remove all but one dot
Or a message on usenet group uk.legal has got to me recently a couple of times.
A lot of the contents of this file plus other material 'peer reviewed' on the main forensic science usergroup

Background
A simulation of DNA profile 'families'
A simulation of DNA profile families with consanguinity
A simulation of DNA profile 'families' for 6 generations
dnas.htm revisited with all alleles represented
dnas.htm revisited for >8 percent allele frequency subset (similar ancestry )
Simulation of Taiwanese Tao and Rukai populations to explore the effect of within and without ancestral clusters
Basques autochthonous DNA profiles simulation, 9 loci
Australian Capital Caucasian 9 loci simulation
Australian Capital Caucasian 9 loci simulation, >= 5% allele frequency
Australian Capital Caucasian 9 loci simulation, >= 5% allele frequency
CODIS, 13 Loci Caucasian Simulation
Automating the macros
Exploring other DNA profile match scenarios
Suspect familial matching
Return to co-ancestry factor in the NDNAD simulations
144 random matches in 65,000 -- ONLY?


Powered by counter.bloke.com