Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

Data compare algorithm [review]

  • 31-07-2009 7:20pm
    #1
    Registered Users, Registered Users 2 Posts: 8,070 ✭✭✭


    I have 9 csv files, only emails, 250k records. I know there is overlaps/duplicates, they are email addresses, so i dont want recipients to get the email twice.

    so wanted a quick algorithm to get unique values, anyway just wanted someone to look at this and see if it was legit
    [i know this is terrible but really stuck for time here]

    [PHP]$array2 = file("3.csv");
    foreach ($array2 as $line1)
    {
    echo $line1;
    echo "<Br>";

    $queryString ="SELECT email FROM tempo WHERE email = '$line1'";

    $result = mysql_query($queryString);
    $num_rows = mysql_num_rows($result);


    if($num_rows == 0)
    {
    $queryString2 ="INSERT INTO tempo(id,email) VALUES ('','$line1')";
    $result2 = mysql_query($queryString2);
    }

    }[/PHP]


    231,808 total values
    48,533 was the clean result i go

    bit sketchy but could be right !?


Comments

  • Registered Users, Registered Users 2 Posts: 2,894 ✭✭✭TinCool


    Logic looks fine to me.


  • Registered Users, Registered Users 2 Posts: 8,070 ✭✭✭Placebo


    thanks TinCool, FYI problem was due to me not using trim

    eg $line2 = trim($line1);
    then just call $line2.

    ended up with 203k records.


  • Registered Users, Registered Users 2 Posts: 2,297 ✭✭✭Ri_Nollaig


    I havnt dealt with PHP much but I dont think its a good idea to do 250k selects followed by 203k insertions.
    Instead I'd say if would be much better for performance if you could get all existing rows, stores as an Array or something (a Map datatype would be best) and just do a check on this first then do the insertion.
    This way you would only execute 203k + 1 SQL statements and im sure it must be possible to do batch insertion aswell. Queue up 100 or even 1000 insert statements and execute as one.
    Just incase you thought it was going slow :)


  • Registered Users, Registered Users 2 Posts: 8,070 ✭✭✭Placebo


    i think its always gonna be a bandwidth problem, best solution i can think of is to read straight from csv and add to a new csv, check the csv first ofcourse

    for each = life saver.


  • Closed Accounts Posts: 577 ✭✭✭Galtee


    Placebo wrote: »
    I have 9 csv files, only emails, 250k records. I know there is overlaps/duplicates, they are email addresses, so i dont want recipients to get the email twice.

    so wanted a quick algorithm to get unique values, anyway just wanted someone to look at this and see if it was legit
    [i know this is terrible but really stuck for time here]

    [php]$array2 = file("3.csv");
    foreach ($array2 as $line1)
    {
    echo $line1;
    echo "<Br>";

    $queryString ="SELECT email FROM tempo WHERE email = '$line1'";

    $result = mysql_query($queryString);
    $num_rows = mysql_num_rows($result);


    if($num_rows == 0)
    {
    $queryString2 ="INSERT INTO tempo(id,email) VALUES ('','$line1')";
    $result2 = mysql_query($queryString2);
    }

    }[/php]


    231,808 total values
    48,533 was the clean result i go

    bit sketchy but could be right !?

    Maybe I missed something along the way but unless you can guarantee that there is only going to be one email address per checkline ie each line of the csv holds only one email address then you can't be sure that you won't have duplicates using the logic above.


  • Advertisement
Advertisement