[go: up one dir, main page]

Menu

#55 PK check, quick mode

open
nobody
None
5
2018-11-30
2018-11-15
No

Had a breakdown on one of those wrong-PK configuration cases today.
It took 9 hours to get to migrating... and failed due to non-unique PK.

You've added
[x] Check primary keys

But it is currently checking 100% records, which for me is overkill.
One of tables has several billion records and weighs several thousand gigs.

Maybe you'd consider a light-mode. A few million records check?
In my case it would have shown the problem beautifully.

Related

Feature Requests: #55

Discussion

1 2 > >> (Page 1 of 2)
  • Ralf Wisser

    Ralf Wisser - 2018-11-15

    Is there an index for the columns of the big table that have been configured as PK?
    An index should be created for these columns to speed both the verification and the migration itself.

     
  • Ralf Wisser

    Ralf Wisser - 2018-11-15

    While it is true that an appropriate index should be defined for each PK configuration, to determine whether a PK configuration would be correct, creating an index would be at least as expensive as running the test requests.

    I'll think about what to do.

    For the time being, I have changed the PK check so that the queries stop immediately if an error is detected. In addition, the queries are processed in multiple threads (as many as in "Parallel Threads" field specified)

     
    • Alexander Petrossian (PAF)

      I feel this is perfect. In my case it would have stopped very quicky = 9 hours test time saved ;)

      (not to speak of energy wasted and entropy increased)

       
  • Ralf Wisser

    Ralf Wisser - 2018-11-15

    An idea:
    Instead of searching for an appropriate primary key configuration for tables without a primary key, you could give such tables a primary key column.
    Ideally one with autogenerated values. So for example:

    alter table table_without_pk add id int identity(1,1) primary key not null
    

    After a "Analyse Datebase", these columns would also automatically be recognized as a primary key.

     
    • Alexander Petrossian (PAF)

      Not an option for me, because DBA gave only read-only access to production data...

       
      • Ralf Wisser

        Ralf Wisser - 2018-11-15

        DBAs are the natural enemies of the Jailer ;)

         
        • Alexander Petrossian (PAF)

          ;)
          now I got a real treat: 1 day before vacation, almost all done (thanks to one very unusual Ralf person)
          and a 300Gb table with >billion records, many full duplicates.

          Just rows with all columns with same values.
          Go in pairs.
          And don't show first 200M records, start after that ;)

          I'm sure it's a mistake on the app side (and stupid non-unique key).

          But that's a no-go for beloved jailer.
          Any ideas on workaround? Quick hack?

           

          Last edit: Alexander Petrossian (PAF) 2018-11-16
          • Alexander Petrossian (PAF)

            I see the code in upsert, and am considering hackin in a group by on all columns.
            But feel that may be a considerable slowdown.

            Please share your feelings about this. Maybe you've tried smth like that and know it would fail?

             
            • Alexander Petrossian (PAF)

              I guess, collect-duplicate was removed when we filled in JAILER_ENTITY,
              so the only real problem is export-duplicate at MERGE operation.

               
              • Ralf Wisser

                Ralf Wisser - 2018-11-16

                Are the duplicates in the source database?

                 
              • Ralf Wisser

                Ralf Wisser - 2018-11-16

                The merge operations are only used if the target tables already contained data before the export. Otherwise, inserts will be executed.
                What result do you get if the target tables are initially empty?

                 
                • Alexander Petrossian (PAF)

                  I didn't try that yet. But may try today.
                  As I understand, it would do 1 insert, because it would have only one relevant entry in ENTITIES.

                  Great hint. Thanks, Ralf.

                  Does it matter if other tables are filled? Is it enough I would truncate only problematic one?

                   
                  • Ralf Wisser

                    Ralf Wisser - 2018-11-16

                    No, if there are duplicate rows in the source table, it would also insert duplicates into the target table. (Would not that be the correct behavior?)

                    As a dirty quick hack, you could try to do "insert into ... select distinct ...". To do so, apply the attached patch.

                     
                    • Alexander Petrossian (PAF)

                      Thanks, woud try that (though am having big worries about slowdown).

                      Just to clarify my understanding: when we fill ENTITIES table we also do the duplicate check. May it be that that check would deduplicate input junk?

                      Also I'm a bit at a loss about how junk managed to get into migrated table [it did! :(].

                       
                      • Alexander Petrossian (PAF)

                        UPD: I feel I got it.
                        Duplicate check probably checks only previous day(s)?
                        So on initial run, when target table was empty, inserts were indeed used and put those duplicates that I see there now (without erors)

                        Then I do subsequent runs, and target table is nonempty, and jailer switched to merge and naturally I get errors.

                        (Am still a bit worried about possible slowdown)

                         
                        • Ralf Wisser

                          Ralf Wisser - 2018-11-16

                          I also suspect that the performance will suffer.

                          Is this a one-time migration, or should it be done on a regular basis?

                          If this is a one-time migration, would it be possible to clone the database and export data out of the clone? Then it would be possible to insert an identity column, which would solve just about any problem.

                          Maybe you could also delete the duplicates after the export in the target table?

                           
                          • Alexander Petrossian (PAF)

                            they plan to do it regularly.
                            and I don't think they would risk changing the origin table.

                            they give me a frozen copy from production to work on. but I don't know their plans on the data flow.
                            One would be crazy to do this towards production servers,
                            so hopefully there is a chance they would allow me adding an ID or cleanzing the junk beforehand.

                            Thanks for the hints.
                            I'll report if it would be too slow. Worried about temp tables nightmare...

                             
                      • Ralf Wisser

                        Ralf Wisser - 2018-11-16

                        when we fill ENTITIES table we also do the duplicate check. May it be that that check would deduplicate input junk?

                        Yes, ENTITIES contains keys without duplicates.
                        But when the rows are exported, ENTITIES is joined to the source table.

                        insert into target select ... from ENTITIES join source on ...
                        

                        So if the source table already contains duplicates, several rows are also exported per ENTITIES element.

                         
                        • Alexander Petrossian (PAF)

                          got it!

                           
                    • Alexander Petrossian (PAF)

                      Alas, it failed before even getting to my table with a sad

                      The text data type cannot be selected as DISTINCT because it is not comparable.
                      

                      I'll try to if it to only my one problematic table, just for test.

                      My taxi to vacation comes in 5 hours, and soon will be a break for all ;)

                       
                      • Alexander Petrossian (PAF)

                        Ralf, (back now in the office now, am afraid am going to bug you some more)

                        I found that distinct for all tables does not work for me, see above this strange "data type can not be ... DISTINCT" business.

                        And anyway, I'm pretty sure it would slow things down considerably.

                        Maybe you would consider adding some table-level option that would do the "distinct" magic for a certain manually-configured table?

                        So far I saw only 1 table like this in whole my 4K+ tables project.

                        (my idea is to switch to release and not give Customer anything my-hand-patched)

                         

                        Last edit: Alexander Petrossian (PAF) 2018-11-28
                        • Ralf Wisser

                          Ralf Wisser - 2018-11-28

                          How many rows does the problematic table contain?

                           
                          • Ralf Wisser

                            Ralf Wisser - 2018-11-28

                            And which version of the MSSQL do you use? (Maybe the pseudo column "rid" can be helpful)

                             
1 2 > >> (Page 1 of 2)

Anonymous
Anonymous

Add attachments
Cancel