Had a breakdown on one of those wrong-PK configuration cases today.
It took 9 hours to get to migrating... and failed due to non-unique PK.
You've added
[x] Check primary keys
But it is currently checking 100% records, which for me is overkill.
One of tables has several billion records and weighs several thousand gigs.
Maybe you'd consider a light-mode. A few million records check?
In my case it would have shown the problem beautifully.
Anonymous
Is there an index for the columns of the big table that have been configured as PK?
An index should be created for these columns to speed both the verification and the migration itself.
While it is true that an appropriate index should be defined for each PK configuration, to determine whether a PK configuration would be correct, creating an index would be at least as expensive as running the test requests.
I'll think about what to do.
For the time being, I have changed the PK check so that the queries stop immediately if an error is detected. In addition, the queries are processed in multiple threads (as many as in "Parallel Threads" field specified)
I feel this is perfect. In my case it would have stopped very quicky = 9 hours test time saved ;)
(not to speak of energy wasted and entropy increased)
An idea:
Instead of searching for an appropriate primary key configuration for tables without a primary key, you could give such tables a primary key column.
Ideally one with autogenerated values. So for example:
After a "Analyse Datebase", these columns would also automatically be recognized as a primary key.
Not an option for me, because DBA gave only read-only access to production data...
DBAs are the natural enemies of the Jailer ;)
;)
now I got a real treat: 1 day before vacation, almost all done (thanks to one very unusual Ralf person)
and a 300Gb table with >billion records, many full duplicates.
Just rows with all columns with same values.
Go in pairs.
And don't show first 200M records, start after that ;)
I'm sure it's a mistake on the app side (and stupid non-unique key).
But that's a no-go for beloved jailer.
Any ideas on workaround? Quick hack?
Last edit: Alexander Petrossian (PAF) 2018-11-16
I see the code in upsert, and am considering hackin in a group by on all columns.
But feel that may be a considerable slowdown.
Please share your feelings about this. Maybe you've tried smth like that and know it would fail?
I guess, collect-duplicate was removed when we filled in JAILER_ENTITY,
so the only real problem is export-duplicate at MERGE operation.
Are the duplicates in the source database?
The merge operations are only used if the target tables already contained data before the export. Otherwise, inserts will be executed.
What result do you get if the target tables are initially empty?
I didn't try that yet. But may try today.
As I understand, it would do 1 insert, because it would have only one relevant entry in ENTITIES.
Great hint. Thanks, Ralf.
Does it matter if other tables are filled? Is it enough I would truncate only problematic one?
No, if there are duplicate rows in the source table, it would also insert duplicates into the target table. (Would not that be the correct behavior?)
As a dirty quick hack, you could try to do "insert into ... select distinct ...". To do so, apply the attached patch.
Thanks, woud try that (though am having big worries about slowdown).
Just to clarify my understanding: when we fill ENTITIES table we also do the duplicate check. May it be that that check would deduplicate input junk?
Also I'm a bit at a loss about how junk managed to get into migrated table [it did! :(].
UPD: I feel I got it.
Duplicate check probably checks only previous day(s)?
So on initial run, when target table was empty, inserts were indeed used and put those duplicates that I see there now (without erors)
Then I do subsequent runs, and target table is nonempty, and jailer switched to merge and naturally I get errors.
(Am still a bit worried about possible slowdown)
I also suspect that the performance will suffer.
Is this a one-time migration, or should it be done on a regular basis?
If this is a one-time migration, would it be possible to clone the database and export data out of the clone? Then it would be possible to insert an identity column, which would solve just about any problem.
Maybe you could also delete the duplicates after the export in the target table?
they plan to do it regularly.
and I don't think they would risk changing the origin table.
they give me a frozen copy from production to work on. but I don't know their plans on the data flow.
One would be crazy to do this towards production servers,
so hopefully there is a chance they would allow me adding an ID or cleanzing the junk beforehand.
Thanks for the hints.
I'll report if it would be too slow. Worried about temp tables nightmare...
Yes, ENTITIES contains keys without duplicates.
But when the rows are exported, ENTITIES is joined to the source table.
So if the source table already contains duplicates, several rows are also exported per ENTITIES element.
got it!
Alas, it failed before even getting to my table with a sad
I'll try to if it to only my one problematic table, just for test.
My taxi to vacation comes in 5 hours, and soon will be a break for all ;)
Ralf, (back now in the office now, am afraid am going to bug you some more)
I found that distinct for all tables does not work for me, see above this strange "data type can not be ... DISTINCT" business.
And anyway, I'm pretty sure it would slow things down considerably.
Maybe you would consider adding some table-level option that would do the "distinct" magic for a certain manually-configured table?
So far I saw only 1 table like this in whole my 4K+ tables project.
(my idea is to switch to release and not give Customer anything my-hand-patched)
Last edit: Alexander Petrossian (PAF) 2018-11-28
How many rows does the problematic table contain?
And which version of the MSSQL do you use? (Maybe the pseudo column "rid" can be helpful)
You could try the following:
- define a column %%physloc%% as the primary key of the problematic table. Type is "varbinary", length 8.
- define a filter that excludes this column from the export. (See screenshots attached).
see https://stackoverflow.com/questions/909155/equivalent-of-oracles-rowid-in-sql-server
Last edit: Ralf Wisser 2018-11-28
18K duplicates in 1.3B records.
0.001% (cryes out loud)
version:
https://support.microsoft.com/en-us/help/4013104/cumulative-update-8-for-sql-server-2012-sp3
one can not use it in where (invalid column name '%%physloc%%), so I feel that idea would not work. trying anyway...
Last edit: Alexander Petrossian (PAF) 2018-11-28