What are the challenges you face when working across database platforms? Take the survey
Options

Log Shipping over WAN

csmcsm Posts: 17
edited May 25, 2007 2:15PM in SQL Backup Previous Versions
Hello!

I have recently begun to roll out Log Shipping via SQL Backup between 2 SQL servers across our WAN. We have a 1.5mb private line connecting two locations, if the private line goes down the network will automatically failover to IPSEC VPN. I plan to schedule the shipping to occur every 15 minutes with a 5 minute delay.

I'm curious where you might recommend we store the files for sharing -- should the log backup job store the log file locally on the source server, or should it copy them to the destination server.

Advantage to storing locally on the source server is that the file is written quickly, and should pretty much be guaranteed to be written as there are no network variables involved. Should the network go down, the log backups would be queued up waiting to be picked up by the Log Restore Job on the desination server.

Disadvantage of storing locally on the source server is that the file must be read twice across the WAN by the destination server -- Once for the restore, and another for the MOVETO to the local backup folder once the restore is complete.

What does everyone recommend? Best practice?

Thank you!
Colby

Comments

  • Options
    When performing the backups itself, you should always back up to a local drive.

    The reason for this is two-fold:
    * Firstly, backing up across a network introduces additional points of failure (the network, switches etc) and places that corruption could occur.
    * Secondly, performance across the network will result in a very slow backup, the performance being a tiny fraction of that experienced when backing up to the local drive and copying afterwards.

    The major caveat with this is that when performing the backup, where possible it should not be to the same local drive - in the event a disk problem occurs, there is less chance of everything being lost in one go.

    When using the Log Shipping Wizard, after the backup has been performed the files are moved (using the COPYTO keyword) to the network share - in the version 5 user interface this is step 4. If you choose to have the network share on the remote machine, you get the following:

    * Log is backed up on a local disk;
    * Log is copied (using COPYTO) from the local disk, to the shared directory (which could be on the remote server);
    * Log is restored from the shared directory (which if on the remote server is a 'local' restore, which gives higher performance);
    * Log is then moved to the processed directory (if both locations are on the remote server, it doesn't need to cross the WAN again).

    In the above, the file is only copied across the WAN once, during the COPYTO phase (point 2)... the MOVETO at the end could move the file from one place on the remote server to another, meaning that it does not need to go across the WAN again.

    I hope that helps with your questions - if you need any more details or more explaination, feel free to ask.

    Jason
  • Options
    Thank you for your reply...

    I agree with everything you said, and understand how the default setup, using the destination server as the share, functions in respect to reducing network traffic, copying/moving files locally, etc...

    However if there was network downtime, or the destination server was unavailable, it wouldn't be as clear which files were pending shipping... the backup would complete into the local folder, but would not be able to copy to a share. So, once connectivity with the destination server was restored, it would need to be determined which logs would need to be delivered, correct?

    Keeping the shipping file share on the source server would queue those up, until the destination server was ready to pull them... however, it would result in duplicate network traffic from the restore, then the move.
  • Options
    Jason CookJason Cook Posts: 185
    edited May 25, 2007 1:44PM
    In regards to the network downtime, that would be correct.

    If the file could not be copied because the network was unavailable, you should see warnings appear for the backup (warning code 141, 142 or 143). This can be used to identify which logs need to be transferred to bring the log shipping up to date.

    Likewise, if the network reappears and later backups are copied over such that one or more have not been copied correctly, by checking the last log in the 'processed' folder and the first in the 'shared' folder, you should be able to identify the range of backups that need to be copied into the 'shared' folder to resolve the problem. You'll also get failure notifications because the restore phase will fail once later logs are copied over.

    This does require manual intervention as you have mentioned... that's the one downside to this method.

    --

    When restoring from the first machine, the major problem is that you are performing a network-based restore, which will incur a significant performance penalty (can easily be a factor of 10 worse than restoring from a local disk).

    By performing a remote restore it makes the restore phase take significantly longer, which could result in the log frequency having to be reduced. You also add more load to the server and network by requiring two network transfers, as you've already mentioned.

    Additionally, by keeping the log on the local machine, you increase the window for a machine crash to wipe out all copies of the log backup. At least if you copy it remotely during the backup phase you have a second copy remotely if the primary server was to crash.

    For example:

    Log backup every 15 minutes, offset restore by 5 minutes.

    Performing the copy on backup, the worst case (assuming no network failure) is losing 15 minutes of data.

    Performing the transfer on restore, the worst case becomes 20 minutes.

    Finally, Unlike the previous scenario, you will get no errors or warnings during the backup phase, and only a 220 warning if the network share is unavailable ("No log files found to be restored"), such that if you are using a MAILTO_ONERROR command, you actually get no emails because it hasn't "failed" as such (or get so many emails using MAILTO that it's easy to miss when a problem actually occured).

    So even ignoring the performance issues, the tradeoff of the second model becomes ease of management vs. knowledge that the log shipping is functioning correctly.

    --

    Hopefully that answers your question, I realise there is a lot to read so if I've not made anything clear or I've missed something, feel free to follow up.

    Hope that helps,
    Jason
  • Options
    Point taken!

    Question...

    If the network or destination server were unavailable for 1 hour, 3-4 transaction logs backups would not make it into the queue, however the following would.

    What would occur on the destination server's restore job? It would be missing 3-4 transaction logs in the sequence.

    Thanks again!
  • Options
    In the first model:

    The 4 backups would pass with a warning because the COPYTO has failed (warning code 141, 142, 143).

    The 4 follow-on restores would pass, citing that "No log files found to be restored", and issue a warning code (220).

    When the network returns and the backup successfully copies, the next restore will fail, citing that the log sequence is missing logs. If you are using the MAILTO_ONERROR you will then get an email telling you a problem has occured. You can then resolve the problem and bring the log shipping back online.

    Once the initial problem has been resolved, the log shipping should be back up to date almost instantly.

    --

    In the second model:

    The 4 backups would pass because the COPYTO is to a local location.

    The 4 restores would also pass, citing that "No log files found to be restored", and issue a warning code (220).

    Once the network returns, the restore will be able to copy the files across, but because of the performance penalty it'll have a knock-on delayed effect until it catches up, depending on the timing issues this could leave a window (minutes? hours?) where the standby server is more than 15 minutes out of date.

    The problem is, you'd never know this was happening if you used the MAILTO_ONERROR keyword, you'd only find out by trawling the MAILTO emails, or scanning the user interface after the event... the first can easily be missed, and the second will only happen after the event.

    Hope that helps,
    Jason
  • Options
    Fair enough! Decision made... Share will be on destination server!

    Thank you for the insightful discussion.
Sign In or Register to comment.