Failing on "not enough space" when plenty is available

marclallen · June 12, 2007 11:50AM

Ok, so it might not be a flaky router.

I've had three backups fail to a remote share. In all three cases, the first error was from "backupmedium::reportioerror" claiming that there was not enough space to write to the virtual backup device.

However, there is plenty of space on the share. Well.. not plenty but enough. I was about 35GB into an 84GB backup file, with over 60GB remaining when I received the error.

The earlier errors had even more space available.

Is this something that the Red Gate virtual device is returning an error for? Is it guessing how much space will be needed based on compression ratios or something? Or did it just get a write error from the share?

Thanks,

Marc

marclallen · June 12, 2007 4:20PM

Even though your test application successfully handed 4MB blocks to the remote share, I changed the max block size to 1MB. We'll see how that works. The first test was successful, but it also tested OK with 4MB yesterday.

Marc

marclallen · June 13, 2007 9:21AM

One of my backups failed again, last night. Although probably a coincidence, the failure occurred when the free space on the target share drive was around 55GB, close to the 60GB + that was on the target previously.

The backup would have required about another 30GB+ or so to complete. I deleted all the old backups, releasing about 150GB to a total of about 200GB and reran the backup. It worked fine. This is still a small sample set, so I don't know how relevant the actual available space left is.

This was with the 1MB max block size. Both parameters were set to 1MB.

Does anyone have any suggestions where else to look?

Oh, and I had already performed the upgrade to 5.1.

Thanks,

Marc

marclallen · June 14, 2007 9:16AM

I tried changing it to delete all previous backups before starting.

Still, two backups failed again, claiming not enough space. Now, these two backups start at the same time (1AM). They are running on two separate machines, but both backing up to directories on the same network share.

Each are set for three threads, with both packet sizes set to 1MB.

Another interesting note:

Three nights ago, the same two backups failed (for the same reason), and each failed at almost the exact same time, within a minute.

I'll try again using a single thread for each, in case there is a problem with that.

Any other suggestions would be welcome.

Marc

marclallen · June 14, 2007 10:08AM

Running both backups simultaneously, each with a single thread, didn't work. Both died within 5 minutes with the dreaded 112 (not enough space) error.

I'm starting to suspect it's the simultaneous execution which makes the problem occur more often.

I'm rerunning the larger of the two by itself and with three threads, again.

Marc

marclallen · June 14, 2007 11:08AM

At least it failed quickly this time. Although it ended, it never showed up in the activity history. Odd.

I simply restarted it. We'll see if it finishes this time.

I anyone actually reading this? If so, I have a couple of questions:

Who actually detects the disk low status? Is it Red Gate in the virtual backup device? Is it SQL Server, somehow? Is it the source or destination Windows 2003 server?

Marc

Brian Donahue · June 14, 2007 11:14AM

As far as I know, SQL Backup only stops when the share actually runs out of space. Is this about that SQL 112 error? Because that is an error code that indicates 'out of space' but what it actually means is that it could not allocate enough buffer memory, so changing the MAXTRANSFERSIZE can help eliminate the problem.

marclallen · June 14, 2007 1:00PM

Brian,

Thanks for the response. If you read back a number of posts, you will see that I have already reduced the transfer size to 1MB. Are you suggesting I make it even smaller?

Also, in some other posts, I saw reference to a program that tested buffer sizes. That program runs fine with no errors on my system.

I do not believe it is a buffer issue, or at least not as simple as you make it out to be.

Marc

marclallen · June 14, 2007 1:41PM

Dropping both MaxTransferSize and MaxDataBlock (or whatever they are called) to 512K failed.

So, if it's an SQL Buffer allocation error, the error is occurring on the machine being backed up, correct? So, there should be a SQL performance counter that would be in play. Any idea on which one it would be?

Also, I should note that this machine is running with both the PAE and the 3GB switch on boot.

Marc

marclallen · June 14, 2007 2:34PM

Some probably useless SQL Log entries I've gotten. These are from the last failure:

Internal I/O request 0x4986DE78: Op: Write, pBuffer: 0x12620000, Size: 131072, Position: 53128658944, UMS: Internal: 0x0, InternalHigh: 0x20000, Offset: 0xAD800000, OffsetHigh: 0x2A, m_buf: 0x12620000, m_len: 131072, m_actualBytes: 0, m_errcode: 112, BackupFile: SQLBACKUP_39817ADE-5343-4C13-A25E-B066011B344401

BackupMedium::ReportIoError: write failure on backup device 'SQLBACKUP_39817ADE-5343-4C13-A25E-B066011B3444'. Operating system error 112(There is not enough space on the disk.).

There are more, but they're all pretty much like the above. It looks like I got both of these for each thread involved.

As I understand it, if there is a problem obtaining enough contiguous memory for a buffer, the log would indicate that as a contiguous memory problem. So, I'm not sure that is it.

Marc

Brian Donahue · June 15, 2007 4:25AM

Hi Marc,

SQL Backup needs 6*MAXTRANSFERSIZE memory from SQL Server's memory space in order to do the backup. This memory needs to be contiguous. If you set MAXTRANSFERSIZE all the way down to 64KB and the backup still fails, then there just isn't enough contiguous memory. You can check the contiguous memory using the master..sqbmemory stored procedure.

There are lots of reasons why SQL Server runs out of memory. Other extended stored procedures, COM objects allocated out of stored procedures using sp_oaCreate that are not freed, and memory configuration problems come into play. As you may have noticed, this is the single-most common failure in SQL Backup, so I never suggested the fix would be easy...

As far as SQL Server memory configuration, here is some information I have found after conducting a bit of research into this area:

If you have: set:
3-4 GB /3GB
4-8GB /3GB /PAE
16+ GB /PAE (/3GB will cripple memory over 8GB).

-When you have set /PAE, go into SQL Server's configuration and set the option to use AWE to ON. If you do this, however, you need to also specify a maximum memory value in SQL Server, if you do not, then SQL Server will take all but 128MB of the computer's memory if the automatic memory management is used in SQL Server.

-The user who runs the SQL Server needs to have the 'lock pages in memory' user right in the local security policy, or it will have problems allocating the memory for SQL Backup's extended stored procedure. If you have checked everything above, please check this as well.

-There is a -g startup option in SQL Server to control how much memory SQL Server will leave free for extended stored procedure code. This is important to do when you have more than 500 databases on a server: each database that's online will use 64Kb of the free memory. Microsoft recommends:

500 databases: -g288
1000 databases -g372

After setting -g, you need to restart SQL Server. Also, AWE is not available on all editions of SQL Server. If you have hundreds of databases on a SQL Server Standard, think about going to Enterprise or get another instance of SQL to hold the databases.

marclallen · June 15, 2007 9:25AM

Brian,

I am currently running 5GB, using /PAE and /3GB, and SQL Server 2000 Enterprise. AWE is enabled, and I have only a single instance running 1 user database plus the 6 or so system databases.

I have not set the -g switch, but the max server memory is set to 4GB.

According to BOL, SQL Server Setup assigned the rights to lock pages in memory to the MSSQL server application by default. I assume I ndo not need to do anything else. I did check and that service runs under 'Administrator' who has permissions to lock pages.

I do not know how to interpret the output of sqbmemory, but here is the idle (no backup running) output of it:

Type             Minimum              Maximum              Average              Blk count            Total                
---------------- -------------------- -------------------- -------------------- -------------------- -------------------- 
Commit           4096                 39514112             20582                28044                577212416
Reserve          8192                 3944448              86979                27068                2354364416
Free             4096                 187760640            609648               475                  289583104
Private          4096                 39514112             52827                54382                2872881152
Mapped           4096                 1060864              53835                314                  16904192
Image            4096                 7581696              100460               416                  41791488

If you can show me where I can identify low contiguous memory, I would appreciate it.

Finally, as I noted earlier, all documentation for SQL Server indicates that the log should show something if a buffer allocation fails due to insufficient contiguous memory.

I will retry my backups again at 64K MAXTRANSFERSIZE and let you know how it works.

Thanks,

Marc

marclallen · June 15, 2007 9:48AM

Egad! Dropping my MAXTRANSFERSIZE to 64KB makes the backups take FOREVER. I don't think that it will be acceptable, even if it works. Even 256K is just too slow.

Can you provide any insight into how I can show, for certain, that the issue is contiguous memory?

Marc

marclallen · June 15, 2007 2:07PM

I discovered the following:

Here are the times my backups failed. Note the minute and seconds for each failure. They are practically the same every time!

SQLBackup@TOTALIZER1	ERROR - full backup &#91;multiple databases&#93; on TOTALS1 at 6/15/2007 1:37:54 PM		
SQLBackup@TOTALIZER1	ERROR - full backup &#91;multiple databases&#93; on TOTALS1 at 6/14/2007 9:37:49 AM		
SQLBackup@TOTALIZER1	ERROR - full backup &#91;multiple databases&#93; on TOTALS1 at 6/14/2007 1:37:50 AM		
SQLBackup@TOTALIZER1	ERROR - full backup &#91;multiple databases&#93; on TOTALS1 at 6/12/2007 1:38:03 AM		

SQLBackup@Transaction1	ERROR - full backup &#91;multiple databases&#93; on DATABASE2 at 6/15/2007 1:24:51 PM		
SQLBackup@Transaction1	ERROR - full backup &#91;multiple databases&#93; on DATABASE2 at 6/15/2007 12:24:59 PM		
SQLBackup@Transaction1	ERROR - full backup &#91;multiple databases&#93; on DATABASE2 at 6/15/2007 11:24:49 AM		
SQLBackup@Transaction1	ERROR - full backup &#91;multiple databases&#93; on DATABASE2 at 6/15/2007 2:24:52 AM		
SQLBackup@Transaction1	ERROR - full backup &#91;multiple databases&#93; on DATABASE2 at 6/14/2007 3:24:52 PM		
SQLBackup@Transaction1	ERROR - full backup &#91;multiple databases&#93; on DATABASE2 at 6/14/2007 1:24:52 PM		
SQLBackup@Transaction1	ERROR - full backup &#91;multiple databases&#93; on DATABASE2 at 6/14/2007 11:24:51 AM		
SQLBackup@Transaction1	ERROR - full backup &#91;multiple databases&#93; on DATABASE2 at 6/14/2007 10:24:52 AM		
SQLBackup@Transaction1	ERROR - full backup &#91;multiple databases&#93; on DATABASE2 at 6/14/2007 9:24:52 AM		
SQLBackup@Transaction1	ERROR - full backup &#91;multiple databases&#93; on DATABASE2 at 6/14/2007 1:24:53 AM		
SQLBackup@Transaction1	ERROR - full backup &#91;multiple databases&#93; on DATABASE2 at 6/13/2007 2:24:54 AM		
SQLBackup@Transaction1	ERROR - full backup &#91;multiple databases&#93; on DATABASE2 at 6/12/2007 11:24:54 AM		
SQLBackup@Transaction1	ERROR - full backup &#91;multiple databases&#93; on DATABASE2 at 6/12/2007 1:24:57 AM		
SQLBackup@Transaction1	ERROR - full backup &#91;multiple databases&#93; on DATABASE2 at 6/11/2007 10:24:55 AM

So, failures apparently occur every hour at this specific time, which is different on each machine. The minute/second timepoint appears to correspond fairly closely to the last system startup time. I believe there is something happening on an hourly timer from system startup which is causing my problem?

Any ideas?

At first glance, it appears that it must be something happening on both machines. I've checked all the SQL jobs, and nothing seems to be executing at those times. There are also no schedules tasks running at those times.

The only other thing (aside the being Windows 2003 OS, SP1) that's in common between the two machines is that they both use NOD32 Anti-Virus. I have disabled it on one machine, but it still failed. However, I may try uninstalling it completely.

Comments would be more than welcome. If anyone at Red Gate has any ideas on how to identify the problem, I would greatly appreciate it. How can an external job affect either the share's availability or SQL's internal memory space?

Marc

Brian Donahue · June 16, 2007 9:29AM

Hi Marc,

I have noticed that, with Symantec's product anyway, network drive scanning practically cripples I/O to shares. Any of our internal workstations that need to access data through a share get this feature disabled. Maybe that's an area to look at?

I'd also run into some information a dew days ago that the NETBIOS protocol can reset the virtual circuit (causing a disconnect at the TCP socket level) for 'security reasons'. This is a known issue with SMB protocol across a VPN that transverses a NAT interface when two clients attempt to setup a virtual circuit to the same share. I'd think this situation would be extremely rare, but may be worth a mention...

petey · June 17, 2007 9:56PM

Could you pls post the contents of the logs for the backups that failed? The default log folder is:

<system drive>:\Documents and Settings\All Users\Application Data\Red Gate\SQL Backup\Log\<instance name>

Thanks.

marclallen · June 18, 2007 7:58AM

Brian,

Good thoughts. We are not running Symantec, althought NOD32 might have similar issues. I did shutdown the actual AV checking for a number of tests, with no luck; however, I know that it installs filters in the network stack that cannot be bypassed without deinstalling it.

Personally, I don't think NOD32 is to blame, though. There have been a number of complaints against it, and, as far as I can tell, it turned out there was always a problem somewhere else. Still, if need be, I will remove it an try more.

The affected servers are on a simple LAN, with the two source machines on static IPs and the destination on DHCP (why? I must've forgotten to give it one.) They connected through a single GB switch. So, no special network architecture here.

Interestingly enough, my final attempt at one of the two databases did work after I disabled the Windows Update service. Unfortunately, the second host still failed after the same change. I will test more on that today.

Petey, the log follows. I need to check, but it appears that the second host (which is not attached) reports a "share not available" in the log, but the event in the event log still indicates an error 112. I am thinking it's because it is trying with single threading.

I have posted the SQL Log entries in an earlier thread.

6/15/2007 3:09:09 PM: Backing up OutsiteContent (full database) to:
6/15/2007 3:09:09 PM: \\Fileserver\HostBackups\Transaction1\FULL_(local)_OutsiteContent_20070615_150909.sqb

6/15/2007 3:09:09 PM: BACKUP DATABASE [OutsiteContent] TO DISK = '\\Fileserver\HostBackups\Transaction1\<AUTO>.sqb' WITH NAME = '<AUTO>', DESCRIPTION = '<AUTO>', INIT, ERASEFILES_ATSTART = 6h, MAILTO = 'mlallen@outsitenetworks.com', COMPRESSION = 1, THREADCOUNT = 3

6/15/2007 3:24:46 PM: Thread 0 error:
Process terminated unexpectedly. Error code: -2139684860
Process terminated unexpectedly. Error code: -2139684860
6/15/2007 3:24:46 PM: Thread 1 error:
Process terminated unexpectedly. Error code: -2139684860
6/15/2007 3:24:46 PM: Thread 2 error:
Process terminated unexpectedly. Error code: -2139684860

6/15/2007 3:24:47 PM: Database size : 319.679 GB
6/15/2007 3:24:47 PM: Compressed data size: 19.148 GB
6/15/2007 3:24:47 PM: Compression rate : 94.01%

SQL error 3013: SQL error 3013: BACKUP DATABASE is terminating abnormally.
SQL error 3202: SQL error 3202: Write on 'SQLBACKUP_D8C9E2E5-ED1F-4DD2-AB9F-E2EE215B5A8D01' failed, status = 112. See the SQL Server error log for more details.

Marc

marclallen · June 18, 2007 8:06AM

Brian,

If this NETBIOS reset is something that MS widely added to their OS for security purposes, then don't you think that a simple retry would take care of it?

I mean, file copies and such don't seem to have problems.

Would it beneficial for SQL Backup to simply retry three or four times? Or is there a problem with not knowing if a write was committed or not?

Marc

marclallen · June 18, 2007 8:31AM

Other information from MS support:

http://support.microsoft.com/kb/843515

I'm not sure if it applies, but since I don't know who's actually issuing the 112 error, I don't know if perhaps it's reporting the wrong error.

Marc

marclallen · June 18, 2007 3:54PM

Disabling NETBIOS over TCP/IP on the share computer seems to be showing good results. I am a second round of backups and no failures so far.

I would prefer not keeping NETBIOS disabled, as I have to retool various portions of my network to handle that. So, please don't abandon the problem.

I also checked again, and the second machine that was still failing after I disabled Windows Update... well, I guess I didn't disable it. I thought I did... I checked and I did, but when I checked this morning, it was running. And no.. the system didn't reboot.

So, the one machine I'm sure I did disable the Windows Update service did manage to complete the backup.

Does anyone know if the Windows Update service operates on an hourly schedule from boot?

Marc

Brian Donahue · June 19, 2007 4:02AM

Hi Marc,

Windows Update will run on a schedule that you have set. You can either set this using the updates tab in My Computer properties, or have it set for you by a domain administrator who has configured a group policy for it.

marclallen · June 19, 2007 9:06AM

Brian,

That is not true as far as I can tell. I can set some options, yes, but I can only control when updates are actually installed, not when they are checked for or how often.

If you search Google on SVCHOST CPU 100 or similar, you'll find tons of threads about how the periodic Windows Update check can own a system for 10 minutes or more. It often happens on bootup and, for me, usually happens at least once in the afternoon, sometimes more often.

I have found no documentation on when Windows Updates does its internal checks or what else it does besides simply updating.

In these particular cases, both machines had uninstalled updates which I hadn't found time to perform, yet. I have no idea if that changes the equation or not.

Finally, all I am doing is reporting what I have found. And that is, that at least one backup succeeded after several failures when I disabled Windows Update.

I apologize if this seems rude, but when I am only able to receive a single communication from Red Gate each day, even though I often provide supplemental information early in your day, it irritates me that I receive unhelpful information.

Even had I not known about a way to schedule Windows Update, and even if your suggestion had fixed it, it does not relieve Red Gate of the responsibility of identifying why that would be causing a problem if, in fact, it is. I am under the impression that Red Gate feels this is not an issue with SQL Backup but a series of external, uncontrollable influences. If so, explain why a standard SQL backup to the same share, which often takes two to three times longer, does not have any issues? I know this has been reported by other users, as I am in communication with one.

As you don't appear to be actually working on it, perhaps you could take some time and answer some questions for me so that, maybe, I can troubleshoot it better:

1) If, as you say, the error may be due to a lack of contiguous memory, why doesn't the SQL Server Log indicate it? They have several warning messages for that sort of scenario. Also, why is it reported as a lack of disk space?

2) Who is actually reporting that error? Is SQL Server doing something, getting an error and passing it on to you? Or does the server grab a pile of data, pass it on to you, and then you error out and return the error back to SQL Server?

3) If the Red Gate virtual backup device is the one generating the error, why is there no retry logic? Or is there? Especially in the case where it's actually a buffer management problem?

Some additional information:

When still failing, I ran NETMON.EXE on the share machine. It did not show any sort of error, or network reset. It simply stopped in the middle of a long SMB write.

Marc

Brian Donahue · June 19, 2007 11:12AM

Marc,

If you have got a current support contract, please please write an email to support@red-gate.com.

I have to say that since a considerable percentage of the workload in the support department is actually spent troubleshooting SQL Backup failures, particularly to network shares, we are not taking it lightly. So if you've got a serious problem, I'd appreciate a direct contact so we can get things rolling.

marclallen · June 19, 2007 11:55AM

Brian,

I have sent an email to support referencing this thread.

I didn't realize that I needed to do for this problem to receive direct attention.

My apologies.

Marc

Failing on "not enough space" when plenty is available

Comments

Product Learning

Community Forums

Events & Friends

Simple Talk

Failing on &quot;not enough space&quot; when plenty is available

Comments

Product Learning

Community Forums

Events & Friends

Simple Talk

Failing on "not enough space" when plenty is available